Amazon Textract
Amazon Textract 是一个机器学习(ML)服务,可以从扫描的文档中自动提取文本、手写文字和数据。
它超越了简单的光学字符识别(OCR),能够识别、理解和提取表格和表单中的数据。如今,许多公司手动从PDF文件、图片、表格和表单等扫描文档中提取数据,或者通过需要手动配置的简单OCR软件来完成这一过程(通常当表格发生变化时必须更新配置)。为了克服这些耗时且昂贵的过程,
Textract利用机器学习技术读取并处理任何类型的文档,并准确地提取文本、手写体、表格和其他数据,无需任何人工干预。
这示例演示了如何将 Amazon Textract 与 LangChain 结合使用,作为文档加载器。
Textract 支持PDF, TIFF, PNG 和 JPEG 格式。
Textract 支持这些 文档大小、语言和字符。
%pip install --upgrade --quiet boto3 langchain-openai tiktoken python-dotenv
%pip install --upgrade --quiet "amazon-textract-caller>=0.2.0"
样本 1
The first example uses a local file, which internally will be sent to Amazon Textract sync API DetectDocumentText.
本地文件或类似HTTP://的URL端点仅限于单页文档用于Textract。 多页文档必须存储在S3上。此示例文件是jpeg格式。
from langchain_community.document_loaders import AmazonTextractPDFLoader
loader = AmazonTextractPDFLoader("example_data/alejandro_rosalez_sample-small.jpeg")
documents = loader.load()
输出来自文件
documents
[Document(page_content='Patient Information First Name: ALEJANDRO Last Name: ROSALEZ Date of Birth: 10/10/1982 Sex: M Marital Status: MARRIED Email Address: Address: 123 ANY STREET City: ANYTOWN State: CA Zip Code: 12345 Phone: 646-555-0111 Emergency Contact 1: First Name: CARLOS Last Name: SALAZAR Phone: 212-555-0150 Relationship to Patient: BROTHER Emergency Contact 2: First Name: JANE Last Name: DOE Phone: 650-555-0123 Relationship FRIEND to Patient: Did you feel fever or feverish lately? Yes No Are you having shortness of breath? Yes No Do you have a cough? Yes No Did you experience loss of taste or smell? Yes No Where you in contact with any confirmed COVID-19 positive patients? Yes No Did you travel in the past 14 days to any regions affected by COVID-19? Yes No Patient Information First Name: ALEJANDRO Last Name: ROSALEZ Date of Birth: 10/10/1982 Sex: M Marital Status: MARRIED Email Address: Address: 123 ANY STREET City: ANYTOWN State: CA Zip Code: 12345 Phone: 646-555-0111 Emergency Contact 1: First Name: CARLOS Last Name: SALAZAR Phone: 212-555-0150 Relationship to Patient: BROTHER Emergency Contact 2: First Name: JANE Last Name: DOE Phone: 650-555-0123 Relationship FRIEND to Patient: Did you feel fever or feverish lately? Yes No Are you having shortness of breath? Yes No Do you have a cough? Yes No Did you experience loss of taste or smell? Yes No Where you in contact with any confirmed COVID-19 positive patients? Yes No Did you travel in the past 14 days to any regions affected by COVID-19? Yes No ', metadata={'source': 'example_data/alejandro_rosalez_sample-small.jpeg', 'page': 1})]
样本 2
以下示例将从HTTPS端点加载一个文件。 它必须是单页面,因为Amazon Textract要求所有多页文档都存储在S3中。
from langchain_community.document_loaders import AmazonTextractPDFLoader
loader = AmazonTextractPDFLoader(
"https://amazon-textract-public-content.s3.us-east-2.amazonaws.com/langchain/alejandro_rosalez_sample_1.jpg"
)
documents = loader.load()
documents
[Document(page_content='Patient Information First Name: ALEJANDRO Last Name: ROSALEZ Date of Birth: 10/10/1982 Sex: M Marital Status: MARRIED Email Address: Address: 123 ANY STREET City: ANYTOWN State: CA Zip Code: 12345 Phone: 646-555-0111 Emergency Contact 1: First Name: CARLOS Last Name: SALAZAR Phone: 212-555-0150 Relationship to Patient: BROTHER Emergency Contact 2: First Name: JANE Last Name: DOE Phone: 650-555-0123 Relationship FRIEND to Patient: Did you feel fever or feverish lately? Yes No Are you having shortness of breath? Yes No Do you have a cough? Yes No Did you experience loss of taste or smell? Yes No Where you in contact with any confirmed COVID-19 positive patients? Yes No Did you travel in the past 14 days to any regions affected by COVID-19? Yes No Patient Information First Name: ALEJANDRO Last Name: ROSALEZ Date of Birth: 10/10/1982 Sex: M Marital Status: MARRIED Email Address: Address: 123 ANY STREET City: ANYTOWN State: CA Zip Code: 12345 Phone: 646-555-0111 Emergency Contact 1: First Name: CARLOS Last Name: SALAZAR Phone: 212-555-0150 Relationship to Patient: BROTHER Emergency Contact 2: First Name: JANE Last Name: DOE Phone: 650-555-0123 Relationship FRIEND to Patient: Did you feel fever or feverish lately? Yes No Are you having shortness of breath? Yes No Do you have a cough? Yes No Did you experience loss of taste or smell? Yes No Where you in contact with any confirmed COVID-19 positive patients? Yes No Did you travel in the past 14 days to any regions affected by COVID-19? Yes No ', metadata={'source': 'example_data/alejandro_rosalez_sample-small.jpeg', 'page': 1})]
样例 3
处理多页文档需要将文档存储在S3上。示例文档位于us-east-2区域的一个桶中,为了使Textract在此区域内调用成功,我们在客户端设置了region_name,并将其传递给加载器以确保Textract从us-east-2区域进行调用。您也可以让笔记本运行在us-east-2区域,将AWS_DEFAULT_REGION设置为us-east-2,或者在不同环境中通过下面的单元格传递一个带有该区域名称的boto3 Textract客户端。
import boto3
textract_client = boto3.client("textract", region_name="us-east-2")
file_path = "s3://amazon-textract-public-content/langchain/layout-parser-paper.pdf"
loader = AmazonTextractPDFLoader(file_path, client=textract_client)
documents = loader.load()
现在获取页面数量以验证响应(打印完整响应将会非常长……)。我们预计有16页。
len(documents)
16
样本 4
您可以在传递给AmazonTextractPDFLoader的附加参数linearization_config中选择,这将决定在Textract运行后解析器如何线性化文本输出。
from langchain_community.document_loaders import AmazonTextractPDFLoader
from textractor.data.text_linearization_config import TextLinearizationConfig
loader = AmazonTextractPDFLoader(
"s3://amazon-textract-public-content/langchain/layout-parser-paper.pdf",
linearization_config=TextLinearizationConfig(
hide_header_layout=True,
hide_footer_layout=True,
hide_figure_layout=True,
),
)
documents = loader.load()
在 LangChain 链(例如 OpenAI)中使用 AmazonTextractPDFLoader
AmazonTextractPDFLoader 可以像其他加载器一样在链中使用。 Textract 本身确实有一个 查询功能,其功能类似于本示例中的 QA 链,也值得一看。
# You can store your OPENAI_API_KEY in a .env file as well
# import os
# from dotenv import load_dotenv
# load_dotenv()
# Or set the OpenAI key in the environment directly
import os
os.environ["OPENAI_API_KEY"] = "your-OpenAI-API-key"
from langchain.chains.question_answering import load_qa_chain
from langchain_openai import OpenAI
chain = load_qa_chain(llm=OpenAI(), chain_type="map_reduce")
query = ["Who are the autors?"]
chain.run(input_documents=documents, question=query)
' The authors are Zejiang Shen, Ruochen Zhang, Melissa Dell, Benjamin Charles Germain Lee, Jacob Carlson, Weining Li, Gardner, M., Grus, J., Neumann, M., Tafjord, O., Dasigi, P., Liu, N., Peters, M., Schmitz, M., Zettlemoyer, L., Lukasz Garncarek, Powalski, R., Stanislawek, T., Topolski, B., Halama, P., Gralinski, F., Graves, A., Fernández, S., Gomez, F., Schmidhuber, J., Harley, A.W., Ufkes, A., Derpanis, K.G., He, K., Gkioxari, G., Dollár, P., Girshick, R., He, K., Zhang, X., Ren, S., Sun, J., Kay, A., Lamiroy, B., Lopresti, D., Mears, J., Jakeway, E., Ferriter, M., Adams, C., Yarasavage, N., Thomas, D., Zwaard, K., Li, M., Cui, L., Huang,'