Skip to main content
Open In Colab在 GitHub 上打开

Amazon Textract

Amazon Textract 是一种机器学习 (ML) 服务,可自动从扫描的文档中提取文本、手写内容和数据。

它超越了简单的光学字符识别 (OCR),还可以从表单和表格中识别、理解和提取数据。如今,许多公司从扫描的文档(如 PDF、图像、表格和表单)中手动提取数据,或通过需要手动配置的简单 OCR 软件(通常必须在表单更改时更新)提取数据。为了克服这些手动和昂贵的流程,Textract使用 ML 读取和处理任何类型的文档,无需手动作即可准确提取文本、手写内容、表格和其他数据。

此示例演示了Amazon Textract与 LangChain 结合使用,作为 DocumentLoader。

Textract支持PDF,TIFF,PNGJPEG格式。

Textract支持这些文档大小、语言和字符

%pip install --upgrade --quiet  boto3 langchain-openai tiktoken python-dotenv
%pip install --upgrade --quiet  "amazon-textract-caller>=0.2.0"

示例 1

第一个示例使用本地文件,该文件将在内部发送到 Amazon Textract 同步 API DetectDocumentText

本地文件或 URL 端点(如 HTTP://)仅限于 Textract 的单页文档。 多页文档必须驻留在 S3 上。此示例文件为 jpeg。

from langchain_community.document_loaders import AmazonTextractPDFLoader

loader = AmazonTextractPDFLoader("example_data/alejandro_rosalez_sample-small.jpeg")
documents = loader.load()

文件的输出

documents
[Document(page_content='Patient Information First Name: ALEJANDRO Last Name: ROSALEZ Date of Birth: 10/10/1982 Sex: M Marital Status: MARRIED Email Address: Address: 123 ANY STREET City: ANYTOWN State: CA Zip Code: 12345 Phone: 646-555-0111 Emergency Contact 1: First Name: CARLOS Last Name: SALAZAR Phone: 212-555-0150 Relationship to Patient: BROTHER Emergency Contact 2: First Name: JANE Last Name: DOE Phone: 650-555-0123 Relationship FRIEND to Patient: Did you feel fever or feverish lately? Yes No Are you having shortness of breath? Yes No Do you have a cough? Yes No Did you experience loss of taste or smell? Yes No Where you in contact with any confirmed COVID-19 positive patients? Yes No Did you travel in the past 14 days to any regions affected by COVID-19? Yes No Patient Information First Name: ALEJANDRO Last Name: ROSALEZ Date of Birth: 10/10/1982 Sex: M Marital Status: MARRIED Email Address: Address: 123 ANY STREET City: ANYTOWN State: CA Zip Code: 12345 Phone: 646-555-0111 Emergency Contact 1: First Name: CARLOS Last Name: SALAZAR Phone: 212-555-0150 Relationship to Patient: BROTHER Emergency Contact 2: First Name: JANE Last Name: DOE Phone: 650-555-0123 Relationship FRIEND to Patient: Did you feel fever or feverish lately? Yes No Are you having shortness of breath? Yes No Do you have a cough? Yes No Did you experience loss of taste or smell? Yes No Where you in contact with any confirmed COVID-19 positive patients? Yes No Did you travel in the past 14 days to any regions affected by COVID-19? Yes No ', metadata={'source': 'example_data/alejandro_rosalez_sample-small.jpeg', 'page': 1})]

示例 2

下一个示例从 HTTPS 终端节点加载文件。 它必须是单页的,因为 Amazon Textract 要求所有多页文档都存储在 S3 上。

from langchain_community.document_loaders import AmazonTextractPDFLoader

loader = AmazonTextractPDFLoader(
"https://amazon-textract-public-content.s3.us-east-2.amazonaws.com/langchain/alejandro_rosalez_sample_1.jpg"
)
documents = loader.load()
documents
[Document(page_content='Patient Information First Name: ALEJANDRO Last Name: ROSALEZ Date of Birth: 10/10/1982 Sex: M Marital Status: MARRIED Email Address: Address: 123 ANY STREET City: ANYTOWN State: CA Zip Code: 12345 Phone: 646-555-0111 Emergency Contact 1: First Name: CARLOS Last Name: SALAZAR Phone: 212-555-0150 Relationship to Patient: BROTHER Emergency Contact 2: First Name: JANE Last Name: DOE Phone: 650-555-0123 Relationship FRIEND to Patient: Did you feel fever or feverish lately? Yes No Are you having shortness of breath? Yes No Do you have a cough? Yes No Did you experience loss of taste or smell? Yes No Where you in contact with any confirmed COVID-19 positive patients? Yes No Did you travel in the past 14 days to any regions affected by COVID-19? Yes No Patient Information First Name: ALEJANDRO Last Name: ROSALEZ Date of Birth: 10/10/1982 Sex: M Marital Status: MARRIED Email Address: Address: 123 ANY STREET City: ANYTOWN State: CA Zip Code: 12345 Phone: 646-555-0111 Emergency Contact 1: First Name: CARLOS Last Name: SALAZAR Phone: 212-555-0150 Relationship to Patient: BROTHER Emergency Contact 2: First Name: JANE Last Name: DOE Phone: 650-555-0123 Relationship FRIEND to Patient: Did you feel fever or feverish lately? Yes No Are you having shortness of breath? Yes No Do you have a cough? Yes No Did you experience loss of taste or smell? Yes No Where you in contact with any confirmed COVID-19 positive patients? Yes No Did you travel in the past 14 days to any regions affected by COVID-19? Yes No ', metadata={'source': 'example_data/alejandro_rosalez_sample-small.jpeg', 'page': 1})]

示例 3

处理多页文档要求文档位于 S3 上。示例文档位于 us-east-2 的存储桶中,需要在同一区域中调用 Textract 才能成功,因此我们在客户端上设置region_name并将其传递给加载程序,以确保从 us-east-2 调用 Textract。您还可以让笔记本在 us-east-2 中运行,将AWS_DEFAULT_REGION设置为 us-east-2,或者在其他环境中运行时,传入具有该区域名称的 boto3 Textract 客户端,如下面的单元格所示。

import boto3

textract_client = boto3.client("textract", region_name="us-east-2")

file_path = "s3://amazon-textract-public-content/langchain/layout-parser-paper.pdf"
loader = AmazonTextractPDFLoader(file_path, client=textract_client)
documents = loader.load()

现在获取验证响应的页数(打印出完整的响应会很长......我们预计有 16 页。

len(documents)
16

示例 4

您可以选择传递一个名为linearization_config到 AmazonTextractPDFLoader,它将确定解析器在 Textract 运行后如何对文本输出进行线性化。

from langchain_community.document_loaders import AmazonTextractPDFLoader
from textractor.data.text_linearization_config import TextLinearizationConfig

loader = AmazonTextractPDFLoader(
"s3://amazon-textract-public-content/langchain/layout-parser-paper.pdf",
linearization_config=TextLinearizationConfig(
hide_header_layout=True,
hide_footer_layout=True,
hide_figure_layout=True,
),
)
documents = loader.load()

在 LangChain 链中使用 AmazonTextractPDFLoader(例如 OpenAI)

AmazonTextractPDFLoader 可以在链中使用,其使用方式与其他加载程序相同。 Textract 本身确实具有 Query 功能,该功能提供与此示例中的 QA 链类似的功能,也值得一试。

# You can store your OPENAI_API_KEY in a .env file as well
# import os
# from dotenv import load_dotenv

# load_dotenv()
# Or set the OpenAI key in the environment directly
import os

os.environ["OPENAI_API_KEY"] = "your-OpenAI-API-key"
from langchain.chains.question_answering import load_qa_chain
from langchain_openai import OpenAI

chain = load_qa_chain(llm=OpenAI(), chain_type="map_reduce")
query = ["Who are the autors?"]

chain.run(input_documents=documents, question=query)
API 参考:load_qa_chain | OpenAI
' The authors are Zejiang Shen, Ruochen Zhang, Melissa Dell, Benjamin Charles Germain Lee, Jacob Carlson, Weining Li, Gardner, M., Grus, J., Neumann, M., Tafjord, O., Dasigi, P., Liu, N., Peters, M., Schmitz, M., Zettlemoyer, L., Lukasz Garncarek, Powalski, R., Stanislawek, T., Topolski, B., Halama, P., Gralinski, F., Graves, A., Fernández, S., Gomez, F., Schmidhuber, J., Harley, A.W., Ufkes, A., Derpanis, K.G., He, K., Gkioxari, G., Dollár, P., Girshick, R., He, K., Zhang, X., Ren, S., Sun, J., Kay, A., Lamiroy, B., Lopresti, D., Mears, J., Jakeway, E., Ferriter, M., Adams, C., Yarasavage, N., Thomas, D., Zwaard, K., Li, M., Cui, L., Huang,'