非结构化
这个笔记本介绍了如何使用Unstructured 文档加载器 加载多种类型文件。目前Unstructured支持加载文本文件、PowerPoint、HTML、PDF、图片等。
请参阅此指南以获取更多关于在本地设置Unstructured的说明,包括所需的系统依赖项配置。
概览
集成细节
| Class | 包 | 本地 | 序列化 | JS支持 |
|---|---|---|---|---|
| UnstructuredLoader | langchain_unstructured | ✅ | ❌ | ✅ |
加载器功能
| 来源 | 文档延迟加载 | 原生异步支持 |
|---|---|---|
| UnstructuredLoader | ✅ | ❌ |
设置
Credentials
默认情况下,langchain-unstructured安装一个较小的脚本规模,需要将分区逻辑卸载到Unstructured API中,这需要一个API密钥。如果您使用本地安装,则无需API密钥。要获取您的API密钥,请访问此网站并获取API密钥,然后在下方单元格中设置它:
import getpass
import os
if "UNSTRUCTURED_API_KEY" not in os.environ:
os.environ["UNSTRUCTURED_API_KEY"] = getpass.getpass(
"Enter your Unstructured API key: "
)
安装
正常安装
以下这些包是运行其余部分笔记本所必需的。
# Install package, compatible with API partitioning
%pip install --upgrade --quiet langchain-unstructured unstructured-client unstructured "unstructured[pdf]" python-magic
安装本地版本
如果您希望在本地运行分片逻辑,您将需要安装一些系统依赖项,如Unstructured 文档中所述。
例如,在Mac上,您可以使用以下命令安装所需的依赖项:
# base dependencies
brew install libmagic poppler tesseract
# If parsing xml / html documents:
brew install libxml2 libxslt
您可以安装所需的 pip 依赖项以进行本地安装:
pip install "langchain-unstructured[local]"
初始化
The UnstructuredLoader 允许从多种不同的文件类型加载。要详细了解 unstructured 包,请参阅其 文档/。在本示例中,我们展示了如何从文本文件和 PDF 文件加载数据。
from langchain_unstructured import UnstructuredLoader
file_paths = [
"./example_data/layout-parser-paper.pdf",
"./example_data/state_of_the_union.txt",
]
loader = UnstructuredLoader(file_paths)
加载
docs = loader.load()
docs[0]
INFO: pikepdf C++ to Python logger bridge initialized
Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 213.36), (16.34, 253.36), (36.34, 253.36), (36.34, 213.36)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'd3ce55f220dfb75891b4394a18bcb973'}, page_content='1 2 0 2')
print(docs[0].metadata)
{'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 213.36), (16.34, 253.36), (36.34, 253.36), (36.34, 213.36)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'd3ce55f220dfb75891b4394a18bcb973'}
懒加载
pages = []
for doc in loader.lazy_load():
pages.append(doc)
pages[0]
Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 213.36), (16.34, 253.36), (36.34, 253.36), (36.34, 213.36)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'd3ce55f220dfb75891b4394a18bcb973'}, page_content='1 2 0 2')
以下是翻译后的HTML内容:
如果在提取后需要对unstructured元素进行后处理,您可以在初始化UnstructuredLoader时通过post_processors参数传入一个函数列表str->str。这同样适用于其他未结构化加载器。下面是一个示例。
from langchain_unstructured import UnstructuredLoader
from unstructured.cleaners.core import clean_extra_whitespace
loader = UnstructuredLoader(
"./example_data/layout-parser-paper.pdf",
post_processors=[clean_extra_whitespace],
)
docs = loader.load()
docs[5:10]
[Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 393.9), (16.34, 560.0), (36.34, 560.0), (36.34, 393.9)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'parent_id': '89565df026a24279aaea20dc08cedbec', 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'e9fa370aef7ee5c05744eb7bb7d9981b'}, page_content='2 v 8 4 3 5 1 . 3 0 1 2 : v i X r a'),
Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((157.62199999999999, 114.23496279999995), (157.62199999999999, 146.5141628), (457.7358962799999, 146.5141628), (457.7358962799999, 114.23496279999995)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'Title', 'element_id': 'bde0b230a1aa488e3ce837d33015181b'}, page_content='LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis'),
Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((134.809, 168.64029940800003), (134.809, 192.2517444), (480.5464199080001, 192.2517444), (480.5464199080001, 168.64029940800003)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': '54700f902899f0c8c90488fa8d825bce'}, page_content='Zejiang Shen1 ((cid:0)), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain Lee4, Jacob Carlson3, and Weining Li5'),
Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((207.23000000000002, 202.57205439999996), (207.23000000000002, 311.8195408), (408.12676, 311.8195408), (408.12676, 202.57205439999996)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'b650f5867bad9bb4e30384282c79bcfe'}, page_content='1 Allen Institute for AI shannons@allenai.org 2 Brown University ruochen zhang@brown.edu 3 Harvard University {melissadell,jacob carlson}@fas.harvard.edu 4 University of Washington bcgl@cs.washington.edu 5 University of Waterloo w422li@uwaterloo.ca'),
Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((162.779, 338.45008160000003), (162.779, 566.8455408), (454.0372021523199, 566.8455408), (454.0372021523199, 338.45008160000003)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'links': [{'text': ':// layout - parser . github . io', 'url': 'https://layout-parser.github.io', 'start_index': 1477}], 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'NarrativeText', 'element_id': 'cfc957c94fe63c8fd7c7f4bcb56e75a7'}, page_content='Abstract. Recent advances in document image analysis (DIA) have been primarily driven by the application of neural networks. Ideally, research outcomes could be easily deployed in production and extended for further investigation. However, various factors like loosely organized codebases and sophisticated model configurations complicate the easy reuse of im- portant innovations by a wide audience. Though there have been on-going efforts to improve reusability and simplify deep learning (DL) model development in disciplines like natural language processing and computer vision, none of them are optimized for challenges in the domain of DIA. This represents a major gap in the existing toolkit, as DIA is central to academic research across a wide range of disciplines in the social sciences and humanities. This paper introduces LayoutParser, an open-source library for streamlining the usage of DL in DIA research and applica- tions. The core LayoutParser library comes with a set of simple and intuitive interfaces for applying and customizing DL models for layout de- tection, character recognition, and many other document processing tasks. To promote extensibility, LayoutParser also incorporates a community platform for sharing both pre-trained models and full document digiti- zation pipelines. We demonstrate that LayoutParser is helpful for both lightweight and large-scale digitization pipelines in real-word use cases. The library is publicly available at https://layout-parser.github.io.')]
未结构化API
如果您想使用较小的包并获得最新的分片功能,可以pip install unstructured-client和pip install langchain-unstructured。有关UnstructuredLoader的更多信息,请参阅
未结构化提供者页面。
当您传入api_key并设置partition_via_api=True时,加载器将使用托管的Unstructured无服务器API处理您的文档。您可以在这里生成免费的Unstructured API密钥。
检查一下这里 如果您想要自行托管Unstructured API或在本地运行它。
from langchain_unstructured import UnstructuredLoader
loader = UnstructuredLoader(
file_path="example_data/fake.docx",
api_key=os.getenv("UNSTRUCTURED_API_KEY"),
partition_via_api=True,
)
docs = loader.load()
docs[0]
INFO: Preparing to split document for partition.
INFO: Given file doesn't have '.pdf' extension, so splitting is not enabled.
INFO: Partitioning without split.
INFO: Successfully partitioned the document.
Document(metadata={'source': 'example_data/fake.docx', 'category_depth': 0, 'filename': 'fake.docx', 'languages': ['por', 'cat'], 'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'category': 'Title', 'element_id': '56d531394823d81787d77a04462ed096'}, page_content='Lorem ipsum dolor sit amet.')
您可以通过 Unstructured API 使用单个 API 批量处理多个文件,使用 UnstructuredLoader。
loader = UnstructuredLoader(
file_path=["example_data/fake.docx", "example_data/fake-email.eml"],
api_key=os.getenv("UNSTRUCTURED_API_KEY"),
partition_via_api=True,
)
docs = loader.load()
print(docs[0].metadata["filename"], ": ", docs[0].page_content[:100])
print(docs[-1].metadata["filename"], ": ", docs[-1].page_content[:100])
INFO: Preparing to split document for partition.
INFO: Given file doesn't have '.pdf' extension, so splitting is not enabled.
INFO: Partitioning without split.
INFO: Successfully partitioned the document.
INFO: Preparing to split document for partition.
INFO: Given file doesn't have '.pdf' extension, so splitting is not enabled.
INFO: Partitioning without split.
INFO: Successfully partitioned the document.
``````output
fake.docx : Lorem ipsum dolor sit amet.
fake-email.eml : Violets are blue
未结构化SDK客户端
使用无结构API进行分区依赖于未结构化SDK客户端。
如果您希望自定义客户端,您需要向UnstructuredLoader传递一个UnstructuredClient实例。以下是一个示例,展示了如何自定义客户端的功能,例如使用自己的requests.Session()、传递替代的server_url以及自定义RetryConfig对象。有关自定义客户端或 SDK 客户端接受其他哪些额外参数的更多信息,请参阅Unstructured Python SDK文档和API 参数文档中的客户端部分。请注意,所有 API 参数都应传递给UnstructuredLoader。
import requests
from langchain_unstructured import UnstructuredLoader
from unstructured_client import UnstructuredClient
from unstructured_client.utils import BackoffStrategy, RetryConfig
client = UnstructuredClient(
api_key_auth=os.getenv(
"UNSTRUCTURED_API_KEY"
), # Note: the client API param is "api_key_auth" instead of "api_key"
client=requests.Session(), # Define your own requests session
server_url="https://api.unstructuredapp.io/general/v0/general", # Define your own api url
retry_config=RetryConfig(
strategy="backoff",
retry_connection_errors=True,
backoff=BackoffStrategy(
initial_interval=500,
max_interval=60000,
exponent=1.5,
max_elapsed_time=900000,
),
), # Define your own retry config
)
loader = UnstructuredLoader(
"./example_data/layout-parser-paper.pdf",
partition_via_api=True,
client=client,
split_pdf_page=True,
split_pdf_page_range=[1, 10],
)
docs = loader.load()
print(docs[0].metadata["filename"], ": ", docs[0].page_content[:100])
INFO: Preparing to split document for partition.
INFO: Concurrency level set to 5
INFO: Splitting pages 1 to 10 (10 total)
INFO: Determined optimal split size of 2 pages.
INFO: Partitioning 5 files with 2 page(s) each.
INFO: Partitioning set #1 (pages 1-2).
INFO: Partitioning set #2 (pages 3-4).
INFO: Partitioning set #3 (pages 5-6).
INFO: Partitioning set #4 (pages 7-8).
INFO: Partitioning set #5 (pages 9-10).
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: Successfully partitioned set #1, elements added to the final result.
INFO: Successfully partitioned set #2, elements added to the final result.
INFO: Successfully partitioned set #3, elements added to the final result.
INFO: Successfully partitioned set #4, elements added to the final result.
INFO: Successfully partitioned set #5, elements added to the final result.
INFO: Successfully partitioned the document.
``````output
layout-parser-paper.pdf : LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis
Chunking
The UnstructuredLoader 不支持将 mode 作为参数进行文本分组,就像较老的加载器 UnstructuredFileLoader 和其他加载器所做的那样。它取而代之的是支持“切块”。在未结构化的上下文中,切块与您可能熟悉的基于纯文本特征(如 '
' 或 ' ',这些可能是段落边界或列表项边界)形成的切块机制不同。相反,所有文档都会根据每个文档格式的具体知识被分割为语义单位(文档元素),我们只需在单个元素超过所需的最大切块大小时才需要进行文本分割。总体而言,切块会将连续的元素结合在一起形成尽可能大的切块,而不超出最大切块大小。切块会产生由 CompositeElement、Table 或 TableChunk 元素组成的序列。每个“切块”都是这三种类型之一的实例。
有关分块选项的更多详细信息,请参见此 页面,但要复制与 mode="single" 相同的行为,您可以设置 chunking_strategy="basic"、max_characters=<some-really-big-number> 和 include_orig_elements=False。
from langchain_unstructured import UnstructuredLoader
loader = UnstructuredLoader(
"./example_data/layout-parser-paper.pdf",
chunking_strategy="basic",
max_characters=1000000,
include_orig_elements=False,
)
docs = loader.load()
print("Number of LangChain documents:", len(docs))
print("Length of text in the document:", len(docs[0].page_content))
Number of LangChain documents: 1
Length of text in the document: 42772
加载网页
UnstructuredLoader 接受一个 web_url 关键字参数,当在本地运行时,该参数会填充底层 Unstructured 分区 的 url 参数。这允许解析远程托管的文档,例如 HTML 网页。
示例用法:
from langchain_unstructured import UnstructuredLoader
loader = UnstructuredLoader(web_url="https://www.example.com")
docs = loader.load()
for doc in docs:
print(f"{doc}\n")
page_content='Example Domain' metadata={'category_depth': 0, 'languages': ['eng'], 'filetype': 'text/html', 'url': 'https://www.example.com', 'category': 'Title', 'element_id': 'fdaa78d856f9d143aeeed85bf23f58f8'}
page_content='This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.' metadata={'languages': ['eng'], 'parent_id': 'fdaa78d856f9d143aeeed85bf23f58f8', 'filetype': 'text/html', 'url': 'https://www.example.com', 'category': 'NarrativeText', 'element_id': '3652b8458b0688639f973fe36253c992'}
page_content='More information...' metadata={'category_depth': 0, 'link_texts': ['More information...'], 'link_urls': ['https://www.iana.org/domains/example'], 'languages': ['eng'], 'filetype': 'text/html', 'url': 'https://www.example.com', 'category': 'Title', 'element_id': '793ab98565d6f6d6f3a6d614e3ace2a9'}
API 参考
有关所有 UnstructuredLoader 功能和配置的详细文档,请访问 API 参考: https://python.langchain.com/api_reference/unstructured/document_loaders/langchain_unstructured.document_loaders.UnstructuredLoader.html