Dedoc

此示例演示了Dedoc与LangChain作为DocumentLoader.

概述

Dedoc 是一个开源库/服务，可提取文本、表格、附件和文档结构（例如，标题、列表项等）从各种格式的文件中。

Dedoc支持DOCX,XLSX,PPTX,EML,HTML,PDF、图像等。可在此处找到支持的格式的完整列表。

集成详细信息

类	包	本地化	序列化	JS 支持
DedocFileLoader	langchain_community	❌	beta	❌
DedocPDFLoader	langchain_community	❌	beta	❌
DedocAPIFileLoader	langchain_community	❌	beta	❌

Loader 功能

可以使用延迟加载和异步加载的方法，但实际上，文档加载是同步执行的。

源	文档延迟加载	异步支持
DedocFileLoader	❌	❌
DedocPDFLoader	❌	❌
DedocAPIFileLoader	❌	❌

设置

要访问DedocFileLoader和DedocPDFLoaderDocument Loader 中，您需要安装dedoc集成包。
要访问DedocAPIFileLoader，您需要运行Dedocservice，例如Docker容器（有关更多详细信息，请参阅文档）：

docker pull dedocproject/dedoc
docker run -p 1231:1231

Dedoc这里给出了安装说明。

# Install package
%pip install --quiet "dedoc[torch]"

Note: you may need to restart the kernel to use updated packages.

实例

from langchain_community.document_loaders import DedocFileLoader

loader = DedocFileLoader("./example_data/state_of_the_union.txt")

API 参考：DedocFileLoader

负荷

docs = loader.load()
docs[0].page_content[:100]

'\nMadam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and t'

延迟加载

docs = loader.lazy_load()

for doc in docs:
    print(doc.page_content[:100])
    break


Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and t

API 参考

有关配置和调用Dedocloader 中，请参阅 API 参考：

加载任何文件

要以受支持的格式自动处理任何文件，DedocFileLoader可能很有用。文件加载器会自动检测具有正确扩展名的文件类型。

文件解析过程可以通过dedoc_kwargs在DedocFileLoader类初始化。这里给出了一些选项使用的基本示例，请参阅DedocFileLoader和 dedoc 文档以获取有关配置参数的更多详细信息。

基本示例

from langchain_community.document_loaders import DedocFileLoader

loader = DedocFileLoader("./example_data/state_of_the_union.txt")

docs = loader.load()

docs[0].page_content[:400]

API 参考：DedocFileLoader

'\nMadam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\n\n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\n\n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\n\n\nWith a duty to one another to the American people to '

拆分模式

DedocFileLoader支持不同类型的文档拆分为多个部分（每个部分单独返回）。为此，splitparameter 与以下选项一起使用：

document（默认值）：文档文本作为单个 LangChain 返回Document对象（不拆分）;
page：将文档文本拆分为多个页面（适用于PDF,DJVU,PPTX,PPT,ODP);
node：将文档文本拆分为Dedoc树节点（标题节点、列表项节点、原始文本节点）;
line：将文档文本拆分为文本行。

loader = DedocFileLoader(
    "./example_data/layout-parser-paper.pdf",
    split="page",
    pages=":2",
)

docs = loader.load()

len(docs)

处理表

DedocFileLoader在以下情况下支持表处理with_tablesparameter 为设置为True在 loader 初始化期间（with_tables=True默认情况下）。

表不被拆分 - 每个表对应一个 langchainDocument对象。对于表，Documentobject 具有额外的metadata领域type="table"和text_as_html带表格HTML表示法。

loader = DedocFileLoader("./example_data/mlb_teams_2012.csv")

docs = loader.load()

docs[1].metadata["type"], docs[1].metadata["text_as_html"][:200]

('table',
 '<table border="1" style="border-collapse: collapse; width: 100%;">\n<tbody>\n<tr>\n<td colspan="1" rowspan="1">Team</td>\n<td colspan="1" rowspan="1"> &quot;Payroll (millions)&quot;</td>\n<td colspan="1" r')

处理附件

DedocFileLoader在以下情况下支持附加文件处理with_attachments已设置自True在 loader 初始化期间（with_attachments=False默认情况下）。

附件根据split参数。对于附件，langchainDocument对象具有附加元数据田type="attachment".

loader = DedocFileLoader(
    "./example_data/fake-email-attachment.eml",
    with_attachments=True,
)

docs = loader.load()

docs[1].metadata["type"], docs[1].page_content

('attachment',
 '\nContent-Type\nmultipart/mixed; boundary="0000000000005d654405f082adb7"\nDate\nFri, 23 Dec 2022 12:08:48 -0600\nFrom\nMallori Harrell <mallori@unstructured.io>\nMIME-Version\n1.0\nMessage-ID\n<CAPgNNXSzLVJ-d1OCX_TjFgJU7ugtQrjFybPtAMmmYZzphxNFYg@mail.gmail.com>\nSubject\nFake email with attachment\nTo\nMallori Harrell <mallori@unstructured.io>')

加载 PDF 文件

如果您只想处理PDFdocuments，您可以使用DedocPDFLoader仅PDF支持。加载器支持文档拆分、表格和附件提取的相同参数。

Dedoc可以提取PDF带或不带文本层，以及自动检测其存在和正确性。几个PDF处理程序可用，你可以使用pdf_with_text_layer参数选择其中一个。请参阅参数说明以获取更多详细信息。

为PDF没有文本层，Tesseract OCR，并且应该安装其语言包。在这种情况下，该指令可能很有用。

from langchain_community.document_loaders import DedocPDFLoader

loader = DedocPDFLoader(
    "./example_data/layout-parser-paper.pdf", pdf_with_text_layer="true", pages="2:2"
)

docs = loader.load()

docs[0].page_content[:400]

API 参考：DedocPDFLoader

'\n2\n\nZ. Shen et al.\n\n37], layout detection [38, 22], table detection [26], and scene text detection [4].\n\nA generalized learning-based framework dramatically reduces the need for the\n\nmanual speciﬁcation of complicated rules, which is the status quo with traditional\n\nmethods. DL has the potential to transform DIA pipelines and beneﬁt a broad\n\nspectrum of large-scale document digitization projects.\n'

Dedoc API

如果您想以较少的设置启动并运行，您可以使用Dedoc作为服务。DedocAPIFileLoader无需安装dedoc库。加载器支持与DedocFileLoader和还会自动检测输入文件类型。

要使用DedocAPIFileLoader，您应该运行Dedocservice，例如Docker容器（有关更多详细信息，请参阅文档）：

docker pull dedocproject/dedoc
docker run -p 1231:1231

请不要使用我们的演示 URLhttps://dedoc-readme.hf.space在你的代码中。

from langchain_community.document_loaders import DedocAPIFileLoader

loader = DedocAPIFileLoader(
    "./example_data/state_of_the_union.txt",
    url="https://dedoc-readme.hf.space",
)

docs = loader.load()

docs[0].page_content[:400]

API 参考：DedocAPIFileLoader

'\nMadam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\n\n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\n\n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\n\n\nWith a duty to one another to the American people to '

Document loader 概念指南
Document loader 操作指南

概述