如何加载 PDF

可移植文档格式 (PDF)，标准化为 ISO 32000，是 Adobe 于 1992 年开发的一种文件格式，用于以独立于应用程序软件、硬件和操作系统的呈现方式展示文档，包括文本格式和图像。

本指南涵盖如何将加载 PDF 个文档到 LangChain 的 Document 格式中，该格式将在后续流程中使用。

PDF 中的文本通常通过文本框表示。它们也可能包含图像。PDF 解析器可能会执行以下操作的组合：

通过启发式方法或机器学习推理将文本框聚合为行、段落和其他结构；
在图像上运行 OCR 以检测其中的文本；
将文本分类为属于段落、列表、表格或其他结构；
将结构化的文本转换为表格的行和列，或键值对。

LangChain 集成了多种 PDF 解析器。有些简单且相对底层；另一些则支持 OCR 和图像处理，或执行高级文档布局分析。最佳选择将取决于您的需求。以下我们列举了各种可能性。

我们将通过一个示例文件来演示这些方法：

file_path = (
    "../../docs/integrations/document_loaders/example_data/layout-parser-paper.pdf"
)

关于多模态模型的说明

许多现代大型语言模型支持对多模态输入（例如图像）进行推理。在某些应用场景中——例如针对具有复杂布局、图表或扫描件的 PDF 进行问答——跳过 PDF 解析，直接将 PDF 页面转换为图像并传递给模型可能是更有利的。我们在下面的多模态模型的使用部分展示了这方面的一个示例。

简单快速的文本提取

如果您正在寻找嵌入在 PDF 中的文本的简单字符串表示，以下方法适用。它将返回一个 Document 对象列表——每页一个——其中包含页面文本的单字符串，位于 Document 对象的 page_content 属性中。它不会解析图像或扫描 PDF 页面中的文本。其底层使用的是 pypdf Python 库。

LangChain 文档加载器实现了 lazy_load 及其异步变体 alazy_load，它们返回 Document 对象的迭代器。我们将在下面使用这些内容。

%pip install -qU pypdf

from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader(file_path)
pages = []
async for page in loader.alazy_load():
    pages.append(page)

API 参考:PyPDFLoader

print(f"{pages[0].metadata}\n")
print(pages[0].page_content)

{'source': '../../docs/integrations/document_loaders/example_data/layout-parser-paper.pdf', 'page': 0}

LayoutParser : A Uniﬁed Toolkit for Deep
Learning Based Document Image Analysis
Zejiang Shen1( �), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain
Lee4, Jacob Carlson3, and Weining Li5
1Allen Institute for AI
shannons@allenai.org
2Brown University
ruochen zhang@brown.edu
3Harvard University
{melissadell,jacob carlson }@fas.harvard.edu
4University of Washington
bcgl@cs.washington.edu
5University of Waterloo
w422li@uwaterloo.ca
Abstract. Recent advances in document image analysis (DIA) have been
primarily driven by the application of neural networks. Ideally, research
outcomes could be easily deployed in production and extended for further
investigation. However, various factors like loosely organized codebases
and sophisticated model conﬁgurations complicate the easy reuse of im-
portant innovations by a wide audience. Though there have been on-going
eﬀorts to improve reusability and simplify deep learning (DL) model
development in disciplines like natural language processing and computer
vision, none of them are optimized for challenges in the domain of DIA.
This represents a major gap in the existing toolkit, as DIA is central to
academic research across a wide range of disciplines in the social sciences
and humanities. This paper introduces LayoutParser , an open-source
library for streamlining the usage of DL in DIA research and applica-
tions. The core LayoutParser library comes with a set of simple and
intuitive interfaces for applying and customizing DL models for layout de-
tection, character recognition, and many other document processing tasks.
To promote extensibility, LayoutParser also incorporates a community
platform for sharing both pre-trained models and full document digiti-
zation pipelines. We demonstrate that LayoutParser is helpful for both
lightweight and large-scale digitization pipelines in real-word use cases.
The library is publicly available at https://layout-parser.github.io .
Keywords: Document Image Analysis ·Deep Learning ·Layout Analysis
·Character Recognition ·Open Source library ·Toolkit.
1 Introduction
Deep Learning(DL)-based approaches are the state-of-the-art for a wide range of
document image analysis (DIA) tasks including document image classiﬁcation [ 11,arXiv:2103.15348v2  [cs.CV]  21 Jun 2021

请注意，每个文档的元数据都存储了对应的页码。

PDF 上的向量搜索

一旦我们将 PDF 加载到 LangChain Document 对象中，我们就可以像往常一样对它们进行索引（例如，RAG 应用）。下面我们使用 OpenAI 嵌入模型，不过任何 LangChain 嵌入模型都可以。

%pip install -qU langchain-openai

import getpass
import os

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

from langchain_core.vectorstores import InMemoryVectorStore
from langchain_openai import OpenAIEmbeddings

vector_store = InMemoryVectorStore.from_documents(pages, OpenAIEmbeddings())
docs = vector_store.similarity_search("What is LayoutParser?", k=2)
for doc in docs:
    print(f'Page {doc.metadata["page"]}: {doc.page_content[:300]}\n')

API 参考:内存向量存储 |OpenAI 嵌入

Page 13: 14 Z. Shen et al.
6 Conclusion
LayoutParser provides a comprehensive toolkit for deep learning-based document
image analysis. The oﬀ-the-shelf library is easy to install, and can be used to
build ﬂexible and accurate pipelines for processing documents with complicated
structures. It also supports hi

Page 0: LayoutParser : A Uniﬁed Toolkit for Deep
Learning Based Document Image Analysis
Zejiang Shen1( �), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain
Lee4, Jacob Carlson3, and Weining Li5
1Allen Institute for AI
shannons@allenai.org
2Brown University
ruochen zhang@brown.edu
3Harvard University

布局分析和从图像中提取文本

如果您需要对文本进行更细粒度的分段（例如，将其划分为不同的段落、标题、表格或其他结构），或者需要从图像中提取文本，则以下方法适用。它将返回一个 Document 对象列表，其中每个对象代表页面上的一个结构。Document 的元数据会存储页码以及与该对象相关的其他信息（例如，对于表格对象，它可能会存储行和列信息）。

底层它使用了 langchain-unstructured 库。有关使用 Unstructured 与 LangChain 的更多信息，请参阅集成文档。

Unstructured 支持用于 PDF 解析的多个参数：

strategy（例如，"fast" 或 "hi-res"）
API 或本地处理。您需要一个 API 密钥才能使用 API。

The 高分辨率策略支持文档布局分析和 OCR。我们下面通过 API 展示其用法。有关在本地运行时的注意事项，请参阅下面的本地解析部分。

%pip install -qU langchain-unstructured

import getpass
import os

if "UNSTRUCTURED_API_KEY" not in os.environ:
    os.environ["UNSTRUCTURED_API_KEY"] = getpass.getpass("Unstructured API Key:")

Unstructured API Key: ········

和之前一样，我们初始化一个加载器并惰性加载文档：

from langchain_unstructured import UnstructuredLoader

loader = UnstructuredLoader(
    file_path=file_path,
    strategy="hi_res",
    partition_via_api=True,
    coordinates=True,
)
docs = []
for doc in loader.lazy_load():
    docs.append(doc)

API 参考:非结构化加载器

INFO: Preparing to split document for partition.
INFO: Starting page number set to 1
INFO: Allow failed set to 0
INFO: Concurrency level set to 5
INFO: Splitting pages 1 to 16 (16 total)
INFO: Determined optimal split size of 4 pages.
INFO: Partitioning 4 files with 4 page(s) each.
INFO: Partitioning set #1 (pages 1-4).
INFO: Partitioning set #2 (pages 5-8).
INFO: Partitioning set #3 (pages 9-12).
INFO: Partitioning set #4 (pages 13-16).
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: Successfully partitioned set #1, elements added to the final result.
INFO: Successfully partitioned set #2, elements added to the final result.
INFO: Successfully partitioned set #3, elements added to the final result.
INFO: Successfully partitioned set #4, elements added to the final result.

我们在16页的文档中恢复了171个不同的结构：

print(len(docs))

我们可以使用文档元数据从单页恢复内容：

first_page_docs = [doc for doc in docs if doc.metadata.get("page_number") == 1]

for doc in first_page_docs:
    print(doc.page_content)

LayoutParser: A Uniﬁed Toolkit for Deep Learning Based Document Image Analysis
1 2 0 2 n u J 1 2 ] V C . s c [ 2 v 8 4 3 5 1 . 3 0 1 2 : v i X r a
Zejiang Shen® (<), Ruochen Zhang?, Melissa Dell®, Benjamin Charles Germain Lee?, Jacob Carlson®, and Weining Li®
1 Allen Institute for AI shannons@allenai.org 2 Brown University ruochen zhang@brown.edu 3 Harvard University {melissadell,jacob carlson}@fas.harvard.edu 4 University of Washington bcgl@cs.washington.edu 5 University of Waterloo w422li@uwaterloo.ca
Abstract. Recent advances in document image analysis (DIA) have been primarily driven by the application of neural networks. Ideally, research outcomes could be easily deployed in production and extended for further investigation. However, various factors like loosely organized codebases and sophisticated model conﬁgurations complicate the easy reuse of im- portant innovations by a wide audience. Though there have been on-going eﬀorts to improve reusability and simplify deep learning (DL) model development in disciplines like natural language processing and computer vision, none of them are optimized for challenges in the domain of DIA. This represents a major gap in the existing toolkit, as DIA is central to academic research across a wide range of disciplines in the social sciences and humanities. This paper introduces LayoutParser, an open-source library for streamlining the usage of DL in DIA research and applica- tions. The core LayoutParser library comes with a set of simple and intuitive interfaces for applying and customizing DL models for layout de- tection, character recognition, and many other document processing tasks. To promote extensibility, LayoutParser also incorporates a community platform for sharing both pre-trained models and full document digiti- zation pipelines. We demonstrate that LayoutParser is helpful for both lightweight and large-scale digitization pipelines in real-word use cases. The library is publicly available at https://layout-parser.github.io.
Keywords: Document Image Analysis · Deep Learning · Layout Analysis · Character Recognition · Open Source library · Toolkit.
1 Introduction
Deep Learning(DL)-based approaches are the state-of-the-art for a wide range of document image analysis (DIA) tasks including document image classiﬁcation [11,

提取表格和其他结构

每个加载的Document代表一个结构，例如标题、段落或表格。

某些结构可能对索引或问答任务特别感兴趣。这些结构可能包括：

分类以便于识别；
解析为更结构化的表示形式。

下面，我们识别并提取一个表格：

点击展开代码以渲染页面

%pip install -qU matplotlib PyMuPDF pillow

import fitz
import matplotlib.patches as patches
import matplotlib.pyplot as plt
from PIL import Image


def plot_pdf_with_boxes(pdf_page, segments):
    pix = pdf_page.get_pixmap()
    pil_image = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)

    fig, ax = plt.subplots(1, figsize=(10, 10))
    ax.imshow(pil_image)
    categories = set()
    category_to_color = {
        "Title": "orchid",
        "Image": "forestgreen",
        "Table": "tomato",
    }
    for segment in segments:
        points = segment["coordinates"]["points"]
        layout_width = segment["coordinates"]["layout_width"]
        layout_height = segment["coordinates"]["layout_height"]
        scaled_points = [
            (x * pix.width / layout_width, y * pix.height / layout_height)
            for x, y in points
        ]
        box_color = category_to_color.get(segment["category"], "deepskyblue")
        categories.add(segment["category"])
        rect = patches.Polygon(
            scaled_points, linewidth=1, edgecolor=box_color, facecolor="none"
        )
        ax.add_patch(rect)

    # Make legend
    legend_handles = [patches.Patch(color="deepskyblue", label="Text")]
    for category in ["Title", "Image", "Table"]:
        if category in categories:
            legend_handles.append(
                patches.Patch(color=category_to_color[category], label=category)
            )
    ax.axis("off")
    ax.legend(handles=legend_handles, loc="upper right")
    plt.tight_layout()
    plt.show()


def render_page(doc_list: list, page_number: int, print_text=True) -> None:
    pdf_page = fitz.open(file_path).load_page(page_number - 1)
    page_docs = [
        doc for doc in doc_list if doc.metadata.get("page_number") == page_number
    ]
    segments = [doc.metadata for doc in page_docs]
    plot_pdf_with_boxes(pdf_page, segments)
    if print_text:
        for doc in page_docs:
            print(f"{doc.page_content}\n")

render_page(docs, 5)

LayoutParser: A Uniﬁed Toolkit for DL-Based DIA

5

Table 1: Current layout detection models in the LayoutParser model zoo

Dataset Base Model1 Large Model Notes PubLayNet [38] PRImA [3] Newspaper [17] TableBank [18] HJDataset [31] F / M M F F F / M M - - F - Layouts of modern scientiﬁc documents Layouts of scanned modern magazines and scientiﬁc reports Layouts of scanned US newspapers from the 20th century Table region on modern scientiﬁc and business document Layouts of history Japanese documents

1 For each dataset, we train several models of diﬀerent sizes for diﬀerent needs (the trade-oﬀ between accuracy vs. computational cost). For “base model” and “large model”, we refer to using the ResNet 50 or ResNet 101 backbones [13], respectively. One can train models of diﬀerent architectures, like Faster R-CNN [28] (F) and Mask R-CNN [12] (M). For example, an F in the Large Model column indicates it has a Faster R-CNN model trained using the ResNet 101 backbone. The platform is maintained and a number of additions will be made to the model zoo in coming months.

layout data structures, which are optimized for eﬃciency and versatility. 3) When necessary, users can employ existing or customized OCR models via the uniﬁed API provided in the OCR module. 4) LayoutParser comes with a set of utility functions for the visualization and storage of the layout data. 5) LayoutParser is also highly customizable, via its integration with functions for layout data annotation and model training. We now provide detailed descriptions for each component.

3.1 Layout Detection Models

In LayoutParser, a layout model takes a document image as an input and generates a list of rectangular boxes for the target content regions. Diﬀerent from traditional methods, it relies on deep convolutional neural networks rather than manually curated rules to identify content regions. It is formulated as an object detection problem and state-of-the-art models like Faster R-CNN [28] and Mask R-CNN [12] are used. This yields prediction results of high accuracy and makes it possible to build a concise, generalized interface for layout detection. LayoutParser, built upon Detectron2 [35], provides a minimal API that can perform layout detection with only four lines of code in Python:

1 import layoutparser as lp 2 image = cv2 . imread ( " image_file " ) # load images 3 model = lp . De t e c tro n2 Lay outM odel ( " lp :// PubLayNet / f as t er _ r c nn _ R _ 50 _ F P N_ 3 x / config " ) 4 5 layout = model . detect ( image )

LayoutParser provides a wealth of pre-trained model weights using various datasets covering diﬀerent languages, time periods, and document types. Due to domain shift [7], the prediction performance can notably drop when models are ap- plied to target samples that are signiﬁcantly diﬀerent from the training dataset. As document structures and layouts vary greatly in diﬀerent domains, it is important to select models trained on a dataset similar to the test samples. A semantic syntax is used for initializing the model weights in LayoutParser, using both the dataset name and model name lp://<dataset-name>/<model-architecture-name>.

请注意，尽管表格文本在文档内容中被折叠为单个字符串，但元数据包含其行和列的表示：

from IPython.display import HTML, display

segments = [
    doc.metadata
    for doc in docs
    if doc.metadata.get("page_number") == 5 and doc.metadata.get("category") == "Table"
]

display(HTML(segments[0]["text_as_html"]))

表1. LangChain布局中的响应式广告在LangChain布局中的效果
数据集	\| 基础模型'\|	笔记
PubLayNet [38]	F/M	Layouts of modern scientific documents
PRImA	M	Layouts of scanned modern magazines and scientific reports
Newspaper	F	Layouts of scanned US newspapers from the 20th century
TableBank [18]	F	Table region on modern scientific and business document
HJDataset	F/M	Layouts of history Japanese documents

从特定部分提取文本

结构之间可能存在父子关系——例如，一个段落可能属于带有标题的章节。如果某个章节特别重要（例如用于索引），我们可以隔离对应的 Document 对象。

下面，我们提取与文档“结论”部分相关的所有文本：

render_page(docs, 14, print_text=False)

conclusion_docs = []
parent_id = -1
for doc in docs:
    if doc.metadata["category"] == "Title" and "Conclusion" in doc.page_content:
        parent_id = doc.metadata["element_id"]
    if doc.metadata.get("parent_id") == parent_id:
        conclusion_docs.append(doc)

for doc in conclusion_docs:
    print(doc.page_content)

LayoutParser provides a comprehensive toolkit for deep learning-based document image analysis. The oﬀ-the-shelf library is easy to install, and can be used to build ﬂexible and accurate pipelines for processing documents with complicated structures. It also supports high-level customization and enables easy labeling and training of DL models on unique document image datasets. The LayoutParser community platform facilitates sharing DL models and DIA pipelines, inviting discussion and promoting code reproducibility and reusability. The LayoutParser team is committed to keeping the library updated continuously and bringing the most recent advances in DL-based DIA, such as multi-modal document modeling [37, 36, 9] (an upcoming priority), to a diverse audience of end-users.
Acknowledgements We thank the anonymous reviewers for their comments and suggestions. This project is supported in part by NSF Grant OIA-2033558 and funding from the Harvard Data Science Initiative and Harvard Catalyst. Zejiang Shen thanks Doug Downey for suggestions.

从图像中提取文本

OCR 在图像上运行，从而能够提取其中的文本：

render_page(docs, 11)

LayoutParser: A Uniﬁed Toolkit for DL-Based DIA

focuses on precision, eﬃciency, and robustness. The target documents may have complicated structures, and may require training multiple layout detection models to achieve the optimal accuracy. Light-weight pipelines are built for relatively simple documents, with an emphasis on development ease, speed and ﬂexibility. Ideally one only needs to use existing resources, and model training should be avoided. Through two exemplar projects, we show how practitioners in both academia and industry can easily build such pipelines using LayoutParser and extract high-quality structured document data for their downstream tasks. The source code for these projects will be publicly available in the LayoutParser community hub.

11

5.1 A Comprehensive Historical Document Digitization Pipeline

The digitization of historical documents can unlock valuable data that can shed light on many important social, economic, and historical questions. Yet due to scan noises, page wearing, and the prevalence of complicated layout structures, ob- taining a structured representation of historical document scans is often extremely complicated. In this example, LayoutParser was used to develop a comprehensive pipeline, shown in Figure 5, to gener- ate high-quality structured data from historical Japanese ﬁrm ﬁnancial ta- bles with complicated layouts. The pipeline applies two layout models to identify diﬀerent levels of document structures and two customized OCR engines for optimized character recog- nition accuracy.

‘Active Learning Layout Annotate Layout Dataset | +—— Annotation Toolkit A4 Deep Learning Layout Layout Detection Model Training & Inference, A Post-processing — Handy Data Structures & \ Lo orajport 7 ) Al Pls for Layout Data A4 Default and Customized Text Recognition 0CR Models ¥ Visualization & Export Layout Structure Visualization & Storage The Japanese Document Helpful LayoutParser Modules Digitization Pipeline

As shown in Figure 4 (a), the document contains columns of text written vertically 15, a common style in Japanese. Due to scanning noise and archaic printing technology, the columns can be skewed or have vari- able widths, and hence cannot be eas- ily identiﬁed via rule-based methods. Within each column, words are sepa- rated by white spaces of variable size, and the vertical positions of objects can be an indicator of their layout type.

Fig. 5: Illustration of how LayoutParser helps with the historical document digi- tization pipeline.

15 A document page consists of eight rows like this. For simplicity we skip the row segmentation discussion and refer readers to the source code when available.

请注意，右侧图中的文本已被提取并整合到Document的内容中。

本地解析

本地解析需要安装额外的依赖项。

Poppler（PDF 分析）

Linux: apt-get install poppler-utils
Mac: brew install poppler
Windows: https://github.com/oschwartz10612/poppler-windows

Tessercat （光学字符识别）

Linux: apt-get install tesseract-ocr
Mac: brew install tesseract
Windows: https://github.com/UB-Mannheim/tesseract/wiki#tesseract-installer-for-windows

我们还需要安装 unstructured PDF 额外组件：

%pip install -qU "unstructured[pdf]"

然后我们可以以类似的方式使用 UnstructuredLoader，省略 API 密钥和 partition_via_api 设置：

loader_local = UnstructuredLoader(
    file_path=file_path,
    strategy="hi_res",
)
docs_local = []
for doc in loader_local.lazy_load():
    docs_local.append(doc)

WARNING: This function will be deprecated in a future release and `unstructured` will simply use the DEFAULT_MODEL from `unstructured_inference.model.base` to set default model name
INFO: Reading PDF for file: /Users/chestercurme/repos/langchain/libs/community/tests/integration_tests/examples/layout-parser-paper.pdf ...
INFO: Detecting page elements ...
INFO: Detecting page elements ...
INFO: Detecting page elements ...
INFO: Detecting page elements ...
INFO: Detecting page elements ...
INFO: Detecting page elements ...
INFO: Detecting page elements ...
INFO: Detecting page elements ...
INFO: Detecting page elements ...
INFO: Detecting page elements ...
INFO: Detecting page elements ...
INFO: Detecting page elements ...
INFO: Detecting page elements ...
INFO: Detecting page elements ...
INFO: Detecting page elements ...
INFO: Detecting page elements ...
INFO: Processing entire page OCR with tesseract...
INFO: Processing entire page OCR with tesseract...
INFO: Processing entire page OCR with tesseract...
INFO: Processing entire page OCR with tesseract...
INFO: Processing entire page OCR with tesseract...
INFO: Processing entire page OCR with tesseract...
INFO: padding image by 20 for structure detection
INFO: Processing entire page OCR with tesseract...
INFO: Processing entire page OCR with tesseract...
INFO: Processing entire page OCR with tesseract...
INFO: Processing entire page OCR with tesseract...
INFO: padding image by 20 for structure detection
INFO: Processing entire page OCR with tesseract...
INFO: Processing entire page OCR with tesseract...
INFO: Processing entire page OCR with tesseract...
INFO: Processing entire page OCR with tesseract...
INFO: Processing entire page OCR with tesseract...
INFO: Processing entire page OCR with tesseract...
INFO: Processing entire page OCR with tesseract...
INFO: Processing entire page OCR with tesseract...

文档列表随后可以像从 API 获取的文档一样进行处理。

多模态模型的使用

许多现代大型语言模型支持对多模态输入（例如图像）进行推理。在某些应用场景中——如针对具有复杂布局、图表或扫描件的 PDF 进行问答——跳过 PDF 解析，直接将 PDF 页面转换为图像并传递给模型可能是更优的选择。这使得模型能够基于页面上的二维内容进行推理，而非“一维”的字符串表示。

原则上，我们可以使用任何支持多模态输入的 LangChain 聊天模型。这些模型的列表记录在此处。下面我们使用的是 OpenAI 的gpt-4o-mini。

首先，我们定义一个简短的工具函数，用于将 PDF 页面转换为 base64 编码的图像：

%pip install -qU PyMuPDF pillow langchain-openai

import base64
import io

import fitz
from PIL import Image


def pdf_page_to_base64(pdf_path: str, page_number: int):
    pdf_document = fitz.open(pdf_path)
    page = pdf_document.load_page(page_number - 1)  # input is one-indexed
    pix = page.get_pixmap()
    img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)

    buffer = io.BytesIO()
    img.save(buffer, format="PNG")

    return base64.b64encode(buffer.getvalue()).decode("utf-8")

from IPython.display import Image as IPImage
from IPython.display import display

base64_image = pdf_page_to_base64(file_path, 11)
display(IPImage(data=base64.b64decode(base64_image)))

然后我们可以以通常的方式查询该模型。下面我们向它提出一个与页面上的图表相关的问题。

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

API 参考:ChatOpenAI

from langchain_core.messages import HumanMessage

query = "What is the name of the first step in the pipeline?"

message = HumanMessage(
    content=[
        {"type": "text", "text": query},
        {
            "type": "image_url",
            "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"},
        },
    ],
)
response = llm.invoke([message])
print(response.content)

API 参考:人类消息

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
``````output
The first step in the pipeline is "Annotate Layout Dataset."

其他 PDF 加载器

对于可用的 LangChain PDF 加载器列表，请参阅此表格。

简单快速的文本提取​

PDF 上的向量搜索​

布局分析和从图像中提取文本​

提取表格和其他结构​

从特定部分提取文本​

从图像中提取文本​

本地解析​

多模态模型的使用​

其他 PDF 加载器​