ZeroxPDF 洛德

概述

ZeroxPDFLoader是利用 Zerox 库的文档加载器。Zerox 将 PDF 文档转换为图像，使用具有视觉功能的语言模型对其进行处理，并生成结构化的 Markdown 表示形式。此加载程序允许异步作并提供页面级文档提取。

集成详细信息

类	包	本地化	序列化	JS 支持
ZeroxPDFLoader	langchain_community	❌	❌	❌

Loader 功能

源	文档延迟加载	原生异步支持
ZeroxPDFLoader	✅	❌

设置

凭据

需要在环境变量中设置适当的凭证。加载器支持许多不同的模型和模型提供程序。请参阅下面的 Usage 标头以查看一些示例，或参阅 Zerox 文档以获取受支持模型的完整列表。

安装

要使用ZeroxPDFLoader，您需要安装zerox包。还要确保有langchain-community安装。

pip install zerox langchain-community

初始化

ZeroxPDFLoader通过将每个页面转换为图像并对其进行异步处理，使用具有视觉功能的语言模型实现 PDF 文本提取。要使用此加载程序，您需要指定一个模型并为 Zerox 配置任何必要的环境变量，例如 API 密钥。

如果您在 Jupyter Notebook 等环境中工作，则可能需要使用nest_asyncio.您可以按如下方式进行设置：

import nest_asyncio
nest_asyncio.apply()

import os

# use nest_asyncio (only necessary inside of jupyter notebook)
import nest_asyncio
from langchain_community.document_loaders.pdf import ZeroxPDFLoader

nest_asyncio.apply()

# Specify the url or file path for the PDF you want to process
# In this case let's use pdf from web
file_path = "https://assets.ctfassets.net/f1df9zr7wr1a/soP1fjvG1Wu66HJhu3FBS/034d6ca48edb119ae77dec5ce01a8612/OpenAI_Sacra_Teardown.pdf"

# Set up necessary env vars for a vision model
os.environ["OPENAI_API_KEY"] = (
    "zK3BAhQUmbwZNoHoOcscBwQdwi3oc3hzwJmbgdZ"  ## your-api-key
)

# Initialize ZeroxPDFLoader with the desired model
loader = ZeroxPDFLoader(file_path=file_path, model="azure/gpt-4o-mini")

API 参考：ZeroxPDFLoader

负荷

# Load the document and look at the first page:
documents = loader.load()
documents[0]

Document(metadata={'source': 'https://assets.ctfassets.net/f1df9zr7wr1a/soP1fjvG1Wu66HJhu3FBS/034d6ca48edb119ae77dec5ce01a8612/OpenAI_Sacra_Teardown.pdf', 'page': 1, 'num_pages': 5}, page_content='# OpenAI\n\nOpenAI is an AI research laboratory.\n\n#ai-models #ai\n\n## Revenue\n- **$1,000,000,000**  \n  2023\n\n## Valuation\n- **$28,000,000,000**  \n  2023\n\n## Growth Rate (Y/Y)\n- **400%**  \n  2023\n\n## Funding\n- **$11,300,000,000**  \n  2023\n\n---\n\n## Details\n- **Headquarters:** San Francisco, CA\n- **CEO:** Sam Altman\n\n[Visit Website](#)\n\n---\n\n## Revenue\n### ARR ($M)  | Growth\n--- | ---\n$1000M  | 456%\n$750M   | \n$500M   | \n$250M   | $36M\n$0     | $200M\n\nis on track to hit $1B in annual recurring revenue by the end of 2023, up about 400% from an estimated $200M at the end of 2022.\n\nOpenAI overall lost about $540M last year while developing ChatGPT, and those losses are expected to increase dramatically in 2023 with the growth in popularity of their consumer tools, with CEO Sam Altman remarking that OpenAI is likely to be "the most capital-intensive startup in Silicon Valley history."\n\nThe reason for that is operating ChatGPT is massively expensive. One analysis of ChatGPT put the running cost at about $700,000 per day taking into account the underlying costs of GPU hours and hardware. That amount—derived from the 175 billion parameter-large architecture of GPT-3—would be even higher with the 100 trillion parameters of GPT-4.\n\n---\n\n## Valuation\nIn April 2023, OpenAI raised its latest round of $300M at a roughly $29B valuation from Sequoia Capital, Andreessen Horowitz, Thrive and K2 Global.\n\nAssuming OpenAI was at roughly $300M in ARR at the time, that would have given them a 96x forward revenue multiple.\n\n---\n\n## Product\n\n### ChatGPT\n| Examples                       | Capabilities                        | Limitations                        |\n|---------------------------------|-------------------------------------|------------------------------------|\n| "Explain quantum computing in simple terms" | "Remember what users said earlier in the conversation" | May occasionally generate incorrect information |\n| "What can you give me for my dad\'s birthday?" | "Allows users to follow-up questions" | Limited knowledge of world events after 2021 |\n| "How do I make an HTTP request in JavaScript?" | "Trained to provide harmless requests" |                                    |')

# Let's look at parsed first page
print(documents[0].page_content)

# OpenAI

OpenAI is an AI research laboratory.

#ai-models #ai

## Revenue
- **$1,000,000,000**  
  2023

## Valuation
- **$28,000,000,000**  
  2023

## Growth Rate (Y/Y)
- **400%**  
  2023

## Funding
- **$11,300,000,000**  
  2023

---

## Details
- **Headquarters:** San Francisco, CA
- **CEO:** Sam Altman

[Visit Website](#)

---

## Revenue
### ARR ($M)  | Growth
--- | ---
$1000M  | 456%
$750M   | 
$500M   | 
$250M   | $36M
$0     | $200M

is on track to hit $1B in annual recurring revenue by the end of 2023, up about 400% from an estimated $200M at the end of 2022.

OpenAI overall lost about $540M last year while developing ChatGPT, and those losses are expected to increase dramatically in 2023 with the growth in popularity of their consumer tools, with CEO Sam Altman remarking that OpenAI is likely to be "the most capital-intensive startup in Silicon Valley history."

The reason for that is operating ChatGPT is massively expensive. One analysis of ChatGPT put the running cost at about $700,000 per day taking into account the underlying costs of GPU hours and hardware. That amount—derived from the 175 billion parameter-large architecture of GPT-3—would be even higher with the 100 trillion parameters of GPT-4.

---

## Valuation
In April 2023, OpenAI raised its latest round of $300M at a roughly $29B valuation from Sequoia Capital, Andreessen Horowitz, Thrive and K2 Global.

Assuming OpenAI was at roughly $300M in ARR at the time, that would have given them a 96x forward revenue multiple.

---

## Product

### ChatGPT
| Examples                       | Capabilities                        | Limitations                        |
|---------------------------------|-------------------------------------|------------------------------------|
| "Explain quantum computing in simple terms" | "Remember what users said earlier in the conversation" | May occasionally generate incorrect information |
| "What can you give me for my dad's birthday?" | "Allows users to follow-up questions" | Limited knowledge of world events after 2021 |
| "How do I make an HTTP request in JavaScript?" | "Trained to provide harmless requests" |                                    |

延迟加载

加载器总是懒惰地获取结果。.load()method 等效于.lazy_load()

API 参考

`ZeroxPDFLoader`

这个 loader 类使用文件路径和模型类型进行初始化，并支持通过zerox_kwargs用于处理特定于 Zerox 的参数。

参数：

file_path（Union[str， Path]）：PDF 文件的路径。
model（str）：用于处理格式的具有视觉功能的模型<provider>/<model>. 有效值的一些示例包括：
- model = "gpt-4o-mini" ## openai model
- model = "azure/gpt-4o-mini"
- model = "gemini/gpt-4o-mini"
- model="claude-3-opus-20240229"
- model = "vertex_ai/gemini-1.5-flash-001"
- 在 Zerox 文档中查看更多详细信息
- 默认为"gpt-4o-mini".
**zerox_kwargs（dict）：其他特定于 Zerox 的参数，例如 API 密钥、终端节点等。
- 请参阅 Zerox 文档

方法：

lazy_load：生成Document实例，每个实例表示 PDF 的一个页面，以及包括页码和源在内的元数据。

在此处查看完整的 API 文档

笔记

模型兼容性： Zerox 支持一系列具有视觉功能的模型。请参阅 Zerox 的 GitHub 文档，了解支持的模型和配置详细信息的列表。
环境变量：确保设置所需的环境变量，例如API_KEY或 Zerox 文档中指定的终端节点详细信息。
异步处理：如果您在 Jupyter Notebook 中遇到与事件循环相关的错误，则可能需要申请nest_asyncio如 Setup 部分所示。

故障排除

RuntimeError：此事件循环已在运行：使用nest_asyncio.apply()以防止 Jupyter 等环境中的异步循环冲突。
配置错误：验证zerox_kwargs匹配所选模型的预期参数，并设置所有必要的环境变量。

其他资源

Zerox 文档：Zerox GitHub 存储库
LangChain 文档加载器： LangChain 文档

Document loader 概念指南
Document loader 操作指南

概述