ZeroxPDFLoader
概览
ZeroxPDFLoader 是一个文档加载器,利用了 Zerox 库。Zerox 将 PDF 文档转换为图像,并使用具备视觉能力的语言模型对其进行处理,生成结构化的 Markdown 表示形式。此加载器支持异步操作并提供按页面级别的文档提取。
集成细节
| Class | 包 | 本地 | 序列化 | JS支持 |
|---|---|---|---|---|
| ZeroxPDFLoader | langchain_community | ❌ | ❌ | ❌ |
加载器功能
| 来源 | 文档延迟加载 | 原生异步支持 |
|---|---|---|
| ZeroxPDFLoader | ✅ | ❌ |
设置
Credentials
需要在环境变量中设置合适的凭据。加载器支持多种不同的模型和模型提供者。请参见下方的 Usage 标题查看几个示例,或访问 Zerox 文档 查看所有受支持的完整列表。
安装
要使用 ZeroxPDFLoader,您需要安装 zerox 包。同时,请确保已经安装了 langchain-community。
pip install zerox langchain-community
初始化
ZeroxPDFLoader 通过将每一页转换为图像并异步处理来使用具备视觉能力的语言模型提取PDF文本。要使用此加载器,您需要指定一个模型,并配置必要的环境变量(例如 Zerox 的 API 密钥)。
如果您在像Jupyter Notebook这样的环境中工作,您可能需要通过使用nest_asyncio来处理异步代码。您可以按照以下方式进行设置:
import nest_asyncio
nest_asyncio.apply()
import os
# use nest_asyncio (only necessary inside of jupyter notebook)
import nest_asyncio
from langchain_community.document_loaders.pdf import ZeroxPDFLoader
nest_asyncio.apply()
# Specify the url or file path for the PDF you want to process
# In this case let's use pdf from web
file_path = "https://assets.ctfassets.net/f1df9zr7wr1a/soP1fjvG1Wu66HJhu3FBS/034d6ca48edb119ae77dec5ce01a8612/OpenAI_Sacra_Teardown.pdf"
# Set up necessary env vars for a vision model
os.environ["OPENAI_API_KEY"] = (
"zK3BAhQUmbwZNoHoOcscBwQdwi3oc3hzwJmbgdZ" ## your-api-key
)
# Initialize ZeroxPDFLoader with the desired model
loader = ZeroxPDFLoader(file_path=file_path, model="azure/gpt-4o-mini")
API 参考:ZeroxPDFLoader
加载
# Load the document and look at the first page:
documents = loader.load()
documents[0]
Document(metadata={'source': 'https://assets.ctfassets.net/f1df9zr7wr1a/soP1fjvG1Wu66HJhu3FBS/034d6ca48edb119ae77dec5ce01a8612/OpenAI_Sacra_Teardown.pdf', 'page': 1, 'num_pages': 5}, page_content='# OpenAI\n\nOpenAI is an AI research laboratory.\n\n#ai-models #ai\n\n## Revenue\n- **$1,000,000,000** \n 2023\n\n## Valuation\n- **$28,000,000,000** \n 2023\n\n## Growth Rate (Y/Y)\n- **400%** \n 2023\n\n## Funding\n- **$11,300,000,000** \n 2023\n\n---\n\n## Details\n- **Headquarters:** San Francisco, CA\n- **CEO:** Sam Altman\n\n[Visit Website](#)\n\n---\n\n## Revenue\n### ARR ($M) | Growth\n--- | ---\n$1000M | 456%\n$750M | \n$500M | \n$250M | $36M\n$0 | $200M\n\nis on track to hit $1B in annual recurring revenue by the end of 2023, up about 400% from an estimated $200M at the end of 2022.\n\nOpenAI overall lost about $540M last year while developing ChatGPT, and those losses are expected to increase dramatically in 2023 with the growth in popularity of their consumer tools, with CEO Sam Altman remarking that OpenAI is likely to be "the most capital-intensive startup in Silicon Valley history."\n\nThe reason for that is operating ChatGPT is massively expensive. One analysis of ChatGPT put the running cost at about $700,000 per day taking into account the underlying costs of GPU hours and hardware. That amount—derived from the 175 billion parameter-large architecture of GPT-3—would be even higher with the 100 trillion parameters of GPT-4.\n\n---\n\n## Valuation\nIn April 2023, OpenAI raised its latest round of $300M at a roughly $29B valuation from Sequoia Capital, Andreessen Horowitz, Thrive and K2 Global.\n\nAssuming OpenAI was at roughly $300M in ARR at the time, that would have given them a 96x forward revenue multiple.\n\n---\n\n## Product\n\n### ChatGPT\n| Examples | Capabilities | Limitations |\n|---------------------------------|-------------------------------------|------------------------------------|\n| "Explain quantum computing in simple terms" | "Remember what users said earlier in the conversation" | May occasionally generate incorrect information |\n| "What can you give me for my dad\'s birthday?" | "Allows users to follow-up questions" | Limited knowledge of world events after 2021 |\n| "How do I make an HTTP request in JavaScript?" | "Trained to provide harmless requests" | |')
# Let's look at parsed first page
print(documents[0].page_content)
# OpenAI
OpenAI is an AI research laboratory.
#ai-models #ai
## Revenue
- **$1,000,000,000**
2023
## Valuation
- **$28,000,000,000**
2023
## Growth Rate (Y/Y)
- **400%**
2023
## Funding
- **$11,300,000,000**
2023
---
## Details
- **Headquarters:** San Francisco, CA
- **CEO:** Sam Altman
[Visit Website](#)
---
## Revenue
### ARR ($M) | Growth
--- | ---
$1000M | 456%
$750M |
$500M |
$250M | $36M
$0 | $200M
is on track to hit $1B in annual recurring revenue by the end of 2023, up about 400% from an estimated $200M at the end of 2022.
OpenAI overall lost about $540M last year while developing ChatGPT, and those losses are expected to increase dramatically in 2023 with the growth in popularity of their consumer tools, with CEO Sam Altman remarking that OpenAI is likely to be "the most capital-intensive startup in Silicon Valley history."
The reason for that is operating ChatGPT is massively expensive. One analysis of ChatGPT put the running cost at about $700,000 per day taking into account the underlying costs of GPU hours and hardware. That amount—derived from the 175 billion parameter-large architecture of GPT-3—would be even higher with the 100 trillion parameters of GPT-4.
---
## Valuation
In April 2023, OpenAI raised its latest round of $300M at a roughly $29B valuation from Sequoia Capital, Andreessen Horowitz, Thrive and K2 Global.
Assuming OpenAI was at roughly $300M in ARR at the time, that would have given them a 96x forward revenue multiple.
---
## Product
### ChatGPT
| Examples | Capabilities | Limitations |
|---------------------------------|-------------------------------------|------------------------------------|
| "Explain quantum computing in simple terms" | "Remember what users said earlier in the conversation" | May occasionally generate incorrect information |
| "What can you give me for my dad's birthday?" | "Allows users to follow-up questions" | Limited knowledge of world events after 2021 |
| "How do I make an HTTP request in JavaScript?" | "Trained to provide harmless requests" | |
懒加载
The loader 总是惰性加载结果。.load() 方法等同于 .lazy_load()
API 参考
ZeroxPDFLoader
这loader类通过文件路径和模型类型初始化,并且支持通过zerox_kwargs处理Zerox特定参数的自定义配置。
参数:
file_path(Union[str, Path]): Path到的PDF文件。model(str): Vision-capable模型用于处理,格式为<provider>/<model>。 一些有效的值示例包括:model = "gpt-4o-mini" ## openai modelmodel = "azure/gpt-4o-mini"model = "gemini/gpt-4o-mini"model="claude-3-opus-20240229"model = "vertex_ai/gemini-1.5-flash-001"- 更多信息请参阅Zerox文档
- 默认值为
"gpt-4o-mini".
**zerox_kwargs(dict): 零星特定的参数,例如API密钥、终端节点等。
- 请参见 零星文档
方法:
lazy_load: 生成一个表示PDF每一页的迭代器,每个实例包含页面编号和源文件的元数据。
见完整API文档 这里
笔记
- 模型兼容性: Zerox 支持一系列具有视觉能力的模型。请参阅 Zerox 的 GitHub 文档 以获取支持的模型列表及配置细节。
- 环境变量: 请确保设置必要的环境变量,例如
API_KEY或端点详细信息,如Zerox 文档中所指定的。 - 异步处理: 如果你在Jupyter Notebook中遇到与事件循环相关的错误,可能需要在设置部分应用
nest_asyncio。
故障排除
- 运行时错误:此事件循环已在运行:使用
nest_asyncio.apply()以防止在 Jupyter 等环境中出现异步循环冲突。 - 配置错误: 请确保
zerox_kwargs与您所选模型预期的参数匹配,并且所有必要的环境变量均已设置。
附加资源
- Zerox 文档: Zerox GitHub 仓库
- LangChain 文档加载器: LangChain 文档