LLM Sherpa
这个笔记本介绍了如何使用LLM Sherpa加载多种类型的文件。LLM Sherpa支持不同的文件格式,包括DOCX、PPTX、HTML、TXT和XML。
LLMSherpaFileLoader 使用 LayoutPDFReader,这是LLMSherpa库的一部分。此工具旨在在保留布局信息的同时解析PDF,而大多数PDF转文本解析器通常会丢失这些布局信息。
以下是一些LayoutPDFReader的关键功能:
- 可以识别并提取各级标题及其对应的段落和子段落。
- 它将行合并形成段落。
- 它可以识别段落之间的链接。
- 可以提取表格以及表格所在的章节。
- 它能够识别并提取列表和嵌套列表。
- 它可以在多页内容中链接起来。
- 可以去除重复的页眉和页脚。
- 可以去除水印。
查看 llmsherpa 文档。
INFO: this library fail with some pdf files so use it with caution.
# Install package
# !pip install --upgrade --quiet llmsherpa
LLMSherpaFileLoader
Under the hood LLMSherpaFileLoader 定义了一些策略来加载文件内容: [
sections strategy: 将文件解析为部分
from langchain_community.document_loaders.llmsherpa import LLMSherpaFileLoader
loader = LLMSherpaFileLoader(
file_path="https://arxiv.org/pdf/2402.14207.pdf",
new_indent_parser=True,
apply_ocr=True,
strategy="sections",
llmsherpa_api_url="http://localhost:5010/api/parseDocument?renderFormat=all",
)
docs = loader.load()
API 参考:LLMSherpaFileLoader
docs[1]
Document(page_content='Abstract\nWe study how to apply large language models to write grounded and organized long-form articles from scratch, with comparable breadth and depth to Wikipedia pages.\nThis underexplored problem poses new challenges at the pre-writing stage, including how to research the topic and prepare an outline prior to writing.\nWe propose STORM, a writing system for the Synthesis of Topic Outlines through\nReferences\nFull-length Article\nTopic\nOutline\n2022 Winter Olympics\nOpening Ceremony\nResearch via Question Asking\nRetrieval and Multi-perspective Question Asking.\nSTORM models the pre-writing stage by\nLLM\n(1) discovering diverse perspectives in researching the given topic, (2) simulating conversations where writers carrying different perspectives pose questions to a topic expert grounded on trusted Internet sources, (3) curating the collected information to create an outline.\nFor evaluation, we curate FreshWiki, a dataset of recent high-quality Wikipedia articles, and formulate outline assessments to evaluate the pre-writing stage.\nWe further gather feedback from experienced Wikipedia editors.\nCompared to articles generated by an outlinedriven retrieval-augmented baseline, more of STORM’s articles are deemed to be organized (by a 25% absolute increase) and broad in coverage (by 10%).\nThe expert feedback also helps identify new challenges for generating grounded long articles, such as source bias transfer and over-association of unrelated facts.\n1. Can you provide any information about the transportation arrangements for the opening ceremony?\nLLM\n2. Can you provide any information about the budget for the 2022 Winter Olympics opening ceremony?…\nLLM- Role1\nLLM- Role2\nLLM- Role1', metadata={'source': 'https://arxiv.org/pdf/2402.14207.pdf', 'section_number': 1, 'section_title': 'Abstract'})
len(docs)
79
分块策略:将文件解析为分块
from langchain_community.document_loaders.llmsherpa import LLMSherpaFileLoader
loader = LLMSherpaFileLoader(
file_path="https://arxiv.org/pdf/2402.14207.pdf",
new_indent_parser=True,
apply_ocr=True,
strategy="chunks",
llmsherpa_api_url="http://localhost:5010/api/parseDocument?renderFormat=all",
)
docs = loader.load()
API 参考:LLMSherpaFileLoader
docs[1]
Document(page_content='Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models\nStanford University {shaoyj, yuchengj, tkanell, peterxu, okhattab}@stanford.edu lam@cs.stanford.edu', metadata={'source': 'https://arxiv.org/pdf/2402.14207.pdf', 'chunk_number': 1, 'chunk_type': 'para'})
len(docs)
306
html策略:以单个HTML文档的形式返回文件
from langchain_community.document_loaders.llmsherpa import LLMSherpaFileLoader
loader = LLMSherpaFileLoader(
file_path="https://arxiv.org/pdf/2402.14207.pdf",
new_indent_parser=True,
apply_ocr=True,
strategy="html",
llmsherpa_api_url="http://localhost:5010/api/parseDocument?renderFormat=all",
)
docs = loader.load()
API 参考:LLMSherpaFileLoader
docs[0].page_content[:400]
'<html><h1>Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models</h1><table><th><td colSpan=1>Yijia Shao</td><td colSpan=1>Yucheng Jiang</td><td colSpan=1>Theodore A. Kanell</td><td colSpan=1>Peter Xu</td></th><tr><td colSpan=1></td><td colSpan=1>Omar Khattab</td><td colSpan=1>Monica S. Lam</td><td colSpan=1></td></tr></table><p>Stanford University {shaoyj, yuchengj, '
len(docs)
1
text strategy: 将文件作为单一文本文档返回
from langchain_community.document_loaders.llmsherpa import LLMSherpaFileLoader
loader = LLMSherpaFileLoader(
file_path="https://arxiv.org/pdf/2402.14207.pdf",
new_indent_parser=True,
apply_ocr=True,
strategy="text",
llmsherpa_api_url="http://localhost:5010/api/parseDocument?renderFormat=all",
)
docs = loader.load()
API 参考:LLMSherpaFileLoader
docs[0].page_content[:400]
'Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models\n | Yijia Shao | Yucheng Jiang | Theodore A. Kanell | Peter Xu\n | --- | --- | --- | ---\n | | Omar Khattab | Monica S. Lam | \n\nStanford University {shaoyj, yuchengj, tkanell, peterxu, okhattab}@stanford.edu lam@cs.stanford.edu\nAbstract\nWe study how to apply large language models to write grounded and organized long'
len(docs)
1