Skip to main content
Open In ColabOpen on GitHub

如何处理长文本提取

在处理文件(如PDF)时,你很可能会遇到超出语言模型上下文窗口限制的文本。为了处理这些文本,请考虑以下策略:

  1. 更换LLM 选择一个支持更大上下文窗口的其他LLM。
  2. 暴力分割 将文档进行暴力分块,并从每个分块中提取内容。
  3. 检索增强生成 将文档分块,索引这些分块,并仅从看起来“相关”的分块子集中提取内容。

请记住,这些策略各有权衡,最适合的策略很可能取决于您所设计的应用!

本指南演示了如何实现策略2和策略3。

设置

首先,我们将安装本指南所需的依赖项:

%pip install -qU langchain-community lxml faiss-cpu langchain-openai
Note: you may need to restart the kernel to use updated packages.

现在我们需要一些示例数据!让我们从维基百科下载一篇关于汽车的文章,并将其加载为LangChain的文档

import re

import requests
from langchain_community.document_loaders import BSHTMLLoader

# Download the content
response = requests.get("https://en.wikipedia.org/wiki/Car")
# Write it to a file
with open("car.html", "w", encoding="utf-8") as f:
f.write(response.text)
# Load it with an HTML parser
loader = BSHTMLLoader("car.html")
document = loader.load()[0]
# Clean up code
# Replace consecutive new lines with a single new line
document.page_content = re.sub("\n\n+", "\n", document.page_content)
API 参考:BSHTMLLoader
print(len(document.page_content))
78865

定义模式

在完成提取教程后,我们将使用 Pydantic 来定义希望提取的信息模式。在此案例中,我们将提取包含年份和描述的“关键发展”(例如重要的历史事件)列表。

请注意,我们还包含一个 evidence 键,并指示模型原样提供文章中的相关句子文本。这使我们能够将提取结果与(模型对原始文档的重构)文本进行比较。

from typing import List, Optional

from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from pydantic import BaseModel, Field


class KeyDevelopment(BaseModel):
"""Information about a development in the history of cars."""

year: int = Field(
..., description="The year when there was an important historic development."
)
description: str = Field(
..., description="What happened in this year? What was the development?"
)
evidence: str = Field(
...,
description="Repeat in verbatim the sentence(s) from which the year and description information were extracted",
)


class ExtractionData(BaseModel):
"""Extracted information about key developments in the history of cars."""

key_developments: List[KeyDevelopment]


# Define a custom prompt to provide instructions and any additional context.
# 1) You can add examples into the prompt template to improve extraction quality
# 2) Introduce additional parameters to take context into account (e.g., include metadata
# about the document from which the text was extracted.)
prompt = ChatPromptTemplate.from_messages(
[
(
"system",
"You are an expert at identifying key historic development in text. "
"Only extract important historic developments. Extract nothing if no important information can be found in the text.",
),
("human", "{text}"),
]
)