Skip to main content
Open In ColabOpen on GitHub

从 RefineDocumentsChain 迁移

RefineDocumentsChain 实现了一种分析长文本的策略。该策略如下:

  • 将文本拆分为更小的文档;
  • 对第一个文档应用一个处理过程;
  • 根据下一份文档,进一步优化或更新结果;
  • 重复遍历文档序列,直到完成。

在此背景下,一个常见的处理流程是摘要生成,即随着我们逐段处理长文本时,不断修改运行中的摘要。这种方法特别适用于那些长度远超给定大型语言模型上下文窗口限制的文本。

LangGraph 实现为解决此问题带来诸多优势:

  • 其中 RefineDocumentsChain 通过类内的 for 循环来优化摘要,LangGraph 实现允许您逐步执行以监控或在需要时进行干预。
  • LangGraph 的实现支持执行步骤和单个标记的流式传输。
  • 由于它由模块化组件组成,因此也易于扩展或修改(例如,集成 工具调用 或其他行为)。

下面我们将通过一个简单的示例,分别介绍 RefineDocumentsChain 和对应的 LangGraph 实现,以作说明。

让我们首先加载一个聊天模型:

pip install -qU "langchain[openai]"
import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain.chat_models import init_chat_model

llm = init_chat_model("gpt-4o-mini", model_provider="openai")

示例

让我们通过一个示例来了解如何总结一系列文档。我们首先生成一些简单的文档,以供说明:

from langchain_core.documents import Document

documents = [
Document(page_content="Apples are red", metadata={"title": "apple_book"}),
Document(page_content="Blueberries are blue", metadata={"title": "blueberry_book"}),
Document(page_content="Bananas are yelow", metadata={"title": "banana_book"}),
]
API 参考:文档

旧版

详细信息

以下是使用 RefineDocumentsChain 的实现示例。我们为初始摘要和后续优化定义提示模板,为这两个目的分别实例化 LLMChain 对象,并使用这些组件实例化 RefineDocumentsChain

from langchain.chains import LLMChain, RefineDocumentsChain
from langchain_core.prompts import ChatPromptTemplate, PromptTemplate
from langchain_openai import ChatOpenAI

# This controls how each document will be formatted. Specifically,
# it will be passed to `format_document` - see that function for more
# details.
document_prompt = PromptTemplate(
input_variables=["page_content"], template="{page_content}"
)
document_variable_name = "context"
# The prompt here should take as an input variable the
# `document_variable_name`
summarize_prompt = ChatPromptTemplate(
[
("human", "Write a concise summary of the following: {context}"),
]
)
initial_llm_chain = LLMChain(llm=llm, prompt=summarize_prompt)
initial_response_name = "existing_answer"
# The prompt here should take as an input variable the
# `document_variable_name` as well as `initial_response_name`
refine_template = """
Produce a final summary.

Existing summary up to this point:
{existing_answer}

New context:
------------
{context}
------------

Given the new context, refine the original summary.
"""
refine_prompt = ChatPromptTemplate([("human", refine_template)])
refine_llm_chain = LLMChain(llm=llm, prompt=refine_prompt)
chain = RefineDocumentsChain(
initial_llm_chain=initial_llm_chain,
refine_llm_chain=refine_llm_chain,
document_prompt=document_prompt,
document_variable_name=document_variable_name,
initial_response_name=initial_response_name,
)

我们现在可以调用我们的链了:

result = chain.invoke(documents)
result["output_text"]
'Apples are typically red in color, blueberries are blue, and bananas are yellow.'

LangSmith跟踪 包含三个LLM调用:一个用于初始摘要,另外两个用于更新该摘要。当我们将最终文档的内容更新到摘要中时,此过程完成。

LangGraph

详细信息

以下是该流程的 LangGraph 实现:

  • 我们使用与之前相同的两个模板。
  • 我们生成一个简单的链,用于初始摘要,该链提取第一个文档,将其格式化为提示,并使用我们的大语言模型运行推理。
  • 我们生成第二个 refine_summary_chain,它作用于每个后续文档,以完善初始摘要。

我们将需要安装 langgraph

pip install -qU langgraph
import operator
from typing import List, Literal, TypedDict

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableConfig
from langchain_openai import ChatOpenAI
from langgraph.constants import Send
from langgraph.graph import END, START, StateGraph

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# Initial summary
summarize_prompt = ChatPromptTemplate(
[
("human", "Write a concise summary of the following: {context}"),
]
)
initial_summary_chain = summarize_prompt | llm | StrOutputParser()

# Refining the summary with new docs
refine_template = """
Produce a final summary.

Existing summary up to this point:
{existing_answer}

New context:
------------
{context}
------------

Given the new context, refine the original summary.
"""
refine_prompt = ChatPromptTemplate([("human", refine_template)])

refine_summary_chain = refine_prompt | llm | StrOutputParser()


# For LangGraph, we will define the state of the graph to hold the query,
# destination, and final answer.
class State(TypedDict):
contents: List[str]
index: int
summary: str


# We define functions for each node, including a node that generates
# the initial summary:
async def generate_initial_summary(state: State, config: RunnableConfig):
summary = await initial_summary_chain.ainvoke(
state["contents"][0],
config,
)
return {"summary": summary, "index": 1}


# And a node that refines the summary based on the next document
async def refine_summary(state: State, config: RunnableConfig):
content = state["contents"][state["index"]]
summary = await refine_summary_chain.ainvoke(
{"existing_answer": state["summary"], "context": content},
config,
)

return {"summary": summary, "index": state["index"] + 1}


# Here we implement logic to either exit the application or refine
# the summary.
def should_refine(state: State) -> Literal["refine_summary", END]:
if state["index"] >= len(state["contents"]):
return END
else:
return "refine_summary"


graph = StateGraph(State)
graph.add_node("generate_initial_summary", generate_initial_summary)
graph.add_node("refine_summary", refine_summary)

graph.add_edge(START, "generate_initial_summary")
graph.add_conditional_edges("generate_initial_summary", should_refine)
graph.add_conditional_edges("refine_summary", should_refine)
app = graph.compile()
from IPython.display import Image

Image(app.get_graph().draw_mermaid_png())

我们可以逐步执行,打印出在不断优化过程中的摘要:

async for step in app.astream(
{"contents": [doc.page_content for doc in documents]},
stream_mode="values",
):
if summary := step.get("summary"):
print(summary)
Apples are typically red in color.
Apples are typically red in color, while blueberries are blue.
Apples are typically red in color, blueberries are blue, and bananas are yellow.

LangSmith trace 中,我们再次恢复了三次 LLM 调用,执行了与之前相同的功能。

请注意,我们可以从应用程序中流式传输标记,包括来自中间步骤的标记:

async for event in app.astream_events(
{"contents": [doc.page_content for doc in documents]}, version="v2"
):
kind = event["event"]
if kind == "on_chat_model_stream":
content = event["data"]["chunk"].content
if content:
print(content, end="|")
elif kind == "on_chat_model_end":
print("\n\n")
Ap|ples| are| characterized| by| their| red| color|.|


Ap|ples| are| characterized| by| their| red| color|,| while| blueberries| are| known| for| their| blue| hue|.|


Ap|ples| are| characterized| by| their| red| color|,| blueberries| are| known| for| their| blue| hue|,| and| bananas| are| recognized| for| their| yellow| color|.|

下一步

查看 此教程 了解更多的基于大型语言模型的摘要策略。

查看 LangGraph 文档 以了解有关使用 LangGraph 构建的详细信息。