Skip to main content
Open In Colab在 GitHub 上打开

从 RefineDocumentsChain 迁移

RefineDocumentsChain 实现了分析长文本的策略。策略如下:

  • 将文本拆分为较小的文档;
  • 将流程应用于第一个文档;
  • 根据下一个文档优化或更新结果;
  • 重复文档序列,直到完成。

在这种情况下,一个常见的过程是摘要,在我们继续处理长文本的块时,会修改一个正在运行的摘要。这对于与给定 LLM 的上下文窗口相比较大的文本特别有用。

LangGraph 实现为这个问题带来了许多好处:

  • 哪里RefineDocumentsChain通过for循环中,LangGraph 实现允许您逐步执行以监控或在需要时控制它。
  • LangGraph 实现支持执行步骤和单个 token 的流式处理。
  • 由于它由模块化组件组装而成,因此也很容易扩展或修改(例如,合并工具调用或其他行为)。

下面我们将介绍两者RefineDocumentsChain以及一个简单的示例上相应的 LangGraph 实现,用于说明目的。

让我们首先加载一个聊天模型:

pip install -qU "langchain[openai]"
import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain.chat_models import init_chat_model

llm = init_chat_model("gpt-4o-mini", model_provider="openai")

让我们看一个总结一系列文档的示例。我们首先生成一些简单的文档以进行说明:

from langchain_core.documents import Document

documents = [
Document(page_content="Apples are red", metadata={"title": "apple_book"}),
Document(page_content="Blueberries are blue", metadata={"title": "blueberry_book"}),
Document(page_content="Bananas are yelow", metadata={"title": "banana_book"}),
]
API 参考:文档

遗产

下面我们展示了一个RefineDocumentsChain.我们为初始摘要和后续细化定义了 prompt 模板,为这两个目的实例化了单独的 LLMChain 对象,并实例化了RefineDocumentsChain使用这些组件。

from langchain.chains import LLMChain, RefineDocumentsChain
from langchain_core.prompts import ChatPromptTemplate, PromptTemplate
from langchain_openai import ChatOpenAI

# This controls how each document will be formatted. Specifically,
# it will be passed to `format_document` - see that function for more
# details.
document_prompt = PromptTemplate(
input_variables=["page_content"], template="{page_content}"
)
document_variable_name = "context"
# The prompt here should take as an input variable the
# `document_variable_name`
summarize_prompt = ChatPromptTemplate(
[
("human", "Write a concise summary of the following: {context}"),
]
)
initial_llm_chain = LLMChain(llm=llm, prompt=summarize_prompt)
initial_response_name = "existing_answer"
# The prompt here should take as an input variable the
# `document_variable_name` as well as `initial_response_name`
refine_template = """
Produce a final summary.

Existing summary up to this point:
{existing_answer}

New context:
------------
{context}
------------

Given the new context, refine the original summary.
"""
refine_prompt = ChatPromptTemplate([("human", refine_template)])
refine_llm_chain = LLMChain(llm=llm, prompt=refine_prompt)
chain = RefineDocumentsChain(
initial_llm_chain=initial_llm_chain,
refine_llm_chain=refine_llm_chain,
document_prompt=document_prompt,
document_variable_name=document_variable_name,
initial_response_name=initial_response_name,
)

我们现在可以调用我们的链:

result = chain.invoke(documents)
result["output_text"]
'Apples are typically red in color, blueberries are blue, and bananas are yellow.'

LangSmith 跟踪由三个 LLM 调用组成:一个用于初始摘要,另外两个用于该摘要的更新。当我们使用最终文档中的内容更新摘要时,该过程即完成。

LangGraph

下面我们展示了此过程的 LangGraph 实现:

  • 我们使用与以前相同的两个模板。
  • 我们为初始摘要生成一个简单的链,该链提取出第一个文档,将其格式化为提示,并使用我们的 LLM 运行推理。
  • 我们生成第二个refine_summary_chain,它对每个连续的文档进行作,从而优化初始摘要。

我们需要安装langgraph:

pip install -qU langgraph
import operator
from typing import List, Literal, TypedDict

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableConfig
from langchain_openai import ChatOpenAI
from langgraph.constants import Send
from langgraph.graph import END, START, StateGraph

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# Initial summary
summarize_prompt = ChatPromptTemplate(
[
("human", "Write a concise summary of the following: {context}"),
]
)
initial_summary_chain = summarize_prompt | llm | StrOutputParser()

# Refining the summary with new docs
refine_template = """
Produce a final summary.

Existing summary up to this point:
{existing_answer}

New context:
------------
{context}
------------

Given the new context, refine the original summary.
"""
refine_prompt = ChatPromptTemplate([("human", refine_template)])

refine_summary_chain = refine_prompt | llm | StrOutputParser()


# For LangGraph, we will define the state of the graph to hold the query,
# destination, and final answer.
class State(TypedDict):
contents: List[str]
index: int
summary: str


# We define functions for each node, including a node that generates
# the initial summary:
async def generate_initial_summary(state: State, config: RunnableConfig):
summary = await initial_summary_chain.ainvoke(
state["contents"][0],
config,
)
return {"summary": summary, "index": 1}


# And a node that refines the summary based on the next document
async def refine_summary(state: State, config: RunnableConfig):
content = state["contents"][state["index"]]
summary = await refine_summary_chain.ainvoke(
{"existing_answer": state["summary"], "context": content},
config,
)

return {"summary": summary, "index": state["index"] + 1}


# Here we implement logic to either exit the application or refine
# the summary.
def should_refine(state: State) -> Literal["refine_summary", END]:
if state["index"] >= len(state["contents"]):
return END
else:
return "refine_summary"


graph = StateGraph(State)
graph.add_node("generate_initial_summary", generate_initial_summary)
graph.add_node("refine_summary", refine_summary)

graph.add_edge(START, "generate_initial_summary")
graph.add_conditional_edges("generate_initial_summary", should_refine)
graph.add_conditional_edges("refine_summary", should_refine)
app = graph.compile()
from IPython.display import Image

Image(app.get_graph().draw_mermaid_png())

我们可以按如下方式逐步执行,在优化后打印出摘要:

async for step in app.astream(
{"contents": [doc.page_content for doc in documents]},
stream_mode="values",
):
if summary := step.get("summary"):
print(summary)
Apples are typically red in color.
Apples are typically red in color, while blueberries are blue.
Apples are typically red in color, blueberries are blue, and bananas are yellow.

LangSmith 跟踪中,我们再次恢复了三个 LLM 调用,执行与以前相同的功能。

请注意,我们可以从应用程序流式传输令牌,包括从中间步骤流式传输令牌:

async for event in app.astream_events(
{"contents": [doc.page_content for doc in documents]}, version="v2"
):
kind = event["event"]
if kind == "on_chat_model_stream":
content = event["data"]["chunk"].content
if content:
print(content, end="|")
elif kind == "on_chat_model_end":
print("\n\n")
Ap|ples| are| characterized| by| their| red| color|.|


Ap|ples| are| characterized| by| their| red| color|,| while| blueberries| are| known| for| their| blue| hue|.|


Ap|ples| are| characterized| by| their| red| color|,| blueberries| are| known| for| their| blue| hue|,| and| bananas| are| recognized| for| their| yellow| color|.|

后续步骤

有关更多基于 LLM 的摘要策略,请参阅本教程

有关使用 LangGraph 构建的详细信息,请查看 LangGraph 文档