如何通过并行化总结文本

大型语言模型可以从文本中总结并提炼出所需信息，包括大量文本内容。在许多情况下，特别是当文本量远超模型上下文窗口大小时，将总结任务分解为更小的组成部分会有所帮助（或必要）。

Map-reduce 是实现这一目标的一种策略。其核心思想是将文本拆分为“子文档”，然后首先使用大型语言模型（LLM）将每个子文档分别映射为一个独立的摘要。接着，将这些摘要进行归约或合并，形成一个全局的汇总摘要。

请注意，映射步骤通常会对输入文档进行并行处理。当理解子文档的内容不依赖于前面的上下文时，这种策略尤其有效。例如，在对大量较短文档组成的语料库进行摘要时。

LangGraph，基于 langchain-core 构建，支持映射-归约工作流，非常适合解决此类问题：

LangGraph 允许对各个步骤（如连续的摘要生成）进行流式处理，从而实现对执行过程的更精细控制；
LangGraph 的检查点支持错误恢复，可扩展至人机协作工作流，并更轻松地集成到对话式应用中。
LangGraph 的实现方式简单明了，易于修改和扩展。

以下是演示如何通过映射-归约策略来总结文本。

加载聊天模型

让我们首先加载一个聊天模型：

选择聊天模型:

pip install -qU "langchain[openai]"

import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
  os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain.chat_models import init_chat_model

llm = init_chat_model("gpt-4o-mini", model_provider="openai")

加载文档

首先我们加载文档。我们将使用 WebBaseLoader 加载一篇博客文章，并将文档拆分为更小的子文档。

from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=1000, chunk_overlap=0
)

loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
docs = loader.load()

split_docs = text_splitter.split_documents(docs)
print(f"Generated {len(split_docs)} documents.")

API 参考：WebBaseLoader | CharacterTextSplitter

Created a chunk of size 1003, which is longer than the specified 1000
``````output
Generated 14 documents.

创建图表

步骤映射

首先，让我们定义与映射步骤相关的提示，并通过 Chains 将其与LLM关联：

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

map_prompt = ChatPromptTemplate.from_messages(
    [("human", "Write a concise summary of the following:\\n\\n{context}")]
)

map_chain = map_prompt | llm | StrOutputParser()

API 参考：StrOutputParser | ChatPromptTemplate

减少步骤

我们还定义了一个链，该链将文档映射结果汇总为单一输出。

reduce_template = """
The following is a set of summaries:
{docs}
Take these and distill it into a final, consolidated summary
of the main themes.
"""

reduce_prompt = ChatPromptTemplate([("human", reduce_template)])

reduce_chain = reduce_prompt | llm | StrOutputParser()

通过 LangGraph 进行编排

下面我们实现一个简单的应用程序，该程序将上述提示用于对文档列表的摘要步骤进行映射，然后进行归约。

Map-reduce 流程在文本长度远超大型语言模型（LLM）上下文窗口时尤为有用。对于长文本，我们需要一种机制，确保在 reduce 步骤中需要总结的上下文不会超过模型的上下文窗口大小。在此，我们实现了一种递归的“压缩”总结方法：根据标记数限制对输入进行分块，并为各分块生成摘要。此步骤重复进行，直到所有摘要的总长度达到期望的限制，从而实现对任意长度文本的总结。

我们将需要安装 langgraph：

pip install -qU langgraph

import operator
from typing import Annotated, List, Literal, TypedDict

from langchain.chains.combine_documents.reduce import (
    acollapse_docs,
    split_list_of_docs,
)
from langchain_core.documents import Document
from langgraph.constants import Send
from langgraph.graph import END, START, StateGraph

token_max = 1000


def length_function(documents: List[Document]) -> int:
    """Get number of tokens for input contents."""
    return sum(llm.get_num_tokens(doc.page_content) for doc in documents)


# This will be the overall state of the main graph.
# It will contain the input document contents, corresponding
# summaries, and a final summary.
class OverallState(TypedDict):
    # Notice here we use the operator.add
    # This is because we want combine all the summaries we generate
    # from individual nodes back into one list - this is essentially
    # the "reduce" part
    contents: List[str]
    summaries: Annotated[list, operator.add]
    collapsed_summaries: List[Document]
    final_summary: str


# This will be the state of the node that we will "map" all
# documents to in order to generate summaries
class SummaryState(TypedDict):
    content: str


# Here we generate a summary, given a document
async def generate_summary(state: SummaryState):
    response = await map_chain.ainvoke(state["content"])
    return {"summaries": [response]}


# Here we define the logic to map out over the documents
# We will use this an edge in the graph
def map_summaries(state: OverallState):
    # We will return a list of `Send` objects
    # Each `Send` object consists of the name of a node in the graph
    # as well as the state to send to that node
    return [
        Send("generate_summary", {"content": content}) for content in state["contents"]
    ]


def collect_summaries(state: OverallState):
    return {
        "collapsed_summaries": [Document(summary) for summary in state["summaries"]]
    }


# Add node to collapse summaries
async def collapse_summaries(state: OverallState):
    doc_lists = split_list_of_docs(
        state["collapsed_summaries"], length_function, token_max
    )
    results = []
    for doc_list in doc_lists:
        results.append(await acollapse_docs(doc_list, reduce_chain.ainvoke))

    return {"collapsed_summaries": results}


# This represents a conditional edge in the graph that determines
# if we should collapse the summaries or not
def should_collapse(
    state: OverallState,
) -> Literal["collapse_summaries", "generate_final_summary"]:
    num_tokens = length_function(state["collapsed_summaries"])
    if num_tokens > token_max:
        return "collapse_summaries"
    else:
        return "generate_final_summary"


# Here we will generate the final summary
async def generate_final_summary(state: OverallState):
    response = await reduce_chain.ainvoke(state["collapsed_summaries"])
    return {"final_summary": response}


# Construct the graph
# Nodes:
graph = StateGraph(OverallState)
graph.add_node("generate_summary", generate_summary)  # same as before
graph.add_node("collect_summaries", collect_summaries)
graph.add_node("collapse_summaries", collapse_summaries)
graph.add_node("generate_final_summary", generate_final_summary)

# Edges:
graph.add_conditional_edges(START, map_summaries, ["generate_summary"])
graph.add_edge("generate_summary", "collect_summaries")
graph.add_conditional_edges("collect_summaries", should_collapse)
graph.add_conditional_edges("collapse_summaries", should_collapse)
graph.add_edge("generate_final_summary", END)

app = graph.compile()

API 参考：acollapse_docs | split_list_of_docs | 文档 |发送 |StateGraph

LangGraph 可以将图结构绘制出来，以帮助可视化其功能：

from IPython.display import Image

Image(app.get_graph().draw_mermaid_png())

调用图表

运行应用程序时，我们可以流式输出图表以观察其执行步骤的顺序。以下我们将仅打印出每一步的名称。

请注意，由于图中存在循环，指定执行的递归限制可能会有所帮助。当超过指定限制时，系统将引发特定错误。

async for step in app.astream(
    {"contents": [doc.page_content for doc in split_docs]},
    {"recursion_limit": 10},
):
    print(list(step.keys()))

['generate_summary']
['generate_summary']
['generate_summary']
['generate_summary']
['generate_summary']
['generate_summary']
['generate_summary']
['generate_summary']
['generate_summary']
['generate_summary']
['generate_summary']
['generate_summary']
['generate_summary']
['generate_summary']
['collect_summaries']
['collapse_summaries']
['collapse_summaries']
['generate_final_summary']

print(step)

{'generate_final_summary': {'final_summary': 'The consolidated summary of the main themes from the provided documents highlights the advancements and applications of large language models (LLMs) in artificial intelligence, particularly in autonomous agents and software development. Key themes include:\n\n1. **Integration of LLMs**: LLMs play a crucial role in enabling autonomous agents to perform complex tasks through advanced reasoning and decision-making techniques, such as Chain of Thought (CoT) and Tree of Thoughts.\n\n2. **Memory Management**: The categorization of memory into sensory, short-term, and long-term types parallels machine learning concepts, with short-term memory facilitating in-context learning and long-term memory enhanced by external storage solutions.\n\n3. **Tool Use and APIs**: Autonomous agents utilize external APIs to expand their capabilities, demonstrating adaptability and improved problem-solving skills.\n\n4. **Search Algorithms**: Various approximate nearest neighbor search algorithms, including Locality-Sensitive Hashing (LSH) and FAISS, are discussed for enhancing search efficiency in high-dimensional spaces.\n\n5. **Neuro-Symbolic Architectures**: The integration of neuro-symbolic systems, such as the MRKL framework, combines expert modules with LLMs to improve problem-solving, particularly in complex tasks.\n\n6. **Challenges and Innovations**: The documents address challenges like hallucination and inefficient planning in LLMs, alongside innovative methods such as Chain of Hindsight (CoH) and Algorithm Distillation (AD) for performance enhancement.\n\n7. **Software Development Practices**: The use of LLMs in software development is explored, particularly in creating structured applications like a Super Mario game using the model-view-controller (MVC) architecture, emphasizing task management, component organization, and documentation.\n\n8. **Limitations of LLMs**: Constraints such as finite context length and challenges in long-term planning are acknowledged, along with concerns regarding the reliability of natural language as an interface.\n\nOverall, the integration of LLMs and neuro-symbolic architectures signifies a significant evolution in AI, with ongoing research focused on enhancing planning, memory management, and problem-solving capabilities across various applications.'}}

下一步

查看 LangGraph 文档了解使用 LangGraph 构建的详细信息，包括本指南中关于 LangGraph 中映射-归约的详细内容。

查看摘要操操作指南以了解其他摘要策略，包括适用于大量文本的策略。

另请参阅此教程以了解有关摘要的更多详细信息。

加载聊天模型​

加载文档​

创建图表​

步骤映射​

减少步骤​

通过 LangGraph 进行编排​

调用图表​

下一步​