如何为每个文档使用多个向量进行检索

每个文档存储多个向量通常很有用。这在多个用例中是有益的。例如，我们可以嵌入文档的多个块，并将这些嵌入与父文档相关联，从而允许对块的检索器命中返回较大的文档。

LangChain 实现了一个基本的 MultiVectorRetriever，这简化了这个过程。大部分复杂性在于如何为每个文档创建多个向量。本笔记本介绍了创建这些向量并使用MultiVectorRetriever.

为每个文档创建多个矢量的方法包括：

较小的块：将文档拆分为较小的块，并嵌入这些块（这是 ParentDocumentRetriever）。
摘要：为每个文档创建一个摘要，将其与文档一起嵌入（或代替）文档。
假设问题：创建每个文档都适合回答的假设问题，将这些问题与文档一起嵌入（或代替）文档。

请注意，这也启用了另一种添加嵌入的方法 - 手动。这很有用，因为您可以显式添加应导致文档恢复的问题或查询，从而为您提供更多控制权。

下面我们来看一个示例。首先，我们实例化一些文档。我们将使用 OpenAI 嵌入在（内存中的）Chroma 向量存储中为它们编制索引，但任何 LangChain 向量存储或嵌入模型都足够了。

%pip install --upgrade --quiet  langchain-chroma langchain langchain-openai > /dev/null

from langchain.storage import InMemoryByteStore
from langchain_chroma import Chroma
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

loaders = [
    TextLoader("paul_graham_essay.txt"),
    TextLoader("state_of_the_union.txt"),
]
docs = []
for loader in loaders:
    docs.extend(loader.load())
text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000)
docs = text_splitter.split_documents(docs)

# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="full_documents", embedding_function=OpenAIEmbeddings()
)

API 参考：InMemoryByteStore | 文本加载器 | OpenAI 嵌入 | 递归角色文本拆分器

较小的块

通常，检索较大的信息块，但嵌入较小的数据块可能很有用。这允许嵌入尽可能接近地捕获语义含义，但允许尽可能多的上下文向下游传递。请注意，这是 ParentDocumentRetriever 所做的。在这里，我们展示了幕后发生的事情。

我们将区分向量存储（为（子）文档的嵌入编制索引）和文档存储（容纳“父”文档并将其与标识符相关联）。

import uuid

from langchain.retrievers.multi_vector import MultiVectorRetriever

# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"

# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)

doc_ids = [str(uuid.uuid4()) for _ in docs]

API 参考：MultiVectorRetriever

接下来，我们通过拆分原始文档来生成 “sub” 文档。请注意，我们将文档标识符存储在metadata的相应 Document 对象。

# The splitter to use to create smaller chunks
child_text_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

sub_docs = []
for i, doc in enumerate(docs):
    _id = doc_ids[i]
    _sub_docs = child_text_splitter.split_documents([doc])
    for _doc in _sub_docs:
        _doc.metadata[id_key] = _id
    sub_docs.extend(_sub_docs)

最后，我们为 vector store 和 document store 中的文档编制索引：

retriever.vectorstore.add_documents(sub_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

单独的 vector store 将检索小块：

retriever.vectorstore.similarity_search("justice breyer")[0]

Document(page_content='Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.', metadata={'doc_id': '064eca46-a4c4-4789-8e3b-583f9597e54f', 'source': 'state_of_the_union.txt'})

而检索器将返回较大的父文档：

len(retriever.invoke("justice breyer")[0].page_content)

检索器对向量数据库执行的默认搜索类型是相似性搜索。LangChain 矢量存储还支持通过 Max Marginal Levance 进行搜索。这可以通过search_type参数：

from langchain.retrievers.multi_vector import SearchType

retriever.search_type = SearchType.mmr

len(retriever.invoke("justice breyer")[0].page_content)

API 参考：SearchType

将摘要与文档关联以进行检索

摘要可能能够更准确地提炼出 chunk 的内容，从而获得更好的检索。在这里，我们将介绍如何创建摘要，然后嵌入这些摘要。

我们构建一个简单的Chains，它将接收一个 input Document 对象并使用 LLM 生成一个摘要。

选择聊天模式：

pip install -qU "langchain[openai]"

import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
  os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain.chat_models import init_chat_model

llm = init_chat_model("gpt-4o-mini", model_provider="openai")

import uuid

from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

chain = (
    {"doc": lambda x: x.page_content}
    | ChatPromptTemplate.from_template("Summarize the following document:\n\n{doc}")
    | llm
    | StrOutputParser()
)

API 参考：文档 | StrOutputParser | 聊天提示模板

请注意，我们可以跨文档批处理链：

summaries = chain.batch(docs, {"max_concurrency": 5})

然后，我们可以初始化一个MultiVectorRetriever和以前一样，在我们的 Vector Store 中索引摘要，并将原始文档保留在我们的 Document Store 中：

# The vectorstore to use to index the child chunks
vectorstore = Chroma(collection_name="summaries", embedding_function=OpenAIEmbeddings())
# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in docs]

summary_docs = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(summaries)
]

retriever.vectorstore.add_documents(summary_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

# # We can also add the original chunks to the vectorstore if we so want
# for i, doc in enumerate(docs):
#     doc.metadata[id_key] = doc_ids[i]
# retriever.vectorstore.add_documents(docs)

查询 vector store 将返回 summes：

sub_docs = retriever.vectorstore.similarity_search("justice breyer")

sub_docs[0]

Document(page_content="President Biden recently nominated Judge Ketanji Brown Jackson to serve on the United States Supreme Court, emphasizing her qualifications and broad support. The President also outlined a plan to secure the border, fix the immigration system, protect women's rights, support LGBTQ+ Americans, and advance mental health services. He highlighted the importance of bipartisan unity in passing legislation, such as the Violence Against Women Act. The President also addressed supporting veterans, particularly those impacted by exposure to burn pits, and announced plans to expand benefits for veterans with respiratory cancers. Additionally, he proposed a plan to end cancer as we know it through the Cancer Moonshot initiative. President Biden expressed optimism about the future of America and emphasized the strength of the American people in overcoming challenges.", metadata={'doc_id': '84015b1b-980e-400a-94d8-cf95d7e079bd'})

而检索器将返回较大的源文档：

retrieved_docs = retriever.invoke("justice breyer")

len(retrieved_docs[0].page_content)

假设查询

LLM 还可用于生成可以针对特定文档提出的假设问题列表，这些问题可能与 RAG 应用程序中的相关查询具有密切的语义相似性。然后，可以将这些问题嵌入并与文档关联，以改进检索。

下面，我们使用 with_structured_output 方法将 LLM 输出构建为字符串列表。

from typing import List

from pydantic import BaseModel, Field


class HypotheticalQuestions(BaseModel):
    """Generate hypothetical questions."""

    questions: List[str] = Field(..., description="List of questions")


chain = (
    {"doc": lambda x: x.page_content}
    # Only asking for 3 hypothetical questions, but this could be adjusted
    | ChatPromptTemplate.from_template(
        "Generate a list of exactly 3 hypothetical questions that the below document could be used to answer:\n\n{doc}"
    )
    | ChatOpenAI(max_retries=0, model="gpt-4o").with_structured_output(
        HypotheticalQuestions
    )
    | (lambda x: x.questions)
)

在单个文档上调用链表明它输出问题列表：

chain.invoke(docs[0])

["What impact did the IBM 1401 have on the author's early programming experiences?",
 "How did the transition from using the IBM 1401 to microcomputers influence the author's programming journey?",
 "What role did Lisp play in shaping the author's understanding and approach to AI?"]

我们可以对所有文档进行批处理链，并像以前一样组装我们的 vector store 和 document store：

# Batch chain over documents to generate hypothetical questions
hypothetical_questions = chain.batch(docs, {"max_concurrency": 5})


# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="hypo-questions", embedding_function=OpenAIEmbeddings()
)
# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in docs]


# Generate Document objects from hypothetical questions
question_docs = []
for i, question_list in enumerate(hypothetical_questions):
    question_docs.extend(
        [Document(page_content=s, metadata={id_key: doc_ids[i]}) for s in question_list]
    )


retriever.vectorstore.add_documents(question_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

请注意，查询基础向量存储将检索在语义上类似于输入查询的假设问题：

sub_docs = retriever.vectorstore.similarity_search("justice breyer")

sub_docs

[Document(page_content='What might be the potential benefits of nominating Circuit Court of Appeals Judge Ketanji Brown Jackson to the United States Supreme Court?', metadata={'doc_id': '43292b74-d1b8-4200-8a8b-ea0cb57fbcdb'}),
 Document(page_content='How might the Bipartisan Infrastructure Law impact the economic competition between the U.S. and China?', metadata={'doc_id': '66174780-d00c-4166-9791-f0069846e734'}),
 Document(page_content='What factors led to the creation of Y Combinator?', metadata={'doc_id': '72003c4e-4cc9-4f09-a787-0b541a65b38c'}),
 Document(page_content='How did the ability to publish essays online change the landscape for writers and thinkers?', metadata={'doc_id': 'e8d2c648-f245-4bcc-b8d3-14e64a164b64'})]

调用 retriever 将返回相应的文档：

retrieved_docs = retriever.invoke("justice breyer")
len(retrieved_docs[0].page_content)