Skip to main content
Open In Colab在 GitHub 上打开

如何创建自定义 Retriever

概述

许多 LLM 应用程序涉及使用 Retriever 从外部数据源检索信息。

检索器负责检索给定用户的相关 Documents 列表query.

检索到的文档通常被格式化为提示,这些提示被馈送到 LLM 中,允许 LLM 使用 中的信息来生成适当的响应(例如,根据知识库回答用户问题)。

接口

要创建自己的检索器,您需要扩展BaseRetriever类并实现以下方法:

方法描述必需/可选
_get_relevant_documentsGet documents relevant to a query.Required
_aget_relevant_documentsImplement to provide async native support.Optional

其中的 logic_get_relevant_documents可能涉及使用请求对数据库或 Web 的任意调用。

提示

通过从BaseRetriever,则你的检索器会自动成为 LangChain Runnable 并获得标准Runnable开箱即用的功能!

信息

您可以使用RunnableLambdaRunnableGenerator实现 retriever。

将 retriever 实现为BaseRetriever与 aRunnableLambda(自定义可运行函数)是BaseRetriever是一口井 已知的 LangChain 实体,因此一些用于监控的工具可能会为检索器实现专门的行为。另一个区别 那是BaseRetriever的行为与RunnableLambda在某些 API 中;例如,start事件 在astream_eventsAPI 将是on_retriever_start而不是on_chain_start.

让我们实现一个玩具检索器,它返回其文本包含 user 查询中的文本的所有文档。

from typing import List

from langchain_core.callbacks import CallbackManagerForRetrieverRun
from langchain_core.documents import Document
from langchain_core.retrievers import BaseRetriever


class ToyRetriever(BaseRetriever):
"""A toy retriever that contains the top k documents that contain the user query.

This retriever only implements the sync method _get_relevant_documents.

If the retriever were to involve file access or network access, it could benefit
from a native async implementation of `_aget_relevant_documents`.

As usual, with Runnables, there's a default async implementation that's provided
that delegates to the sync implementation running on another thread.
"""

documents: List[Document]
"""List of documents to retrieve from."""
k: int
"""Number of top results to return"""

def _get_relevant_documents(
self, query: str, *, run_manager: CallbackManagerForRetrieverRun
) -> List[Document]:
"""Sync implementations for retriever."""
matching_documents = []
for document in documents:
if len(matching_documents) > self.k:
return matching_documents

if query.lower() in document.page_content.lower():
matching_documents.append(document)
return matching_documents

# Optional: Provide a more efficient native implementation by overriding
# _aget_relevant_documents
# async def _aget_relevant_documents(
# self, query: str, *, run_manager: AsyncCallbackManagerForRetrieverRun
# ) -> List[Document]:
# """Asynchronously get documents relevant to a query.

# Args:
# query: String to find relevant documents for
# run_manager: The callbacks handler to use

# Returns:
# List of relevant documents
# """

🧪 测试

documents = [
Document(
page_content="Dogs are great companions, known for their loyalty and friendliness.",
metadata={"type": "dog", "trait": "loyalty"},
),
Document(
page_content="Cats are independent pets that often enjoy their own space.",
metadata={"type": "cat", "trait": "independence"},
),
Document(
page_content="Goldfish are popular pets for beginners, requiring relatively simple care.",
metadata={"type": "fish", "trait": "low maintenance"},
),
Document(
page_content="Parrots are intelligent birds capable of mimicking human speech.",
metadata={"type": "bird", "trait": "intelligence"},
),
Document(
page_content="Rabbits are social animals that need plenty of space to hop around.",
metadata={"type": "rabbit", "trait": "social"},
),
]
retriever = ToyRetriever(documents=documents, k=3)
retriever.invoke("that")
[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'type': 'cat', 'trait': 'independence'}),
Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'type': 'rabbit', 'trait': 'social'})]

它是一个可运行的,因此它将受益于标准的 Runnable 接口!🤩

await retriever.ainvoke("that")
[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'type': 'cat', 'trait': 'independence'}),
Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'type': 'rabbit', 'trait': 'social'})]
retriever.batch(["dog", "cat"])
[[Document(page_content='Dogs are great companions, known for their loyalty and friendliness.', metadata={'type': 'dog', 'trait': 'loyalty'})],
[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'type': 'cat', 'trait': 'independence'})]]
async for event in retriever.astream_events("bar", version="v1"):
print(event)
{'event': 'on_retriever_start', 'run_id': 'f96f268d-8383-4921-b175-ca583924d9ff', 'name': 'ToyRetriever', 'tags': [], 'metadata': {}, 'data': {'input': 'bar'}}
{'event': 'on_retriever_stream', 'run_id': 'f96f268d-8383-4921-b175-ca583924d9ff', 'tags': [], 'metadata': {}, 'name': 'ToyRetriever', 'data': {'chunk': []}}
{'event': 'on_retriever_end', 'name': 'ToyRetriever', 'run_id': 'f96f268d-8383-4921-b175-ca583924d9ff', 'tags': [], 'metadata': {}, 'data': {'output': []}}

贡献

我们感谢有趣的Retriever的贡献!

这是一个清单,可帮助确保你的贡献被添加到 LangChain:

文档:

  • 检索器包含所有初始化参数的文档字符串,因为这些参数将显示在 API 参考 中。
  • 模型的类 doc-string 包含指向用于检索器的任何相关 API 的链接(例如,如果检索器是从 wikipedia 检索的,最好链接到 wikipedia API!

测试:

  • 添加单元测试或集成测试以验证invokeainvoke工作。

优化:

如果检索器正在连接到外部数据源(例如,API 或文件),它几乎肯定会从异步本机优化中受益!

  • 提供_aget_relevant_documents(使用者ainvoke)