Skip to main content
Open In Colab在 GitHub 上打开

盒子检索器

这将帮助您开始使用 Box retriever。有关所有 BoxRetriever 功能和配置的详细文档,请前往 API 参考

概述

BoxRetriever类帮助你从 Langchain 的Document格式。您可以通过基于全文搜索搜索文件或使用 Box AI 检索Document包含针对文件的 AI 查询的结果。这需要包含一个List[str]包含 Box 文件 ID,即["12345","67890"]

信息

Box AI 需要 Enterprise Plus 许可证

将跳过没有文本表示的文件。

集成详细信息

1:自带数据(即索引和搜索自定义文档语料库):

Retriever自托管云产品
BoxRetrieverlangchain-box

设置

为了使用 Box 包,您需要一些东西:

  • Box 账户 – 如果您不是当前的 Box 客户,或者想在生产 Box 实例之外进行测试,则可以使用免费的开发人员账户
  • Box 应用程序 — 这是在开发人员控制台中配置的,对于 Box AI,必须具有Manage AI启用范围。在这里,您还将选择您的身份验证方法
  • 该应用程序必须由管理员启用。对于免费开发者账户,这是注册该账户的人。

凭据

对于这些示例,我们将使用 Token Authentication。这可以与任何身份验证方法一起使用。只需使用任何方法获取令牌即可。如果您想详细了解如何将其他身份验证类型与langchain-box,请访问 Box 提供程序文档。

import getpass
import os

box_developer_token = getpass.getpass("Enter your Box Developer Token: ")

如果您想从单个查询中获得自动跟踪,您还可以通过取消下面的注释来设置您的 LangSmith API 密钥:

# os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter your LangSmith API key: ")
# os.environ["LANGSMITH_TRACING"] = "true"

安装

这只Retriever生活在langchain-box包:

%pip install -qU langchain-box
Note: you may need to restart the kernel to use updated packages.

实例

现在我们可以实例化我们的 retriever:

from langchain_box import BoxRetriever

retriever = BoxRetriever(box_developer_token=box_developer_token)

对于更精细的搜索,我们提供了一系列选项来帮助您筛选结果。这将使用langchain_box.utilities.SearchOptionslangchain_box.utilities.SearchTypeFilterlangchain_box.utilities.DocumentFiles枚举来过滤创建日期、要搜索的文件部分等内容,甚至将搜索范围限制为特定文件夹。

有关更多信息,请查看 API 参考

from langchain_box.utilities import BoxSearchOptions, DocumentFiles, SearchTypeFilter

box_folder_id = "260931903795"

box_search_options = BoxSearchOptions(
ancestor_folder_ids=[box_folder_id],
search_type_filter=[SearchTypeFilter.FILE_CONTENT],
created_date_range=["2023-01-01T00:00:00-07:00", "2024-08-01T00:00:00-07:00,"],
k=200,
size_range=[1, 1000000],
updated_data_range=None,
)

retriever = BoxRetriever(
box_developer_token=box_developer_token, box_search_options=box_search_options
)

retriever.invoke("AstroTech Solutions")
[Document(metadata={'source': 'https://dl.boxcloud.com/api/2.0/internal_files/1514555423624/versions/1663171610024/representations/extracted_text/content/', 'title': 'Invoice-A5555_txt'}, page_content='Vendor: AstroTech Solutions\nInvoice Number: A5555\n\nLine Items:\n    - Gravitational Wave Detector Kit: $800\n    - Exoplanet Terrarium: $120\nTotal: $920')]

盒子 AI

from langchain_box import BoxRetriever

box_file_ids = ["1514555423624", "1514553902288"]

retriever = BoxRetriever(
box_developer_token=box_developer_token, box_file_ids=box_file_ids
)

用法

query = "What was the most expensive item purchased"

retriever.invoke(query)
[Document(metadata={'source': 'Box AI', 'title': 'Box AI What was the most expensive item purchased'}, page_content='The most expensive item purchased is the **Gravitational Wave Detector Kit** from AstroTech Solutions, which costs **$800**.')]

引文

借助 Box AI 和BoxRetriever,您可以返回提示的答案,返回 Box 用于获取该答案的引文,或同时返回两者。无论您选择如何使用 Box AI,检索器都会返回一个List[Document]对象。我们提供两种灵活性bool参数answercitations.答案默认为Truecitations 默认为False,如果你只想要答案,是否可以省略两者。如果你两者都想要,你可以只包括citations=True如果你只想要引用,你可以包括answer=Falsecitations=True

同时获取两者

retriever = BoxRetriever(
box_developer_token=box_developer_token, box_file_ids=box_file_ids, citations=True
)

retriever.invoke(query)
[Document(metadata={'source': 'Box AI', 'title': 'Box AI What was the most expensive item purchased'}, page_content='The most expensive item purchased is the **Gravitational Wave Detector Kit** from AstroTech Solutions, which costs **$800**.'),
Document(metadata={'source': 'Box AI What was the most expensive item purchased', 'file_name': 'Invoice-A5555.txt', 'file_id': '1514555423624', 'file_type': 'file'}, page_content='Vendor: AstroTech Solutions\nInvoice Number: A5555\n\nLine Items:\n - Gravitational Wave Detector Kit: $800\n - Exoplanet Terrarium: $120\nTotal: $920')]

引用

retriever = BoxRetriever(
box_developer_token=box_developer_token,
box_file_ids=box_file_ids,
answer=False,
citations=True,
)

retriever.invoke(query)
[Document(metadata={'source': 'Box AI What was the most expensive item purchased', 'file_name': 'Invoice-A5555.txt', 'file_id': '1514555423624', 'file_type': 'file'}, page_content='Vendor: AstroTech Solutions\nInvoice Number: A5555\n\nLine Items:\n    - Gravitational Wave Detector Kit: $800\n    - Exoplanet Terrarium: $120\nTotal: $920')]

在链内使用

与其他检索器一样,BoxRetriever 可以通过Chains合并到 LLM 应用程序中。

我们需要一个 LLM 或聊天模型:

pip install -qU "langchain[openai]"
import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain.chat_models import init_chat_model

llm = init_chat_model("gpt-4o-mini", model_provider="openai")
openai_key = getpass.getpass("Enter your OpenAI key: ")
Enter your OpenAI key:  ········
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

box_search_options = BoxSearchOptions(
ancestor_folder_ids=[box_folder_id],
search_type_filter=[SearchTypeFilter.FILE_CONTENT],
created_date_range=["2023-01-01T00:00:00-07:00", "2024-08-01T00:00:00-07:00,"],
k=200,
size_range=[1, 1000000],
updated_data_range=None,
)

retriever = BoxRetriever(
box_developer_token=box_developer_token, box_search_options=box_search_options
)

context = "You are a finance professional that handles invoices and purchase orders."
question = "Show me all the items purchased from AstroTech Solutions"

prompt = ChatPromptTemplate.from_template(
"""Answer the question based only on the context provided.

Context: {context}

Question: {question}"""
)


def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)


chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
chain.invoke(question)
'- Gravitational Wave Detector Kit: $800\n- Exoplanet Terrarium: $120'

用作代理工具

与其他检索器一样,BoxRetriever 也可以作为工具添加到 LangGraph 代理中。

pip install -U langsmith
from langchain import hub
from langchain.agents import AgentExecutor, create_openai_tools_agent
from langchain.tools.retriever import create_retriever_tool
box_search_options = BoxSearchOptions(
ancestor_folder_ids=[box_folder_id],
search_type_filter=[SearchTypeFilter.FILE_CONTENT],
created_date_range=["2023-01-01T00:00:00-07:00", "2024-08-01T00:00:00-07:00,"],
k=200,
size_range=[1, 1000000],
updated_data_range=None,
)

retriever = BoxRetriever(
box_developer_token=box_developer_token, box_search_options=box_search_options
)

box_search_tool = create_retriever_tool(
retriever,
"box_search_tool",
"This tool is used to search Box and retrieve documents that match the search criteria",
)
tools = [box_search_tool]
prompt = hub.pull("hwchase17/openai-tools-agent")
prompt.messages

llm = ChatOpenAI(temperature=0, openai_api_key=openai_key)

agent = create_openai_tools_agent(llm, tools, prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools)
/Users/shurrey/local/langchain/.venv/lib/python3.11/site-packages/langsmith/client.py:312: LangSmithMissingAPIKeyWarning: API key must be provided when using hosted LangSmith API
warnings.warn(
result = agent_executor.invoke(
{
"input": "list the items I purchased from AstroTech Solutions from most expensive to least expensive"
}
)
print(f"result {result['output']}")
result The items you purchased from AstroTech Solutions from most expensive to least expensive are:

1. Gravitational Wave Detector Kit: $800
2. Exoplanet Terrarium: $120

Total: $920

额外字段

所有 Box 连接器都提供从 Box 中选择其他字段的功能FileFullobject 作为自定义 LangChain 元数据返回。每个对象都接受一个可选的List[str]extra_fields包含来自 return 对象的 JSON 键,例如extra_fields=["shared_link"].

连接器会将此字段添加到集成运行所需的字段列表中,然后将结果添加到返回的元数据中DocumentBlob喜欢"metadata" : { "source" : "source, "shared_link" : "shared_link" }.如果该字段对该文件不可用,它将作为空字符串返回,例如"shared_link" : "".

API 参考

有关所有 BoxRetriever 功能和配置的详细文档,请访问 API 参考

帮助

如果您有任何疑问,可以查看我们的开发人员文档或在我们的开发人员社区中联系使用。