VectorizeRetriever
这个笔记本展示了如何使用LangChain向量化检索器。
Vectorize 帮助您更快、更轻松地构建AI应用程序。 它自动化数据提取,使用RAG评估找到最佳向量化策略,并让您快速部署实时的非结构化数据RAG流水线。 您的向量搜索索引保持最新状态,并且它可以与现有的向量数据库集成,因此您可以完全控制自己的数据。 Vectorize 处理繁重的工作,使您能够专注于构建稳健的AI解决方案而不必陷入数据管理的泥潭。
设置
在以下步骤中,我们将设置Vectorize环境并创建一个RAG管道。
创建 Vectorize 账户并获取您的访问令牌
注册一个免费的Vectorize账号 这里 在 访问令牌 部分生成一个访问令牌 收集您的组织ID。从浏览器URL中,提取URL中的UUID,即/organization/后面的UUID
配置令牌和组织ID
import getpass
VECTORIZE_ORG_ID = getpass.getpass("Enter Vectorize organization ID: ")
VECTORIZE_API_TOKEN = getpass.getpass("Enter Vectorize API Token: ")
安装
这个检索器位于langchain-vectorize包中:
!pip install -qU langchain-vectorize
下载一个PDF文件
!wget "https://raw.githubusercontent.com/vectorize-io/vectorize-clients/refs/tags/python-0.1.3/tests/python/tests/research.pdf"
初始化向量化客户端
import vectorize_client as v
api = v.ApiClient(v.Configuration(access_token=VECTORIZE_API_TOKEN))
创建文件上传源连接器
import json
import os
import urllib3
connectors_api = v.ConnectorsApi(api)
response = connectors_api.create_source_connector(
VECTORIZE_ORG_ID, [{"type": "FILE_UPLOAD", "name": "From API"}]
)
source_connector_id = response.connectors[0].id
上传PDF文件
file_path = "research.pdf"
http = urllib3.PoolManager()
uploads_api = v.UploadsApi(api)
metadata = {"created-from-api": True}
upload_response = uploads_api.start_file_upload_to_connector(
VECTORIZE_ORG_ID,
source_connector_id,
v.StartFileUploadToConnectorRequest(
name=file_path.split("/")[-1],
content_type="application/pdf",
# add additional metadata that will be stored along with each chunk in the vector database
metadata=json.dumps(metadata),
),
)
with open(file_path, "rb") as f:
response = http.request(
"PUT",
upload_response.upload_url,
body=f,
headers={
"Content-Type": "application/pdf",
"Content-Length": str(os.path.getsize(file_path)),
},
)
if response.status != 200:
print("Upload failed: ", response.data)
else:
print("Upload successful")
连接到AI平台和向量数据库
ai_platforms = connectors_api.get_ai_platform_connectors(VECTORIZE_ORG_ID)
builtin_ai_platform = [
c.id for c in ai_platforms.ai_platform_connectors if c.type == "VECTORIZE"
][0]
vector_databases = connectors_api.get_destination_connectors(VECTORIZE_ORG_ID)
builtin_vector_db = [
c.id for c in vector_databases.destination_connectors if c.type == "VECTORIZE"
][0]
配置和部署流水线
pipelines = v.PipelinesApi(api)
response = pipelines.create_pipeline(
VECTORIZE_ORG_ID,
v.PipelineConfigurationSchema(
source_connectors=[
v.SourceConnectorSchema(
id=source_connector_id, type="FILE_UPLOAD", config={}
)
],
destination_connector=v.DestinationConnectorSchema(
id=builtin_vector_db, type="VECTORIZE", config={}
),
ai_platform=v.AIPlatformSchema(
id=builtin_ai_platform, type="VECTORIZE", config={}
),
pipeline_name="My Pipeline From API",
schedule=v.ScheduleSchema(type="manual"),
),
)
pipeline_id = response.data.id
配置追踪(可选)
如果您想要从单个查询中获取自动跟踪,您也可以通过取消注释下方代码来设置您的LangSmith API密钥:
# os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter your LangSmith API key: ")
# os.environ["LANGSMITH_TRACING"] = "true"
Instantiation
from langchain_vectorize.retrievers import VectorizeRetriever
retriever = VectorizeRetriever(
api_token=VECTORIZE_API_TOKEN,
organization=VECTORIZE_ORG_ID,
pipeline_id=pipeline_id,
)
用法
query = "Apple Shareholders equity"
retriever.invoke(query, num_results=2)
使用在链中
像其他检索器一样,VectorizeRetriever可以通过Chains被集成到LLM应用中。
我们需要一个大语言模型或聊天模型:
选择 聊天模型:
pip install -qU "langchain[openai]"
import getpass
import os
if not os.environ.get("OPENAI_API_KEY"):
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")
from langchain.chat_models import init_chat_model
llm = init_chat_model("gpt-4o-mini", model_provider="openai")
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
prompt = ChatPromptTemplate.from_template(
"""Answer the question based only on the context provided.
Context: {context}
Question: {question}"""
)
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
chain.invoke("...")
API 参考
所有向量检索器(VectorizeRetriever)功能和配置的详细文档,请访问 API 参考。