维维亚特
此笔记本介绍了如何使用langchain-weaviate包。
Weaviate 是一个开源的向量数据库。它允许您存储您最喜欢的 ML 模型中的数据对象和向量嵌入,并无缝扩展到数十亿个数据对象。
要使用此集成,您需要有一个正在运行的 Weaviate 数据库实例。
最低版本
此模块需要 Weaviate1.23.7或更高。但是,我们建议您使用最新版本的 Weaviate。
连接到 Weaviate
在此笔记本中,我们假设您有一个正在运行的 Weaviate 本地实例http://localhost:8080和 50051 端口为 gRPC 流量打开。因此,我们将通过以下方式与 Weaviate 建立联系:
weaviate_client = weaviate.connect_to_local()
其他部署选项
Weaviate 可以通过多种不同的方式进行部署,例如使用 Weaviate 云服务 (WCS)、Docker 或 Kubernetes。
如果您的 Weaviate 实例以其他方式部署,请在此处阅读有关连接到 Weaviate 的不同方式的更多信息。您可以使用不同的帮助程序函数或创建自定义实例。
请注意,您需要一个
v4client API 的 API 中,它将创建一个weaviate.WeaviateClient对象。
认证
某些 Weaviate 实例(例如在 WCS 上运行的实例)启用了身份验证,例如 API 密钥和/或用户名 + 密码身份验证。
有关更多信息,请阅读客户端身份验证指南,以及深入的身份验证配置页面。
安装
# install package
# %pip install -Uqq langchain-weaviate
# %pip install openai tiktoken langchain
环境设置
此笔记本通过以下方式使用 OpenAI APIOpenAIEmbeddings.我们建议获取一个 OpenAI API 密钥并将其导出为名称为OPENAI_API_KEY.
完成此作后,您的 OpenAI API 密钥将被自动读取。如果您不熟悉环境变量,请在此处或本指南中阅读有关它们的更多信息。
用法
按相似度查找对象
下面是一个示例,说明如何通过与查询的相似性查找对象,从数据导入到查询 Weaviate 实例。
第 1 步:数据导入
首先,我们将创建要添加到Weaviate通过加载和分块长文本文件的内容。
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
loader = TextLoader("state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings()
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/langchain_core/_api/deprecation.py:117: LangChainDeprecationWarning: The class `langchain_community.embeddings.openai.OpenAIEmbeddings` was deprecated in langchain-community 0.1.0 and will be removed in 0.2.0. An updated version of the class exists in the langchain-openai package and should be used instead. To use it run `pip install -U langchain-openai` and import as `from langchain_openai import OpenAIEmbeddings`.
warn_deprecated(
现在,我们可以导入数据。
为此,请连接到 Weaviate 实例并使用生成的weaviate_client对象。例如,我们可以导入文档,如下所示:
import weaviate
from langchain_weaviate.vectorstores import WeaviateVectorStore
weaviate_client = weaviate.connect_to_local()
db = WeaviateVectorStore.from_documents(docs, embeddings, client=weaviate_client)
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/pydantic/main.py:1024: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
warnings.warn('The `dict` method is deprecated; use `model_dump` instead.', category=PydanticDeprecatedSince20)
第 2 步:执行搜索
我们现在可以执行相似性搜索。这将根据存储在 Weaviate 中的嵌入和从查询文本生成的等效嵌入返回与查询文本最相似的文档。
query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query)
# Print the first 100 characters of each result
for i, doc in enumerate(docs):
print(f"\nDocument {i+1}:")
print(doc.page_content[:100] + "...")
Document 1:
Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Ac...
Document 2:
And so many families are living paycheck to paycheck, struggling to keep up with the rising cost of ...
Document 3:
Vice President Harris and I ran for office with a new economic vision for America.
Invest in Ameri...
Document 4:
A former top litigator in private practice. A former federal public defender. And from a family of p...
您还可以添加筛选条件,这些筛选条件将根据筛选条件包含或排除结果。(查看更多过滤器示例。
from weaviate.classes.query import Filter
for filter_str in ["blah.txt", "state_of_the_union.txt"]:
search_filter = Filter.by_property("source").equal(filter_str)
filtered_search_results = db.similarity_search(query, filters=search_filter)
print(len(filtered_search_results))
if filter_str == "state_of_the_union.txt":
assert len(filtered_search_results) > 0 # There should be at least one result
else:
assert len(filtered_search_results) == 0 # There should be no results
0
4
也可以提供k,这是要返回的结果数的上限。
search_filter = Filter.by_property("source").equal("state_of_the_union.txt")
filtered_search_results = db.similarity_search(query, filters=search_filter, k=3)
assert len(filtered_search_results) <= 3
量化结果相似性
您可以选择检索相关性 “score”。这是一个相对分数,表示特定搜索结果在搜索结果池中的好坏。
请注意,这是相对分数,这意味着它不应用于确定相关性阈值。但是,它可用于比较整个搜索结果集中不同搜索结果的相关性。
docs = db.similarity_search_with_score("country", k=5)
for doc in docs:
print(f"{doc[1]:.3f}", ":", doc[0].page_content[:100] + "...")
0.935 : For that purpose we’ve mobilized American ground forces, air squadrons, and ship deployments to prot...
0.500 : And built the strongest, freest, and most prosperous nation the world has ever known.
Now is the h...
0.462 : If you travel 20 miles east of Columbus, Ohio, you’ll find 1,000 empty acres of land.
It won’t loo...
0.450 : And my report is this: the State of the Union is strong—because you, the American people, are strong...
0.442 : Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Ac...
搜索机制
similarity_search使用 Weaviate 的混合搜索。
混合搜索将向量搜索和关键字搜索组合在一起,其中alpha作为向量搜索的权重。这similarity_searchfunction 允许您将其他参数作为 kwargs 传递。有关可用参数,请参阅此参考文档。
因此,您可以通过添加alpha=0如下所示:
docs = db.similarity_search(query, alpha=0)
docs[0]
Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': 'state_of_the_union.txt'})
坚持
通过langchain-weaviate将根据其配置在 Weaviate 中持续存在。
例如,WCS 实例配置为无限期保留数据,而 Docker 实例可以设置为在卷中保留数据。阅读更多关于 Weaviate 的坚持。
多租户
多租户允许您在单个 Weaviate 实例中拥有大量具有相同收集配置的隔离数据集合。这非常适合多用户环境,例如构建 SaaS 应用程序,其中每个最终用户都将拥有自己的隔离数据收集。
要使用多租户,向量存储需要注意tenant参数。
因此,在添加任何数据时,请提供tenant参数,如下所示。
db_with_mt = WeaviateVectorStore.from_documents(
docs, embeddings, client=weaviate_client, tenant="Foo"
)
2024-Mar-26 03:40 PM - langchain_weaviate.vectorstores - INFO - Tenant Foo does not exist in index LangChain_30b9273d43b3492db4fb2aba2e0d6871. Creating tenant.
在执行查询时,提供tenant参数。
db_with_mt.similarity_search(query, tenant="Foo")
[Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': 'state_of_the_union.txt'}),
Document(page_content='And so many families are living paycheck to paycheck, struggling to keep up with the rising cost of food, gas, housing, and so much more. \n\nI understand. \n\nI remember when my Dad had to leave our home in Scranton, Pennsylvania to find work. I grew up in a family where if the price of food went up, you felt it. \n\nThat’s why one of the first things I did as President was fight to pass the American Rescue Plan. \n\nBecause people were hurting. We needed to act, and we did. \n\nFew pieces of legislation have done more in a critical moment in our history to lift us out of crisis. \n\nIt fueled our efforts to vaccinate the nation and combat COVID-19. It delivered immediate economic relief for tens of millions of Americans. \n\nHelped put food on their table, keep a roof over their heads, and cut the cost of health insurance. \n\nAnd as my Dad used to say, it gave people a little breathing room.', metadata={'source': 'state_of_the_union.txt'}),
Document(page_content='He and his Dad both have Type 1 diabetes, which means they need insulin every day. Insulin costs about $10 a vial to make. \n\nBut drug companies charge families like Joshua and his Dad up to 30 times more. I spoke with Joshua’s mom. \n\nImagine what it’s like to look at your child who needs insulin and have no idea how you’re going to pay for it. \n\nWhat it does to your dignity, your ability to look your child in the eye, to be the parent you expect to be. \n\nJoshua is here with us tonight. Yesterday was his birthday. Happy birthday, buddy. \n\nFor Joshua, and for the 200,000 other young people with Type 1 diabetes, let’s cap the cost of insulin at $35 a month so everyone can afford it. \n\nDrug companies will still do very well. And while we’re at it let Medicare negotiate lower prices for prescription drugs, like the VA already does.', metadata={'source': 'state_of_the_union.txt'}),
Document(page_content='Putin’s latest attack on Ukraine was premeditated and unprovoked. \n\nHe rejected repeated efforts at diplomacy. \n\nHe thought the West and NATO wouldn’t respond. And he thought he could divide us at home. Putin was wrong. We were ready. Here is what we did. \n\nWe prepared extensively and carefully. \n\nWe spent months building a coalition of other freedom-loving nations from Europe and the Americas to Asia and Africa to confront Putin. \n\nI spent countless hours unifying our European allies. We shared with the world in advance what we knew Putin was planning and precisely how he would try to falsely justify his aggression. \n\nWe countered Russia’s lies with truth. \n\nAnd now that he has acted the free world is holding him accountable. \n\nAlong with twenty-seven members of the European Union including France, Germany, Italy, as well as countries like the United Kingdom, Canada, Japan, Korea, Australia, New Zealand, and many others, even Switzerland.', metadata={'source': 'state_of_the_union.txt'})]
检索器选项
Weaviate 也可以用作检索器
最大边际相关性搜索 (MMR)
除了在 retriever 对象中使用 similaritysearch 之外,您还可以使用mmr.
retriever = db.as_retriever(search_type="mmr")
retriever.invoke(query)[0]
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/pydantic/main.py:1024: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
warnings.warn('The `dict` method is deprecated; use `model_dump` instead.', category=PydanticDeprecatedSince20)
Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': 'state_of_the_union.txt'})
与 LangChain 一起使用
大型语言模型 (LLM) 的一个已知限制是,它们的训练数据可能已过时,或者不包含您需要的特定领域知识。
请看下面的示例:
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
llm.predict("What did the president say about Justice Breyer")
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/langchain_core/_api/deprecation.py:117: LangChainDeprecationWarning: The class `langchain_community.chat_models.openai.ChatOpenAI` was deprecated in langchain-community 0.0.10 and will be removed in 0.2.0. An updated version of the class exists in the langchain-openai package and should be used instead. To use it run `pip install -U langchain-openai` and import as `from langchain_openai import ChatOpenAI`.
warn_deprecated(
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/langchain_core/_api/deprecation.py:117: LangChainDeprecationWarning: The function `predict` was deprecated in LangChain 0.1.7 and will be removed in 0.2.0. Use invoke instead.
warn_deprecated(
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/pydantic/main.py:1024: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
warnings.warn('The `dict` method is deprecated; use `model_dump` instead.', category=PydanticDeprecatedSince20)
"I'm sorry, I cannot provide real-time information as my responses are generated based on a mixture of licensed data, data created by human trainers, and publicly available data. The last update was in October 2021."
向量存储通过提供一种存储和检索相关信息的方法来补充 LLM。这允许您通过将 LLM 的推理和语言功能与向量存储检索相关信息的能力相结合,将 LLM 和向量存储的优势结合起来。
用于组合 LLM 和向量存储的两个众所周知的应用程序是:
- 问答
- 检索增强生成 (RAG)
使用源进行问答
langchain 中的问答可以通过使用 vector store 来增强。让我们看看如何做到这一点。
本部分使用RetrievalQAWithSourcesChain,它执行 Index 中的文档查找。
首先,我们将再次对文本进行分块,并将其导入到 Weaviate 向量存储中。
from langchain.chains import RetrievalQAWithSourcesChain
from langchain_openai import OpenAI
with open("state_of_the_union.txt") as f:
state_of_the_union = f.read()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_text(state_of_the_union)
docsearch = WeaviateVectorStore.from_texts(
texts,
embeddings,
client=weaviate_client,
metadatas=[{"source": f"{i}-pl"} for i in range(len(texts))],
)
现在我们可以构造链,并指定 retriever:
chain = RetrievalQAWithSourcesChain.from_chain_type(
OpenAI(temperature=0), chain_type="stuff", retriever=docsearch.as_retriever()
)
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/langchain_core/_api/deprecation.py:117: LangChainDeprecationWarning: The class `langchain_community.llms.openai.OpenAI` was deprecated in langchain-community 0.0.10 and will be removed in 0.2.0. An updated version of the class exists in the langchain-openai package and should be used instead. To use it run `pip install -U langchain-openai` and import as `from langchain_openai import OpenAI`.
warn_deprecated(
并运行链,以提出问题:
chain(
{"question": "What did the president say about Justice Breyer"},
return_only_outputs=True,
)
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/langchain_core/_api/deprecation.py:117: LangChainDeprecationWarning: The function `__call__` was deprecated in LangChain 0.1.0 and will be removed in 0.2.0. Use invoke instead.
warn_deprecated(
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/pydantic/main.py:1024: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
warnings.warn('The `dict` method is deprecated; use `model_dump` instead.', category=PydanticDeprecatedSince20)
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/pydantic/main.py:1024: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
warnings.warn('The `dict` method is deprecated; use `model_dump` instead.', category=PydanticDeprecatedSince20)
{'answer': ' The president thanked Justice Stephen Breyer for his service and announced his nomination of Judge Ketanji Brown Jackson to the Supreme Court.\n',
'sources': '31-pl'}
检索增强一代
将 LLM 和向量存储相结合的另一个非常流行的应用是检索增强生成 (RAG)。这是一种使用检索器从向量存储中查找相关信息,然后使用 LLM 根据检索到的数据和提示提供输出的技术。
我们从类似的设置开始:
with open("state_of_the_union.txt") as f:
state_of_the_union = f.read()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_text(state_of_the_union)
docsearch = WeaviateVectorStore.from_texts(
texts,
embeddings,
client=weaviate_client,
metadatas=[{"source": f"{i}-pl"} for i in range(len(texts))],
)
retriever = docsearch.as_retriever()
我们需要为 RAG 模型构建一个模板,以便将检索到的信息填充到模板中。
from langchain_core.prompts import ChatPromptTemplate
template = """You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: {question}
Context: {context}
Answer:
"""
prompt = ChatPromptTemplate.from_template(template)
print(prompt)
input_variables=['context', 'question'] messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], template="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: {question}\nContext: {context}\nAnswer:\n"))]
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
运行单元,我们得到的结果非常相似。
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
rag_chain = (
{"context": retriever, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
rag_chain.invoke("What did the president say about Justice Breyer")
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/pydantic/main.py:1024: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
warnings.warn('The `dict` method is deprecated; use `model_dump` instead.', category=PydanticDeprecatedSince20)
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/pydantic/main.py:1024: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
warnings.warn('The `dict` method is deprecated; use `model_dump` instead.', category=PydanticDeprecatedSince20)
"The president honored Justice Stephen Breyer for his service to the country as an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. The president also mentioned nominating Circuit Court of Appeals Judge Ketanji Brown Jackson to continue Justice Breyer's legacy of excellence. The president expressed gratitude towards Justice Breyer and highlighted the importance of nominating someone to serve on the United States Supreme Court."
但请注意,由于模板由您构建,因此您可以根据需要对其进行自定义。
总结 & 资源
Weaviate 是一个可扩展的、生产就绪的载体存储。
这种集成使 Weaviate 可以与 LangChain 一起使用,以增强具有强大数据存储的大型语言模型的功能。它的可扩展性和生产就绪性使其成为 LangChain 应用程序向量存储的绝佳选择,它将缩短您的生产时间。