Skip to main content
Open In Colab在 GitHub 上打开

沙发基地

Couchbase 是一个屡获殊荣的分布式 NoSQL 云数据库,可为您的所有云、移动、AI 和边缘计算应用程序提供无与伦比的多功能性、性能、可扩展性和财务价值。Couchbase 采用 AI,为开发人员提供编码帮助,并为他们的应用程序提供矢量搜索。

Vector Search 是 Couchbase 中全文搜索服务(Search Service)的一部分。

本教程介绍了如何在 Couchbase 中使用 Vector Search。您可以使用 Couchbase Capella 和自行管理的 Couchbase Server。

设置

要访问CouchbaseSearchVectorStore您首先需要安装langchain-couchbase合作伙伴套餐:

pip install -qU langchain-couchbase

[notice] A new release of pip is available: 24.1.2 -> 25.0.1
[notice] To update, run: pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.

凭据

前往 Couchbase 网站并创建一个新连接,确保保存您的数据库用户名和密码:

import getpass

COUCHBASE_CONNECTION_STRING = getpass.getpass(
"Enter the connection string for the Couchbase cluster: "
)
DB_USERNAME = getpass.getpass("Enter the username for the Couchbase cluster: ")
DB_PASSWORD = getpass.getpass("Enter the password for the Couchbase cluster: ")
Enter the connection string for the Couchbase cluster:  ········
Enter the username for the Couchbase cluster: ········
Enter the password for the Couchbase cluster: ········

如果您想获得一流的模型调用自动跟踪,您还可以通过取消下面的注释来设置 LangSmith API 密钥:

# os.environ["LANGSMITH_TRACING"] = "true"
# os.environ["LANGSMITH_API_KEY"] = getpass.getpass()

初始化

在实例化之前,我们需要创建一个连接。

创建 Couchbase 连接对象

我们首先创建与 Couchbase 集群的连接,然后将集群对象传递给 Vector Store。

在这里,我们使用上面的用户名和密码进行连接。您还可以使用任何其他受支持的方式连接到您的集群。

有关连接到 Couchbase 集群的更多信息,请查看文档

from datetime import timedelta

from couchbase.auth import PasswordAuthenticator
from couchbase.cluster import Cluster
from couchbase.options import ClusterOptions

auth = PasswordAuthenticator(DB_USERNAME, DB_PASSWORD)
options = ClusterOptions(auth)
cluster = Cluster(COUCHBASE_CONNECTION_STRING, options)

# Wait until the cluster is ready for use.
cluster.wait_until_ready(timedelta(seconds=5))

现在,我们将在 Couchbase 集群中设置要用于 Vector Search 的存储桶、范围和集合名称。

在这个例子中,我们使用的是默认的范围和集合。

BUCKET_NAME = "langchain_bucket"
SCOPE_NAME = "_default"
COLLECTION_NAME = "_default"
SEARCH_INDEX_NAME = "langchain-test-index"

有关如何创建支持 Vector 字段的 Search 索引的详细信息,请参阅文档。

简单实例化

下面,我们使用集群信息和搜索索引名称创建向量存储对象。

pip install -qU langchain-openai
import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
from langchain_couchbase.vectorstores import CouchbaseSearchVectorStore

vector_store = CouchbaseSearchVectorStore(
cluster=cluster,
bucket_name=BUCKET_NAME,
scope_name=SCOPE_NAME,
collection_name=COLLECTION_NAME,
embedding=embeddings,
index_name=SEARCH_INDEX_NAME,
)

指定文本和嵌入字段

您可以选择使用文档的 text & embeddings 字段text_keyembedding_key领域。

vector_store_specific = CouchbaseSearchVectorStore(
cluster=cluster,
bucket_name=BUCKET_NAME,
scope_name=SCOPE_NAME,
collection_name=COLLECTION_NAME,
embedding=embeddings,
index_name=SEARCH_INDEX_NAME,
text_key="text",
embedding_key="embedding",
)

管理矢量存储

创建 vector store 后,我们可以通过添加和删除不同的项目来与之交互。

将项目添加到向量存储

我们可以使用add_documents功能。

from uuid import uuid4

from langchain_core.documents import Document

document_1 = Document(
page_content="I had chocolate chip pancakes and scrambled eggs for breakfast this morning.",
metadata={"source": "tweet"},
)

document_2 = Document(
page_content="The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.",
metadata={"source": "news"},
)

document_3 = Document(
page_content="Building an exciting new project with LangChain - come check it out!",
metadata={"source": "tweet"},
)

document_4 = Document(
page_content="Robbers broke into the city bank and stole $1 million in cash.",
metadata={"source": "news"},
)

document_5 = Document(
page_content="Wow! That was an amazing movie. I can't wait to see it again.",
metadata={"source": "tweet"},
)

document_6 = Document(
page_content="Is the new iPhone worth the price? Read this review to find out.",
metadata={"source": "website"},
)

document_7 = Document(
page_content="The top 10 soccer players in the world right now.",
metadata={"source": "website"},
)

document_8 = Document(
page_content="LangGraph is the best framework for building stateful, agentic applications!",
metadata={"source": "tweet"},
)

document_9 = Document(
page_content="The stock market is down 500 points today due to fears of a recession.",
metadata={"source": "news"},
)

document_10 = Document(
page_content="I have a bad feeling I am going to get deleted :(",
metadata={"source": "tweet"},
)

documents = [
document_1,
document_2,
document_3,
document_4,
document_5,
document_6,
document_7,
document_8,
document_9,
document_10,
]
uuids = [str(uuid4()) for _ in range(len(documents))]

vector_store.add_documents(documents=documents, ids=uuids)
API 参考:文档
['4a6b5252-24ca-4e48-97a9-c33211fc7736',
'594a413d-761a-44f1-8f0c-6418700b198d',
'fdd8461c-f4e3-4c85-af8e-7782ce4d2311',
'3f6a82b2-7464-4eee-b209-cbca5a236a8a',
'df8b87ad-464e-4f83-a007-ccf5a8fa4ff5',
'aa18502e-6fb4-4578-9c63-b9a299259b01',
'8c55a17d-5fa7-4c30-a55d-7ded0d39bf46',
'41b68c5a-ebf5-4d7a-a079-5e32926ca484',
'146ac3e0-474a-422a-b0ac-c9fee718396b',
'e44941e9-fb3a-4090-88a0-9ffecee3e80e']

从 vector store 中删除项目

vector_store.delete(ids=[uuids[-1]])
True

查询向量存储

创建矢量存储并添加相关文档后,您很可能希望在链或代理运行期间对其进行查询。

直接查询

可以按如下方式执行简单的相似性搜索:

results = vector_store.similarity_search(
"LangChain provides abstractions to make working with LLMs easy",
k=2,
)
for res in results:
print(f"* {res.page_content} [{res.metadata}]")
* Building an exciting new project with LangChain - come check it out! [{'source': 'tweet'}]
* LangGraph is the best framework for building stateful, agentic applications! [{'source': 'tweet'}]

使用 Score 进行相似性搜索

您还可以通过调用similarity_search_with_score方法。

results = vector_store.similarity_search_with_score("Will it be hot tomorrow?", k=1)
for res, score in results:
print(f"* [SIM={score:3f}] {res.page_content} [{res.metadata}]")
* [SIM=0.553145] The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees. [{'source': 'news'}]

指定要返回的字段

您可以使用fields参数。这些字段作为metadataobject 的 Document。您可以获取存储在 Search 索引中的任何字段。这text_key作为文档的page_content.

如果未指定要获取的任何字段,则返回索引中存储的所有字段。

如果要获取元数据中的某个字段,则需要使用.

例如,要获取source字段,您需要指定metadata.source.

query = "What did I eat for breakfast today?"
results = vector_store.similarity_search(query, fields=["metadata.source"])
print(results[0])
page_content='I had chocolate chip pancakes and scrambled eggs for breakfast this morning.' metadata={'source': 'tweet'}

混合查询

Couchbase 允许您通过将 Vector Search 结果与文档的非 Vector 字段(如metadata对象。

结果将基于 Vector Search 和 Search Service 支持的搜索的结果组合。将每个组件搜索的分数相加,得到结果的总分。

要执行混合搜索,有一个可选参数search_options可以传递给所有相似性搜索。
不同的搜索/查询可能性
search_options可以在这里找到。

为了模拟混合搜索,让我们从现有文档创建一些随机元数据。 我们统一地向元数据中添加三个字段,date在2010年至2020年之间,rating在 1 到 5 之间和author设置为 John Doe 或 Jane Doe。

from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter

loader = TextLoader("../../how_to/state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

# Adding metadata to documents
for i, doc in enumerate(docs):
doc.metadata["date"] = f"{range(2010, 2020)[i % 10]}-01-01"
doc.metadata["rating"] = range(1, 6)[i % 5]
doc.metadata["author"] = ["John Doe", "Jane Doe"][i % 2]

vector_store.add_documents(docs)

query = "What did the president say about Ketanji Brown Jackson"
results = vector_store.similarity_search(query)
print(results[0].metadata)
{'author': 'John Doe', 'date': '2016-01-01', 'rating': 2, 'source': '../../how_to/state_of_the_union.txt'}

按精确值查询

我们可以在文本字段上搜索精确匹配项,例如 author 在metadata对象。

query = "What did the president say about Ketanji Brown Jackson"
results = vector_store.similarity_search(
query,
search_options={"query": {"field": "metadata.author", "match": "John Doe"}},
fields=["metadata.author"],
)
print(results[0])
page_content='One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.' metadata={'author': 'John Doe'}

按部分匹配查询

我们可以通过为搜索指定模糊度来搜索部分匹配项。当您想要搜索搜索查询的细微变化或拼写错误时,这非常有用。

此处,“Jae” 与 “Jane” 接近(模糊度为 1)。

query = "What did the president say about Ketanji Brown Jackson"
results = vector_store.similarity_search(
query,
search_options={
"query": {"field": "metadata.author", "match": "Jae", "fuzziness": 1}
},
fields=["metadata.author"],
)
print(results[0])
page_content='A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. 

And if we are to advance liberty and justice, we need to secure the Border and fix the immigration system.' metadata={'author': 'Jane Doe'}

按日期范围查询 Query by Date Range Query

我们可以在日期字段(如metadata.date.

query = "Any mention about independence?"
results = vector_store.similarity_search(
query,
search_options={
"query": {
"start": "2016-12-31",
"end": "2017-01-02",
"inclusive_start": True,
"inclusive_end": False,
"field": "metadata.date",
}
},
)
print(results[0])
page_content='We are cutting off Russia’s largest banks from the international financial system.  

Preventing Russia’s central bank from defending the Russian Ruble making Putin’s $630 Billion “war fund” worthless.

We are choking off Russia’s access to technology that will sap its economic strength and weaken its military for years to come.

Tonight I say to the Russian oligarchs and corrupt leaders who have bilked billions of dollars off this violent regime no more.' metadata={'author': 'Jane Doe', 'date': '2017-01-01', 'rating': 3, 'source': '../../how_to/state_of_the_union.txt'}

按数值范围查询 Query by Numeric Range Query

我们可以搜索位于数字字段范围内的文档,例如metadata.rating.

query = "Any mention about independence?"
results = vector_store.similarity_search_with_score(
query,
search_options={
"query": {
"min": 3,
"max": 5,
"inclusive_min": True,
"inclusive_max": True,
"field": "metadata.rating",
}
},
)
print(results[0])
(Document(id='8616f24425b94a52af3d32d20e6ffb4b', metadata={'author': 'John Doe', 'date': '2014-01-01', 'rating': 5, 'source': '../../how_to/state_of_the_union.txt'}, page_content='In this Capitol, generation after generation, Americans have debated great questions amid great strife, and have done great things. \n\nWe have fought for freedom, expanded liberty, defeated totalitarianism and terror. \n\nAnd built the strongest, freest, and most prosperous nation the world has ever known. \n\nNow is the hour. \n\nOur moment of responsibility. \n\nOur test of resolve and conscience, of history itself.'), 0.361933544533826)

组合多个搜索查询

可以使用 AND(连接)或 OR(析取)运算符组合不同的搜索查询。

在这个例子中,我们正在检查评级在3到4之间且日期在2015到2018年之间的文件。

query = "Any mention about independence?"
results = vector_store.similarity_search_with_score(
query,
search_options={
"query": {
"conjuncts": [
{"min": 3, "max": 4, "inclusive_max": True, "field": "metadata.rating"},
{"start": "2016-12-31", "end": "2017-01-02", "field": "metadata.date"},
]
}
},
)
print(results[0])
(Document(id='d9b36ef70b8942dda4db63563f51cf0f', metadata={'author': 'Jane Doe', 'date': '2017-01-01', 'rating': 3, 'source': '../../how_to/state_of_the_union.txt'}, page_content='We are cutting off Russia’s largest banks from the international financial system.  \n\nPreventing Russia’s central bank from defending the Russian Ruble making Putin’s $630 Billion “war fund” worthless.   \n\nWe are choking off Russia’s access to technology that will sap its economic strength and weaken its military for years to come.  \n\nTonight I say to the Russian oligarchs and corrupt leaders who have bilked billions of dollars off this violent regime no more.'), 0.7107075545629284)

其他查询

同样,您可以在search_options参数。有关可用查询方法及其语法的更多详细信息,请参阅文档。

通过转换为 retriever 进行查询

您还可以将 vector store 转换为检索器,以便在您的链中更轻松地使用。

下面介绍如何将 vector store 转换为 retriever,然后使用简单的查询和过滤器调用 retreiever。

retriever = vector_store.as_retriever(
search_type="similarity",
search_kwargs={"k": 1, "score_threshold": 0.5},
)
retriever.invoke("Stealing from the bank is a crime", filter={"source": "news"})
[Document(id='3f6a82b2-7464-4eee-b209-cbca5a236a8a', metadata={'source': 'news'}, page_content='Robbers broke into the city bank and stole $1 million in cash.')]

用于检索增强生成

有关如何使用此向量存储进行检索增强生成 (RAG) 的指南,请参阅以下部分:

常见问题解答

问题:我是否应该在创建 CouchbaseVectorStore 对象之前创建 Search 索引?

是的,目前您需要先创建 Search index(搜索索引),然后再创建CouchbaseVectoreStore对象。

问题:我没有看到我在搜索结果中指定的所有字段。

在 Couchbase 中,我们只能返回存储在 Search 索引中的字段。请确保您尝试在搜索结果中访问的字段是 Search index(搜索索引)的一部分。

处理此问题的一种方法是在索引中动态索引和存储文档的字段。

  • 在 Capella 中,您需要转到“高级模式”,然后在 V 形“常规设置”下,您可以选中“[X] 存储动态字段”或“[X] 索引动态字段”
  • 在 Couchbase Server 中,在索引编辑器(不是快速编辑器)中,在 V 形“高级”下,您可以选中“[X] 存储动态字段”或“[X] 索引动态字段”

请注意,这些选项将增加索引的大小。

有关动态映射的更多详细信息,请参阅文档

问题:我无法在搜索结果中看到元数据对象。

这很可能是由于metadata字段中未被 Couchbase Search 索引和/或存储。为了对metadata字段中,您需要将其作为子 Map 添加到 index 中。

如果选择映射映射中的所有字段,则可以按所有元数据字段进行搜索。或者,要优化索引,您可以选择其中的特定字段metadata对象。您可以参考文档以了解有关索引子映射的更多信息。

创建子映射

API 参考

有关所有CouchbaseSearchVectorStore功能和配置可参考 API 参考:https://couchbase-ecosystem.github.io/langchain-couchbase/langchain_couchbase.html