Skip to main content
Open In ColabOpen on GitHub

Couchbase

Couchbase 是一项获奖的分布式 NoSQL 云数据库,为您的所有云、移动、AI 和边缘计算应用程序提供无与伦比的灵活性、性能、可扩展性和经济效益。Couchbase 通过为开发人员提供的编码辅助和为他们的应用程序提供的向量搜索拥抱 AI。

向量搜索是Couchbase的全文搜索服务(Search Service)的一部分。

这个教程解释了如何在Couchbase中使用向量搜索。你可以与[Couchbase Capella](0)和你自己的托管Couchbase服务器一起工作。

设置

要访问CouchbaseSearchVectorStore,您首先需要安装langchain-couchbase合作伙伴包:

pip install -qU langchain-couchbase

[notice] A new release of pip is available: 24.1.2 -> 25.0.1
[notice] To update, run: pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.

Credentials

前往Couchbase的网站并创建一个新的连接,确保保存您的数据库用户名和密码:

import getpass

COUCHBASE_CONNECTION_STRING = getpass.getpass(
"Enter the connection string for the Couchbase cluster: "
)
DB_USERNAME = getpass.getpass("Enter the username for the Couchbase cluster: ")
DB_PASSWORD = getpass.getpass("Enter the password for the Couchbase cluster: ")
Enter the connection string for the Couchbase cluster:  ········
Enter the username for the Couchbase cluster: ········
Enter the password for the Couchbase cluster: ········

如果您想要获得最佳的模型调用自动化跟踪,您也可以通过取消注释下方代码来设置您的LangSmith API密钥。

# os.environ["LANGSMITH_TRACING"] = "true"
# os.environ["LANGSMITH_API_KEY"] = getpass.getpass()

初始化

在实例化之前,我们需要创建一个连接。

创建 Couchbase 连接对象

我们最初与Couchbase集群建立连接,然后将集群对象传递给向量存储。

这里,我们将使用上方的用户名和密码进行连接。您也可以通过其他任何支持的方式连接到您的集群。

要了解如何连接到Couchbase集群,请参阅文档

from datetime import timedelta

from couchbase.auth import PasswordAuthenticator
from couchbase.cluster import Cluster
from couchbase.options import ClusterOptions

auth = PasswordAuthenticator(DB_USERNAME, DB_PASSWORD)
options = ClusterOptions(auth)
cluster = Cluster(COUCHBASE_CONNECTION_STRING, options)

# Wait until the cluster is ready for use.
cluster.wait_until_ready(timedelta(seconds=5))

我们将现在设置用于向量搜索的Couchbase集群中的桶、范围和集合名称。

对于这个示例,我们使用了默认的范围和集合。

BUCKET_NAME = "langchain_bucket"
SCOPE_NAME = "_default"
COLLECTION_NAME = "_default"
SEARCH_INDEX_NAME = "langchain-test-index"

要了解如何创建支持向量字段的搜索索引,请参阅文档。

简单实例化

在下面,我们将使用集群信息和搜索索引名称创建向量存储对象。

pip install -qU langchain-openai
import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
from langchain_couchbase.vectorstores import CouchbaseSearchVectorStore

vector_store = CouchbaseSearchVectorStore(
cluster=cluster,
bucket_name=BUCKET_NAME,
scope_name=SCOPE_NAME,
collection_name=COLLECTION_NAME,
embedding=embeddings,
index_name=SEARCH_INDEX_NAME,
)

指定文本与嵌入字段

您可以选择为文档指定文本及嵌入字段,使用text_keyembedding_key字段。

vector_store_specific = CouchbaseSearchVectorStore(
cluster=cluster,
bucket_name=BUCKET_NAME,
scope_name=SCOPE_NAME,
collection_name=COLLECTION_NAME,
embedding=embeddings,
index_name=SEARCH_INDEX_NAME,
text_key="text",
embedding_key="embedding",
)

管理向量存储

创建向量存储后,我们可以对其进行交互,通过添加和删除不同的项。

添加项到向量存储

我们可以通过使用add_documents函数来向我们的向量存储中添加项目。

from uuid import uuid4

from langchain_core.documents import Document

document_1 = Document(
page_content="I had chocolate chip pancakes and scrambled eggs for breakfast this morning.",
metadata={"source": "tweet"},
)

document_2 = Document(
page_content="The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.",
metadata={"source": "news"},
)

document_3 = Document(
page_content="Building an exciting new project with LangChain - come check it out!",
metadata={"source": "tweet"},
)

document_4 = Document(
page_content="Robbers broke into the city bank and stole $1 million in cash.",
metadata={"source": "news"},
)

document_5 = Document(
page_content="Wow! That was an amazing movie. I can't wait to see it again.",
metadata={"source": "tweet"},
)

document_6 = Document(
page_content="Is the new iPhone worth the price? Read this review to find out.",
metadata={"source": "website"},
)

document_7 = Document(
page_content="The top 10 soccer players in the world right now.",
metadata={"source": "website"},
)

document_8 = Document(
page_content="LangGraph is the best framework for building stateful, agentic applications!",
metadata={"source": "tweet"},
)

document_9 = Document(
page_content="The stock market is down 500 points today due to fears of a recession.",
metadata={"source": "news"},
)

document_10 = Document(
page_content="I have a bad feeling I am going to get deleted :(",
metadata={"source": "tweet"},
)

documents = [
document_1,
document_2,
document_3,
document_4,
document_5,
document_6,
document_7,
document_8,
document_9,
document_10,
]
uuids = [str(uuid4()) for _ in range(len(documents))]

vector_store.add_documents(documents=documents, ids=uuids)
API 参考:文档
['4a6b5252-24ca-4e48-97a9-c33211fc7736',
'594a413d-761a-44f1-8f0c-6418700b198d',
'fdd8461c-f4e3-4c85-af8e-7782ce4d2311',
'3f6a82b2-7464-4eee-b209-cbca5a236a8a',
'df8b87ad-464e-4f83-a007-ccf5a8fa4ff5',
'aa18502e-6fb4-4578-9c63-b9a299259b01',
'8c55a17d-5fa7-4c30-a55d-7ded0d39bf46',
'41b68c5a-ebf5-4d7a-a079-5e32926ca484',
'146ac3e0-474a-422a-b0ac-c9fee718396b',
'e44941e9-fb3a-4090-88a0-9ffecee3e80e']

删除向量存储中的项

vector_store.delete(ids=[uuids[-1]])
True

查询向量存储

一旦您的向量存储已经创建并添加了相关文档,您很可能在运行链或代理的过程中希望对其进行查询。

查询直接

简单进行相似性搜索可以按照以下方式进行:

results = vector_store.similarity_search(
"LangChain provides abstractions to make working with LLMs easy",
k=2,
)
for res in results:
print(f"* {res.page_content} [{res.metadata}]")
* Building an exciting new project with LangChain - come check it out! [{'source': 'tweet'}]
* LangGraph is the best framework for building stateful, agentic applications! [{'source': 'tweet'}]

相似性搜索(带分数)

您还可以通过调用similarity_search_with_score方法来获取结果的分数。

results = vector_store.similarity_search_with_score("Will it be hot tomorrow?", k=1)
for res, score in results:
print(f"* [SIM={score:3f}] {res.page_content} [{res.metadata}]")
* [SIM=0.553145] The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees. [{'source': 'news'}]

指定返回字段

您可以通过在搜索中使用fields参数来指定从文档返回的字段。这些字段作为返回Document对象中的metadata的一部分被返回。您可以检索存储在Search索引中的任何字段。文档的text_key作为文档的page_content的一部分被返回。

如果没有指定要获取的字段,则将返回索引中存储的所有字段。

如果你想获取元数据中的某个字段,你需要使用.来指定它。

例如,要获取元数据中的source字段,您需要指定metadata.source

query = "What did I eat for breakfast today?"
results = vector_store.similarity_search(query, fields=["metadata.source"])
print(results[0])
page_content='I had chocolate chip pancakes and scrambled eggs for breakfast this morning.' metadata={'source': 'tweet'}

Hybrid Queries

Couchbase 允许您通过将向量搜索结果与文档中非向量字段(如metadata对象)的搜索结果结合起来来进行混合搜索。

The results will be based on the combination of the results from both Vector Search and the searches supported by Search Service. The scores of each of the component searches are added up to get the total score of the result.

要执行混合搜索,有一个可选参数,search_options可以传递给所有相似性搜索。
The different search/query possibilities for thesearch_options可以找到here.

要在模拟混合搜索时创建一些随机元数据,请从现有文档中添加三个字段。 我们均匀地在2010年至2020年之间添加date,在1至5之间添加rating,并将author设置为约翰·多伊或简·多伊。

from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter

loader = TextLoader("../../how_to/state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

# Adding metadata to documents
for i, doc in enumerate(docs):
doc.metadata["date"] = f"{range(2010, 2020)[i % 10]}-01-01"
doc.metadata["rating"] = range(1, 6)[i % 5]
doc.metadata["author"] = ["John Doe", "Jane Doe"][i % 2]

vector_store.add_documents(docs)

query = "What did the president say about Ketanji Brown Jackson"
results = vector_store.similarity_search(query)
print(results[0].metadata)
{'author': 'John Doe', 'date': '2016-01-01', 'rating': 2, 'source': '../../how_to/state_of_the_union.txt'}

查询精确值

我们可以在metadata对象的文本字段(如作者)中搜索精确匹配。

query = "What did the president say about Ketanji Brown Jackson"
results = vector_store.similarity_search(
query,
search_options={"query": {"field": "metadata.author", "match": "John Doe"}},
fields=["metadata.author"],
)
print(results[0])
page_content='One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.' metadata={'author': 'John Doe'}

查询部分匹配

我们可以通过指定搜索的模糊度来查找部分匹配项。这在您希望搜索查询的细微变化或拼写错误时非常有用。

这里,"Jae"与"Jane"接近(模糊度为1)。

query = "What did the president say about Ketanji Brown Jackson"
results = vector_store.similarity_search(
query,
search_options={
"query": {"field": "metadata.author", "match": "Jae", "fuzziness": 1}
},
fields=["metadata.author"],
)
print(results[0])
page_content='A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. 

And if we are to advance liberty and justice, we need to secure the Border and fix the immigration system.' metadata={'author': 'Jane Doe'}

查询日期范围查询(Query by Date Range Query)

我们可以在日期字段(如metadata.date)范围内查询文档。

query = "Any mention about independence?"
results = vector_store.similarity_search(
query,
search_options={
"query": {
"start": "2016-12-31",
"end": "2017-01-02",
"inclusive_start": True,
"inclusive_end": False,
"field": "metadata.date",
}
},
)
print(results[0])
page_content='We are cutting off Russia’s largest banks from the international financial system.  

Preventing Russia’s central bank from defending the Russian Ruble making Putin’s $630 Billion “war fund” worthless.

We are choking off Russia’s access to technology that will sap its economic strength and weaken its military for years to come.

Tonight I say to the Russian oligarchs and corrupt leaders who have bilked billions of dollars off this violent regime no more.' metadata={'author': 'Jane Doe', 'date': '2017-01-01', 'rating': 3, 'source': '../../how_to/state_of_the_union.txt'}

查询数值范围查询

我们可以在数值字段如metadata.rating的范围内搜索文档。

query = "Any mention about independence?"
results = vector_store.similarity_search_with_score(
query,
search_options={
"query": {
"min": 3,
"max": 5,
"inclusive_min": True,
"inclusive_max": True,
"field": "metadata.rating",
}
},
)
print(results[0])
(Document(id='8616f24425b94a52af3d32d20e6ffb4b', metadata={'author': 'John Doe', 'date': '2014-01-01', 'rating': 5, 'source': '../../how_to/state_of_the_union.txt'}, page_content='In this Capitol, generation after generation, Americans have debated great questions amid great strife, and have done great things. \n\nWe have fought for freedom, expanded liberty, defeated totalitarianism and terror. \n\nAnd built the strongest, freest, and most prosperous nation the world has ever known. \n\nNow is the hour. \n\nOur moment of responsibility. \n\nOur test of resolve and conscience, of history itself.'), 0.361933544533826)

结合多个搜索查询

不同的查询可以使用AND(合取)或OR(析取)运算符进行组合。

在本例中,我们检查的是评级在3到4之间且日期在2015年至2018年之间的文档。

query = "Any mention about independence?"
results = vector_store.similarity_search_with_score(
query,
search_options={
"query": {
"conjuncts": [
{"min": 3, "max": 4, "inclusive_max": True, "field": "metadata.rating"},
{"start": "2016-12-31", "end": "2017-01-02", "field": "metadata.date"},
]
}
},
)
print(results[0])
(Document(id='d9b36ef70b8942dda4db63563f51cf0f', metadata={'author': 'Jane Doe', 'date': '2017-01-01', 'rating': 3, 'source': '../../how_to/state_of_the_union.txt'}, page_content='We are cutting off Russia’s largest banks from the international financial system.  \n\nPreventing Russia’s central bank from defending the Russian Ruble making Putin’s $630 Billion “war fund” worthless.   \n\nWe are choking off Russia’s access to technology that will sap its economic strength and weaken its military for years to come.  \n\nTonight I say to the Russian oligarchs and corrupt leaders who have bilked billions of dollars off this violent regime no more.'), 0.7107075545629284)

其他查询

同样,您可以在search_options参数中使用任何支持的查询方法,例如地理距离、多边形搜索、通配符、正则表达式等。请参阅文档以获取更多可用查询方法及其语法的详细信息。

查询通过转换为检索器

您也可以将向量存储转换为检索器,以便在链条中更方便地使用。

如何将您的向量存储转换为检索器,然后使用简单的查询和过滤器调用检索器。

retriever = vector_store.as_retriever(
search_type="similarity",
search_kwargs={"k": 1, "score_threshold": 0.5},
)
retriever.invoke("Stealing from the bank is a crime", filter={"source": "news"})
[Document(id='3f6a82b2-7464-4eee-b209-cbca5a236a8a', metadata={'source': 'news'}, page_content='Robbers broke into the city bank and stole $1 million in cash.')]

使用检索增强生成

对于如何使用此向量存储进行检索增强生成(RAG)的指南,请参见以下部分:

常见问题

Question: 应该在创建CouchbaseVectorStore对象之前创建Search索引吗?

是的,目前您需要在创建 CouchbaseVectoreStore 对象之前先创建搜索索引。

Question: 我在搜索结果中看不到我指定的所有字段。

在Couchbase中,我们只能返回Search索引中存储的字段。请确保您尝试在搜索结果中访问的字段是Search索引的一部分。

一种处理方法是动态地在索引中存储文档的字段。

  • 在Capella中,您需要进入“高级模式”,然后在向下的箭头“通用设置”下可以勾选“[X] 存储动态字段”或“[X] 索引动态字段”
  • 在Couchbase Server中,在Index Editor(不是Quick Editor)下的箭头 "Advanced" 选项下,可以勾选 “[X] Store Dynamic Fields” 或 “[X] Index Dynamic Fields”。

注意,这些选项会增加索引的大小。

有关动态映射的更多详细信息,请参阅文档

Question: 我无法在我的搜索结果中看到元数据对象。

这很可能是因为文档中的metadata字段未被Couchbase Search索引进行索引和/或存储。为了对文档中的metadata字段进行索引,您需要将其作为子映射添加到索引中。

如果您选择映射所有字段,则可以通过所有元数据字段进行搜索。或者,为了优化索引,您可以选择在metadata对象中要被索引的具体字段。您可以在文档中了解有关子映射索引的更多信息。

创建子映射

API 参考

详细文档请参阅所有CouchbaseSearchVectorStore功能和配置的API参考:https://couchbase-ecosystem.github.io/langchain-couchbase/langchain_couchbase.html