CloudflareVectorizeVectorStore

此笔记本介绍了如何开始使用 CloudflareVectorize 向量存储。

设置

这个 Python 包是 Cloudflare REST API 的包装器。要与 API 交互，您需要提供具有适当权限的 API 令牌。

您可以在此处创建和管理 API 令牌：

https://dash.cloudflare.com/YOUR-ACCT-NUMBER/api-tokens

凭据

CloudflareVectorize 依赖于 WorkersAI（如果您想将其用于嵌入）和 D1（如果您使用它来存储和检索原始值）。

虽然您可以创建单个api_token拥有对所有需要的资源（WorkersAI、Vectorize和D1）的编辑权限，您可能希望遵循“最小权限访问”的原则，并为每个服务创建单独的API令牌。

注意：这些特定于服务的令牌（如果提供）将优先于全局令牌。您可以提供这些令牌，而不是全局令牌。

import os

from dotenv import load_dotenv

load_dotenv(".env")

cf_acct_id = os.getenv("cf_acct_id")

# single token with WorkersAI, Vectorize & D1
api_token = os.getenv("cf_ai_token")

# OR, separate tokens with access to each service
cf_vectorize_token = os.getenv("cf_vectorize_token")
cf_d1_token = os.getenv("cf_d1_token")

初始化

import asyncio
import json
import uuid

from langchain_cloudflare.embeddings import CloudflareWorkersAIEmbeddings
from langchain_cloudflare.vectorstores import CloudflareVectorize
from langchain_community.document_loaders import WikipediaLoader
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter

API 参考：WikipediaLoader | 文档 | 递归角色文本拆分器

嵌入

对于嵌入向量的存储、语义搜索和检索，您必须将原始值嵌入为嵌入向量。指定嵌入模型，该模型在 WorkersAI 上可用

https://developers.cloudflare.com/workers-ai/models/

MODEL_WORKERSAI = "@cf/baai/bge-large-en-v1.5"

cf_ai_token = os.getenv(
    "cf_ai_token"
)  # needed if you want to use workersAI for embeddings

embedder = CloudflareWorkersAIEmbeddings(
    account_id=cf_acct_id, api_token=cf_ai_token, model_name=MODEL_WORKERSAI
)

D1 的原始值

Vectorize 仅存储嵌入、元数据和命名空间。如果您想存储和检索原始值，则必须利用 Cloudflare 的 SQL 数据库 D1。

您可以在此处创建数据库并检索其 ID：

[https://dash.cloudflare.com/YOUR-ACCT-NUMBER/workers/d1

# provide the id of your D1 Database
d1_database_id = os.getenv("d1_database_id")

CloudflareVectorize 类

现在我们可以创建 CloudflareVectorize 实例。我们在这里经过：

这embedding实例
账户 ID
所有服务（WorkersAI、Vectorize、D1）的全局 API 令牌
每个服务都有单独的 API 令牌

vectorize_index_name = f"test-langchain-{uuid.uuid4().hex}"

cfVect = CloudflareVectorize(
    embedding=embedder,
    account_id=cf_acct_id,
    d1_api_token=cf_d1_token,  # (Optional if using global token)
    vectorize_api_token=cf_vectorize_token,  # (Optional if using global token)
    d1_database_id=d1_database_id,  # (Optional if not using D1)
)

清理

在开始之前，让我们删除任何test-langchain*本演练的索引

# depending on your notebook environment you might need to include:
# import nest_asyncio
# nest_asyncio.apply()

arr_indexes = cfVect.list_indexes()
arr_indexes = [x for x in arr_indexes if "test-langchain" in x.get("name")]
arr_async_requests = [
    cfVect.adelete_index(index_name=x.get("name")) for x in arr_indexes
]
await asyncio.gather(*arr_async_requests);

Gotchyas

下面显示了各种缺失的令牌/参数组合的一些“陷阱”

D1 数据库 ID 已提供，但未提供“全局”api_token没有d1_api_token

try:
    cfVect = CloudflareVectorize(
        embedding=embedder,
        account_id=cf_acct_id,
        # api_token=api_token, # (Optional if using service-specific token)
        ai_api_token=cf_ai_token,  # (Optional if using global token)
        # d1_api_token=cf_d1_token,  # (Optional if using global token)
        vectorize_api_token=cf_vectorize_token,  # (Optional if using global token)
        d1_database_id=d1_database_id,  # (Optional if not using D1)
    )
except Exception as e:
    print(str(e))

`d1_database_id` provided, but no global `api_token` provided and no `d1_api_token` provided.

没有“全球”api_token提供且缺少ai_api_token或vectorize_api_token

try:
    cfVect = CloudflareVectorize(
        embedding=embedder,
        account_id=cf_acct_id,
        # api_token=api_token, # (Optional if using service-specific token)
        # ai_api_token=cf_ai_token,  # (Optional if using global token)
        d1_api_token=cf_d1_token,  # (Optional if using global token)
        vectorize_api_token=cf_vectorize_token,  # (Optional if using global token)
        d1_database_id=d1_database_id,  # (Optional if not using D1)
    )
except Exception as e:
    print(str(e))

管理矢量存储

创建索引

让我们从 create 和 index 开始这个例子（如果存在，首先删除）。如果索引不存在，我们将从 Cloudflare 收到一个错误，告诉我们。

%%capture

try:
    cfVect.delete_index(index_name=vectorize_index_name, wait=True)
except Exception as e:
    print(e)

r = cfVect.create_index(index_name=vectorize_index_name, wait=True)
print(r)

{'created_on': '2025-04-09T18:08:57.067099Z', 'modified_on': '2025-04-09T18:08:57.067099Z', 'name': 'test-langchain-b594da547de4463180a08b2117c4904d', 'description': '', 'config': {'dimensions': 1024, 'metric': 'cosine'}}

列出索引

现在，我们可以在我们的账户上列出我们的索引

indexes = cfVect.list_indexes()
indexes = [x for x in indexes if "test-langchain" in x.get("name")]
print(indexes)

[{'created_on': '2025-04-09T18:08:57.067099Z', 'modified_on': '2025-04-09T18:08:57.067099Z', 'name': 'test-langchain-b594da547de4463180a08b2117c4904d', 'description': '', 'config': {'dimensions': 1024, 'metric': 'cosine'}}]

获取索引信息

我们还可以获取某些索引并检索有关索引的更精细信息。

此调用返回一个processedUpToMutation可用于跟踪作的状态，例如创建索引、添加或删除记录。

r = cfVect.get_index_info(index_name=vectorize_index_name)
print(r)

{'dimensions': 1024, 'vectorCount': 0}

添加元数据索引

通过在 quereies 中提供元数据过滤器来帮助检索是很常见的。在 Vectorize 中，这是通过首先在 Vectorize Index 上创建一个 “metadata index” 来完成的。对于我们的示例，我们将通过在section字段中。

编号： https://developers.cloudflare.com/vectorize/reference/metadata-filtering/

r = cfVect.create_metadata_index(
    property_name="section",
    index_type="string",
    index_name=vectorize_index_name,
    wait=True,
)
print(r)

{'mutationId': '5e1895ff-a0f6-4fbc-aa93-58d2e181650d'}

列出元数据索引

r = cfVect.list_metadata_indexes(index_name=vectorize_index_name)
print(r)

[]

添加文档

在此示例中，我们将使用 LangChain 的 Wikipedia 加载器来提取有关 Cloudflare 的文章。我们将它存储在 Vectorize 中，稍后再查询其内容。

docs = WikipediaLoader(query="Cloudflare", load_max_docs=2).load()

然后，我们将根据 chunk 部分创建一些带有元数据的简单 chunk。

text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size=100,
    chunk_overlap=20,
    length_function=len,
    is_separator_regex=False,
)
texts = text_splitter.create_documents([docs[0].page_content])

running_section = ""
for idx, text in enumerate(texts):
    if text.page_content.startswith("="):
        running_section = text.page_content
        running_section = running_section.replace("=", "").strip()
    else:
        if running_section == "":
            text.metadata = {"section": "Introduction"}
        else:
            text.metadata = {"section": running_section}

print(len(texts))
print(texts[0], "\n\n", texts[-1])

55
page_content='Cloudflare, Inc., is an American company that provides content delivery network services,' metadata={'section': 'Introduction'} 

 page_content='attacks, Cloudflare ended up being attacked as well; Google and other companies eventually' metadata={'section': 'DDoS mitigation'}

现在，我们将向 Vectorize Index 添加文档。

注意：向 Vectorize 添加嵌入asyncronously，这意味着在添加嵌入和能够查询它们之间会有一个小的延迟。默认情况下add_documents具有wait=True参数，该参数等待此作完成，然后再返回响应。如果您不希望程序等待嵌入向量可用，则可以将其设置为wait=False.

r = cfVect.add_documents(index_name=vectorize_index_name, documents=texts, wait=True)

print(json.dumps(r)[:300])

["58577244-247a-407e-8764-3c1a251c6855", "7f107458-a6e4-4571-867e-5a1c8a6eecc0", "6245c111-957c-48c0-9033-e5b0ce7a667b", "f5153123-5964-4126-affd-609e061cff5a", "68ceeb19-bf41-4c83-a1b4-c13894fd7157", "679e8b74-daf4-4d39-a49c-8a945557038d", "2cba8eed-2a83-4c42-bea3-3163a0ed9eea", "76e02c1a-a30c-4b2c

查询向量存储

我们将对我们的 embedding 进行一些搜索。我们可以指定我们的搜索query以及我们想要的结果数量最多的k.

query_documents = cfVect.similarity_search(
    index_name=vectorize_index_name, query="Workers AI", k=100, return_metadata="none"
)

print(f"{len(query_documents)} results:\n{query_documents[:3]}")

55 results:
[Document(id='6d9f5eca-d664-42ff-a98e-4cec8d2a6418', metadata={}, page_content="In 2023, Cloudflare launched Workers AI, a framework allowing for use of Nvidia GPU's within"), Document(id='ca1b3f52-b017-47bd-afb0-88e497842b8b', metadata={}, page_content='based on queries by leveraging Workers AI.Cloudflare announced plans in September 2024 to launch a'), Document(id='ef9318d7-498b-4411-81d7-e3c37453bb36', metadata={}, page_content='=== Artificial intelligence ===')]

输出

如果你想返回元数据，你可以将return_metadata="all" | 'indexed'.默认值为all.

如果要返回 embeddings 值，可以将return_values=True.默认值为False. 嵌入向量将返回在metadata字段_values田。

注意： return_metadata="none"和return_values=True只会返回_values字段输入metadata.

注意：如果返回元数据或值，则结果将限制为前 20 个。

https://developers.cloudflare.com/vectorize/platform/limits/

query_documents = cfVect.similarity_search(
    index_name=vectorize_index_name,
    query="Workers AI",
    return_values=True,
    return_metadata="all",
    k=100,
)
print(f"{len(query_documents)} results:\n{str(query_documents[0])[:500]}")

20 results:
page_content='In 2023, Cloudflare launched Workers AI, a framework allowing for use of Nvidia GPU's within' metadata={'section': 'Artificial intelligence', '_values': [0.014350891, 0.0053482056, -0.022354126, 0.002948761, 0.010406494, -0.016067505, -0.002029419, -0.023513794, 0.020141602, 0.023742676, 0.01361084, 0.003019333, 0.02748108, -0.023162842, 0.008979797, -0.029373169, -0.03643799, -0.03842163, -0.004463196, 0.021255493, 0.02192688, -0.005947113, -0.060272217, -0.055389404, -0.031188965

如果你想要相似性scores要返回，您可以使用similarity_search_with_score

query_documents = cfVect.similarity_search_with_score(
    index_name=vectorize_index_name,
    query="Workers AI",
    k=100,
    return_metadata="all",
)
print(f"{len(query_documents)} results:\n{str(query_documents[0])[:500]}")

20 results:
(Document(id='6d9f5eca-d664-42ff-a98e-4cec8d2a6418', metadata={'section': 'Artificial intelligence'}, page_content="In 2023, Cloudflare launched Workers AI, a framework allowing for use of Nvidia GPU's within"), 0.7851709)

用于检索增强生成

包括 D1 表示 “Raw Values”

所有add和search方法支持include_d1parameter （default=True）的 Alpha 参数。

这是为了配置是否要存储/检索原始值。

如果您不想为此使用 D1，则可以将其设置为include=False.这将返回带有空page_content田。

query_documents = cfVect.similarity_search_with_score(
    index_name=vectorize_index_name,
    query="california",
    k=100,
    return_metadata="all",
    include_d1=False,
)
print(f"{len(query_documents)} results:\n{str(query_documents[0])[:500]}")

20 results:
(Document(id='f5153123-5964-4126-affd-609e061cff5a', metadata={'section': 'Introduction'}, page_content=''), 0.60426825)

通过转换为 retriever 进行查询

您还可以将 vector store 转换为检索器，以便在您的链中更轻松地使用。

retriever = cfVect.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 1, "index_name": vectorize_index_name},
)
r = retriever.get_relevant_documents("california")

使用元数据筛选进行搜索

如前所述，Vectorize 支持通过对索引元数据字段进行过滤来进行过滤搜索。下面是一个示例，我们搜索Introductionindexedsectionmetadata 字段。

有关搜索元数据字段的更多信息，请访问：https://developers.cloudflare.com/vectorize/reference/metadata-filtering/

query_documents = cfVect.similarity_search_with_score(
    index_name=vectorize_index_name,
    query="California",
    k=100,
    md_filter={"section": "Introduction"},
    return_metadata="all",
)
print(f"{len(query_documents)} results:\n - {str(query_documents[:3])}")

6 results:
 - [(Document(id='f5153123-5964-4126-affd-609e061cff5a', metadata={'section': 'Introduction'}, page_content="and other services. Cloudflare's headquarters are in San Francisco, California. According to"), 0.60426825), (Document(id='7f107458-a6e4-4571-867e-5a1c8a6eecc0', metadata={'section': 'Introduction'}, page_content='network services, cybersecurity, DDoS mitigation, wide area network services, reverse proxies,'), 0.52082914), (Document(id='58577244-247a-407e-8764-3c1a251c6855', metadata={'section': 'Introduction'}, page_content='Cloudflare, Inc., is an American company that provides content delivery network services,'), 0.50490546)]

您也可以进行更复杂的筛选

https://developers.cloudflare.com/vectorize/reference/metadata-filtering/#valid-filter-examples

query_documents = cfVect.similarity_search_with_score(
    index_name=vectorize_index_name,
    query="California",
    k=100,
    md_filter={"section": {"$ne": "Introduction"}},
    return_metadata="all",
)
print(f"{len(query_documents)} results:\n - {str(query_documents[:3])}")

20 results:
 - [(Document(id='354f6e61-9a45-46fd-b9b9-2182a7b3e8da', metadata={}, page_content='== Products =='), 0.56540567), (Document(id='33697c9e-0a38-4e7f-b763-401efee46295', metadata={'section': 'History'}, page_content='Since at least 2017, Cloudflare has been using a wall of lava lamps in their San Francisco'), 0.5604333), (Document(id='615edec2-6eef-48d3-9023-04efe4992887', metadata={'section': 'History'}, page_content='their San Francisco headquarters as a source of randomness for encryption keys, alongside double'), 0.55573463)]

query_documents = cfVect.similarity_search_with_score(
    index_name=vectorize_index_name,
    query="DNS",
    k=100,
    md_filter={"section": {"$in": ["Products", "History"]}},
    return_metadata="all",
)
print(f"{len(query_documents)} results:\n - {str(query_documents)}")

20 results:
 - [(Document(id='520e5786-1ffd-4fe7-82c0-00ce53846454', metadata={'section': 'Products'}, page_content='protocols such as DNS over HTTPS, SMTP, and HTTP/2 with support for HTTP/2 Server Push. As of 2023,'), 0.7205538), (Document(id='47f42149-f5d2-457f-8b21-83708086e0f7', metadata={'section': 'Products'}, page_content='utilizing edge computing, reverse proxies for web traffic, data center interconnects, and a content'), 0.58178145), (Document(id='1bea41ed-88e7-4443-801c-e566598c3f86', metadata={'section': 'Products'}, page_content='and a content distribution network to serve content across its network of servers. It supports'), 0.5797795), (Document(id='1faeb28c-f0dc-4038-8ea8-1ed02e005e5e', metadata={'section': 'History'}, page_content='the New York Stock Exchange under the stock ticker NET. It opened for public trading on September'), 0.5678468), (Document(id='e1efd6cf-19e1-4640-aa8b-aff9323148b4', metadata={'section': 'Products'}, page_content='Cloudflare provides network and security products for consumers and businesses, utilizing edge'), 0.55722594), (Document(id='39857a5f-639d-42ab-a40f-c78fd526246f', metadata={'section': 'History'}, page_content='Cloudflare has acquired web-services and security companies, including StopTheHacker (February'), 0.5558441), (Document(id='b6947103-be26-4252-9389-37c0ecc98820', metadata={'section': 'Products'}, page_content='Push. As of 2023, Cloudflare handles an average of 45 million HTTP requests per second.'), 0.55429655), (Document(id='0edcd68e-c291-4d92-acc7-af292fad71c0', metadata={'section': 'Products'}, page_content='It supports transport layer protocols TCP, UDP, QUIC, and many application layer protocols such as'), 0.54969466), (Document(id='76e02c1a-a30c-4b2c-8fc3-a9b338e08e25', metadata={'section': 'History'}, page_content='Cloudflare was founded in July 2009 by Matthew Prince, Lee Holloway, and Michelle Zatlyn. Prince'), 0.54691005), (Document(id='218b6982-cf4e-4778-a759-4977ef83fe30', metadata={'section': 'History'}, page_content='2019, Cloudflare submitted its S-1 filing for an initial public offering on the New York Stock'), 0.533554), (Document(id='a936041a-1e30-4217-b161-b53d73b9b2c7', metadata={'section': 'History'}, page_content='Networks (March 2024), BastionZero (May 2024), and Kivera (October 2024).'), 0.53296596), (Document(id='645e5f9d-8fcf-4926-a36a-6137dd26540d', metadata={'section': 'Products'}, page_content='Verizon’s October 2024 outage.'), 0.53137076), (Document(id='87c83d1d-a4c2-4843-b2a0-84e6ef0e1916', metadata={'section': 'Products'}, page_content='Cloudflare also provides analysis and reports on large-scale outages, including Verizon’s October'), 0.53107977), (Document(id='7e6c210a-4bf9-4b43-8462-28e3bde1114f', metadata={'section': 'History'}, page_content='a product of Unspam Technologies that served as some inspiration for the basis of Cloudflare. From'), 0.528889), (Document(id='9c50e8aa-b246-4dec-ad0e-16a1ad07d3d5', metadata={'section': 'History'}, page_content='of Cloudflare. From 2009, the company was venture-capital funded. On August 15, 2019, Cloudflare'), 0.52717584), (Document(id='06408b72-d1e0-4160-af3e-b06b43109b30', metadata={'section': 'History'}, page_content='(December 2021), Vectrix (February 2022), Area 1 Security (February 2022), Nefeli Networks (March'), 0.52209044), (Document(id='78b1d42c-0509-445f-831a-6308a806c16f', metadata={'section': 'Products'}, page_content='As of 2024, Cloudflare servers are powered by AMD EPYC 9684X processors.'), 0.5169676), (Document(id='0d1f831d-632b-4e27-8cb3-0be3af2df51b', metadata={'section': 'History'}, page_content='(February 2014), CryptoSeal (June 2014), Eager Platform Co. (December 2016), Neumob (November'), 0.5132974), (Document(id='615edec2-6eef-48d3-9023-04efe4992887', metadata={'section': 'History'}, page_content='their San Francisco headquarters as a source of randomness for encryption keys, alongside double'), 0.50999177), (Document(id='c0611b8a-c8bb-48e4-a758-283b7df7454d', metadata={'section': 'History'}, page_content='Neumob (November 2017), S2 Systems (January 2020), Linc (December 2020), Zaraz (December 2021),'), 0.5092492)]

按命名空间搜索

我们还可以通过 vector 来搜索 vectornamespace.我们只需要将其添加到namespaces数组。

https://developers.cloudflare.com/vectorize/reference/metadata-filtering/#namespace-versus-metadata-filtering

namespace_name = f"test-namespace-{uuid.uuid4().hex[:8]}"

new_documents = [
    Document(
        page_content="This is a new namespace specific document!",
        metadata={"section": "Namespace Test1"},
    ),
    Document(
        page_content="This is another namespace specific document!",
        metadata={"section": "Namespace Test2"},
    ),
]

r = cfVect.add_documents(
    index_name=vectorize_index_name,
    documents=new_documents,
    namespaces=[namespace_name] * len(new_documents),
    wait=True,
)

query_documents = cfVect.similarity_search(
    index_name=vectorize_index_name,
    query="California",
    namespace=namespace_name,
)

print(f"{len(query_documents)} results:\n - {str(query_documents)}")

2 results:
 - [Document(id='6c9ab453-bf69-42aa-910d-e148c9c638d0', metadata={'section': 'Namespace Test2', '_namespace': 'test-namespace-e85040f0'}, page_content='This is another namespace specific document!'), Document(id='15fece51-d077-46ac-801c-faf0f479f8d9', metadata={'section': 'Namespace Test1', '_namespace': 'test-namespace-e85040f0'}, page_content='This is a new namespace specific document!')]

按 ID 搜索

我们还可以检索特定 ID 的特定记录。为此，我们需要在index_name矢量化状态参数。

这将返回_namespace和_values以及其他metadata.

sample_ids = [x.id for x in query_documents]

cfVect.index_name = vectorize_index_name

query_documents = cfVect.get_by_ids(
    sample_ids,
)
print(str(query_documents[:3])[:500])

[Document(id='6c9ab453-bf69-42aa-910d-e148c9c638d0', metadata={'section': 'Namespace Test2', '_namespace': 'test-namespace-e85040f0', '_values': [-0.0005841255, 0.014480591, 0.040771484, 0.005218506, 0.015579224, 0.0007543564, -0.005138397, -0.022720337, 0.021835327, 0.038970947, 0.017456055, 0.022705078, 0.013450623, -0.015686035, -0.019119263, -0.01512146, -0.017471313, -0.007183075, -0.054382324, -0.01914978, 0.0005302429, 0.018600464, -0.083740234, -0.006462097, 0.0005598068, 0.024230957, -0

命名空间将包含在_namespace字段输入metadata以及其他元数据（如果您在return_metadata).

注意：您无法设置_namespace或_values字段中metadata因为他们是保留的。它们将在插入过程中被剥离。

更新插入

Vectorize 支持 Upsert，您可以通过设置upsert=True.

query_documents[0].page_content = "Updated: " + query_documents[0].page_content
print(query_documents[0].page_content)

Updated: This is another namespace specific document!

new_document_id = "12345678910"
new_document = Document(
    id=new_document_id,
    page_content="This is a new document!",
    metadata={"section": "Introduction"},
)

r = cfVect.add_documents(
    index_name=vectorize_index_name,
    documents=[new_document, query_documents[0]],
    upsert=True,
    wait=True,
)

query_documents_updated = cfVect.get_by_ids([new_document_id, query_documents[0].id])

print(str(query_documents_updated[0])[:500])
print(query_documents_updated[0].page_content)
print(query_documents_updated[1].page_content)

page_content='This is a new document!' metadata={'section': 'Introduction', '_namespace': None, '_values': [-0.007522583, 0.0023021698, 0.009963989, 0.031051636, -0.021316528, 0.0048103333, 0.026046753, 0.01348114, 0.026306152, 0.040374756, 0.03225708, 0.007423401, 0.031021118, -0.007347107, -0.034179688, 0.002111435, -0.027191162, -0.020950317, -0.021636963, -0.0030593872, -0.04977417, 0.018859863, -0.08062744, -0.027679443, 0.012512207, 0.0053634644, 0.008079529, -0.010528564, 0.07312012, 0.02
This is a new document!
Updated: This is another namespace specific document!

删除记录

我们也可以按 ID 删除记录

r = cfVect.delete(index_name=vectorize_index_name, ids=sample_ids, wait=True)
print(r)

True

并确认删除

query_documents = cfVect.get_by_ids(sample_ids)
assert len(query_documents) == 0

从文档创建

LangChain 规定所有 vectorstore 都必须具有from_documents方法从文档中实例化新的 Vectorstore。这是一种比个人更简化的方法create, add步骤。

您可以按如下所示执行此作：

vectorize_index_name = "test-langchain-from-docs"

cfVect = CloudflareVectorize.from_documents(
    account_id=cf_acct_id,
    index_name=vectorize_index_name,
    documents=texts,
    embedding=embedder,
    d1_database_id=d1_database_id,
    d1_api_token=cf_d1_token,
    vectorize_api_token=cf_vectorize_token,
    wait=True,
)

# query for documents
query_documents = cfVect.similarity_search(
    index_name=vectorize_index_name,
    query="Edge Computing",
)

print(f"{len(query_documents)} results:\n{str(query_documents[0])[:300]}")

20 results:
page_content='utilizing edge computing, reverse proxies for web traffic, data center interconnects, and a content' metadata={'section': 'Products'}

异步示例

本节将展示一些 Async 示例

创建索引

vectorize_index_name1 = f"test-langchain-{uuid.uuid4().hex}"
vectorize_index_name2 = f"test-langchain-{uuid.uuid4().hex}"
vectorize_index_name3 = f"test-langchain-{uuid.uuid4().hex}"

# depending on your notebook environment you might need to include these:
# import nest_asyncio
# nest_asyncio.apply()

async_requests = [
    cfVect.acreate_index(index_name=vectorize_index_name1),
    cfVect.acreate_index(index_name=vectorize_index_name2),
    cfVect.acreate_index(index_name=vectorize_index_name3),
]

res = await asyncio.gather(*async_requests);

创建元数据索引

async_requests = [
    cfVect.acreate_metadata_index(
        property_name="section",
        index_type="string",
        index_name=vectorize_index_name1,
        wait=True,
    ),
    cfVect.acreate_metadata_index(
        property_name="section",
        index_type="string",
        index_name=vectorize_index_name2,
        wait=True,
    ),
    cfVect.acreate_metadata_index(
        property_name="section",
        index_type="string",
        index_name=vectorize_index_name3,
        wait=True,
    ),
]

await asyncio.gather(*async_requests);

添加文档

async_requests = [
    cfVect.aadd_documents(index_name=vectorize_index_name1, documents=texts, wait=True),
    cfVect.aadd_documents(index_name=vectorize_index_name2, documents=texts, wait=True),
    cfVect.aadd_documents(index_name=vectorize_index_name3, documents=texts, wait=True),
]

await asyncio.gather(*async_requests);

查询/搜索

async_requests = [
    cfVect.asimilarity_search(index_name=vectorize_index_name1, query="Workers AI"),
    cfVect.asimilarity_search(index_name=vectorize_index_name2, query="Edge Computing"),
    cfVect.asimilarity_search(index_name=vectorize_index_name3, query="SASE"),
]

async_results = await asyncio.gather(*async_requests);

print(f"{len(async_results[0])} results:\n{str(async_results[0][0])[:300]}")
print(f"{len(async_results[1])} results:\n{str(async_results[1][0])[:300]}")
print(f"{len(async_results[1])} results:\n{str(async_results[2][0])[:300]}")

20 results:
page_content='In 2023, Cloudflare launched Workers AI, a framework allowing for use of Nvidia GPU's within'
20 results:
page_content='utilizing edge computing, reverse proxies for web traffic, data center interconnects, and a content'
20 results:
page_content='== Products =='

返回元数据/值

async_requests = [
    cfVect.asimilarity_search(
        index_name=vectorize_index_name1,
        query="California",
        return_values=True,
        return_metadata="all",
    ),
    cfVect.asimilarity_search(
        index_name=vectorize_index_name2,
        query="California",
        return_values=True,
        return_metadata="all",
    ),
    cfVect.asimilarity_search(
        index_name=vectorize_index_name3,
        query="California",
        return_values=True,
        return_metadata="all",
    ),
]

async_results = await asyncio.gather(*async_requests);

print(f"{len(async_results[0])} results:\n{str(async_results[0][0])[:300]}")
print(f"{len(async_results[1])} results:\n{str(async_results[1][0])[:300]}")
print(f"{len(async_results[1])} results:\n{str(async_results[2][0])[:300]}")

20 results:
page_content='and other services. Cloudflare's headquarters are in San Francisco, California. According to' metadata={'section': 'Introduction', '_values': [-0.031219482, -0.018295288, -0.006000519, 0.017532349, 0.016403198, -0.029922485, -0.007133484, 0.004447937, 0.04559326, -0.011405945, 0.034820
20 results:
page_content='and other services. Cloudflare's headquarters are in San Francisco, California. According to' metadata={'section': 'Introduction', '_values': [-0.031219482, -0.018295288, -0.006000519, 0.017532349, 0.016403198, -0.029922485, -0.007133484, 0.004447937, 0.04559326, -0.011405945, 0.034820
20 results:
page_content='and other services. Cloudflare's headquarters are in San Francisco, California. According to' metadata={'section': 'Introduction', '_values': [-0.031219482, -0.018295288, -0.006000519, 0.017532349, 0.016403198, -0.029922485, -0.007133484, 0.004447937, 0.04559326, -0.011405945, 0.034820

使用元数据筛选进行搜索

async_requests = [
    cfVect.asimilarity_search(
        index_name=vectorize_index_name1,
        query="Cloudflare services",
        k=2,
        md_filter={"section": "Products"},
        return_metadata="all",
        # return_values=True
    ),
    cfVect.asimilarity_search(
        index_name=vectorize_index_name2,
        query="Cloudflare services",
        k=2,
        md_filter={"section": "Products"},
        return_metadata="all",
        # return_values=True
    ),
    cfVect.asimilarity_search(
        index_name=vectorize_index_name3,
        query="Cloudflare services",
        k=2,
        md_filter={"section": "Products"},
        return_metadata="all",
        # return_values=True
    ),
]

async_results = await asyncio.gather(*async_requests);

[doc.metadata["section"] == "Products" for doc in async_results[0]]

[True, True]

print(f"{len(async_results[0])} results:\n{str(async_results[0][-1])[:300]}")
print(f"{len(async_results[1])} results:\n{str(async_results[1][0])[:300]}")
print(f"{len(async_results[1])} results:\n{str(async_results[2][0])[:300]}")

2 results:
page_content='Cloudflare also provides analysis and reports on large-scale outages, including Verizon’s October' metadata={'section': 'Products'}
2 results:
page_content='Cloudflare provides network and security products for consumers and businesses, utilizing edge' metadata={'section': 'Products'}
2 results:
page_content='Cloudflare provides network and security products for consumers and businesses, utilizing edge' metadata={'section': 'Products'}

清理

最后，我们删除在此笔记本中创建的所有索引。

arr_indexes = cfVect.list_indexes()
arr_indexes = [x for x in arr_indexes if "test-langchain" in x.get("name")]

arr_async_requests = [
    cfVect.adelete_index(index_name=x.get("name")) for x in arr_indexes
]
await asyncio.gather(*arr_async_requests);

API 参考

https://developers.cloudflare.com/api/resources/vectorize/

https://developers.cloudflare.com/vectorize/

矢量存储概念指南
Vector store 操作指南

设置

凭据

初始化

嵌入

D1 的原始值

CloudflareVectorize 类

清理

Gotchyas

管理矢量存储

创建索引

列出索引

获取索引信息

添加元数据索引

列出元数据索引

添加文档

查询向量存储

输出

用于检索增强生成

包括 D1 表示 “Raw Values”

通过转换为 retriever 进行查询

使用元数据筛选进行搜索

按命名空间搜索

按 ID 搜索

更新插入

删除记录

从文档创建

异步示例

创建索引

创建元数据索引

添加文档

查询/搜索

返回元数据/值

使用元数据筛选进行搜索

清理

API 参考

相关