Cloudflare向量存储
这个笔记本介绍了如何使用CloudflareVectorize向量存储。
设置
此 Python 包是 Cloudflare 的 REST API 的封装。要与 API 交互,您需要提供具有适当权限的 API 密钥。
您可以在这里创建和管理API令牌:<br/>
https://dash.cloudflare.com/YOUR-ACCT-NUMBER/api-tokens
Credentials
CloudflareVectorize 依赖于 WorkersAI(如果您打算使用其进行向量嵌入),并依赖于 D1(如果您正在使用它来存储和检索原始值)。
您可以创建一个具有所有所需资源(WorkersAI、Vectorize & D1)编辑权限的单个api_token,但建议遵循“最小权限访问”原则,并为每个服务单独创建API令牌。
Note: 这些特定于服务的标记(如有提供)将优先于全局标记。您可以提供这些标记而不是全局标记。
import os
from dotenv import load_dotenv
load_dotenv(".env")
cf_acct_id = os.getenv("cf_acct_id")
# single token with WorkersAI, Vectorize & D1
api_token = os.getenv("cf_ai_token")
# OR, separate tokens with access to each service
cf_vectorize_token = os.getenv("cf_vectorize_token")
cf_d1_token = os.getenv("cf_d1_token")
初始化
import asyncio
import json
import uuid
from langchain_cloudflare.embeddings import CloudflareWorkersAIEmbeddings
from langchain_cloudflare.vectorstores import CloudflareVectorize
from langchain_community.document_loaders import WikipediaLoader
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
嵌入
对于嵌入存储、语义搜索和检索,您必须将原始值嵌入为嵌入。指定一个在WorkersAI可用的嵌入模型。
https://developers.cloudflare.com/workers-ai/models/
MODEL_WORKERSAI = "@cf/baai/bge-large-en-v1.5"
cf_ai_token = os.getenv(
"cf_ai_token"
) # needed if you want to use workersAI for embeddings
embedder = CloudflareWorkersAIEmbeddings(
account_id=cf_acct_id, api_token=cf_ai_token, model_name=MODEL_WORKERSAI
)
原始值与D1
向量存储仅保存嵌入、元数据和命名空间。如果您希望存储和检索原始值,必须利用Cloudflare的SQL数据库D1。
您可以在这里创建一个数据库并检索其ID:
[https://dash.cloudflare.com/YOUR-ACCT-NUMBER/workers/d1
# provide the id of your D1 Database
d1_database_id = os.getenv("d1_database_id")
CloudflareVectorize 类
现在我们可以创建 CloudflareVectorize 实例。这里我们传递了:
- 之前的
embedding实例 - 账户ID
- 全部服务(WorkersAI、Vectorize、D1)的全球API令牌
- 每个服务的单独API令牌
vectorize_index_name = f"test-langchain-{uuid.uuid4().hex}"
cfVect = CloudflareVectorize(
embedding=embedder,
account_id=cf_acct_id,
d1_api_token=cf_d1_token, # (Optional if using global token)
vectorize_api_token=cf_vectorize_token, # (Optional if using global token)
d1_database_id=d1_database_id, # (Optional if not using D1)
)
清理
在开始之前,让我们删除这个示例中的任何test-langchain*索引
# depending on your notebook environment you might need to include:
# import nest_asyncio
# nest_asyncio.apply()
arr_indexes = cfVect.list_indexes()
arr_indexes = [x for x in arr_indexes if "test-langchain" in x.get("name")]
arr_async_requests = [
cfVect.adelete_index(index_name=x.get("name")) for x in arr_indexes
]
await asyncio.gather(*arr_async_requests);
Gotchyas
以下是一些由于缺少各种令牌/参数组合而导致的“坑”示例
D1 提供了数据库ID,但没有提供“全局”api_token和d1_api_token
try:
cfVect = CloudflareVectorize(
embedding=embedder,
account_id=cf_acct_id,
# api_token=api_token, # (Optional if using service-specific token)
ai_api_token=cf_ai_token, # (Optional if using global token)
# d1_api_token=cf_d1_token, # (Optional if using global token)
vectorize_api_token=cf_vectorize_token, # (Optional if using global token)
d1_database_id=d1_database_id, # (Optional if not using D1)
)
except Exception as e:
print(str(e))
`d1_database_id` provided, but no global `api_token` provided and no `d1_api_token` provided.
未提供“全局”api_token,且缺少ai_api_token或vectorize_api_token
try:
cfVect = CloudflareVectorize(
embedding=embedder,
account_id=cf_acct_id,
# api_token=api_token, # (Optional if using service-specific token)
# ai_api_token=cf_ai_token, # (Optional if using global token)
d1_api_token=cf_d1_token, # (Optional if using global token)
vectorize_api_token=cf_vectorize_token, # (Optional if using global token)
d1_database_id=d1_database_id, # (Optional if not using D1)
)
except Exception as e:
print(str(e))
管理向量存储
创建一个索引
让我们通过创建一个索引来开始本示例(如果该索引已存在则先删除)。如果索引不存在,Cloudflare 将会给我们返回一个相应的错误提示。
%%capture
try:
cfVect.delete_index(index_name=vectorize_index_name, wait=True)
except Exception as e:
print(e)
r = cfVect.create_index(index_name=vectorize_index_name, wait=True)
print(r)
{'created_on': '2025-04-09T18:08:57.067099Z', 'modified_on': '2025-04-09T18:08:57.067099Z', 'name': 'test-langchain-b594da547de4463180a08b2117c4904d', 'description': '', 'config': {'dimensions': 1024, 'metric': 'cosine'}}
列表索引
现在,我们可以在我们的账户中列出索引
indexes = cfVect.list_indexes()
indexes = [x for x in indexes if "test-langchain" in x.get("name")]
print(indexes)
[{'created_on': '2025-04-09T18:08:57.067099Z', 'modified_on': '2025-04-09T18:08:57.067099Z', 'name': 'test-langchain-b594da547de4463180a08b2117c4904d', 'description': '', 'config': {'dimensions': 1024, 'metric': 'cosine'}}]
获取索引信息
我们也可以获取某些索引并检索更多关于该索引的详细信息。
此调用返回一个processedUpToMutation,可以用来跟踪诸如创建索引、添加或删除记录等操作的状态。
r = cfVect.get_index_info(index_name=vectorize_index_name)
print(r)
{'dimensions': 1024, 'vectorCount': 0}
添加元数据索引
通过在查询中提供元数据过滤器来辅助检索是常见的做法。在Vectorize中,这可以通过首先在Vectorize Index上创建一个“元数据索引”来实现。我们将为示例在此文档的section字段上创建这样一个元数据索引。
参考: https://developers.cloudflare.com/vectorize/reference/metadata-filtering/
r = cfVect.create_metadata_index(
property_name="section",
index_type="string",
index_name=vectorize_index_name,
wait=True,
)
print(r)
{'mutationId': '5e1895ff-a0f6-4fbc-aa93-58d2e181650d'}
列表元数据索引
r = cfVect.list_metadata_indexes(index_name=vectorize_index_name)
print(r)
[]
添加文档
对于这个示例,我们将使用LangChain的Wikipedia加载器来获取有关Cloudflare的文章。我们将将其存储在Vectorize中,并稍后查询其内容。
docs = WikipediaLoader(query="Cloudflare", load_max_docs=2).load()
我们将根据段落部分创建一些简单的带有元数据的片段。
text_splitter = RecursiveCharacterTextSplitter(
# Set a really small chunk size, just to show.
chunk_size=100,
chunk_overlap=20,
length_function=len,
is_separator_regex=False,
)
texts = text_splitter.create_documents([docs[0].page_content])
running_section = ""
for idx, text in enumerate(texts):
if text.page_content.startswith("="):
running_section = text.page_content
running_section = running_section.replace("=", "").strip()
else:
if running_section == "":
text.metadata = {"section": "Introduction"}
else:
text.metadata = {"section": running_section}
print(len(texts))
print(texts[0], "\n\n", texts[-1])
55
page_content='Cloudflare, Inc., is an American company that provides content delivery network services,' metadata={'section': 'Introduction'}
page_content='attacks, Cloudflare ended up being attacked as well; Google and other companies eventually' metadata={'section': 'DDoS mitigation'}
现在我们将向向量索引添加文档。
Note:
Adding embeddings to Vectorize happens asyncronously, meaning there will be a small delay between adding the embeddings and being able to query them. By default add_documents has a wait=True parameter which waits for this operation to complete before returning a response. If you do not want the program to wait for embeddings availability, you can set this to wait=False.
r = cfVect.add_documents(index_name=vectorize_index_name, documents=texts, wait=True)
print(json.dumps(r)[:300])
["58577244-247a-407e-8764-3c1a251c6855", "7f107458-a6e4-4571-867e-5a1c8a6eecc0", "6245c111-957c-48c0-9033-e5b0ce7a667b", "f5153123-5964-4126-affd-609e061cff5a", "68ceeb19-bf41-4c83-a1b4-c13894fd7157", "679e8b74-daf4-4d39-a49c-8a945557038d", "2cba8eed-2a83-4c42-bea3-3163a0ed9eea", "76e02c1a-a30c-4b2c
查询向量存储
我们在嵌入搜索中进行一些查询。我们可以指定我们的搜索query和我们想要的顶级结果数量使用k。
query_documents = cfVect.similarity_search(
index_name=vectorize_index_name, query="Workers AI", k=100, return_metadata="none"
)
print(f"{len(query_documents)} results:\n{query_documents[:3]}")
55 results:
[Document(id='6d9f5eca-d664-42ff-a98e-4cec8d2a6418', metadata={}, page_content="In 2023, Cloudflare launched Workers AI, a framework allowing for use of Nvidia GPU's within"), Document(id='ca1b3f52-b017-47bd-afb0-88e497842b8b', metadata={}, page_content='based on queries by leveraging Workers AI.Cloudflare announced plans in September 2024 to launch a'), Document(id='ef9318d7-498b-4411-81d7-e3c37453bb36', metadata={}, page_content='=== Artificial intelligence ===')]
输出
如果您希望返回元数据,请传递return_metadata="all" | 'indexed'。默认值为all。
如果你想返回嵌入值,可以传递return_values=True。默认是False。
嵌入将在特殊字段_values下的metadata字段中返回。
Note: return_metadata="none" 和 return_values=True 将仅返回 metadata 中的 _values 字段。
注意: 如果返回元数据或值,结果将限制为前20个。
https://developers.cloudflare.com/vectorize/platform/limits/
query_documents = cfVect.similarity_search(
index_name=vectorize_index_name,
query="Workers AI",
return_values=True,
return_metadata="all",
k=100,
)
print(f"{len(query_documents)} results:\n{str(query_documents[0])[:500]}")
20 results:
page_content='In 2023, Cloudflare launched Workers AI, a framework allowing for use of Nvidia GPU's within' metadata={'section': 'Artificial intelligence', '_values': [0.014350891, 0.0053482056, -0.022354126, 0.002948761, 0.010406494, -0.016067505, -0.002029419, -0.023513794, 0.020141602, 0.023742676, 0.01361084, 0.003019333, 0.02748108, -0.023162842, 0.008979797, -0.029373169, -0.03643799, -0.03842163, -0.004463196, 0.021255493, 0.02192688, -0.005947113, -0.060272217, -0.055389404, -0.031188965
如果您希望相似度scores被返回,您可以使用similarity_search_with_score
query_documents = cfVect.similarity_search_with_score(
index_name=vectorize_index_name,
query="Workers AI",
k=100,
return_metadata="all",
)
print(f"{len(query_documents)} results:\n{str(query_documents[0])[:500]}")
20 results:
(Document(id='6d9f5eca-d664-42ff-a98e-4cec8d2a6418', metadata={'section': 'Artificial intelligence'}, page_content="In 2023, Cloudflare launched Workers AI, a framework allowing for use of Nvidia GPU's within"), 0.7851709)
使用检索增强生成
包括D1用于“原始值”
所有在CloudflareVectorize上的add和search方法都支持一个include_d1参数(默认值为True)。
这用于配置您是否希望存储/检索原始值。
如果不希望使用D1,可以将其设置为include=False。这将返回带有空的page_content字段的文档。
query_documents = cfVect.similarity_search_with_score(
index_name=vectorize_index_name,
query="california",
k=100,
return_metadata="all",
include_d1=False,
)
print(f"{len(query_documents)} results:\n{str(query_documents[0])[:500]}")
20 results:
(Document(id='f5153123-5964-4126-affd-609e061cff5a', metadata={'section': 'Introduction'}, page_content=''), 0.60426825)
查询通过转换为检索器
您也可以将向量存储转换为检索器,以便在链条中更方便地使用。
retriever = cfVect.as_retriever(
search_type="similarity",
search_kwargs={"k": 1, "index_name": vectorize_index_name},
)
r = retriever.get_relevant_documents("california")
搜索引擎中的元数据过滤
正如之前提到的,Vectorize 支持通过索引元数据字段进行过滤搜索。这里有一个例子,我们在索引的section元数据字段中查找Introduction值。
更多关于在元数据字段中搜索的信息请参阅:https://developers.cloudflare.com/vectorize/reference/metadata-filtering/
query_documents = cfVect.similarity_search_with_score(
index_name=vectorize_index_name,
query="California",
k=100,
md_filter={"section": "Introduction"},
return_metadata="all",
)
print(f"{len(query_documents)} results:\n - {str(query_documents[:3])}")
6 results:
- [(Document(id='f5153123-5964-4126-affd-609e061cff5a', metadata={'section': 'Introduction'}, page_content="and other services. Cloudflare's headquarters are in San Francisco, California. According to"), 0.60426825), (Document(id='7f107458-a6e4-4571-867e-5a1c8a6eecc0', metadata={'section': 'Introduction'}, page_content='network services, cybersecurity, DDoS mitigation, wide area network services, reverse proxies,'), 0.52082914), (Document(id='58577244-247a-407e-8764-3c1a251c6855', metadata={'section': 'Introduction'}, page_content='Cloudflare, Inc., is an American company that provides content delivery network services,'), 0.50490546)]
您可以进行更复杂的过滤操作
https://developers.cloudflare.com/vectorize/reference/metadata-filtering/#valid-filter-examples
query_documents = cfVect.similarity_search_with_score(
index_name=vectorize_index_name,
query="California",
k=100,
md_filter={"section": {"$ne": "Introduction"}},
return_metadata="all",
)
print(f"{len(query_documents)} results:\n - {str(query_documents[:3])}")
20 results:
- [(Document(id='354f6e61-9a45-46fd-b9b9-2182a7b3e8da', metadata={}, page_content='== Products =='), 0.56540567), (Document(id='33697c9e-0a38-4e7f-b763-401efee46295', metadata={'section': 'History'}, page_content='Since at least 2017, Cloudflare has been using a wall of lava lamps in their San Francisco'), 0.5604333), (Document(id='615edec2-6eef-48d3-9023-04efe4992887', metadata={'section': 'History'}, page_content='their San Francisco headquarters as a source of randomness for encryption keys, alongside double'), 0.55573463)]
query_documents = cfVect.similarity_search_with_score(
index_name=vectorize_index_name,
query="DNS",
k=100,
md_filter={"section": {"$in": ["Products", "History"]}},
return_metadata="all",
)
print(f"{len(query_documents)} results:\n - {str(query_documents)}")
20 results:
- [(Document(id='520e5786-1ffd-4fe7-82c0-00ce53846454', metadata={'section': 'Products'}, page_content='protocols such as DNS over HTTPS, SMTP, and HTTP/2 with support for HTTP/2 Server Push. As of 2023,'), 0.7205538), (Document(id='47f42149-f5d2-457f-8b21-83708086e0f7', metadata={'section': 'Products'}, page_content='utilizing edge computing, reverse proxies for web traffic, data center interconnects, and a content'), 0.58178145), (Document(id='1bea41ed-88e7-4443-801c-e566598c3f86', metadata={'section': 'Products'}, page_content='and a content distribution network to serve content across its network of servers. It supports'), 0.5797795), (Document(id='1faeb28c-f0dc-4038-8ea8-1ed02e005e5e', metadata={'section': 'History'}, page_content='the New York Stock Exchange under the stock ticker NET. It opened for public trading on September'), 0.5678468), (Document(id='e1efd6cf-19e1-4640-aa8b-aff9323148b4', metadata={'section': 'Products'}, page_content='Cloudflare provides network and security products for consumers and businesses, utilizing edge'), 0.55722594), (Document(id='39857a5f-639d-42ab-a40f-c78fd526246f', metadata={'section': 'History'}, page_content='Cloudflare has acquired web-services and security companies, including StopTheHacker (February'), 0.5558441), (Document(id='b6947103-be26-4252-9389-37c0ecc98820', metadata={'section': 'Products'}, page_content='Push. As of 2023, Cloudflare handles an average of 45 million HTTP requests per second.'), 0.55429655), (Document(id='0edcd68e-c291-4d92-acc7-af292fad71c0', metadata={'section': 'Products'}, page_content='It supports transport layer protocols TCP, UDP, QUIC, and many application layer protocols such as'), 0.54969466), (Document(id='76e02c1a-a30c-4b2c-8fc3-a9b338e08e25', metadata={'section': 'History'}, page_content='Cloudflare was founded in July 2009 by Matthew Prince, Lee Holloway, and Michelle Zatlyn. Prince'), 0.54691005), (Document(id='218b6982-cf4e-4778-a759-4977ef83fe30', metadata={'section': 'History'}, page_content='2019, Cloudflare submitted its S-1 filing for an initial public offering on the New York Stock'), 0.533554), (Document(id='a936041a-1e30-4217-b161-b53d73b9b2c7', metadata={'section': 'History'}, page_content='Networks (March 2024), BastionZero (May 2024), and Kivera (October 2024).'), 0.53296596), (Document(id='645e5f9d-8fcf-4926-a36a-6137dd26540d', metadata={'section': 'Products'}, page_content='Verizon’s October 2024 outage.'), 0.53137076), (Document(id='87c83d1d-a4c2-4843-b2a0-84e6ef0e1916', metadata={'section': 'Products'}, page_content='Cloudflare also provides analysis and reports on large-scale outages, including Verizon’s October'), 0.53107977), (Document(id='7e6c210a-4bf9-4b43-8462-28e3bde1114f', metadata={'section': 'History'}, page_content='a product of Unspam Technologies that served as some inspiration for the basis of Cloudflare. From'), 0.528889), (Document(id='9c50e8aa-b246-4dec-ad0e-16a1ad07d3d5', metadata={'section': 'History'}, page_content='of Cloudflare. From 2009, the company was venture-capital funded. On August 15, 2019, Cloudflare'), 0.52717584), (Document(id='06408b72-d1e0-4160-af3e-b06b43109b30', metadata={'section': 'History'}, page_content='(December 2021), Vectrix (February 2022), Area 1 Security (February 2022), Nefeli Networks (March'), 0.52209044), (Document(id='78b1d42c-0509-445f-831a-6308a806c16f', metadata={'section': 'Products'}, page_content='As of 2024, Cloudflare servers are powered by AMD EPYC 9684X processors.'), 0.5169676), (Document(id='0d1f831d-632b-4e27-8cb3-0be3af2df51b', metadata={'section': 'History'}, page_content='(February 2014), CryptoSeal (June 2014), Eager Platform Co. (December 2016), Neumob (November'), 0.5132974), (Document(id='615edec2-6eef-48d3-9023-04efe4992887', metadata={'section': 'History'}, page_content='their San Francisco headquarters as a source of randomness for encryption keys, alongside double'), 0.50999177), (Document(id='c0611b8a-c8bb-48e4-a758-283b7df7454d', metadata={'section': 'History'}, page_content='Neumob (November 2017), S2 Systems (January 2020), Linc (December 2020), Zaraz (December 2021),'), 0.5092492)]
搜索命名空间
我们也可以通过namespace来搜索向量。当我们将其添加到向量数据库时,只需要将它添加到namespaces数组中。
namespace_name = f"test-namespace-{uuid.uuid4().hex[:8]}"
new_documents = [
Document(
page_content="This is a new namespace specific document!",
metadata={"section": "Namespace Test1"},
),
Document(
page_content="This is another namespace specific document!",
metadata={"section": "Namespace Test2"},
),
]
r = cfVect.add_documents(
index_name=vectorize_index_name,
documents=new_documents,
namespaces=[namespace_name] * len(new_documents),
wait=True,
)
query_documents = cfVect.similarity_search(
index_name=vectorize_index_name,
query="California",
namespace=namespace_name,
)
print(f"{len(query_documents)} results:\n - {str(query_documents)}")
2 results:
- [Document(id='6c9ab453-bf69-42aa-910d-e148c9c638d0', metadata={'section': 'Namespace Test2', '_namespace': 'test-namespace-e85040f0'}, page_content='This is another namespace specific document!'), Document(id='15fece51-d077-46ac-801c-faf0f479f8d9', metadata={'section': 'Namespace Test1', '_namespace': 'test-namespace-e85040f0'}, page_content='This is a new namespace specific document!')]
搜索ID
我们也可以检索特定ID的记录。为了做到这一点,我们需要在index_name向量化状态参数上设置向量索引名称。
这将返回两个_namespace和_values以及其他metadata。
sample_ids = [x.id for x in query_documents]
cfVect.index_name = vectorize_index_name
query_documents = cfVect.get_by_ids(
sample_ids,
)
print(str(query_documents[:3])[:500])
[Document(id='6c9ab453-bf69-42aa-910d-e148c9c638d0', metadata={'section': 'Namespace Test2', '_namespace': 'test-namespace-e85040f0', '_values': [-0.0005841255, 0.014480591, 0.040771484, 0.005218506, 0.015579224, 0.0007543564, -0.005138397, -0.022720337, 0.021835327, 0.038970947, 0.017456055, 0.022705078, 0.013450623, -0.015686035, -0.019119263, -0.01512146, -0.017471313, -0.007183075, -0.054382324, -0.01914978, 0.0005302429, 0.018600464, -0.083740234, -0.006462097, 0.0005598068, 0.024230957, -0
The namespace will be included in the _namespace field in metadata along with your other metadata (if you requested it in return_metadata).
Note: 您无法设置metadata中的_namespace或_values字段,因为这些字段已被预留。在插入过程中,它们将会被移除。
Upserts
Vectorize 支持插入操作(Upserts),您可以通过设置upsert=True来执行。
query_documents[0].page_content = "Updated: " + query_documents[0].page_content
print(query_documents[0].page_content)
Updated: This is another namespace specific document!
new_document_id = "12345678910"
new_document = Document(
id=new_document_id,
page_content="This is a new document!",
metadata={"section": "Introduction"},
)
r = cfVect.add_documents(
index_name=vectorize_index_name,
documents=[new_document, query_documents[0]],
upsert=True,
wait=True,
)
query_documents_updated = cfVect.get_by_ids([new_document_id, query_documents[0].id])
print(str(query_documents_updated[0])[:500])
print(query_documents_updated[0].page_content)
print(query_documents_updated[1].page_content)
page_content='This is a new document!' metadata={'section': 'Introduction', '_namespace': None, '_values': [-0.007522583, 0.0023021698, 0.009963989, 0.031051636, -0.021316528, 0.0048103333, 0.026046753, 0.01348114, 0.026306152, 0.040374756, 0.03225708, 0.007423401, 0.031021118, -0.007347107, -0.034179688, 0.002111435, -0.027191162, -0.020950317, -0.021636963, -0.0030593872, -0.04977417, 0.018859863, -0.08062744, -0.027679443, 0.012512207, 0.0053634644, 0.008079529, -0.010528564, 0.07312012, 0.02
This is a new document!
Updated: This is another namespace specific document!
删除记录
我们可以通过记录的ID来删除记录
r = cfVect.delete(index_name=vectorize_index_name, ids=sample_ids, wait=True)
print(r)
True
要确认删除
query_documents = cfVect.get_by_ids(sample_ids)
assert len(query_documents) == 0
创建从文档
LangChain 规定,所有向量存储都必须具有一个from_documents方法,用于从文档实例化一个新的向量存储。这比上面展示的单独create, add步骤更为简化。
您可以在如下所示的内容中做到这一点:
vectorize_index_name = "test-langchain-from-docs"
cfVect = CloudflareVectorize.from_documents(
account_id=cf_acct_id,
index_name=vectorize_index_name,
documents=texts,
embedding=embedder,
d1_database_id=d1_database_id,
d1_api_token=cf_d1_token,
vectorize_api_token=cf_vectorize_token,
wait=True,
)
# query for documents
query_documents = cfVect.similarity_search(
index_name=vectorize_index_name,
query="Edge Computing",
)
print(f"{len(query_documents)} results:\n{str(query_documents[0])[:300]}")
20 results:
page_content='utilizing edge computing, reverse proxies for web traffic, data center interconnects, and a content' metadata={'section': 'Products'}
Async Examples
这一部分将展示一些异步示例
创建索引
vectorize_index_name1 = f"test-langchain-{uuid.uuid4().hex}"
vectorize_index_name2 = f"test-langchain-{uuid.uuid4().hex}"
vectorize_index_name3 = f"test-langchain-{uuid.uuid4().hex}"
# depending on your notebook environment you might need to include these:
# import nest_asyncio
# nest_asyncio.apply()
async_requests = [
cfVect.acreate_index(index_name=vectorize_index_name1),
cfVect.acreate_index(index_name=vectorize_index_name2),
cfVect.acreate_index(index_name=vectorize_index_name3),
]
res = await asyncio.gather(*async_requests);
创建元数据索引
async_requests = [
cfVect.acreate_metadata_index(
property_name="section",
index_type="string",
index_name=vectorize_index_name1,
wait=True,
),
cfVect.acreate_metadata_index(
property_name="section",
index_type="string",
index_name=vectorize_index_name2,
wait=True,
),
cfVect.acreate_metadata_index(
property_name="section",
index_type="string",
index_name=vectorize_index_name3,
wait=True,
),
]
await asyncio.gather(*async_requests);
添加文档
async_requests = [
cfVect.aadd_documents(index_name=vectorize_index_name1, documents=texts, wait=True),
cfVect.aadd_documents(index_name=vectorize_index_name2, documents=texts, wait=True),
cfVect.aadd_documents(index_name=vectorize_index_name3, documents=texts, wait=True),
]
await asyncio.gather(*async_requests);
查询/搜索
async_requests = [
cfVect.asimilarity_search(index_name=vectorize_index_name1, query="Workers AI"),
cfVect.asimilarity_search(index_name=vectorize_index_name2, query="Edge Computing"),
cfVect.asimilarity_search(index_name=vectorize_index_name3, query="SASE"),
]
async_results = await asyncio.gather(*async_requests);
print(f"{len(async_results[0])} results:\n{str(async_results[0][0])[:300]}")
print(f"{len(async_results[1])} results:\n{str(async_results[1][0])[:300]}")
print(f"{len(async_results[1])} results:\n{str(async_results[2][0])[:300]}")
20 results:
page_content='In 2023, Cloudflare launched Workers AI, a framework allowing for use of Nvidia GPU's within'
20 results:
page_content='utilizing edge computing, reverse proxies for web traffic, data center interconnects, and a content'
20 results:
page_content='== Products =='
返回元数据/值
async_requests = [
cfVect.asimilarity_search(
index_name=vectorize_index_name1,
query="California",
return_values=True,
return_metadata="all",
),
cfVect.asimilarity_search(
index_name=vectorize_index_name2,
query="California",
return_values=True,
return_metadata="all",
),
cfVect.asimilarity_search(
index_name=vectorize_index_name3,
query="California",
return_values=True,
return_metadata="all",
),
]
async_results = await asyncio.gather(*async_requests);
print(f"{len(async_results[0])} results:\n{str(async_results[0][0])[:300]}")
print(f"{len(async_results[1])} results:\n{str(async_results[1][0])[:300]}")
print(f"{len(async_results[1])} results:\n{str(async_results[2][0])[:300]}")
20 results:
page_content='and other services. Cloudflare's headquarters are in San Francisco, California. According to' metadata={'section': 'Introduction', '_values': [-0.031219482, -0.018295288, -0.006000519, 0.017532349, 0.016403198, -0.029922485, -0.007133484, 0.004447937, 0.04559326, -0.011405945, 0.034820
20 results:
page_content='and other services. Cloudflare's headquarters are in San Francisco, California. According to' metadata={'section': 'Introduction', '_values': [-0.031219482, -0.018295288, -0.006000519, 0.017532349, 0.016403198, -0.029922485, -0.007133484, 0.004447937, 0.04559326, -0.011405945, 0.034820
20 results:
page_content='and other services. Cloudflare's headquarters are in San Francisco, California. According to' metadata={'section': 'Introduction', '_values': [-0.031219482, -0.018295288, -0.006000519, 0.017532349, 0.016403198, -0.029922485, -0.007133484, 0.004447937, 0.04559326, -0.011405945, 0.034820
搜索引擎中的元数据过滤
async_requests = [
cfVect.asimilarity_search(
index_name=vectorize_index_name1,
query="Cloudflare services",
k=2,
md_filter={"section": "Products"},
return_metadata="all",
# return_values=True
),
cfVect.asimilarity_search(
index_name=vectorize_index_name2,
query="Cloudflare services",
k=2,
md_filter={"section": "Products"},
return_metadata="all",
# return_values=True
),
cfVect.asimilarity_search(
index_name=vectorize_index_name3,
query="Cloudflare services",
k=2,
md_filter={"section": "Products"},
return_metadata="all",
# return_values=True
),
]
async_results = await asyncio.gather(*async_requests);
[doc.metadata["section"] == "Products" for doc in async_results[0]]
[True, True]
print(f"{len(async_results[0])} results:\n{str(async_results[0][-1])[:300]}")
print(f"{len(async_results[1])} results:\n{str(async_results[1][0])[:300]}")
print(f"{len(async_results[1])} results:\n{str(async_results[2][0])[:300]}")
2 results:
page_content='Cloudflare also provides analysis and reports on large-scale outages, including Verizon’s October' metadata={'section': 'Products'}
2 results:
page_content='Cloudflare provides network and security products for consumers and businesses, utilizing edge' metadata={'section': 'Products'}
2 results:
page_content='Cloudflare provides network and security products for consumers and businesses, utilizing edge' metadata={'section': 'Products'}
清理
让我们通过删除我们在本次笔记簿中创建的所有索引来结束本操作。
arr_indexes = cfVect.list_indexes()
arr_indexes = [x for x in arr_indexes if "test-langchain" in x.get("name")]
arr_async_requests = [
cfVect.adelete_index(index_name=x.get("name")) for x in arr_indexes
]
await asyncio.gather(*arr_async_requests);
API 参考
https://developers.cloudflare.com/api/resources/vectorize/
https://developers.cloudflare.com/vectorize/