ClickHouse

ClickHouse is the fastest and most resource efficient open-source database for real-time apps and analytics with full SQL support and a wide range of functions to assist users in writing analytical queries. Lately added data structures and distance search functions (like L2Distance) as well as approximate nearest neighbor search indexes enable ClickHouse to be used as a high performance and scalable vector database to store and search vectors with SQL.

本笔记本展示了如何使用与 ClickHouse 向量存储相关的功能。

设置

首先使用 Docker 设置本地 ClickHouse 服务器：

! docker run -d -p 8123:8123 -p9000:9000 --name langchain-clickhouse-server --ulimit nofile=262144:262144 clickhouse/clickhouse-server:24.7.6.8

您需要安装 langchain-community 和 clickhouse-connect 才能使用此集成

pip install -qU langchain-community clickhouse-connect

凭据

此笔记本无需凭证，只需确保已按上述说明安装了相关包。

如果您希望获得一流的模型调用自动追踪功能，还可以通过取消注释以下代码来设置您的 LangSmith API 密钥：

# os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter your LangSmith API key: ")
# os.environ["LANGSMITH_TRACING"] = "true"

实例化

选择嵌入模型：

pip install -qU langchain-openai

import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
  os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

from langchain_community.vectorstores import Clickhouse, ClickhouseSettings

settings = ClickhouseSettings(table="clickhouse_example")
vector_store = Clickhouse(embeddings, config=settings)

API 参考：Clickhouse |ClickhouseSettings

管理向量存储

创建向量存储后，我们可以通过添加和删除不同条目来与其交互。

将项目添加到向量存储

我们可以通过使用add_documents函数将项目添加到我们的向量存储中。

from uuid import uuid4

from langchain_core.documents import Document

document_1 = Document(
    page_content="I had chocolate chip pancakes and scrambled eggs for breakfast this morning.",
    metadata={"source": "tweet"},
)

document_2 = Document(
    page_content="The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.",
    metadata={"source": "news"},
)

document_3 = Document(
    page_content="Building an exciting new project with LangChain - come check it out!",
    metadata={"source": "tweet"},
)

document_4 = Document(
    page_content="Robbers broke into the city bank and stole $1 million in cash.",
    metadata={"source": "news"},
)

document_5 = Document(
    page_content="Wow! That was an amazing movie. I can't wait to see it again.",
    metadata={"source": "tweet"},
)

document_6 = Document(
    page_content="Is the new iPhone worth the price? Read this review to find out.",
    metadata={"source": "website"},
)

document_7 = Document(
    page_content="The top 10 soccer players in the world right now.",
    metadata={"source": "website"},
)

document_8 = Document(
    page_content="LangGraph is the best framework for building stateful, agentic applications!",
    metadata={"source": "tweet"},
)

document_9 = Document(
    page_content="The stock market is down 500 points today due to fears of a recession.",
    metadata={"source": "news"},
)

document_10 = Document(
    page_content="I have a bad feeling I am going to get deleted :(",
    metadata={"source": "tweet"},
)

documents = [
    document_1,
    document_2,
    document_3,
    document_4,
    document_5,
    document_6,
    document_7,
    document_8,
    document_9,
    document_10,
]
uuids = [str(uuid4()) for _ in range(len(documents))]

vector_store.add_documents(documents=documents, ids=uuids)

API 参考：文档

从向量存储中删除项目

我们可以通过使用delete函数按 ID 从向量存储中删除项目。

vector_store.delete(ids=uuids[-1])

查询向量存储

一旦您的向量存储已创建并添加了相关文档，您很可能希望在链或代理运行期间对其进行查询。

直接查询

相似性搜索

执行简单的相似度搜索可以按以下方式完成：

results = vector_store.similarity_search(
    "LangChain provides abstractions to make working with LLMs easy", k=2
)
for res in results:
    print(f"* {res.page_content} [{res.metadata}]")

带分数的相似性搜索

您也可以按分数搜索：

results = vector_store.similarity_search_with_score("Will it be hot tomorrow?", k=1)
for res, score in results:
    print(f"* [SIM={score:3f}] {res.page_content} [{res.metadata}]")

过滤

您可以直接访问 ClickHouse SQL 的 WHERE 语句。您可以按照标准 SQL 编写 WHERE 子句。

注意：请注意 SQL 注入风险，此接口不得由最终用户直接调用。

如果您在设置中自定义了column_map，您可以使用如下过滤器进行搜索：

meta = vector_store.metadata_column
results = vector_store.similarity_search_with_relevance_scores(
    "What did I eat for breakfast?",
    k=4,
    where_str=f"{meta}.source = 'tweet'",
)
for res in results:
    print(f"* {res.page_content} [{res.metadata}]")

其他搜索方法

还有许多其他搜索方法未在本笔记本中涵盖，例如 MMR 搜索或按向量搜索。如需查看Clickhouse向量存储可用的完整搜索功能列表，请参阅API 参考。

通过转换为检索器进行查询

您还可以将向量存储转换为检索器，以便在链中更轻松地使用。

以下是如何将您的向量存储转换为检索器，然后使用简单查询和过滤器调用该检索器的方法。

retriever = vector_store.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={"k": 1, "score_threshold": 0.5},
)
retriever.invoke("Stealing from the bank is a crime", filter={"source": "news"})

检索增强生成的用法

有关如何使用此向量存储进行检索增强生成 (RAG) 的指南，请参阅以下部分：

了解更多，请查看使用 Astra DB 的完整 RAG 模板此处。

API 参考

有关所有Clickhouse功能和配置的详细文档，请参阅 API 参考：https://python.langchain.com/api_reference/community/vectorstores/langchain_community.vectorstores.clickhouse.Clickhouse.html

向量存储概念指南
向量存储操操作指南

设置​

凭据​

实例化​

管理向量存储​

将项目添加到向量存储​

从向量存储中删除项目​

查询向量存储​

直接查询

相似性搜索​

带分数的相似性搜索​

过滤​

其他搜索方法​

通过转换为检索器进行查询​

检索增强生成的用法​

API 参考​

相关​

设置

凭据

实例化