SAP HANA Cloud 向量引擎
SAP HANA Cloud 向量引擎 是完全集成到
SAP HANA Cloud数据库中的向量存储。
设置
安装langchain-hana个外部集成包,以及本笔记本中使用的其他包。
%pip install -qU langchain-hana
Credentials
确保您的 SAP HANA 实例正在运行。从环境变量中加载凭据并创建连接:
import os
from dotenv import load_dotenv
from hdbcli import dbapi
load_dotenv()
# Use connection settings from the environment
connection = dbapi.connect(
address=os.environ.get("HANA_DB_ADDRESS"),
port=os.environ.get("HANA_DB_PORT"),
user=os.environ.get("HANA_DB_USER"),
password=os.environ.get("HANA_DB_PASSWORD"),
autocommit=True,
sslValidateCertificate=False,
)
在 什么是 SAP HANA? 中了解有关 SAP HANA 的更多信息。
初始化
要初始化一个 HanaDB 向量存储,你需要一个数据库连接和一个嵌入实例。SAP HANA Cloud 向量引擎支持外部和内部嵌入。
-
使用外部嵌入
pip install -qU langchain-openai
import getpass
import os
if not os.environ.get("OPENAI_API_KEY"):
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
-
使用内部嵌入
您也可以使用 SAP HANA 内置的 VECTOR_EMBEDDING() 函数直接计算嵌入(embeddings)。为此,请使用您的内部模型 ID 创建一个 HanaInternalEmbeddings 实例,并将其传递给 HanaDB。请注意,HanaInternalEmbeddings 实例是专门为与 HanaDB 配合使用而设计的,不适用于其他向量存储实现。有关内部嵌入的更多信息,请参阅 SAP HANA VECTOR_EMBEDDING 函数。
注意: 确保在您的 SAP HANA Cloud 实例中已启用 NLP。
from langchain_hana import HanaInternalEmbeddings
embeddings = HanaInternalEmbeddings(internal_embedding_model_id="SAP_NEB.20240715")
一旦你有了连接和嵌入实例,通过将它们与用于存储向量的表名一起传递给 HanaDB 来创建向量存储:
from langchain_hana import HanaDB
db = HanaDB(
embedding=embeddings, connection=connection, table_name="STATE_OF_THE_UNION"
)
示例
加载示例文档 \"state_of_the_union.txt\" 并从中创建文本块。
from langchain_community.document_loaders import TextLoader
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
text_documents = TextLoader(
"../../how_to/state_of_the_union.txt", encoding="UTF-8"
).load()
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=0)
text_chunks = text_splitter.split_documents(text_documents)
print(f"Number of document chunks: {len(text_chunks)}")
Number of document chunks: 88
将已加载的文档块添加到表中。在本示例中,我们会删除表中可能由之前运行留下的任何先前内容。
# Delete already existing documents from the table
db.delete(filter={})
# add the loaded document chunks
db.add_documents(text_chunks)
[]
执行查询以从上一步中添加的文档块中获取两个最匹配的文档块。 默认情况下,搜索使用“余弦相似度”。
query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query, k=2)
for doc in docs:
print("-" * 80)
print(doc.page_content)
--------------------------------------------------------------------------------
One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.
And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
--------------------------------------------------------------------------------
As I said last year, especially to our younger transgender Americans, I will always have your back as your President, so you can be yourself and reach your God-given potential.
While it often appears that we never agree, that isn’t true. I signed 80 bipartisan bills into law last year. From preventing government shutdowns to protecting Asian-Americans from still-too-common hate crimes to reforming military justice.
使用“欧几里得距离”查询相同的内容。结果应与“余弦相似度”相同。
from langchain_hana.utils import DistanceStrategy
db = HanaDB(
embedding=embeddings,
connection=connection,
distance_strategy=DistanceStrategy.EUCLIDEAN_DISTANCE,
table_name="STATE_OF_THE_UNION",
)
query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query, k=2)
for doc in docs:
print("-" * 80)
print(doc.page_content)
--------------------------------------------------------------------------------
One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.
And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
--------------------------------------------------------------------------------
As I said last year, especially to our younger transgender Americans, I will always have your back as your President, so you can be yourself and reach your God-given potential.
While it often appears that we never agree, that isn’t true. I signed 80 bipartisan bills into law last year. From preventing government shutdowns to protecting Asian-Americans from still-too-common hate crimes to reforming military justice.
最大边际相关性搜索 (MMR)
Maximal marginal relevance 在优化与查询的相似性的同时,也注重所选文档之间的多样性。将从数据库中检索前 20 个(fetch_k)项目,然后 MMR 算法会从中找出最佳的 2 个(k)匹配项。
docs = db.max_marginal_relevance_search(query, k=2, fetch_k=20)
for doc in docs:
print("-" * 80)
print(doc.page_content)
--------------------------------------------------------------------------------
One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.
And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
--------------------------------------------------------------------------------
Groups of citizens blocking tanks with their bodies. Everyone from students to retirees teachers turned soldiers defending their homeland.
In this struggle as President Zelenskyy said in his speech to the European Parliament “Light will win over darkness.” The Ukrainian Ambassador to the United States is here tonight.
Let each of us here tonight in this Chamber send an unmistakable signal to Ukraine and to the world.
创建一个HNSW向量索引
向量索引可以显著加快向量的 top-k 最近邻查询速度。用户可以使用 create_hnsw_index 函数创建分层可导航小世界(HNSW)向量索引。
有关在数据库级别创建索引的更多信息,请参阅官方文档。
# HanaDB instance uses cosine similarity as default:
db_cosine = HanaDB(
embedding=embeddings, connection=connection, table_name="STATE_OF_THE_UNION"
)
# Attempting to create the HNSW index with default parameters
db_cosine.create_hnsw_index() # If no other parameters are specified, the default values will be used
# Default values: m=64, ef_construction=128, ef_search=200
# The default index name will be: STATE_OF_THE_UNION_COSINE_SIMILARITY_IDX (verify this naming pattern in HanaDB class)
# Creating a HanaDB instance with L2 distance as the similarity function and defined values
db_l2 = HanaDB(
embedding=embeddings,
connection=connection,
table_name="STATE_OF_THE_UNION",
distance_strategy=DistanceStrategy.EUCLIDEAN_DISTANCE, # Specify L2 distance
)
# This will create an index based on L2 distance strategy.
db_l2.create_hnsw_index(
index_name="STATE_OF_THE_UNION_L2_index",
m=100, # Max number of neighbors per graph node (valid range: 4 to 1000)
ef_construction=200, # Max number of candidates during graph construction (valid range: 1 to 100000)
ef_search=500, # Min number of candidates during the search (valid range: 1 to 100000)
)
# Use L2 index to perform MMR
docs = db_l2.max_marginal_relevance_search(query, k=2, fetch_k=20)
for doc in docs:
print("-" * 80)
print(doc.page_content)
--------------------------------------------------------------------------------
One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.
And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
--------------------------------------------------------------------------------
Groups of citizens blocking tanks with their bodies. Everyone from students to retirees teachers turned soldiers defending their homeland.
In this struggle as President Zelenskyy said in his speech to the European Parliament “Light will win over darkness.” The Ukrainian Ambassador to the United States is here tonight.
Let each of us here tonight in this Chamber send an unmistakable signal to Ukraine and to the world.
关键点:
- 相似性函数:索引的相似性函数默认为余弦相似性。如果你想使用不同的相似性函数(例如,
L2距离),你需要在初始化HanaDB实例时指定它。 - 默认参数:在
create_hnsw_index函数中,如果用户未为诸如m、ef_construction或ef_search等参数提供自定义值,则将自动使用默认值(例如m=64、ef_construction=128、ef_search=200)。这些值可确保索引以合理的性能创建,而无需用户干预。
基础向量存储操作
db = HanaDB(
connection=connection, embedding=embeddings, table_name="LANGCHAIN_DEMO_BASIC"
)
# Delete already existing documents from the table
db.delete(filter={})
True
我们可以向现有表格添加简单的文本文档。
docs = [Document(page_content="Some text"), Document(page_content="Other docs")]
db.add_documents(docs)
[]
添加带有元数据的文档。
docs = [
Document(
page_content="foo",
metadata={"start": 100, "end": 150, "doc_name": "foo.txt", "quality": "bad"},
),
Document(
page_content="bar",
metadata={"start": 200, "end": 250, "doc_name": "bar.txt", "quality": "good"},
),
]
db.add_documents(docs)
[]
使用特定元数据查询文档。
docs = db.similarity_search("foobar", k=2, filter={"quality": "bad"})
# With filtering on "quality"=="bad", only one document should be returned
for doc in docs:
print("-" * 80)
print(doc.page_content)
print(doc.metadata)
--------------------------------------------------------------------------------
foo
{'start': 100, 'end': 150, 'doc_name': 'foo.txt', 'quality': 'bad'}
删除具有特定元数据的文档。
db.delete(filter={"quality": "bad"})
# Now the similarity search with the same filter will return no results
docs = db.similarity_search("foobar", k=2, filter={"quality": "bad"})
print(len(docs))
0
高级过滤
除了基本的基于值的过滤功能外,还可以使用更高级的过滤。 下表显示了可用的过滤操作符。
| 操作员 | 语义 |
|---|---|
$eq | Equality (==) |
$ne | Inequality (!=) |
$lt | Less than (<) |
$lte | Less than or equal (<=) |
$gt | Greater than (>) |
$gte | Greater than or equal (>=) |
$in | Contained in a set of given values (in) |
$nin | Not contained in a set of given values (not in) |
$between | Between the range of two boundary values |
$like | Text equality based on the "LIKE" semantics in SQL (using "%" as wildcard) |
$contains | Filters documents containing a specific keyword |
$and | Logical "and", supporting 2 or more operands |
$or | Logical "or", supporting 2 or more operands |
# Prepare some test documents
docs = [
Document(
page_content="First",
metadata={"name": "Adam Smith", "is_active": True, "id": 1, "height": 10.0},
),
Document(
page_content="Second",
metadata={"name": "Bob Johnson", "is_active": False, "id": 2, "height": 5.7},
),
Document(
page_content="Third",
metadata={"name": "Jane Doe", "is_active": True, "id": 3, "height": 2.4},
),
]
db = HanaDB(
connection=connection,
embedding=embeddings,
table_name="LANGCHAIN_DEMO_ADVANCED_FILTER",
)
# Delete already existing documents from the table
db.delete(filter={})
db.add_documents(docs)
# Helper function for printing filter results
def print_filter_result(result):
if len(result) == 0:
print("<empty result>")
for doc in result:
print(doc.metadata)
使用 $ne、$gt、$gte、$lt、$lte 进行过滤
advanced_filter = {"id": {"$ne": 1}}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))
advanced_filter = {"id": {"$gt": 1}}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))
advanced_filter = {"id": {"$gte": 1}}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))
advanced_filter = {"id": {"$lt": 1}}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))
advanced_filter = {"id": {"$lte": 1}}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))
Filter: {'id': {'$ne': 1}}
{'name': 'Jane Doe', 'is_active': True, 'id': 3, 'height': 2.4}
{'name': 'Bob Johnson', 'is_active': False, 'id': 2, 'height': 5.7}
Filter: {'id': {'$gt': 1}}
{'name': 'Jane Doe', 'is_active': True, 'id': 3, 'height': 2.4}
{'name': 'Bob Johnson', 'is_active': False, 'id': 2, 'height': 5.7}
Filter: {'id': {'$gte': 1}}
{'name': 'Adam Smith', 'is_active': True, 'id': 1, 'height': 10.0}
{'name': 'Jane Doe', 'is_active': True, 'id': 3, 'height': 2.4}
{'name': 'Bob Johnson', 'is_active': False, 'id': 2, 'height': 5.7}
Filter: {'id': {'$lt': 1}}
<empty result>
Filter: {'id': {'$lte': 1}}
{'name': 'Adam Smith', 'is_active': True, 'id': 1, 'height': 10.0}
使用 $between、$in、$nin 进行过滤
advanced_filter = {"id": {"$between": (1, 2)}}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))
advanced_filter = {"name": {"$in": ["Adam Smith", "Bob Johnson"]}}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))
advanced_filter = {"name": {"$nin": ["Adam Smith", "Bob Johnson"]}}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))
Filter: {'id': {'$between': (1, 2)}}
{'name': 'Adam Smith', 'is_active': True, 'id': 1, 'height': 10.0}
{'name': 'Bob Johnson', 'is_active': False, 'id': 2, 'height': 5.7}
Filter: {'name': {'$in': ['Adam Smith', 'Bob Johnson']}}
{'name': 'Adam Smith', 'is_active': True, 'id': 1, 'height': 10.0}
{'name': 'Bob Johnson', 'is_active': False, 'id': 2, 'height': 5.7}
Filter: {'name': {'$nin': ['Adam Smith', 'Bob Johnson']}}
{'name': 'Jane Doe', 'is_active': True, 'id': 3, 'height': 2.4}
使用 $like 进行文本过滤
advanced_filter = {"name": {"$like": "a%"}}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))
advanced_filter = {"name": {"$like": "%a%"}}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))
Filter: {'name': {'$like': 'a%'}}
<empty result>
Filter: {'name': {'$like': '%a%'}}
{'name': 'Adam Smith', 'is_active': True, 'id': 1, 'height': 10.0}
{'name': 'Jane Doe', 'is_active': True, 'id': 3, 'height': 2.4}
使用 $contains 进行文本过滤
advanced_filter = {"name": {"$contains": "bob"}}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))
advanced_filter = {"name": {"$contains": "bo"}}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))
advanced_filter = {"name": {"$contains": "Adam Johnson"}}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))
advanced_filter = {"name": {"$contains": "Adam Smith"}}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))
Filter: {'name': {'$contains': 'bob'}}
{'name': 'Bob Johnson', 'is_active': False, 'id': 2, 'height': 5.7}
Filter: {'name': {'$contains': 'bo'}}
<empty result>
Filter: {'name': {'$contains': 'Adam Johnson'}}
<empty result>
Filter: {'name': {'$contains': 'Adam Smith'}}
{'name': 'Adam Smith', 'is_active': True, 'id': 1, 'height': 10.0}
使用 $and, $or 进行组合过滤
advanced_filter = {"$or": [{"id": 1}, {"name": "bob"}]}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))
advanced_filter = {"$and": [{"id": 1}, {"id": 2}]}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))
advanced_filter = {"$or": [{"id": 1}, {"id": 2}, {"id": 3}]}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))
advanced_filter = {
"$and": [{"name": {"$contains": "bob"}}, {"name": {"$contains": "johnson"}}]
}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))
Filter: {'$or': [{'id': 1}, {'name': 'bob'}]}
{'name': 'Adam Smith', 'is_active': True, 'id': 1, 'height': 10.0}
Filter: {'$and': [{'id': 1}, {'id': 2}]}
<empty result>
Filter: {'$or': [{'id': 1}, {'id': 2}, {'id': 3}]}
{'name': 'Adam Smith', 'is_active': True, 'id': 1, 'height': 10.0}
{'name': 'Jane Doe', 'is_active': True, 'id': 3, 'height': 2.4}
{'name': 'Bob Johnson', 'is_active': False, 'id': 2, 'height': 5.7}
Filter: {'$and': [{'name': {'$contains': 'bob'}}, {'name': {'$contains': 'johnson'}}]}
{'name': 'Bob Johnson', 'is_active': False, 'id': 2, 'height': 5.7}
将 VectorStore 用作检索增强生成(RAG)链中的检索器
# Access the vector DB with a new table
db = HanaDB(
connection=connection,
embedding=embeddings,
table_name="LANGCHAIN_DEMO_RETRIEVAL_CHAIN",
)
# Delete already existing entries from the table
db.delete(filter={})
# add the loaded document chunks from the "State Of The Union" file
db.add_documents(text_chunks)
# Create a retriever instance of the vector store
retriever = db.as_retriever()
定义提示。
from langchain_core.prompts import PromptTemplate
prompt_template = """
You are an expert in state of the union topics. You are provided multiple context items that are related to the prompt you have to answer.
Use the following pieces of context to answer the question at the end.
'''
{context}
'''
Question: {question}
"""
PROMPT = PromptTemplate(
template=prompt_template, input_variables=["context", "question"]
)
chain_type_kwargs = {"prompt": PROMPT}
创建ConversationalRetrievalChain,它负责处理聊天历史以及检索类似的文档片段以添加到提示中。
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-3.5-turbo")
memory = ConversationBufferMemory(
memory_key="chat_history", output_key="answer", return_messages=True
)
qa_chain = ConversationalRetrievalChain.from_llm(
llm,
db.as_retriever(search_kwargs={"k": 5}),
return_source_documents=True,
memory=memory,
verbose=False,
combine_docs_chain_kwargs={"prompt": PROMPT},
)
提出第一个问题(并验证已使用多少文本块)。
question = "What about Mexico and Guatemala?"
result = qa_chain.invoke({"question": question})
print("Answer from LLM:")
print("================")
print(result["answer"])
source_docs = result["source_documents"]
print("================")
print(f"Number of used source document chunks: {len(source_docs)}")
Answer from LLM:
================
The United States has set up joint patrols with Mexico and Guatemala to catch more human traffickers at the border. This collaborative effort aims to improve border security and combat illegal activities such as human trafficking.
================
Number of used source document chunks: 5
详细检查链中使用的各个块。查看排名最高的块是否包含问题中提到的“墨西哥和危地马拉”的相关信息。
for doc in source_docs:
print("-" * 80)
print(doc.page_content)
print(doc.metadata)
在相同的对话链上提出另一个问题。答案应与之前给出的答案相关。
question = "How many casualties were reported after that?"
result = qa_chain.invoke({"question": question})
print("Answer from LLM:")
print("================")
print(result["answer"])
Answer from LLM:
================
Countries like Mexico and Guatemala are participating in joint patrols to catch human traffickers. The United States is also working with partners in South and Central America to host more refugees and secure their borders. Additionally, the U.S. is working with twenty-seven members of the European Union, as well as countries like France, Germany, Italy, the United Kingdom, Canada, Japan, Korea, Australia, New Zealand, and Switzerland.
标准表与包含向量数据的“自定义”表
作为默认行为,嵌入表将创建3列:
- 一个包含文档文本的第
VEC_TEXT列 - 一个包含 Document 元数据的
VEC_META列 - 一个名为
VEC_VECTOR的列,其中包含文档文本的嵌入向量
# Access the vector DB with a new table
db = HanaDB(
connection=connection, embedding=embeddings, table_name="LANGCHAIN_DEMO_NEW_TABLE"
)
# Delete already existing entries from the table
db.delete(filter={})
# Add a simple document with some metadata
docs = [
Document(
page_content="A simple document",
metadata={"start": 100, "end": 150, "doc_name": "simple.txt"},
)
]
db.add_documents(docs)
[]
显示表 \"LANGCHAIN_DEMO_NEW_TABLE\" 中的列
cur = connection.cursor()
cur.execute(
"SELECT COLUMN_NAME, DATA_TYPE_NAME FROM SYS.TABLE_COLUMNS WHERE SCHEMA_NAME = CURRENT_SCHEMA AND TABLE_NAME = 'LANGCHAIN_DEMO_NEW_TABLE'"
)
rows = cur.fetchall()
for row in rows:
print(row)
cur.close()
('VEC_META', 'NCLOB')
('VEC_TEXT', 'NCLOB')
('VEC_VECTOR', 'REAL_VECTOR')
显示插入文档在三列中的值
cur = connection.cursor()
cur.execute(
"SELECT VEC_TEXT, VEC_META, TO_NVARCHAR(VEC_VECTOR) FROM LANGCHAIN_DEMO_NEW_TABLE LIMIT 1"
)
rows = cur.fetchall()
print(rows[0][0]) # The text
print(rows[0][1]) # The metadata
print(rows[0][2]) # The vector
cur.close()
自定义表格必须至少包含三列,且这些列需符合标准表格的语义
- 一个类型为
NCLOB或NVARCHAR的列,用于嵌入的文本/上下文 - 一个具有类型
NCLOB或NVARCHAR的元数据列 - 具有类型
REAL_VECTOR的列用于嵌入向量
表格可以包含额外的列。当向表中插入新文档时,这些额外的列必须允许 NULL 值。
# Create a new table "MY_OWN_TABLE_ADD" with three "standard" columns and one additional column
my_own_table_name = "MY_OWN_TABLE_ADD"
cur = connection.cursor()
cur.execute(
(
f"CREATE TABLE {my_own_table_name} ("
"SOME_OTHER_COLUMN NVARCHAR(42), "
"MY_TEXT NVARCHAR(2048), "
"MY_METADATA NVARCHAR(1024), "
"MY_VECTOR REAL_VECTOR )"
)
)
# Create a HanaDB instance with the own table
db = HanaDB(
connection=connection,
embedding=embeddings,
table_name=my_own_table_name,
content_column="MY_TEXT",
metadata_column="MY_METADATA",
vector_column="MY_VECTOR",
)
# Add a simple document with some metadata
docs = [
Document(
page_content="Some other text",
metadata={"start": 400, "end": 450, "doc_name": "other.txt"},
)
]
db.add_documents(docs)
# Check if data has been inserted into our own table
cur.execute(f"SELECT * FROM {my_own_table_name} LIMIT 1")
rows = cur.fetchall()
print(rows[0][0]) # Value of column "SOME_OTHER_DATA". Should be NULL/None
print(rows[0][1]) # The text
print(rows[0][2]) # The metadata
print(rows[0][3]) # The vector
cur.close()
None
Some other text
{"start": 400, "end": 450, "doc_name": "other.txt"}
<memory at 0x110f856c0>
添加另一个文档并对自定义表执行相似性搜索。
docs = [
Document(
page_content="Some more text",
metadata={"start": 800, "end": 950, "doc_name": "more.txt"},
)
]
db.add_documents(docs)
query = "What's up?"
docs = db.similarity_search(query, k=2)
for doc in docs:
print("-" * 80)
print(doc.page_content)
--------------------------------------------------------------------------------
Some more text
--------------------------------------------------------------------------------
Some other text
筛选性能优化与自定义列
为了支持灵活的元数据值,默认情况下所有元数据均以 JSON 格式存储在 metadata 列中。如果已知部分使用的元数据键和值类型,则可以通过在创建目标表时将键名称作为列名,并通过 specific_metadata_columns 列表将其传递给 HanaDB 构造函数,将这些元数据存储在额外的列中。在插入期间,与 specific_metadata_columns 列表中值匹配的元数据键会被复制到特殊列中。对于位于 specific_metadata_columns 列表中的键,过滤器将使用这些特殊列而非元数据 JSON 列。
# Create a new table "PERFORMANT_CUSTOMTEXT_FILTER" with three "standard" columns and one additional column
my_own_table_name = "PERFORMANT_CUSTOMTEXT_FILTER"
cur = connection.cursor()
cur.execute(
(
f"CREATE TABLE {my_own_table_name} ("
"CUSTOMTEXT NVARCHAR(500), "
"MY_TEXT NVARCHAR(2048), "
"MY_METADATA NVARCHAR(1024), "
"MY_VECTOR REAL_VECTOR )"
)
)
# Create a HanaDB instance with the own table
db = HanaDB(
connection=connection,
embedding=embeddings,
table_name=my_own_table_name,
content_column="MY_TEXT",
metadata_column="MY_METADATA",
vector_column="MY_VECTOR",
specific_metadata_columns=["CUSTOMTEXT"],
)
# Add a simple document with some metadata
docs = [
Document(
page_content="Some other text",
metadata={
"start": 400,
"end": 450,
"doc_name": "other.txt",
"CUSTOMTEXT": "Filters on this value are very performant",
},
)
]
db.add_documents(docs)
# Check if data has been inserted into our own table
cur.execute(f"SELECT * FROM {my_own_table_name} LIMIT 1")
rows = cur.fetchall()
print(
rows[0][0]
) # Value of column "CUSTOMTEXT". Should be "Filters on this value are very performant"
print(rows[0][1]) # The text
print(
rows[0][2]
) # The metadata without the "CUSTOMTEXT" data, as this is extracted into a sperate column
print(rows[0][3]) # The vector
cur.close()
Filters on this value are very performant
Some other text
{"start": 400, "end": 450, "doc_name": "other.txt", "CUSTOMTEXT": "Filters on this value are very performant"}
<memory at 0x110f859c0>
特殊列对 LangChain 接口的其余部分完全透明。所有功能都像以前一样正常运行,只是性能更优。
docs = [
Document(
page_content="Some more text",
metadata={
"start": 800,
"end": 950,
"doc_name": "more.txt",
"CUSTOMTEXT": "Another customtext value",
},
)
]
db.add_documents(docs)
advanced_filter = {"CUSTOMTEXT": {"$like": "%value%"}}
query = "What's up?"
docs = db.similarity_search(query, k=2, filter=advanced_filter)
for doc in docs:
print("-" * 80)
print(doc.page_content)
--------------------------------------------------------------------------------
Some more text
--------------------------------------------------------------------------------
Some other text