Skip to main content
Open In ColabOpen on GitHub

Faiss(异步)

Facebook AI相似搜索(FAISS) 是一个用于高效相似度搜索和稠密向量聚类的库。它包含可以在任意大小集合中进行搜索的算法,这些集合可能大到无法装入RAM中。它还包含评估和支持参数调整所需的代码。

See 《FAISS库》 研究论文。

Faiss 文档.

您需要使用 langchain-community 安装 pip install -qU langchain-community 才能使用此集成

这个笔记本展示了如何使用与FAISS向量数据库相关的功能,以及如何使用asyncio。 LangChain 实现了同步和异步的向量存储函数。

请参见 synchronous 版本 此处

%pip install --upgrade --quiet  faiss-gpu # For CUDA 7.5+ Supported GPU's.
# OR
%pip install --upgrade --quiet faiss-cpu # For CPU Installation

我们需要使用 OpenAIEmbeddings,因此必须获取 OpenAI API 密钥。

import getpass
import os

if "OPENAI_API_KEY" not in os.environ:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

# Uncomment the following line if you need to initialize FAISS with no AVX2 optimization
# os.environ['FAISS_NO_AVX2'] = '1'

from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter

loader = TextLoader("../../../extras/modules/state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()

db = await FAISS.afrom_documents(docs, embeddings)

query = "What did the president say about Ketanji Brown Jackson"
docs = await db.asimilarity_search(query)

print(docs[0].page_content)

相似性搜索(带分数)

存在一些特定于FAISS的方法。其中之一是similarity_search_with_score,它允许您不仅返回文档,还可以返回查询到这些文档的距离分数。返回的距离分数是L2距离。因此,较低的分数更好。

docs_and_scores = await db.asimilarity_search_with_score(query)

docs_and_scores[0]

也可以通过使用 similarity_search_by_vector 来搜索与给定嵌入向量相似的文档,它接受嵌入向量作为参数而不是字符串。

embedding_vector = await embeddings.aembed_query(query)
docs_and_scores = await db.asimilarity_search_by_vector(embedding_vector)

保存和加载

您可以保存和加载一个FAISS索引,这样每次使用时就不必每次都重新创建它。

db.save_local("faiss_index")

new_db = FAISS.load_local("faiss_index", embeddings, asynchronous=True)

docs = await new_db.asimilarity_search(query)

docs[0]

序列化与反序列化为字节

you can pickle the FAISS Index by these functions. If you use embeddings model which is of 90 mb (sentence-transformers/all-MiniLM-L6-v2 or any other model), the resultant pickle size would be more than 90 mb. the size of the model is also included in the overall size. To overcome this, use the below functions. These functions only serializes FAISS index and size would be much lesser. this can be helpful if you wish to store the index in database like sql.

from langchain_huggingface import HuggingFaceEmbeddings

pkl = db.serialize_to_bytes() # serializes the faiss index
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
db = FAISS.deserialize_from_bytes(
embeddings=embeddings, serialized=pkl, asynchronous=True
) # Load the index

合并

您也可以合并两个FAISS向量存储库

db1 = await FAISS.afrom_texts(["foo"], embeddings)
db2 = await FAISS.afrom_texts(["bar"], embeddings)
db1.docstore._dict
{'8164a453-9643-4959-87f7-9ba79f9e8fb0': Document(page_content='foo')}
db2.docstore._dict
{'4fbcf8a2-e80f-4f65-9308-2f4cb27cb6e7': Document(page_content='bar')}
db1.merge_from(db2)
db1.docstore._dict
{'8164a453-9643-4959-87f7-9ba79f9e8fb0': Document(page_content='foo'),
'4fbcf8a2-e80f-4f65-9308-2f4cb27cb6e7': Document(page_content='bar')}

相似性搜索与过滤

FAISS向量存储也可以支持过滤功能,但由于FAISS本身不原生支持过滤,所以我们需要手动实现。这可以通过首先获取比 k 更多的结果,然后进行过滤来完成。你可以根据元数据对文档进行过滤。当你调用任何搜索方法时,你还可以设置fetch_k 参数以指定在过滤前要获取多少个文档。这里有一个小例子:

from langchain_core.documents import Document

list_of_documents = [
Document(page_content="foo", metadata=dict(page=1)),
Document(page_content="bar", metadata=dict(page=1)),
Document(page_content="foo", metadata=dict(page=2)),
Document(page_content="barbar", metadata=dict(page=2)),
Document(page_content="foo", metadata=dict(page=3)),
Document(page_content="bar burr", metadata=dict(page=3)),
Document(page_content="foo", metadata=dict(page=4)),
Document(page_content="bar bruh", metadata=dict(page=4)),
]
db = FAISS.from_documents(list_of_documents, embeddings)
results_with_scores = db.similarity_search_with_score("foo")
for doc, score in results_with_scores:
print(f"Content: {doc.page_content}, Metadata: {doc.metadata}, Score: {score}")
API 参考:文档
Content: foo, Metadata: {'page': 1}, Score: 5.159960813797904e-15
Content: foo, Metadata: {'page': 2}, Score: 5.159960813797904e-15
Content: foo, Metadata: {'page': 3}, Score: 5.159960813797904e-15
Content: foo, Metadata: {'page': 4}, Score: 5.159960813797904e-15

现在我们进行相同的查询调用,但我们仅筛选出page = 1

results_with_scores = await db.asimilarity_search_with_score("foo", filter=dict(page=1))
for doc, score in results_with_scores:
print(f"Content: {doc.page_content}, Metadata: {doc.metadata}, Score: {score}")
Content: foo, Metadata: {'page': 1}, Score: 5.159960813797904e-15
Content: bar, Metadata: {'page': 1}, Score: 0.3131446838378906

同样也可以使用max_marginal_relevance_search完成相同的操作。

results = await db.amax_marginal_relevance_search("foo", filter=dict(page=1))
for doc in results:
print(f"Content: {doc.page_content}, Metadata: {doc.metadata}")
Content: foo, Metadata: {'page': 1}
Content: bar, Metadata: {'page': 1}

Here is an example of how to set fetch_k parameter when calling similarity_search. Usually you would want the fetch_k parameter >> k parameter. This is because the fetch_k parameter is the number of documents that will be fetched before filtering. If you set fetch_k to a low number, you might not get enough documents to filter from.

results = await db.asimilarity_search("foo", filter=dict(page=1), k=1, fetch_k=4)
for doc in results:
print(f"Content: {doc.page_content}, Metadata: {doc.metadata}")
Content: foo, Metadata: {'page': 1}

一些高级元数据过滤支持使用MongoDB查询和投影操作符。当前支持的操作符列表如下:

  • $eq (等于)
  • $neq (不等于)
  • $gt (大于)
  • $lt (小于)
  • $gte (大于或等于)
  • $lte (小于或等于)
  • $in (会员列表)
  • $nin (不在列表中)
  • $and (所有条件必须匹配)
  • $or (任何条件必须匹配)
  • $not (否定条件)

使用高级元数据过滤进行相同的上述相似性搜索可以按照以下方式进行:

results = await db.asimilarity_search(
"foo", filter={"page": {"$eq": 1}}, k=1, fetch_k=4
)
for doc in results:
print(f"Content: {doc.page_content}, Metadata: {doc.metadata}")
Content: foo, Metadata: {'page': 1}

删除

您也可以删除ID。请注意,要删除的ID应该是docstore中的ID。

db.delete([db.index_to_docstore_id[0]])
True
# Is now missing
0 in db.index_to_docstore_id
False