构建语义搜索引擎

本教程将帮助您熟悉 LangChain 的文档加载器、嵌入和向量存储抽象。这些抽象设计用于支持从（向量）数据库及其他数据源检索数据，以便与大型语言模型工作流集成。它们对于需要在模型推理过程中获取数据进行推理的应用至关重要，例如在检索增强生成或 RAG（参见我们的 RAG 教程此处）中。

在这里，我们将基于PDF文档构建一个搜索引擎。这将使我们能够检索与输入查询相似的PDF文档段落。

概念

本指南专注于文本数据的检索。我们将介绍以下概念：

文档和文档加载器；
文本分割器；
嵌入
向量存储和检索器。

设置

Jupyter Notebook

其他教程可能最方便在 Jupyter Notebook 中运行。有关安装说明，请参见此处。

安装

本教程需要 langchain-community 和 pypdf 包：

Pip
Conda

pip install langchain-community pypdf

conda install langchain-community pypdf -c conda-forge

有关详细信息，请参阅我们的安装指南。

LangSmith

使用 LangChain 构建的许多应用程序都包含多个步骤，以及多次调用大型语言模型（LLM）。随着这些应用程序变得越来越复杂，能够检查链或代理内部的具体情况变得至关重要。实现这一点的最佳方式是使用 LangSmith。

在您通过上方链接注册后，请确保设置您的环境变量以开始记录追踪信息：

export LANGSMITH_TRACING="true"
export LANGSMITH_API_KEY="..."

或者，如果在笔记本中，你可以通过以下方式设置它们：

import getpass
import os

os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_API_KEY"] = getpass.getpass()

文档和文档加载器

LangChain 实现了一个文档抽象，旨在表示一段文本及其相关元数据。它包含三个属性：

page_content: 一个表示内容的字符串；
metadata: 一个包含任意元数据的字典；
id: （可选）文档的字符串标识符。

metadata 属性可以捕获有关文档来源、其与其他文档的关系以及其他信息。请注意，单个 Document 对象通常代表较大文档的一个片段。

我们可以在需要时生成示例文档：

from langchain_core.documents import Document

documents = [
    Document(
        page_content="Dogs are great companions, known for their loyalty and friendliness.",
        metadata={"source": "mammal-pets-doc"},
    ),
    Document(
        page_content="Cats are independent pets that often enjoy their own space.",
        metadata={"source": "mammal-pets-doc"},
    ),
]

API 参考：文档

然而，LangChain生态系统实现了文档加载器，这些加载器能够与数百种常见数据源集成。这使得将这些数据源中的数据整合到您的AI应用中变得非常简单。

加载文档

让我们将PDF加载为一系列 Document 对象。LangChain 仓库中有一个示例PDF 这里 —— 2023年耐克的10-K文件。我们可以查阅 LangChain 文档以了解可用的PDF文档加载器。我们选择 PyPDFLoader，它相当轻量。

from langchain_community.document_loaders import PyPDFLoader

file_path = "../example_data/nke-10k-2023.pdf"
loader = PyPDFLoader(file_path)

docs = loader.load()

print(len(docs))

API 参考：PyPDFLoader

提示

查看此指南以了解有关PDF文档加载器的更多详细信息。

PyPDFLoader 每个PDF页面加载一个 Document 对象。对于每个对象，我们可以轻松访问：

页面的字符串内容；
包含文件名和页码的元数据。

print(f"{docs[0].page_content[:200]}\n")
print(docs[0].metadata)

Table of Contents
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
FORM 10-K
(Mark One)
☑ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934
FO

{'source': '../example_data/nke-10k-2023.pdf', 'page': 0}

拆分

对于信息检索和后续问答任务而言，页面可能是一种过于粗略的表示方式。我们的最终目标是检索出能够回答输入查询的 Document 个对象，进一步拆分我们的PDF文件有助于确保文档中相关部分的含义不会被周围文本“冲淡”。

我们可以使用文本分割器来实现这一目的。这里我们将使用一种基于字符的简单文本分割器。我们将文档分割为每块1000个字符，块与块之间有200个字符的重叠。重叠有助于避免将某个陈述与其相关的重要上下文分开。我们使用递归字符文本分割器，它会递归地使用常见的分隔符（如换行符）来分割文档，直到每块达到合适的大小。这是通用文本场景下的推荐文本分割器。

我们设置 add_start_index=True，以便在初始文档中每个拆分文档开始的字符索引作为元数据属性“start_index”被保留。

有关处理PDF文件的更多详细信息，请参阅此指南，其中包括如何从特定部分和图像中提取文本。

from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_documents(docs)

len(all_splits)

API 参考：RecursiveCharacterTextSplitter

嵌入

向量搜索是一种存储和搜索非结构化数据（如非结构化文本）的常用方法。其原理是存储与文本相关联的数值向量。给定一个查询，我们可以将其嵌入为相同维度的向量，并使用向量相似性度量（如余弦相似性）来识别相关文本。

LangChain 支持来自数十个提供方的嵌入。这些模型定义了文本应如何转换为数值向量。让我们选择一个模型：

选择嵌入模型：

pip install -qU langchain-openai

import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
  os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

vector_1 = embeddings.embed_query(all_splits[0].page_content)
vector_2 = embeddings.embed_query(all_splits[1].page_content)

assert len(vector_1) == len(vector_2)
print(f"Generated vectors of length {len(vector_1)}\n")
print(vector_1[:10])

Generated vectors of length 1536

[-0.008586574345827103, -0.03341241180896759, -0.008936782367527485, -0.0036674530711025, 0.010564599186182022, 0.009598285891115665, -0.028587326407432556, -0.015824200585484505, 0.0030416189692914486, -0.012899317778646946]

有了生成文本嵌入的模型后，我们可以将这些嵌入存储在一种特殊的数据结构中，该结构支持高效的相似性搜索。

向量存储

LangChain 向量存储对象包含用于向存储中添加文本和 Document 对象的方法，并使用各种相似性度量进行查询。它们通常使用嵌入模型初始化，这些模型决定了如何将文本数据转换为数值向量。

LangChain 包含与不同向量存储技术的全套集成。一些向量存储由提供商托管（例如，各种云提供商），使用时需要特定凭据；一些（如 Postgres）运行在独立的基础设施上，可本地运行或通过第三方提供；另一些则可在内存中运行，适用于轻量级工作负载。让我们选择一个向量存储：

选择向量存储：

pip install -qU langchain-core

from langchain_core.vectorstores import InMemoryVectorStore

vector_store = InMemoryVectorStore(embeddings)

在实例化我们的向量存储后，我们现在可以对文档进行索引。

ids = vector_store.add_documents(documents=all_splits)

请注意，大多数向量存储实现都允许您连接到现有的向量存储——例如，通过提供客户端、索引名称或其他信息。有关详细信息，请参阅特定集成的文档。

实例化一个包含文档的 VectorStore 后，我们就可以对其进行查询。 VectorStore 包含用于查询的方法：

同步和异步；
按字符串查询和按向量查询；
包含和不包含返回相似度分数；
通过相似性和最大边际相关性（在检索结果的相似性与多样性之间取得平衡）。

这些方法的输出通常会包含一个Document对象列表。

使用

嵌入通常将文本表示为“密集”向量，使得语义相似的文本在几何上彼此接近。这使得我们只需输入一个问题即可检索相关信息，而无需了解文档中使用的任何特定关键词。

根据字符串查询的相似性返回文档：

results = vector_store.similarity_search(
    "How many distribution centers does Nike have in the US?"
)

print(results[0])

page_content='direct to consumer operations sell products through the following number of retail stores in the United States:
U.S. RETAIL STORES NUMBER
NIKE Brand factory stores 213 
NIKE Brand in-line stores (including employee-only stores) 74 
Converse stores (including factory stores) 82 
TOTAL 369 
In the United States, NIKE has eight significant distribution centers. Refer to Item 2. Properties for further information.
2023 FORM 10-K 2' metadata={'page': 4, 'source': '../example_data/nke-10k-2023.pdf', 'start_index': 3125}

异步查询：

results = await vector_store.asimilarity_search("When was Nike incorporated?")

print(results[0])

page_content='Table of Contents
PART I
ITEM 1. BUSINESS
GENERAL
NIKE, Inc. was incorporated in 1967 under the laws of the State of Oregon. As used in this Annual Report on Form 10-K (this "Annual Report"), the terms "we," "us," "our,"
"NIKE" and the "Company" refer to NIKE, Inc. and its predecessors, subsidiaries and affiliates, collectively, unless the context indicates otherwise.
Our principal business activity is the design, development and worldwide marketing and selling of athletic footwear, apparel, equipment, accessories and services. NIKE is
the largest seller of athletic footwear and apparel in the world. We sell our products through NIKE Direct operations, which are comprised of both NIKE-owned retail stores
and sales through our digital platforms (also referred to as "NIKE Brand Digital"), to retail accounts and to a mix of independent distributors, licensees and sales' metadata={'page': 3, 'source': '../example_data/nke-10k-2023.pdf', 'start_index': 0}

返回得分：

# Note that providers implement different scores; the score here
# is a distance metric that varies inversely with similarity.

results = vector_store.similarity_search_with_score("What was Nike's revenue in 2023?")
doc, score = results[0]
print(f"Score: {score}\n")
print(doc)

Score: 0.23699893057346344

page_content='Table of Contents
FISCAL 2023 NIKE BRAND REVENUE HIGHLIGHTS
The following tables present NIKE Brand revenues disaggregated by reportable operating segment, distribution channel and major product line:
FISCAL 2023 COMPARED TO FISCAL 2022
•NIKE, Inc. Revenues were $51.2 billion in fiscal 2023, which increased 10% and 16% compared to fiscal 2022 on a reported and currency-neutral basis, respectively.
The increase was due to higher revenues in North America, Europe, Middle East & Africa ("EMEA"), APLA and Greater China, which contributed approximately 7, 6,
2 and 1 percentage points to NIKE, Inc. Revenues, respectively.
•NIKE Brand revenues, which represented over 90% of NIKE, Inc. Revenues, increased 10% and 16% on a reported and currency-neutral basis, respectively. This
increase was primarily due to higher revenues in Men's, the Jordan Brand, Women's and Kids' which grew 17%, 35%,11% and 10%, respectively, on a wholesale
equivalent basis.' metadata={'page': 35, 'source': '../example_data/nke-10k-2023.pdf', 'start_index': 0}

根据嵌入查询的相似性返回文档：

embedding = embeddings.embed_query("How were Nike's margins impacted in 2023?")

results = vector_store.similarity_search_by_vector(embedding)
print(results[0])

page_content='Table of Contents
GROSS MARGIN
FISCAL 2023 COMPARED TO FISCAL 2022
For fiscal 2023, our consolidated gross profit increased 4% to $22,292 million compared to $21,479 million for fiscal 2022. Gross margin decreased 250 basis points to
43.5% for fiscal 2023 compared to 46.0% for fiscal 2022 due to the following:
*Wholesale equivalent
The decrease in gross margin for fiscal 2023 was primarily due to:
•Higher NIKE Brand product costs, on a wholesale equivalent basis, primarily due to higher input costs and elevated inbound freight and logistics costs as well as
product mix;
•Lower margin in our NIKE Direct business, driven by higher promotional activity to liquidate inventory in the current period compared to lower promotional activity in
the prior period resulting from lower available inventory supply;
•Unfavorable changes in net foreign currency exchange rates, including hedges; and
•Lower off-price margin, on a wholesale equivalent basis.
This was partially offset by:' metadata={'page': 36, 'source': '../example_data/nke-10k-2023.pdf', 'start_index': 0}

了解更多信息：

检索器

LangChain VectorStore 对象不继承自 Runnable。LangChain Retrievers 是 Runnables，因此它们实现了一套标准方法（例如同步和异步的 invoke 和 batch 操作）。尽管我们可以从向量存储中构建检索器，但检索器也可以与非向量存储的数据源进行交互（例如外部 API）。

我们无需继承 Retriever 即可自行创建这个功能的简化版本。如果我们选择用于检索文档的方法，就可以轻松创建一个可运行的对象。以下是围绕 similarity_search 方法构建的一个示例：

from typing import List

from langchain_core.documents import Document
from langchain_core.runnables import chain


@chain
def retriever(query: str) -> List[Document]:
    return vector_store.similarity_search(query, k=1)


retriever.batch(
    [
        "How many distribution centers does Nike have in the US?",
        "When was Nike incorporated?",
    ],
)

API 参考：文档 |Chains

[[Document(metadata={'page': 4, 'source': '../example_data/nke-10k-2023.pdf', 'start_index': 3125}, page_content='direct to consumer operations sell products through the following number of retail stores in the United States:\nU.S. RETAIL STORES NUMBER\nNIKE Brand factory stores 213 \nNIKE Brand in-line stores (including employee-only stores) 74 \nConverse stores (including factory stores) 82 \nTOTAL 369 \nIn the United States, NIKE has eight significant distribution centers. Refer to Item 2. Properties for further information.\n2023 FORM 10-K 2')],
 [Document(metadata={'page': 3, 'source': '../example_data/nke-10k-2023.pdf', 'start_index': 0}, page_content='Table of Contents\nPART I\nITEM 1. BUSINESS\nGENERAL\nNIKE, Inc. was incorporated in 1967 under the laws of the State of Oregon. As used in this Annual Report on Form 10-K (this "Annual Report"), the terms "we," "us," "our,"\n"NIKE" and the "Company" refer to NIKE, Inc. and its predecessors, subsidiaries and affiliates, collectively, unless the context indicates otherwise.\nOur principal business activity is the design, development and worldwide marketing and selling of athletic footwear, apparel, equipment, accessories and services. NIKE is\nthe largest seller of athletic footwear and apparel in the world. We sell our products through NIKE Direct operations, which are comprised of both NIKE-owned retail stores\nand sales through our digital platforms (also referred to as "NIKE Brand Digital"), to retail accounts and to a mix of independent distributors, licensees and sales')]]

向量存储实现了一个 as_retriever 方法，该方法将生成一个检索器，具体为 VectorStoreRetriever。这些检索器包含特定的 search_type 和 search_kwargs 属性，用于标识应调用底层向量存储的方法以及如何对其进行参数化。例如，我们可以使用以下代码复制上述内容：

retriever = vector_store.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 1},
)

retriever.batch(
    [
        "How many distribution centers does Nike have in the US?",
        "When was Nike incorporated?",
    ],
)

[[Document(metadata={'page': 4, 'source': '../example_data/nke-10k-2023.pdf', 'start_index': 3125}, page_content='direct to consumer operations sell products through the following number of retail stores in the United States:\nU.S. RETAIL STORES NUMBER\nNIKE Brand factory stores 213 \nNIKE Brand in-line stores (including employee-only stores) 74 \nConverse stores (including factory stores) 82 \nTOTAL 369 \nIn the United States, NIKE has eight significant distribution centers. Refer to Item 2. Properties for further information.\n2023 FORM 10-K 2')],
 [Document(metadata={'page': 3, 'source': '../example_data/nke-10k-2023.pdf', 'start_index': 0}, page_content='Table of Contents\nPART I\nITEM 1. BUSINESS\nGENERAL\nNIKE, Inc. was incorporated in 1967 under the laws of the State of Oregon. As used in this Annual Report on Form 10-K (this "Annual Report"), the terms "we," "us," "our,"\n"NIKE" and the "Company" refer to NIKE, Inc. and its predecessors, subsidiaries and affiliates, collectively, unless the context indicates otherwise.\nOur principal business activity is the design, development and worldwide marketing and selling of athletic footwear, apparel, equipment, accessories and services. NIKE is\nthe largest seller of athletic footwear and apparel in the world. We sell our products through NIKE Direct operations, which are comprised of both NIKE-owned retail stores\nand sales through our digital platforms (also referred to as "NIKE Brand Digital"), to retail accounts and to a mix of independent distributors, licensees and sales')]]

VectorStoreRetriever 支持搜索类型 "similarity"（默认）、"mmr"（最大相关性，如上所述）和 "similarity_score_threshold"。我们可以使用后者通过相似度得分对检索器输出的文档进行阈值筛选。

检索器可以轻松集成到更复杂的应用中，例如检索增强生成（RAG）应用，这类应用将给定的问题与检索到的上下文结合，形成提示词输入大型语言模型。要了解如何构建此类应用，请参阅 RAG 教程教程。

了解更多：

检索策略可以非常丰富且复杂。例如：

我们可以从查询中推断出硬性规则和过滤条件（例如，“使用2020年以后发布的文档”）；
我们可以返回与检索到的上下文以某种方式关联的文档（例如，通过某种文档分类体系）；
我们可以为每个上下文单元生成多个嵌入；
我们可以合并多个检索器的结果；
我们可以为文档分配权重，例如，提高最近的文档的权重。

“如何指南”中的检索器部分涵盖了这些以及其他内置的检索策略。

也可以轻松扩展 BaseRetriever 类以实现自定义检索器。请参阅我们的操操作指南此处。

下一步

现在你已经了解如何基于PDF文档构建语义搜索引擎。

有关文档加载器的更多信息：

有关嵌入的更多信息：

有关向量存储的更多信息：

有关RAG的更多信息，请参见：

概念​

设置​

Jupyter Notebook​

安装​

LangSmith​

文档和文档加载器​

加载文档​

拆分​

嵌入​

向量存储​

使用​

检索器​

了解更多：​

下一步​

概念

设置