Skip to main content
Open In ColabOpen on GitHub

Vectara

Vectara 是可信赖的 AI 助手和代理平台,专注于为企业关键应用做好准备。 Vectara 无服务器 RAG-as-a-service 提供了所有 RAG 的组件,通过易于使用的 API 包括:

  1. 一种从文件(PDF、PPT、DOCX等)中提取文本的方法
  2. 基于机器学习的分块提供了一流的性能。
  3. The Boomerang 模型。
  4. 自己的内部向量数据库,其中存储了文本片段和嵌入向量。
  5. 一个自动将查询编码为嵌入向量的服务,并检索最相关的文本片段,包括对混合搜索的支持以及多种再排序选项如多语言相关性再排序器MMRUDF再排序器
  6. 基于检索到的文档(上下文),根据所获取的文档生成一个生成性摘要,包括引用。

对于更多信息,请参阅:

这个笔记本展示了如何在仅将 Vectara 用作向量存储(不涉及摘要)时使用基本的检索功能,包括:similarity_searchsimilarity_search_with_score 以及使用 LangChain 的 as_retriever 功能。

设置

要使用VectaraVectorStore,您首先需要安装合作伙伴包。

!uv pip install -U pip && uv pip install -qU langchain-vectara

Getting Started

要开始,请按照以下步骤操作:

  1. 如果您还没有,请注册免费试用Vectara。
  2. 在您的账户中,您可以创建一个或多个语料库。每个语料库代表一个存储从输入文档导入的文本数据的区域。要创建一个语料库,请使用"Create Corpus"按钮。然后您需要为语料库提供一个名称以及描述。可选地,您可以定义过滤属性并应用一些高级选项。如果您点击您创建的语料库,您可以在顶部看到其名称和语料库ID。
  3. 接下来,您需要创建API密钥以访问语料库。在语料库视图中点击"权限控制"标签页,然后点击"创建API密钥"按钮。给您的密钥命名,并选择您希望的查询仅限或查询+索引选项。点击"创建",现在您就有了一个有效的API密钥。请保密保存此密钥。

要使用LangChain与Vectara配合,您需要拥有这两个值:corpus_keyapi_key。 您可以以两种方式向LangChain提供VECTARA_API_KEY

  1. 在您的环境中包含以下两个变量:VECTARA_API_KEY

    例如,您可以使用os.environ和getpass按照如下方式设置这些变量:

import os
import getpass

os.environ["VECTARA_API_KEY"] = getpass.getpass("Vectara API Key:")
  1. 将它们添加到Vectara向量存储构造函数中:
vectara = Vectara(
vectara_api_key=vectara_api_key
)

在本笔记本中,我们假设它们已经提供在环境中。

import os

os.environ["VECTARA_API_KEY"] = "<VECTARA_API_KEY>"
os.environ["VECTARA_CORPUS_KEY"] = "VECTARA_CORPUS_KEY"

from langchain_vectara import Vectara
from langchain_vectara.vectorstores import (
ChainReranker,
CorpusConfig,
CustomerSpecificReranker,
File,
GenerationConfig,
MmrReranker,
SearchConfig,
VectaraQueryConfig,
)

vectara = Vectara(vectara_api_key=os.getenv("VECTARA_API_KEY"))

首先,我们将国情咨文文本加载到Vectara中。

请注意,我们使用的是add_files接口,不需要进行任何本地处理或分块操作 - Vectara会接收文件内容并执行所有必要的预处理、分块和将文件嵌入其知识库的过程。

在这种情况中它使用了一个.txt文件,但同样的方式也适用于许多其他文件类型

corpus_key = os.getenv("VECTARA_CORPUS_KEY")
file_obj = File(
file_path="../document_loaders/example_data/state_of_the_union.txt",
metadata={"source": "text_file"},
)
vectara.add_files([file_obj], corpus_key)
['state_of_the_union.txt']

Vectara RAG(检索增强生成)

我们现在创建一个VectaraQueryConfig对象来控制检索和总结的选项:

  • 我们启用摘要功能,指定希望LLM挑选出匹配度最高的7个片段,并用英文回答

使用此配置,让我们创建一个包含完整 Vectara RAG 管道的 LangChain Runnable 对象,使用 as_rag 方法:

generation_config = GenerationConfig(
max_used_search_results=7,
response_language="eng",
generation_preset_name="vectara-summary-ext-24-05-med-omni",
enable_factual_consistency_score=True,
)
search_config = SearchConfig(
corpora=[CorpusConfig(corpus_key=corpus_key)],
limit=25,
reranker=ChainReranker(
rerankers=[
CustomerSpecificReranker(reranker_id="rnk_272725719", limit=100),
MmrReranker(diversity_bias=0.2, limit=100),
]
),
)

config = VectaraQueryConfig(
search=search_config,
generation=generation_config,
)

query_str = "what did Biden say?"

rag = vectara.as_rag(config)
rag.invoke(query_str)["answer"]
"President Biden discussed several key issues in his recent statements. He emphasized the importance of keeping schools open and noted that with a high vaccination rate and reduced hospitalizations, most Americans can safely return to normal activities without masks [1]. He addressed the need to hold social media platforms accountable for their impact on children and called for stronger privacy protections and mental health services [2]. Biden also announced measures against Russian oligarchs, including closing American airspace to Russian flights and targeting their assets, as part of efforts to weaken Russia's economy [3], [7]. Additionally, he reaffirmed the need to protect women's rights, particularly the right to choose as affirmed in Roe v. Wade [5]."

我们也可以像这样使用流式接口:<br>

output = {}
curr_key = None
for chunk in rag.stream(query_str):
for key in chunk:
if key not in output:
output[key] = chunk[key]
else:
output[key] += chunk[key]
if key == "answer":
print(chunk[key], end="", flush=True)
curr_key = key
President Biden discussed several key issues in his recent statements. He emphasized the importance of keeping schools open and noted that with a high vaccination rate and reduced hospitalizations, most Americans can safely return to normal activities without masks [1]. He addressed the need to hold social media platforms accountable for their impact on children and called for stronger privacy protections and mental health services [2]. Biden also announced measures against Russia, including preventing its central bank from defending the Ruble and targeting Russian oligarchs' assets, as part of efforts to weaken Russia's economy and military [3]. Additionally, he reaffirmed the commitment to protect women's rights, particularly the right to choose as affirmed in Roe v. Wade [5]. Lastly, he advocated for funding the police with necessary resources and training to ensure community safety [6].

幻觉检测与事实一致性评分

Vectara 创建了 HHEM —— 一个开源模型,可用于评估 RAG 回答的事实一致性。

作为 Vectara RAG 的一部分,通过 API 提供了“事实一致性评分”(Factual Consistency Score,简称 FCS),它是开源 HHEM 的改进版本。该评分将自动包含在 RAG 管道的输出中。

resp = rag.invoke(query_str)
print(resp["answer"])
print(f"Vectara FCS = {resp['fcs']}")
President Biden discussed several key topics in his recent statements. He emphasized the importance of keeping schools open and noted that with a high vaccination rate and reduced hospitalizations, most Americans can safely return to normal activities without masks [1]. He addressed the need to hold social media platforms accountable for their impact on children and called for stronger privacy protections and mental health services [2]. Biden also announced measures against Russian oligarchs, including closing American airspace to Russian flights and targeting their assets, as part of efforts to weaken Russia's economy [3], [7]. Additionally, he reaffirmed the need to protect women's rights, particularly the right to choose as affirmed in Roe v. Wade [5].
Vectara FCS = 0.61621094

Vectara作为langchain检索器

Vectara 组件也可以仅用作检索器。

在这种情况中,它就像任何其他LangChain检索器一样工作。此模式的主要用途是进行语义搜索,在这种情况下我们禁用了摘要功能:

config.generation = None
config.search.limit = 5
retriever = vectara.as_retriever(config=config)
retriever.invoke(query_str)
[Document(metadata={'X-TIKA:Parsed-By': 'org.apache.tika.parser.csv.TextAndCSVParser', 'Content-Encoding': 'UTF-8', 'X-TIKA:detectedEncoding': 'UTF-8', 'X-TIKA:encodingDetector': 'UniversalEncodingDetector', 'Content-Type': 'text/plain; charset=UTF-8', 'source': 'text_file', 'framework': 'langchain'}, page_content='The U.S. Department of Justice is assembling a dedicated task force to go after the crimes of Russian oligarchs. We are joining with our European allies to find and seize your yachts your luxury apartments your private jets. We are coming for your ill-begotten gains. And tonight I am announcing that we will join our allies in closing off American air space to all Russian flights – further isolating Russia – and adding an additional squeeze –on their economy. The Ruble has lost 30% of its value.'),
Document(metadata={'X-TIKA:Parsed-By': 'org.apache.tika.parser.csv.TextAndCSVParser', 'Content-Encoding': 'UTF-8', 'X-TIKA:detectedEncoding': 'UTF-8', 'X-TIKA:encodingDetector': 'UniversalEncodingDetector', 'Content-Type': 'text/plain; charset=UTF-8', 'source': 'text_file', 'framework': 'langchain'}, page_content='When they came home, many of the world’s fittest and best trained warriors were never the same. Dizziness. \n\nA cancer that would put them in a flag-draped coffin. I know. \n\nOne of those soldiers was my son Major Beau Biden. We don’t know for sure if a burn pit was the cause of his brain cancer, or the diseases of so many of our troops. But I’m committed to finding out everything we can.'),
Document(metadata={'X-TIKA:Parsed-By': 'org.apache.tika.parser.csv.TextAndCSVParser', 'Content-Encoding': 'UTF-8', 'X-TIKA:detectedEncoding': 'UTF-8', 'X-TIKA:encodingDetector': 'UniversalEncodingDetector', 'Content-Type': 'text/plain; charset=UTF-8', 'source': 'text_file', 'framework': 'langchain'}, page_content='He rejected repeated efforts at diplomacy. He thought the West and NATO wouldn’t respond. And he thought he could divide us at home. We were ready. Here is what we did. We prepared extensively and carefully.'),
Document(metadata={'X-TIKA:Parsed-By': 'org.apache.tika.parser.csv.TextAndCSVParser', 'Content-Encoding': 'UTF-8', 'X-TIKA:detectedEncoding': 'UTF-8', 'X-TIKA:encodingDetector': 'UniversalEncodingDetector', 'Content-Type': 'text/plain; charset=UTF-8', 'source': 'text_file', 'framework': 'langchain'}, page_content='And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson.'),
Document(metadata={'X-TIKA:Parsed-By': 'org.apache.tika.parser.csv.TextAndCSVParser', 'Content-Encoding': 'UTF-8', 'X-TIKA:detectedEncoding': 'UTF-8', 'X-TIKA:encodingDetector': 'UniversalEncodingDetector', 'Content-Type': 'text/plain; charset=UTF-8', 'source': 'text_file', 'framework': 'langchain'}, page_content='Putin’s latest attack on Ukraine was premeditated and unprovoked. He rejected repeated efforts at diplomacy. He thought the West and NATO wouldn’t respond. And he thought he could divide us at home. We were ready. Here is what we did.')]

为了向后兼容,您还可以在检索器中启用摘要功能,此时摘要将作为额外的Document对象添加:

config.generation = GenerationConfig()
config.search.limit = 10
retriever = vectara.as_retriever(config=config)
retriever.invoke(query_str)
[Document(metadata={'X-TIKA:Parsed-By': 'org.apache.tika.parser.csv.TextAndCSVParser', 'Content-Encoding': 'UTF-8', 'X-TIKA:detectedEncoding': 'UTF-8', 'X-TIKA:encodingDetector': 'UniversalEncodingDetector', 'Content-Type': 'text/plain; charset=UTF-8', 'source': 'text_file', 'framework': 'langchain'}, page_content='We won’t be able to compete for the jobs of the 21st Century if we don’t fix that. That’s why it was so important to pass the Bipartisan Infrastructure Law—the most sweeping investment to rebuild America in history. This was a bipartisan effort, and I want to thank the members of both parties who worked to make it happen. We’re done talking about infrastructure weeks. We’re going to have an infrastructure decade.'),
Document(metadata={'X-TIKA:Parsed-By': 'org.apache.tika.parser.csv.TextAndCSVParser', 'Content-Encoding': 'UTF-8', 'X-TIKA:detectedEncoding': 'UTF-8', 'X-TIKA:encodingDetector': 'UniversalEncodingDetector', 'Content-Type': 'text/plain; charset=UTF-8', 'source': 'text_file', 'framework': 'langchain'}, page_content='The U.S. Department of Justice is assembling a dedicated task force to go after the crimes of Russian oligarchs. We are joining with our European allies to find and seize your yachts your luxury apartments your private jets. We are coming for your ill-begotten gains. And tonight I am announcing that we will join our allies in closing off American air space to all Russian flights – further isolating Russia – and adding an additional squeeze –on their economy. The Ruble has lost 30% of its value.'),
Document(metadata={'X-TIKA:Parsed-By': 'org.apache.tika.parser.csv.TextAndCSVParser', 'Content-Encoding': 'UTF-8', 'X-TIKA:detectedEncoding': 'UTF-8', 'X-TIKA:encodingDetector': 'UniversalEncodingDetector', 'Content-Type': 'text/plain; charset=UTF-8', 'source': 'text_file', 'framework': 'langchain'}, page_content='When they came home, many of the world’s fittest and best trained warriors were never the same. Dizziness. \n\nA cancer that would put them in a flag-draped coffin. I know. \n\nOne of those soldiers was my son Major Beau Biden. We don’t know for sure if a burn pit was the cause of his brain cancer, or the diseases of so many of our troops. But I’m committed to finding out everything we can.'),
Document(metadata={'X-TIKA:Parsed-By': 'org.apache.tika.parser.csv.TextAndCSVParser', 'Content-Encoding': 'UTF-8', 'X-TIKA:detectedEncoding': 'UTF-8', 'X-TIKA:encodingDetector': 'UniversalEncodingDetector', 'Content-Type': 'text/plain; charset=UTF-8', 'source': 'text_file', 'framework': 'langchain'}, page_content='Preventing Russia’s central bank from defending the Russian Ruble making Putin’s $630 Billion “war fund” worthless. We are choking off Russia’s access to technology that will sap its economic strength and weaken its military for years to come. Tonight I say to the Russian oligarchs and corrupt leaders who have bilked billions of dollars off this violent regime no more. The U.S. Department of Justice is assembling a dedicated task force to go after the crimes of Russian oligarchs. We are joining with our European allies to find and seize your yachts your luxury apartments your private jets.'),
Document(metadata={'X-TIKA:Parsed-By': 'org.apache.tika.parser.csv.TextAndCSVParser', 'Content-Encoding': 'UTF-8', 'X-TIKA:detectedEncoding': 'UTF-8', 'X-TIKA:encodingDetector': 'UniversalEncodingDetector', 'Content-Type': 'text/plain; charset=UTF-8', 'source': 'text_file', 'framework': 'langchain'}, page_content='He rejected repeated efforts at diplomacy. He thought the West and NATO wouldn’t respond. And he thought he could divide us at home. We were ready. Here is what we did. We prepared extensively and carefully.'),
Document(metadata={'X-TIKA:Parsed-By': 'org.apache.tika.parser.csv.TextAndCSVParser', 'Content-Encoding': 'UTF-8', 'X-TIKA:detectedEncoding': 'UTF-8', 'X-TIKA:encodingDetector': 'UniversalEncodingDetector', 'Content-Type': 'text/plain; charset=UTF-8', 'source': 'text_file', 'framework': 'langchain'}, page_content='It delivered immediate economic relief for tens of millions of Americans. Helped put food on their table, keep a roof over their heads, and cut the cost of health insurance. And as my Dad used to say, it gave people a little breathing room. And unlike the $2 Trillion tax cut passed in the previous administration that benefitted the top 1% of Americans, the American Rescue Plan helped working people—and left no one behind. Lots of jobs. \n\nIn fact—our economy created over 6.5 Million new jobs just last year, more jobs created in one year \nthan ever before in the history of America.'),
Document(metadata={'X-TIKA:Parsed-By': 'org.apache.tika.parser.csv.TextAndCSVParser', 'Content-Encoding': 'UTF-8', 'X-TIKA:detectedEncoding': 'UTF-8', 'X-TIKA:encodingDetector': 'UniversalEncodingDetector', 'Content-Type': 'text/plain; charset=UTF-8', 'source': 'text_file', 'framework': 'langchain'}, page_content='And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson.'),
Document(metadata={'X-TIKA:Parsed-By': 'org.apache.tika.parser.csv.TextAndCSVParser', 'Content-Encoding': 'UTF-8', 'X-TIKA:detectedEncoding': 'UTF-8', 'X-TIKA:encodingDetector': 'UniversalEncodingDetector', 'Content-Type': 'text/plain; charset=UTF-8', 'source': 'text_file', 'framework': 'langchain'}, page_content='All told, we created 369,000 new manufacturing jobs in America just last year. Powered by people I’ve met like JoJo Burgess, from generations of union steelworkers from Pittsburgh, who’s here with us tonight. As Ohio Senator Sherrod Brown says, “It’s time to bury the label “Rust Belt.” It’s time. \n\nBut with all the bright spots in our economy, record job growth and higher wages, too many families are struggling to keep up with the bills. Inflation is robbing them of the gains they might otherwise feel.'),
Document(metadata={'X-TIKA:Parsed-By': 'org.apache.tika.parser.csv.TextAndCSVParser', 'Content-Encoding': 'UTF-8', 'X-TIKA:detectedEncoding': 'UTF-8', 'X-TIKA:encodingDetector': 'UniversalEncodingDetector', 'Content-Type': 'text/plain; charset=UTF-8', 'source': 'text_file', 'framework': 'langchain'}, page_content='Putin’s latest attack on Ukraine was premeditated and unprovoked. He rejected repeated efforts at diplomacy. He thought the West and NATO wouldn’t respond. And he thought he could divide us at home. We were ready. Here is what we did.'),
Document(metadata={'X-TIKA:Parsed-By': 'org.apache.tika.parser.csv.TextAndCSVParser', 'Content-Encoding': 'UTF-8', 'X-TIKA:detectedEncoding': 'UTF-8', 'X-TIKA:encodingDetector': 'UniversalEncodingDetector', 'Content-Type': 'text/plain; charset=UTF-8', 'source': 'text_file', 'framework': 'langchain'}, page_content='Danielle says Heath was a fighter to the very end. He didn’t know how to stop fighting, and neither did she. Through her pain she found purpose to demand we do better. Tonight, Danielle—we are. The VA is pioneering new ways of linking toxic exposures to diseases, already helping more veterans get benefits.'),
Document(metadata={'summary': True, 'fcs': (0.54785156,)}, page_content='President Biden spoke about several key issues. He emphasized the importance of the Bipartisan Infrastructure Law, calling it the most significant investment to rebuild America and highlighting it as a bipartisan effort [1]. He also announced measures against Russian oligarchs, including assembling a task force to seize their assets and closing American airspace to Russian flights, further isolating Russia economically [2]. Additionally, he expressed a commitment to investigating the health impacts of burn pits on military personnel, referencing his son, Major Beau Biden, who suffered from brain cancer [3].')]

使用Vectara进行高级LangChain查询预处理

Vectara 的“将 RAG 作为服务”在创建问答系统或聊天机器人链时承担了大量繁重的工作。与 LangChain 的集成提供了使用其他功能的选项,例如查询预处理,如 SelfQueryRetrieverMultiQueryRetriever。让我们来看一个使用 MultiQueryRetriever 的示例。

由于MQR使用了大语言模型(LLM),我们必须先进行设置——在这里我们选择 ChatOpenAI

from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_openai.chat_models import ChatOpenAI

llm = ChatOpenAI(temperature=0)
mqr = MultiQueryRetriever.from_llm(retriever=retriever, llm=llm)


def get_summary(documents):
return documents[-1].page_content


(mqr | get_summary).invoke(query_str)
'The remarks made by Biden include his emphasis on the importance of the Bipartisan Infrastructure Law, which he describes as the most significant investment to rebuild America in history. He highlights the bipartisan effort involved in passing this law and expresses gratitude to members of both parties for their collaboration. Biden also mentions the transition from "infrastructure weeks" to an "infrastructure decade" [1]. Additionally, he shares a personal story about his father having to leave their home in Scranton, Pennsylvania, to find work, which influenced his decision to fight for the American Rescue Plan to help those in need [2].'