Skip to main content
Open In ColabOpen on GitHub

Activeloop 深度记忆

Activeloop Deep Memory 是一套工具集,它可以帮助你优化你的向量存储以适应你的应用场景,并在你的LLM应用中实现更高的准确性。

Retrieval-Augmented Generatation (RAG) 最近获得了显著的关注。随着先进检索和生成技术(RAG)的出现和发展,它们扩展了RAG能够实现的可能性。然而,一些挑战可能会限制RAGs的生产集成。在生产环境中实施检索增强生成(RAGs)时,需要考虑的主要因素是准确性(召回率)、成本和延迟。对于基本用例,将OpenAI的Ada模型与朴素相似性搜索结合使用可以产生满意的结果。但是,为了在搜索中获得更高的准确性或召回率,可能需要采用高级检索技术。这些方法可能涉及变化的数据片段大小、多次重写查询,以及其他操作,这可能会增加延迟和成本。Activeloop的Deep Memory这一功能针对Activeloop Deep Lake用户可用,通过引入一个训练用于匹配用户查询与语料中相关数据的小型神经网络层来解决这些问题。此功能在搜索过程中引入的延迟很低,但可以将检索准确性提升高达27%°,并且保持成本效益和简单易用,无需使用任何额外的高级RAG技术。

对于这个教程,我们将解析DeepLake文档,并创建一个基于检索增强生成(RAG)系统的工具,该工具可以从文档中回答问题。

1. 数据集创建

我们将使用BeautifulSoup库解析activeloop的文档,并结合LangChain的文档解析器如Html2TextTransformerAsyncHtmlLoader进行教程。因此,我们需要安装以下库:

%pip install --upgrade --quiet  tiktoken langchain-openai python-dotenv datasets langchain deeplake beautifulsoup4 html2text ragas

Also you'll need to create a Activeloop account.

ORG_ID = "..."
from langchain.chains import RetrievalQA
from langchain_community.vectorstores import DeepLake
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
import getpass
import os

if "OPENAI_API_KEY" not in os.environ:
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API token: ")
# # activeloop token is needed if you are not signed in using CLI: `activeloop login -u <USERNAME> -p <PASSWORD>`
if "ACTIVELOOP_TOKEN" not in os.environ:
os.environ["ACTIVELOOP_TOKEN"] = getpass.getpass(
"Enter your ActiveLoop API token: "
) # Get your API token from https://app.activeloop.ai, click on your profile picture in the top right corner, and select "API Tokens"

token = os.getenv("ACTIVELOOP_TOKEN")
openai_embeddings = OpenAIEmbeddings()
db = DeepLake(
dataset_path=f"hub://{ORG_ID}/deeplake-docs-deepmemory", # org_id stands for your username or organization from activeloop
embedding=openai_embeddings,
runtime={"tensor_db": True},
token=token,
# overwrite=True, # user overwrite flag if you want to overwrite the full dataset
read_only=False,
)

使用BeautifulSoup解析网页中的所有链接。

from urllib.parse import urljoin

import requests
from bs4 import BeautifulSoup


def get_all_links(url):
response = requests.get(url)
if response.status_code != 200:
print(f"Failed to retrieve the page: {url}")
return []

soup = BeautifulSoup(response.content, "html.parser")

# Finding all 'a' tags which typically contain href attribute for links
links = [
urljoin(url, a["href"]) for a in soup.find_all("a", href=True) if a["href"]
]

return links


base_url = "https://docs.deeplake.ai/en/latest/"
all_links = get_all_links(base_url)

加载数据:

from langchain_community.document_loaders.async_html import AsyncHtmlLoader

loader = AsyncHtmlLoader(all_links)
docs = loader.load()
API 参考:AsyncHtmlLoader

将数据转换成用户可读格式:

from langchain_community.document_transformers import Html2TextTransformer

html2text = Html2TextTransformer()
docs_transformed = html2text.transform_documents(docs)

现在,让我们进一步拆分文档,因为有些文档包含的内容太多:

from langchain_text_splitters import RecursiveCharacterTextSplitter

chunk_size = 4096
docs_new = []

text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
)

for doc in docs_transformed:
if len(doc.page_content) < chunk_size:
docs_new.append(doc)
else:
docs = text_splitter.create_documents([doc.page_content])
docs_new.extend(docs)

填充向量存储:<br>

docs = db.add_documents(docs_new)

2. 生成合成查询并训练深度记忆

接下来的步骤将是训练一个深度记忆模型,该模型将使您的用户查询与您已经拥有的数据集对齐。如果您还没有任何用户的查询,请不要担心,我们将使用大语言模型(LLM)生成它们!

TODO: 添加图片

Here above we showed the overall schema how deep_memory works. So as you can see, in order to train it you need relevance, queries together with corpus data (data that we want to query). Corpus data was already populated in the previous section, here we will be generating questions and relevance.

  1. questions - 是由字符串组成的文本,其中每个字符串代表一个查询
  2. relevance - 包含每个问题的链接到正确答案。对于给定的问题,可能有多个文档包含答案。由于这个相关性是 List[List[tuple[str, float]]],外层列表代表查询,内层列表表示相关的文档。元组包含一个字符串和浮点数对,其中字符串表示源文档的ID(对应于数据集中id张量),而浮点数则表示当前文档与问题的相关程度。

现在,让我们生成合成问题和相关性:

from typing import List

from langchain.chains.openai_functions import (
create_structured_output_chain,
)
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_core.prompts import ChatPromptTemplate, HumanMessagePromptTemplate
from langchain_openai import ChatOpenAI
from pydantic import BaseModel, Field
# fetch dataset docs and ids if they exist (optional you can also ingest)
docs = db.vectorstore.dataset.text.data(fetch_chunks=True, aslist=True)["value"]
ids = db.vectorstore.dataset.id.data(fetch_chunks=True, aslist=True)["value"]
# If we pass in a model explicitly, we need to make sure it supports the OpenAI function-calling API.
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)


class Questions(BaseModel):
"""Identifying information about a person."""

question: str = Field(..., description="Questions about text")


prompt_msgs = [
SystemMessage(
content="You are a world class expert for generating questions based on provided context. \
You make sure the question can be answered by the text."
),
HumanMessagePromptTemplate.from_template(
"Use the given text to generate a question from the following input: {input}"
),
HumanMessage(content="Tips: Make sure to answer in the correct format"),
]
prompt = ChatPromptTemplate(messages=prompt_msgs)
chain = create_structured_output_chain(Questions, llm, prompt, verbose=True)

text = "# Understanding Hallucinations and Bias ## **Introduction** In this lesson, we'll cover the concept of **hallucinations** in LLMs, highlighting their influence on AI applications and demonstrating how to mitigate them using techniques like the retriever's architectures. We'll also explore **bias** within LLMs with examples."
questions = chain.run(input=text)
print(questions)
import random

from langchain_openai import OpenAIEmbeddings
from tqdm import tqdm


def generate_queries(docs: List[str], ids: List[str], n: int = 100):
questions = []
relevances = []
pbar = tqdm(total=n)
while len(questions) < n:
# 1. randomly draw a piece of text and relevance id
r = random.randint(0, len(docs) - 1)
text, label = docs[r], ids[r]

# 2. generate queries and assign and relevance id
generated_qs = [chain.run(input=text).question]
questions.extend(generated_qs)
relevances.extend([[(label, 1)] for _ in generated_qs])
pbar.update(len(generated_qs))
if len(questions) % 10 == 0:
print(f"q: {len(questions)}")
return questions[:n], relevances[:n]


chain = create_structured_output_chain(Questions, llm, prompt, verbose=False)
questions, relevances = generate_queries(docs, ids, n=200)

train_questions, train_relevances = questions[:100], relevances[:100]
test_questions, test_relevances = questions[100:], relevances[100:]
API 参考:OpenAI 嵌入

现在我们创建了100个训练查询以及100个测试查询。现在让我们训练深记忆:

job_id = db.vectorstore.deep_memory.train(
queries=train_questions,
relevance=train_relevances,
)

让我们跟踪训练进度:

db.vectorstore.deep_memory.status("6538939ca0b69a9ca45c528c")

--------------------------------------------------------------
| 6538e02ecda4691033a51c5b |
--------------------------------------------------------------
| status | completed |
--------------------------------------------------------------
| progress | eta: 1.4 seconds |
| | recall@10: 79.00% (+34.00%) |
--------------------------------------------------------------
| results | recall@10: 79.00% (+34.00%) |
--------------------------------------------------------------

3. 评估深度记忆性能

great 我们已经训练了模型!它的召回率有了显著的提升,但我们该如何使用它,并在未见过的新数据上进行评估呢?在这个部分我们将深入探讨模型评估和推理的部分,并看看如何通过 LangChain 来提高检索准确性。

3.1 深度记忆评估

对于开始,我们可以使用 `deep_memory` 的内置评估方法。 它可以计算几个 recall 指标。 这可以通过几行代码轻松完成。

recall = db.vectorstore.deep_memory.evaluate(
queries=test_questions,
relevance=test_relevances,
)

Embedding queries took 0.81 seconds
---- Evaluating without model ----
Recall@1: 9.0%
Recall@3: 19.0%
Recall@5: 24.0%
Recall@10: 42.0%
Recall@50: 93.0%
Recall@100: 98.0%
---- Evaluating with model ----
Recall@1: 19.0%
Recall@3: 42.0%
Recall@5: 49.0%
Recall@10: 69.0%
Recall@50: 97.0%
Recall@100: 97.0%

这在未见过的测试数据集上也显示出相当显著的改进!!!

3.2 深度记忆+RAGas

from ragas.langchain import RagasEvaluatorChain
from ragas.metrics import (
context_recall,
)

让我们将召回转换为地面真相:

def convert_relevance_to_ground_truth(docs, relevance):
ground_truths = []

for rel in relevance:
ground_truth = []
for doc_id, _ in rel:
ground_truth.append(docs[doc_id])
ground_truths.append(ground_truth)
return ground_truths
ground_truths = convert_relevance_to_ground_truth(docs, test_relevances)

for deep_memory in [False, True]:
print("\nEvaluating with deep_memory =", deep_memory)
print("===================================")

retriever = db.as_retriever()
retriever.search_kwargs["deep_memory"] = deep_memory

qa_chain = RetrievalQA.from_chain_type(
llm=ChatOpenAI(model="gpt-3.5-turbo"),
chain_type="stuff",
retriever=retriever,
return_source_documents=True,
)

metrics = {
"context_recall_score": 0,
}

eval_chains = {m.name: RagasEvaluatorChain(metric=m) for m in [context_recall]}

for question, ground_truth in zip(test_questions, ground_truths):
result = qa_chain({"query": question})
result["ground_truths"] = ground_truth
for name, eval_chain in eval_chains.items():
score_name = f"{name}_score"
metrics[score_name] += eval_chain(result)[score_name]

for metric in metrics:
metrics[metric] /= len(test_questions)
print(f"{metric}: {metrics[metric]}")
print("===================================")

Evaluating with deep_memory = False
===================================
context_recall_score = 0.3763423145
===================================

Evaluating with deep_memory = True
===================================
context_recall_score = 0.5634545323
===================================

3.3 深度记忆推理

TODO: 添加图片

with deep_memory

retriever = db.as_retriever()
retriever.search_kwargs["deep_memory"] = True
retriever.search_kwargs["k"] = 10

query = "Deamination of cytidine to uridine on the minus strand of viral DNA results in catastrophic G-to-A mutations in the viral genome."
qa = RetrievalQA.from_chain_type(
llm=ChatOpenAI(model="gpt-4"), chain_type="stuff", retriever=retriever
)
print(qa.run(query))
The base htype of the 'video_seq' tensor is 'video'.

无深记忆

retriever = db.as_retriever()
retriever.search_kwargs["deep_memory"] = False
retriever.search_kwargs["k"] = 10

query = "Deamination of cytidine to uridine on the minus strand of viral DNA results in catastrophic G-to-A mutations in the viral genome."
qa = RetrievalQA.from_chain_type(
llm=ChatOpenAI(model="gpt-4"), chain_type="stuff", retriever=retriever
)
qa.run(query)
The text does not provide information on the base htype of the 'video_seq' tensor.

3.4 深度记忆成本节省

深内存可以在不改变现有工作流的情况下提高检索准确性。此外,通过减少传递给LLM的top_k输入,您可以显著降低推理成本,从而减少令牌使用量。