Skip to main content
Open In Colab在 GitHub 上打开

RAGatouille

RAGatouille 使其尽可能简单易用ColBERT!ColBERT 是一种快速准确的检索模型,可在数十毫秒内对大型文本集合进行基于 BERT 的可扩展搜索。

请参阅 ColBERTv2:通过轻量级后期交互进行有效和高效的检索论文。

我们可以通过多种方式使用 RAGatouille。

设置

集成位于ragatouille包。

pip install -U ragatouille
from ragatouille import RAGPretrainedModel

RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
[Jan 10, 10:53:28] Loading segmented_maxsim_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
``````output
/Users/harrisonchase/.pyenv/versions/3.10.1/envs/langchain/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py:125: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available. Disabling.
warnings.warn(

Retriever

我们可以将 RAGatouille 用作检索器。有关更多信息,请参阅 RAGatouille Retriever

文档压缩器

我们还可以将现成的 RAGatouille 用作重新排序器。这将允许我们使用 ColBERT 对从任何通用检索器中检索到的结果进行重新排序。这样做的好处是我们可以在任何现有索引之上执行此作,因此我们不需要创建新的 idex。我们可以通过使用 LangChain 中的文档压缩器抽象来实现这一点。

设置 Vanilla Retriever

首先,让我们设置一个 vanilla retriever 作为示例。

import requests
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter


def get_wikipedia_page(title: str):
"""
Retrieve the full text content of a Wikipedia page.

:param title: str - Title of the Wikipedia page.
:return: str - Full text content of the page as raw string.
"""
# Wikipedia API endpoint
URL = "https://en.wikipedia.org/w/api.php"

# Parameters for the API request
params = {
"action": "query",
"format": "json",
"titles": title,
"prop": "extracts",
"explaintext": True,
}

# Custom User-Agent header to comply with Wikipedia's best practices
headers = {"User-Agent": "RAGatouille_tutorial/0.0.1 (ben@clavie.eu)"}

response = requests.get(URL, params=params, headers=headers)
data = response.json()

# Extracting page content
page = next(iter(data["query"]["pages"].values()))
return page["extract"] if "extract" in page else None


text = get_wikipedia_page("Hayao_Miyazaki")
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
texts = text_splitter.create_documents([text])
retriever = FAISS.from_documents(texts, OpenAIEmbeddings()).as_retriever(
search_kwargs={"k": 10}
)
docs = retriever.invoke("What animation studio did Miyazaki found")
docs[0]
Document(page_content='collaborative projects. In April 1984, Miyazaki opened his own office in Suginami Ward, naming it Nibariki.')

我们可以看到,结果与所提出的问题并不完全相关

使用 ColBERT 作为重新排名器

from langchain.retrievers import ContextualCompressionRetriever

compression_retriever = ContextualCompressionRetriever(
base_compressor=RAG.as_langchain_document_compressor(), base_retriever=retriever
)

compressed_docs = compression_retriever.invoke(
"What animation studio did Miyazaki found"
)
/Users/harrisonchase/.pyenv/versions/3.10.1/envs/langchain/lib/python3.10/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
warnings.warn(
compressed_docs[0]
Document(page_content='In June 1985, Miyazaki, Takahata, Tokuma and Suzuki founded the animation production company Studio Ghibli, with funding from Tokuma Shoten. Studio Ghibli\'s first film, Laputa: Castle in the Sky (1986), employed the same production crew of Nausicaä. Miyazaki\'s designs for the film\'s setting were inspired by Greek architecture and "European urbanistic templates". Some of the architecture in the film was also inspired by a Welsh mining town; Miyazaki witnessed the mining strike upon his first', metadata={'relevance_score': 26.5194149017334})

这个答案更相关!