Arxiv检索器

arXiv 是一个开放获取的学术论文档案，包含数以百万计的文章，涉及物理学、数学、计算机科学、定量生物学、定量金融、统计学、电气工程和系统科学以及经济学等领域。

此笔记本展示了如何从Arxiv.org检索科学文章并转换为文档格式，这种格式将在后续步骤中使用。

详细文档请参阅所有ArxivRetriever功能和配置的API参考。

集成细节

检索器	来源	包
ArxivRetriever	Scholarly articles on arxiv.org	langchain_community

设置

如果您想要从单个查询中获取自动跟踪，您也可以通过取消注释下方代码来设置您的LangSmith API密钥：

# os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter your LangSmith API key: ")
# os.environ["LANGSMITH_TRACING"] = "true"

安装

这个检索器位于langchain-community包中。我们还需要arxiv依赖项:

%pip install -qU langchain-community arxiv

Instantiation

ArxivRetriever 参数包括：

optional load_max_docs: 默认=100。使用它来限制下载的文档数量。全部下载100份文档需要花费一些时间，因此在实验中可以使用较小的数字。目前有一个硬性限制为300。
optional load_all_available_meta: default=False. By default only the most important fields downloaded: Published (date when document was published/last updated), Title, Authors, Summary. If True, other fields also downloaded.
get_full_documents: bool, default False. 确定是否获取文档的完整文本。

见API参考以获取更多详细信息。

from langchain_community.retrievers import ArxivRetriever

retriever = ArxivRetriever(
    load_max_docs=2,
    get_ful_documents=True,
)

API 参考:Arxiv检索器

用法

ArxivRetriever 支持通过文章标识符进行检索:

docs = retriever.invoke("1605.08386")

docs[0].metadata  # meta-information of the Document

{'Entry ID': 'http://arxiv.org/abs/1605.08386v1',
 'Published': datetime.date(2016, 5, 26),
 'Title': 'Heat-bath random walks with Markov bases',
 'Authors': 'Caprice Stanley, Tobias Windisch'}

docs[0].page_content[:400]  # a content of the Document

'Graphs on lattice points are studied whose edges come from a finite set of\nallowed moves of arbitrary length. We show that the diameter of these graphs on\nfibers of a fixed integer matrix can be bounded from above by a constant. We\nthen study the mixing behaviour of heat-bath random walks on these graphs. We\nalso state explicit conditions on the set of moves so that the heat-bath random\nwalk, a ge'

ArxivRetriever 也支持基于自然语言文本的检索:

docs = retriever.invoke("What is the ImageBind model?")

docs[0].metadata

{'Entry ID': 'http://arxiv.org/abs/2305.05665v2',
 'Published': datetime.date(2023, 5, 31),
 'Title': 'ImageBind: One Embedding Space To Bind Them All',
 'Authors': 'Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra'}

使用在链中

像其他检索器一样，ArxivRetriever可以通过链路被整合到LLM应用中。

我们需要一个大语言模型或聊天模型:

选择聊天模型:

pip install -qU "langchain[openai]"

import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
  os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain.chat_models import init_chat_model

llm = init_chat_model("gpt-4o-mini", model_provider="openai")

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

prompt = ChatPromptTemplate.from_template(
    """Answer the question based only on the context provided.

Context: {context}

Question: {question}"""
)


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

API 参考:StrOutputParser |聊天提示模板 |可运行传递器

chain.invoke("What is the ImageBind model?")

'The ImageBind model is an approach to learn a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. It shows that only image-paired data is sufficient to bind the modalities together and can leverage large scale vision-language models for zero-shot capabilities and emergent applications such as cross-modal retrieval, composing modalities with arithmetic, cross-modal detection and generation.'

API 参考

详细文档请参阅所有ArxivRetriever功能和配置的API参考。

检索器概念指南
检索器如何指南

集成细节​

设置​

安装​

Instantiation​

用法​

使用在链中​

API 参考​

相关​

集成细节

设置

安装

Instantiation

用法

使用在链中

API 参考

相关