Skip to main content
Open In Colab在 GitHub 上打开

递归 URL

RecursiveUrlLoader允许您从根 URL 递归地抓取所有子链接并将它们解析为 Documents。

概述

集成详细信息

本地化序列 化JS 支持
RecursiveUrlLoaderlangchain_community

Loader 功能

文档延迟加载原生异步支持
RecursiveUrlLoader

设置

凭据

使用RecursiveUrlLoader.

安装

RecursiveUrlLoader住在langchain-community包。没有其他需要的包,但如果你也安装了 ''beautifulsoup4',你将获得更丰富的默认文档元数据。

%pip install -qU langchain-community beautifulsoup4 lxml

实例

现在我们可以实例化我们的 Document loader 对象并加载 Documents:

from langchain_community.document_loaders import RecursiveUrlLoader

loader = RecursiveUrlLoader(
"https://docs.python.org/3.9/",
# max_depth=2,
# use_async=False,
# extractor=None,
# metadata_extractor=None,
# exclude_dirs=(),
# timeout=10,
# check_response_status=True,
# continue_on_failure=True,
# prevent_outside=True,
# base_url=None,
# ...
)
API 参考:RecursiveUrlLoader

负荷

.load()将所有 Documents 同步加载到内存中,其中 每个访问的 URL 的文档。从初始 URL 开始,我们递归 所有链接的 URL 直到指定的max_depth。

让我们来看看如何使用RecursiveUrlLoaderPython 3.9 文档中

docs = loader.load()
docs[0].metadata
/Users/bagatur/.pyenv/versions/3.9.1/lib/python3.9/html/parser.py:170: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
k = self.parse_starttag(i)
{'source': 'https://docs.python.org/3.9/',
'content_type': 'text/html',
'title': '3.9.19 Documentation',
'language': None}

伟大!第一个文档看起来像我们开始的根页面。让我们看看下一个文档的元数据

docs[1].metadata
{'source': 'https://docs.python.org/3.9/using/index.html',
'content_type': 'text/html',
'title': 'Python Setup and Usage — Python 3.9.19 documentation',
'language': None}

那个 url 看起来像我们根页面的子页面,这太棒了!让我们从元数据继续检查我们其中一个文档的内容

print(docs[0].page_content[:300])

<!DOCTYPE html>

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8" /><title>3.9.19 Documentation</title><meta name="viewport" content="width=device-width, initial-scale=1.0">

<link rel="stylesheet" href="_static/pydoctheme.css" type="text/css" />
<link rel=

这肯定看起来像来自 url https://docs.python.org/3.9/ 的 HTML,这是我们所期望的。现在让我们看看我们可以对基本示例进行的一些变化,这些变化在不同情况下可能会有所帮助。

延迟加载

如果我们要加载大量 Documents,并且我们的下游作可以对所有加载的 Documents 的子集完成,我们可以一次延迟加载一个 Documents,以最大限度地减少我们的内存占用:

pages = []
for doc in loader.lazy_load():
pages.append(doc)
if len(pages) >= 10:
# do some paged operation, e.g.
# index.upsert(page)

pages = []
/var/folders/4j/2rz3865x6qg07tx43146py8h0000gn/T/ipykernel_73962/2110507528.py:6: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
soup = BeautifulSoup(html, "lxml")

在此示例中,一次加载到内存中的 Documents 永远不会超过 10 个。

添加提取器

默认情况下,加载程序将每个链接的原始 HTML 设置为 Document 页面内容。要将此 HTML 解析为对人类/LLM 更友好的格式,您可以传入自定义extractor方法:

import re

from bs4 import BeautifulSoup


def bs4_extractor(html: str) -> str:
soup = BeautifulSoup(html, "lxml")
return re.sub(r"\n\n+", "\n\n", soup.text).strip()


loader = RecursiveUrlLoader("https://docs.python.org/3.9/", extractor=bs4_extractor)
docs = loader.load()
print(docs[0].page_content[:200])
/var/folders/td/vzm913rx77x21csd90g63_7c0000gn/T/ipykernel_10935/1083427287.py:6: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
soup = BeautifulSoup(html, "lxml")
/Users/isaachershenson/.pyenv/versions/3.11.9/lib/python3.11/html/parser.py:170: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
k = self.parse_starttag(i)
``````output
3.9.19 Documentation

Download
Download these documents
Docs by version

Python 3.13 (in development)
Python 3.12 (stable)
Python 3.11 (security-fixes)
Python 3.10 (security-fixes)
Python 3.9 (securit

这看起来好多了!

同样,您可以传入metadata_extractor自定义从 HTTP 响应中提取 Document 元数据的方式。请参阅 API 参考 以了解更多信息。

API 参考

这些示例仅显示了修改默认值RecursiveUrlLoader,但还可以进行更多修改以最适合您的使用案例。使用参数link_regexexclude_dirs可以帮助您过滤掉不需要的 URL,aload()alazy_load()可用于异步加载等。

有关配置和调用RecursiveUrlLoader,请参阅 API 参考:https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.recursive_url_loader.RecursiveUrlLoader.html