Skip to main content
Open In ColabOpen on GitHub

递归 URL

The RecursiveUrlLoader 让你可以从根URL递归抓取所有子链接,并将它们解析为文档。

概览

集成细节

Class本地序列化JS支持
RecursiveUrlLoaderlangchain_community

加载器功能

来源文档延迟加载原生异步支持
RecursiveUrlLoader

设置

Credentials

无需凭证即可使用 RecursiveUrlLoader

安装

RecursiveUrlLoader 位于 langchain-community 包中。没有其他必需的包,但如果你同时安装了 ``beautifulsoup4``,你将获得更丰富的默认 Document 元数据。

%pip install -qU langchain-community beautifulsoup4 lxml

Instantiation

现在我们可以实例化我们的文档加载器对象并加载文档:

from langchain_community.document_loaders import RecursiveUrlLoader

loader = RecursiveUrlLoader(
"https://docs.python.org/3.9/",
# max_depth=2,
# use_async=False,
# extractor=None,
# metadata_extractor=None,
# exclude_dirs=(),
# timeout=10,
# check_response_status=True,
# continue_on_failure=True,
# prevent_outside=True,
# base_url=None,
# ...
)

加载

使用.load()以同步方式将所有文档加载到内存中,每个文档对应一个访问的URL。从初始URL开始,我们递归遍历所有链接的URL,直到指定的最大深度。

让我们通过一个基本示例来了解如何在Python 3.9 文档上使用RecursiveUrlLoader

docs = loader.load()
docs[0].metadata
/Users/bagatur/.pyenv/versions/3.9.1/lib/python3.9/html/parser.py:170: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
k = self.parse_starttag(i)
{'source': 'https://docs.python.org/3.9/',
'content_type': 'text/html',
'title': '3.9.19 Documentation',
'language': None}

很好!第一个文档看起来像是我们开始的根页面。让我们查看下一个文档的元数据

docs[1].metadata
{'source': 'https://docs.python.org/3.9/using/index.html',
'content_type': 'text/html',
'title': 'Python Setup and Usage — Python 3.9.19 documentation',
'language': None}

那看起来像是我们根页面的子页面,很好!让我们从元数据转向检查其中一个文档的内容

print(docs[0].page_content[:300])

<!DOCTYPE html>

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8" /><title>3.9.19 Documentation</title><meta name="viewport" content="width=device-width, initial-scale=1.0">

<link rel="stylesheet" href="_static/pydoctheme.css" type="text/css" />
<link rel=

那 certainly 看起来像是来自 https://docs.python.org/3.9/ 的 HTML 内容,这正是我们预期的。现在让我们来看看我们可以对基本示例做出的一些变化,在不同的情况下可能会很有帮助。

懒加载

如果我们要加载大量的文档,并且下游操作可以在所有加载的文档的部分子集上执行,那么我们可以一次仅按需加载一个文档来最小化我们的内存占用:

pages = []
for doc in loader.lazy_load():
pages.append(doc)
if len(pages) >= 10:
# do some paged operation, e.g.
# index.upsert(page)

pages = []
/var/folders/4j/2rz3865x6qg07tx43146py8h0000gn/T/ipykernel_73962/2110507528.py:6: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
soup = BeautifulSoup(html, "lxml")

在本例中,我们从未在同一时间将超过10个文档加载到内存中。

添加一个提取器

默认情况下,加载器将每个链接的原始HTML内容作为文档页面内容。为了将这些HTML解析为更易于人类/LLM阅读的格式,你可以传递一个自定义的extractor方法:

import re

from bs4 import BeautifulSoup


def bs4_extractor(html: str) -> str:
soup = BeautifulSoup(html, "lxml")
return re.sub(r"\n\n+", "\n\n", soup.text).strip()


loader = RecursiveUrlLoader("https://docs.python.org/3.9/", extractor=bs4_extractor)
docs = loader.load()
print(docs[0].page_content[:200])
/var/folders/td/vzm913rx77x21csd90g63_7c0000gn/T/ipykernel_10935/1083427287.py:6: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
soup = BeautifulSoup(html, "lxml")
/Users/isaachershenson/.pyenv/versions/3.11.9/lib/python3.11/html/parser.py:170: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
k = self.parse_starttag(i)
``````output
3.9.19 Documentation

Download
Download these documents
Docs by version

Python 3.13 (in development)
Python 3.12 (stable)
Python 3.11 (security-fixes)
Python 3.10 (security-fixes)
Python 3.9 (securit

这看起来好多了!

您可以类似地传递一个metadata_extractor来自定义如何从HTTP响应中提取文档元数据。有关更多详细信息,请参阅API参考

API 参考

这些示例仅展示了您可以对默认的RecursiveUrlLoader进行修改的一些方法,但实际上还有许多其他方式可以对其进行修改以更好地适应您的使用场景。使用参数link_regexexclude_dirs可以帮助您过滤掉不需要的URL,而aload()alazy_load()则可用于异步加载等操作。

对于配置和调用RecursiveUrlLoader的详细信息,请参见API参考:https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.recursive_url_loader.RecursiveUrlLoader.html