Skip to main content
Open In ColabOpen on GitHub

站点地图

WebBaseLoader继承,SitemapLoader从给定的URL加载一个站点地图,然后抓取并加载站点中的所有页面,并将每个页面返回为一个Document。

The scraping is done concurrently. There are reasonable limits to concurrent requests, defaulting to 2 per second. If you aren't concerned about being a good citizen, or you control the scrapped server, or don't care about load you can increase this limit. Note, while this will speed up the scraping process, it may cause the server to block you. Be careful!

概览

集成细节

Class本地序列化JS支持
SiteMapLoaderlangchain_community

加载器功能

来源文档延迟加载原生异步支持
SiteMapLoader

设置

要访问SiteMap文档加载器,您需要安装langchain-community集成包。

Credentials

无需任何凭据即可运行此内容。

要启用对您的模型调用的自动跟踪,请设置您的LangSmith API密钥:

# os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter your LangSmith API key: ")
# os.environ["LANGSMITH_TRACING"] = "true"

安装

安装 langchain_community

%pip install -qU langchain-community

修复笔记本异步循环错误

import nest_asyncio

nest_asyncio.apply()

初始化

现在我们可以实例化我们的模型对象并加载文档:

from langchain_community.document_loaders.sitemap import SitemapLoader
API 参考:SiteMapLoader
sitemap_loader = SitemapLoader(web_path="https://api.python.langchain.com/sitemap.xml")

加载

docs = sitemap_loader.load()
docs[0]
Fetching pages: 100%|##########| 28/28 [00:04<00:00,  6.83it/s]
Document(metadata={'source': 'https://api.python.langchain.com/en/stable/', 'loc': 'https://api.python.langchain.com/en/stable/', 'lastmod': '2024-05-15T00:29:42.163001+00:00', 'changefreq': 'weekly', 'priority': '1'}, page_content='\n\n\n\n\n\n\n\n\n\nLangChain Python API Reference Documentation.\n\n\nYou will be automatically redirected to the new location of this page.\n\n')
print(docs[0].metadata)
{'source': 'https://api.python.langchain.com/en/stable/', 'loc': 'https://api.python.langchain.com/en/stable/', 'lastmod': '2024-05-15T00:29:42.163001+00:00', 'changefreq': 'weekly', 'priority': '1'}

您可以将requests_per_second参数更改为增加最大并发请求数,并在发送请求时使用requests_kwargs传递kwargs。

sitemap_loader.requests_per_second = 2
# Optional: avoid `[SSL: CERTIFICATE_VERIFY_FAILED]` issue
sitemap_loader.requests_kwargs = {"verify": False}

懒加载

您也可以在需要时才加载页面,以最小化内存负载。

page = []
for doc in sitemap_loader.lazy_load():
page.append(doc)
if len(page) >= 10:
# do some paged operation, e.g.
# index.upsert(page)

page = []
Fetching pages: 100%|##########| 28/28 [00:01<00:00, 19.06it/s]

过滤站点地图URL

Sitemaps可以是巨大的文件,包含成千上万的URL。通常你不需要每一个。你可以通过向filter_urls参数传递一个字符串列表或正则表达式模式来进行过滤。只有匹配其中一个模式的URL才会被加载。

loader = SitemapLoader(
web_path="https://api.python.langchain.com/sitemap.xml",
filter_urls=["https://api.python.langchain.com/en/latest"],
)
documents = loader.load()
documents[0]
Document(page_content='\n\n\n\n\n\n\n\n\n\nLangChain Python API Reference Documentation.\n\n\nYou will be automatically redirected to the new location of this page.\n\n', metadata={'source': 'https://api.python.langchain.com/en/latest/', 'loc': 'https://api.python.langchain.com/en/latest/', 'lastmod': '2024-02-12T05:26:10.971077+00:00', 'changefreq': 'daily', 'priority': '0.9'})

添加自定义抓取规则

The SitemapLoader uses beautifulsoup4 for the scraping process, and it scrapes every element on the page by default. The SitemapLoader constructor accepts a custom scraping function. This feature can be helpful to tailor the scraping process to your specific needs; for example, you might want to avoid scraping headers or navigation elements.

The following example shows how to develop and use a custom function to avoid navigation and header elements.

导入beautifulsoup4库并定义自定义函数。

pip install beautifulsoup4
from bs4 import BeautifulSoup


def remove_nav_and_header_elements(content: BeautifulSoup) -> str:
# Find all 'nav' and 'header' elements in the BeautifulSoup object
nav_elements = content.find_all("nav")
header_elements = content.find_all("header")

# Remove each 'nav' and 'header' element from the BeautifulSoup object
for element in nav_elements + header_elements:
element.decompose()

return str(content.get_text())

请将自定义函数添加到 SitemapLoader 对象中。

loader = SitemapLoader(
"https://api.python.langchain.com/sitemap.xml",
filter_urls=["https://api.python.langchain.com/en/latest/"],
parsing_function=remove_nav_and_header_elements,
)

本地站点地图

The sitemap loader can also be used to load local files.

sitemap_loader = SitemapLoader(web_path="example_data/sitemap.xml", is_local=True)

docs = sitemap_loader.load()

API 参考

详细介绍了所有SiteMapLoader功能和配置的文档,请参阅API参考: https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.sitemap.SitemapLoader.html#langchain_community.document_loaders.sitemap.SitemapLoader