站点地图
从WebBaseLoader继承,SitemapLoader从给定的URL加载一个站点地图,然后抓取并加载站点中的所有页面,并将每个页面返回为一个Document。
The scraping is done concurrently. There are reasonable limits to concurrent requests, defaulting to 2 per second. If you aren't concerned about being a good citizen, or you control the scrapped server, or don't care about load you can increase this limit. Note, while this will speed up the scraping process, it may cause the server to block you. Be careful!
概览
集成细节
| Class | 包 | 本地 | 序列化 | JS支持 |
|---|---|---|---|---|
| SiteMapLoader | langchain_community | ✅ | ❌ | ✅ |
加载器功能
| 来源 | 文档延迟加载 | 原生异步支持 |
|---|---|---|
| SiteMapLoader | ✅ | ❌ |
设置
要访问SiteMap文档加载器,您需要安装langchain-community集成包。
Credentials
无需任何凭据即可运行此内容。
要启用对您的模型调用的自动跟踪,请设置您的LangSmith API密钥:
# os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter your LangSmith API key: ")
# os.environ["LANGSMITH_TRACING"] = "true"
安装
安装 langchain_community。
%pip install -qU langchain-community
修复笔记本异步循环错误
import nest_asyncio
nest_asyncio.apply()
初始化
现在我们可以实例化我们的模型对象并加载文档:
from langchain_community.document_loaders.sitemap import SitemapLoader
sitemap_loader = SitemapLoader(web_path="https://api.python.langchain.com/sitemap.xml")
加载
docs = sitemap_loader.load()
docs[0]
Fetching pages: 100%|##########| 28/28 [00:04<00:00, 6.83it/s]
Document(metadata={'source': 'https://api.python.langchain.com/en/stable/', 'loc': 'https://api.python.langchain.com/en/stable/', 'lastmod': '2024-05-15T00:29:42.163001+00:00', 'changefreq': 'weekly', 'priority': '1'}, page_content='\n\n\n\n\n\n\n\n\n\nLangChain Python API Reference Documentation.\n\n\nYou will be automatically redirected to the new location of this page.\n\n')
print(docs[0].metadata)
{'source': 'https://api.python.langchain.com/en/stable/', 'loc': 'https://api.python.langchain.com/en/stable/', 'lastmod': '2024-05-15T00:29:42.163001+00:00', 'changefreq': 'weekly', 'priority': '1'}
您可以将requests_per_second参数更改为增加最大并发请求数,并在发送请求时使用requests_kwargs传递kwargs。
sitemap_loader.requests_per_second = 2
# Optional: avoid `[SSL: CERTIFICATE_VERIFY_FAILED]` issue
sitemap_loader.requests_kwargs = {"verify": False}
懒加载
您也可以在需要时才加载页面,以最小化内存负载。
page = []
for doc in sitemap_loader.lazy_load():
page.append(doc)
if len(page) >= 10:
# do some paged operation, e.g.
# index.upsert(page)
page = []
Fetching pages: 100%|##########| 28/28 [00:01<00:00, 19.06it/s]
过滤站点地图URL
Sitemaps可以是巨大的文件,包含成千上万的URL。通常你不需要每一个。你可以通过向filter_urls参数传递一个字符串列表或正则表达式模式来进行过滤。只有匹配其中一个模式的URL才会被加载。
loader = SitemapLoader(
web_path="https://api.python.langchain.com/sitemap.xml",
filter_urls=["https://api.python.langchain.com/en/latest"],
)
documents = loader.load()
documents[0]
Document(page_content='\n\n\n\n\n\n\n\n\n\nLangChain Python API Reference Documentation.\n\n\nYou will be automatically redirected to the new location of this page.\n\n', metadata={'source': 'https://api.python.langchain.com/en/latest/', 'loc': 'https://api.python.langchain.com/en/latest/', 'lastmod': '2024-02-12T05:26:10.971077+00:00', 'changefreq': 'daily', 'priority': '0.9'})
添加自定义抓取规则
The SitemapLoader uses beautifulsoup4 for the scraping process, and it scrapes every element on the page by default. The SitemapLoader constructor accepts a custom scraping function. This feature can be helpful to tailor the scraping process to your specific needs; for example, you might want to avoid scraping headers or navigation elements.
The following example shows how to develop and use a custom function to avoid navigation and header elements.
导入beautifulsoup4库并定义自定义函数。
pip install beautifulsoup4
from bs4 import BeautifulSoup
def remove_nav_and_header_elements(content: BeautifulSoup) -> str:
# Find all 'nav' and 'header' elements in the BeautifulSoup object
nav_elements = content.find_all("nav")
header_elements = content.find_all("header")
# Remove each 'nav' and 'header' element from the BeautifulSoup object
for element in nav_elements + header_elements:
element.decompose()
return str(content.get_text())
请将自定义函数添加到 SitemapLoader 对象中。
loader = SitemapLoader(
"https://api.python.langchain.com/sitemap.xml",
filter_urls=["https://api.python.langchain.com/en/latest/"],
parsing_function=remove_nav_and_header_elements,
)
本地站点地图
The sitemap loader can also be used to load local files.
sitemap_loader = SitemapLoader(web_path="example_data/sitemap.xml", is_local=True)
docs = sitemap_loader.load()
API 参考
详细介绍了所有SiteMapLoader功能和配置的文档,请参阅API参考: https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.sitemap.SitemapLoader.html#langchain_community.document_loaders.sitemap.SitemapLoader