WebBaseLoader
这涵盖了如何使用WebBaseLoader加载所有文本HTML网页转换为我们可以在下游使用的文档格式。有关加载网页的更多自定义逻辑,请查看一些子类示例,例如IMSDbLoader,AZLyricsLoader和CollegeConfidentialLoader.
如果您不想担心网站爬取、绕过 JS 阻止站点和数据清理,请考虑使用FireCrawlLoader或更快的选项SpiderLoader.
概述
集成详细信息
- TODO:填写表格功能。
- TODO: 如果不相关,请删除 JS 支持链接,否则请确保链接正确。
- TODO:确保 API 参考链接正确无误。
| 类 | 包 | 本地化 | 序列 化 | JS 支持 |
|---|---|---|---|---|
| WebBaseLoader | langchain_community | ✅ | ❌ | ❌ |
Loader 功能
| 源 | 文档延迟加载 | 原生异步支持 |
|---|---|---|
| WebBaseLoader | ✅ | ✅ |
设置
凭据
WebBaseLoader不需要任何凭证。
安装
要使用WebBaseLoader您首先需要安装langchain-communitypython 软件包。
%pip install -qU langchain_community beautifulsoup4
初始化
现在我们可以实例化我们的模型对象并加载文档:
from langchain_community.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://www.example.com/")
要在获取过程中绕过 SSL 验证错误,您可以设置 “verify” 选项:
loader.requests_kwargs = {'verify':False}
使用多个页面进行初始化
您还可以传入要从中加载的页面列表。
loader_multiple_pages = WebBaseLoader(
["https://www.example.com/", "https://google.com"]
)
负荷
docs = loader.load()
docs[0]
Document(metadata={'source': 'https://www.example.com/', 'title': 'Example Domain', 'language': 'No language found.'}, page_content='\n\n\nExample Domain\n\n\n\n\n\n\n\nExample Domain\nThis domain is for use in illustrative examples in documents. You may use this\n domain in literature without prior coordination or asking for permission.\nMore information...\n\n\n\n')
print(docs[0].metadata)
{'source': 'https://www.example.com/', 'title': 'Example Domain', 'language': 'No language found.'}
同时加载多个 URL
您可以通过同时抓取和解析多个 url 来加快抓取过程。
并发请求有合理的限制,默认为每秒 2 个。如果你不关心做一个好公民,或者你控制着你正在抓取的服务器并且不关心负载,你可以更改requests_per_second参数增加最大并发请求数。请注意,虽然这会加快抓取过程,但可能会导致服务器阻止您。小心!
%pip install -qU nest_asyncio
# fixes a bug with asyncio and jupyter
import nest_asyncio
nest_asyncio.apply()
Note: you may need to restart the kernel to use updated packages.
loader = WebBaseLoader(["https://www.example.com/", "https://google.com"])
loader.requests_per_second = 1
docs = loader.aload()
docs
Fetching pages: 100%|###########################################################################| 2/2 [00:00<00:00, 8.28it/s]
[Document(metadata={'source': 'https://www.example.com/', 'title': 'Example Domain', 'language': 'No language found.'}, page_content='\n\n\nExample Domain\n\n\n\n\n\n\n\nExample Domain\nThis domain is for use in illustrative examples in documents. You may use this\n domain in literature without prior coordination or asking for permission.\nMore information...\n\n\n\n'),
Document(metadata={'source': 'https://google.com', 'title': 'Google', 'description': "Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for.", 'language': 'en'}, page_content='GoogleSearch Images Maps Play YouTube News Gmail Drive More »Web History | Settings | Sign in\xa0Advanced search5 ways Gemini can help during the HolidaysAdvertisingBusiness SolutionsAbout Google© 2024 - Privacy - Terms ')]
加载 xml 文件,或使用其他 BeautifulSoup 解析器
您还可以查看SitemapLoader有关如何加载 Sitemap 文件的示例,这是使用此功能的示例。
loader = WebBaseLoader(
"https://www.govinfo.gov/content/pkg/CFR-2018-title10-vol3/xml/CFR-2018-title10-vol3-sec431-86.xml"
)
loader.default_parser = "xml"
docs = loader.load()
docs
[Document(metadata={'source': 'https://www.govinfo.gov/content/pkg/CFR-2018-title10-vol3/xml/CFR-2018-title10-vol3-sec431-86.xml'}, page_content='\n\n10\nEnergy\n3\n2018-01-01\n2018-01-01\nfalse\nUniform test method for the measurement of energy efficiency of commercial packaged boilers.\n§ 431.86\nSection § 431.86\n\nEnergy\nDEPARTMENT OF ENERGY\nENERGY CONSERVATION\nENERGY EFFICIENCY PROGRAM FOR CERTAIN COMMERCIAL AND INDUSTRIAL EQUIPMENT\nCommercial Packaged Boilers\nTest Procedures\n\n\n\n\n§\u2009431.86\nUniform test method for the measurement of energy efficiency of commercial packaged boilers.\n(a) Scope. This section provides test procedures, pursuant to the Energy Policy and Conservation Act (EPCA), as amended, which must be followed for measuring the combustion efficiency and/or thermal efficiency of a gas- or oil-fired commercial packaged boiler.\n(b) Testing and Calculations. Determine the thermal efficiency or combustion efficiency of commercial packaged boilers by conducting the appropriate test procedure(s) indicated in Table 1 of this section.\n\nTable 1—Test Requirements for Commercial Packaged Boiler Equipment Classes\n\nEquipment category\nSubcategory\nCertified rated inputBtu/h\n\nStandards efficiency metric(§\u2009431.87)\n\nTest procedure(corresponding to\nstandards efficiency\nmetric required\nby §\u2009431.87)\n\n\n\nHot Water\nGas-fired\n≥300,000 and ≤2,500,000\nThermal Efficiency\nAppendix A, Section 2.\n\n\nHot Water\nGas-fired\n>2,500,000\nCombustion Efficiency\nAppendix A, Section 3.\n\n\nHot Water\nOil-fired\n≥300,000 and ≤2,500,000\nThermal Efficiency\nAppendix A, Section 2.\n\n\nHot Water\nOil-fired\n>2,500,000\nCombustion Efficiency\nAppendix A, Section 3.\n\n\nSteam\nGas-fired (all*)\n≥300,000 and ≤2,500,000\nThermal Efficiency\nAppendix A, Section 2.\n\n\nSteam\nGas-fired (all*)\n>2,500,000 and ≤5,000,000\nThermal Efficiency\nAppendix A, Section 2.\n\n\n\u2003\n\n>5,000,000\nThermal Efficiency\nAppendix A, Section 2.OR\nAppendix A, Section 3 with Section 2.4.3.2.\n\n\n\nSteam\nOil-fired\n≥300,000 and ≤2,500,000\nThermal Efficiency\nAppendix A, Section 2.\n\n\nSteam\nOil-fired\n>2,500,000 and ≤5,000,000\nThermal Efficiency\nAppendix A, Section 2.\n\n\n\u2003\n\n>5,000,000\nThermal Efficiency\nAppendix A, Section 2.OR\nAppendix A, Section 3. with Section 2.4.3.2.\n\n\n\n*\u2009Equipment classes for commercial packaged boilers as of July 22, 2009 (74 FR 36355) distinguish between gas-fired natural draft and all other gas-fired (except natural draft).\n\n(c) Field Tests. The field test provisions of appendix A may be used only to test a unit of commercial packaged boiler with rated input greater than 5,000,000 Btu/h.\n[81 FR 89305, Dec. 9, 2016]\n\n\nEnergy Efficiency Standards\n\n')]
延迟加载
您可以使用延迟加载一次只加载一个页面,以最大程度地减少内存需求。
pages = []
for doc in loader.lazy_load():
pages.append(doc)
print(pages[0].page_content[:100])
print(pages[0].metadata)
10
Energy
3
2018-01-01
2018-01-01
false
Uniform test method for the measurement of energy efficien
{'source': 'https://www.govinfo.gov/content/pkg/CFR-2018-title10-vol3/xml/CFR-2018-title10-vol3-sec431-86.xml'}
异步
pages = []
async for doc in loader.alazy_load():
pages.append(doc)
print(pages[0].page_content[:100])
print(pages[0].metadata)
Fetching pages: 100%|###########################################################################| 1/1 [00:00<00:00, 10.51it/s]
``````output
10
Energy
3
2018-01-01
2018-01-01
false
Uniform test method for the measurement of energy efficien
{'source': 'https://www.govinfo.gov/content/pkg/CFR-2018-title10-vol3/xml/CFR-2018-title10-vol3-sec431-86.xml'}
使用代理
有时您可能需要使用代理来绕过 IP 块。你可以将代理字典传递给加载器(并且requests)来使用它们。
loader = WebBaseLoader(
"https://www.walmart.com/search?q=parrots",
proxies={
"http": "http://{username}:{password}:@proxy.service.com:6666/",
"https": "https://{username}:{password}:@proxy.service.com:6666/",
},
)
docs = loader.load()
API 参考
有关所有WebBaseLoader功能和配置可参考 API 参考:https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.web_base.WebBaseLoader.html