WebBaseLoader
这涵盖了如何使用WebBaseLoader从HTML网页中加载所有文本并转换为我们可以用于下游处理的文档格式。如果您需要更自定义的逻辑来加载网页,请查看一些子类示例,如IMSDbLoader、AZLyricsLoader和CollegeConfidentialLoader。
如果您不想担心网站抓取、绕过JS阻塞站点以及数据清理,可以考虑使用FireCrawlLoader或更快的选项SpiderLoader。
概览
集成细节
- TODO: 填写表格功能。
- TODO: 移除如果无关的JS支持链接,否则确保链接正确。
- TODO: 确保API参考链接正确。
| Class | 包 | 本地 | 序列化 | JS支持 |
|---|---|---|---|---|
| WebBaseLoader | langchain_community | ✅ | ❌ | ❌ |
加载器功能
| 来源 | 文档延迟加载 | 原生异步支持 |
|---|---|---|
| WebBaseLoader | ✅ | ✅ |
设置
Credentials
WebBaseLoader 不需要任何凭证。
安装
要使用WebBaseLoader,您首先需要安装langchain-community的Python包。
%pip install -qU langchain_community beautifulsoup4
初始化
现在我们可以实例化我们的模型对象并加载文档:
from langchain_community.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://www.example.com/")
为了在获取内容时跳过SSL验证错误,可以设置"verify"选项:
loader.requests_kwargs = {'verify':False}
使用多个页面初始化
您也可以传入一个页面列表来加载。
loader_multiple_pages = WebBaseLoader(
["https://www.example.com/", "https://google.com"]
)
加载
docs = loader.load()
docs[0]
Document(metadata={'source': 'https://www.example.com/', 'title': 'Example Domain', 'language': 'No language found.'}, page_content='\n\n\nExample Domain\n\n\n\n\n\n\n\nExample Domain\nThis domain is for use in illustrative examples in documents. You may use this\n domain in literature without prior coordination or asking for permission.\nMore information...\n\n\n\n')
print(docs[0].metadata)
{'source': 'https://www.example.com/', 'title': 'Example Domain', 'language': 'No language found.'}
加载多个URL并发
您可以同时爬取和解析多个URL以加快抓取过程。
存在合理的并发请求限制,默认每秒2个。如果你不介意做一个好市民,或者你控制着你要爬取的服务器且不在乎负载,你可以改变requests_per_second参数来增加最大并发请求数。请注意,虽然这会加快抓取过程,但可能会导致服务器将你屏蔽。务必小心!
%pip install -qU nest_asyncio
# fixes a bug with asyncio and jupyter
import nest_asyncio
nest_asyncio.apply()
Note: you may need to restart the kernel to use updated packages.
loader = WebBaseLoader(["https://www.example.com/", "https://google.com"])
loader.requests_per_second = 1
docs = loader.aload()
docs
Fetching pages: 100%|###########################################################################| 2/2 [00:00<00:00, 8.28it/s]
[Document(metadata={'source': 'https://www.example.com/', 'title': 'Example Domain', 'language': 'No language found.'}, page_content='\n\n\nExample Domain\n\n\n\n\n\n\n\nExample Domain\nThis domain is for use in illustrative examples in documents. You may use this\n domain in literature without prior coordination or asking for permission.\nMore information...\n\n\n\n'),
Document(metadata={'source': 'https://google.com', 'title': 'Google', 'description': "Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for.", 'language': 'en'}, page_content='GoogleSearch Images Maps Play YouTube News Gmail Drive More »Web History | Settings | Sign in\xa0Advanced search5 ways Gemini can help during the HolidaysAdvertisingBusiness SolutionsAbout Google© 2024 - Privacy - Terms ')]
加载一个xml文件,或者使用不同的BeautifulSoup解析器
您可以查看SitemapLoader以了解如何加载网站地图文件的示例,这展示了此功能的一个用例。
loader = WebBaseLoader(
"https://www.govinfo.gov/content/pkg/CFR-2018-title10-vol3/xml/CFR-2018-title10-vol3-sec431-86.xml"
)
loader.default_parser = "xml"
docs = loader.load()
docs
[Document(metadata={'source': 'https://www.govinfo.gov/content/pkg/CFR-2018-title10-vol3/xml/CFR-2018-title10-vol3-sec431-86.xml'}, page_content='\n\n10\nEnergy\n3\n2018-01-01\n2018-01-01\nfalse\nUniform test method for the measurement of energy efficiency of commercial packaged boilers.\n§ 431.86\nSection § 431.86\n\nEnergy\nDEPARTMENT OF ENERGY\nENERGY CONSERVATION\nENERGY EFFICIENCY PROGRAM FOR CERTAIN COMMERCIAL AND INDUSTRIAL EQUIPMENT\nCommercial Packaged Boilers\nTest Procedures\n\n\n\n\n§\u2009431.86\nUniform test method for the measurement of energy efficiency of commercial packaged boilers.\n(a) Scope. This section provides test procedures, pursuant to the Energy Policy and Conservation Act (EPCA), as amended, which must be followed for measuring the combustion efficiency and/or thermal efficiency of a gas- or oil-fired commercial packaged boiler.\n(b) Testing and Calculations. Determine the thermal efficiency or combustion efficiency of commercial packaged boilers by conducting the appropriate test procedure(s) indicated in Table 1 of this section.\n\nTable 1—Test Requirements for Commercial Packaged Boiler Equipment Classes\n\nEquipment category\nSubcategory\nCertified rated inputBtu/h\n\nStandards efficiency metric(§\u2009431.87)\n\nTest procedure(corresponding to\nstandards efficiency\nmetric required\nby §\u2009431.87)\n\n\n\nHot Water\nGas-fired\n≥300,000 and ≤2,500,000\nThermal Efficiency\nAppendix A, Section 2.\n\n\nHot Water\nGas-fired\n>2,500,000\nCombustion Efficiency\nAppendix A, Section 3.\n\n\nHot Water\nOil-fired\n≥300,000 and ≤2,500,000\nThermal Efficiency\nAppendix A, Section 2.\n\n\nHot Water\nOil-fired\n>2,500,000\nCombustion Efficiency\nAppendix A, Section 3.\n\n\nSteam\nGas-fired (all*)\n≥300,000 and ≤2,500,000\nThermal Efficiency\nAppendix A, Section 2.\n\n\nSteam\nGas-fired (all*)\n>2,500,000 and ≤5,000,000\nThermal Efficiency\nAppendix A, Section 2.\n\n\n\u2003\n\n>5,000,000\nThermal Efficiency\nAppendix A, Section 2.OR\nAppendix A, Section 3 with Section 2.4.3.2.\n\n\n\nSteam\nOil-fired\n≥300,000 and ≤2,500,000\nThermal Efficiency\nAppendix A, Section 2.\n\n\nSteam\nOil-fired\n>2,500,000 and ≤5,000,000\nThermal Efficiency\nAppendix A, Section 2.\n\n\n\u2003\n\n>5,000,000\nThermal Efficiency\nAppendix A, Section 2.OR\nAppendix A, Section 3. with Section 2.4.3.2.\n\n\n\n*\u2009Equipment classes for commercial packaged boilers as of July 22, 2009 (74 FR 36355) distinguish between gas-fired natural draft and all other gas-fired (except natural draft).\n\n(c) Field Tests. The field test provisions of appendix A may be used only to test a unit of commercial packaged boiler with rated input greater than 5,000,000 Btu/h.\n[81 FR 89305, Dec. 9, 2016]\n\n\nEnergy Efficiency Standards\n\n')]
懒加载
您可以使用懒加载,每次只加载一个页面以最小化内存需求。
pages = []
for doc in loader.lazy_load():
pages.append(doc)
print(pages[0].page_content[:100])
print(pages[0].metadata)
10
Energy
3
2018-01-01
2018-01-01
false
Uniform test method for the measurement of energy efficien
{'source': 'https://www.govinfo.gov/content/pkg/CFR-2018-title10-vol3/xml/CFR-2018-title10-vol3-sec431-86.xml'}
Async
pages = []
async for doc in loader.alazy_load():
pages.append(doc)
print(pages[0].page_content[:100])
print(pages[0].metadata)
Fetching pages: 100%|###########################################################################| 1/1 [00:00<00:00, 10.51it/s]
``````output
10
Energy
3
2018-01-01
2018-01-01
false
Uniform test method for the measurement of energy efficien
{'source': 'https://www.govinfo.gov/content/pkg/CFR-2018-title10-vol3/xml/CFR-2018-title10-vol3-sec431-86.xml'}
使用代理
有时您可能需要使用代理来绕过IP封锁。您可以将代理字典传递给加载器(并在下面输入requests)以使用它们。
loader = WebBaseLoader(
"https://www.walmart.com/search?q=parrots",
proxies={
"http": "http://{username}:{password}:@proxy.service.com:6666/",
"https": "https://{username}:{password}:@proxy.service.com:6666/",
},
)
docs = loader.load()
API 参考
详细文档请参阅所有WebBaseLoader功能和配置: https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.web_base.WebBaseLoader.html