文档加载器
DocumentLoader 将数据加载为标准的 LangChain Document 格式。
每个 DocumentLoader 都有其自己的特定参数,但都可以使用 .load 方法以相同的方式调用它们。 示例用例如下:
from langchain_community.document_loaders.csv_loader import CSVLoader
loader = CSVLoader(
... # <-- Integration specific parameters here
)
data = loader.load()
API 参考:CSVLoader
网页
以下文档加载器允许您加载网页。
有关起点,请参阅本指南:如何:加载网页。
| 文档加载器 | 描述 | 包/API |
|---|---|---|
| Web | Uses urllib and BeautifulSoup to load and parse HTML web pages | Package |
| Unstructured | Uses Unstructured to load and parse web pages | Package |
| RecursiveURL | Recursively scrapes all child links from a root URL | Package |
| Sitemap | Scrapes all pages on a given sitemap | Package |
| Firecrawl | API service that can be deployed locally, hosted version has free credits. | API |
| Docling | Uses Docling to load and parse web pages | Package |
| Hyperbrowser | Platform for running and scaling headless browsers, can be used to scrape/crawl any site | API |
| AgentQL | Web interaction and structured data extraction from any web page using an AgentQL query or a Natural Language prompt | API |
PDF文件
以下文档加载器允许您加载 PDF 文档。
有关起点,请参阅本指南:如何:加载 PDF 文件。
| 文档加载器 | 描述 | 包/API |
|---|---|---|
| PyPDF | Uses `pypdf` to load and parse PDFs | Package |
| Unstructured | Uses Unstructured's open source library to load PDFs | Package |
| Amazon Textract | Uses AWS API to load PDFs | API |
| MathPix | Uses MathPix to load PDFs | Package |
| PDFPlumber | Load PDF files using PDFPlumber | Package |
| PyPDFDirectry | Load a directory with PDF files | Package |
| PyPDFium2 | Load PDF files using PyPDFium2 | Package |
| PyMuPDF | Load PDF files using PyMuPDF | Package |
| PyMuPDF4LLM | Load PDF content to Markdown using PyMuPDF4LLM | Package |
| PDFMiner | Load PDF files using PDFMiner | Package |
| Upstage Document Parse Loader | Load PDF files using UpstageDocumentParseLoader | Package |
| Docling | Load PDF files using Docling | Package |
云提供商
以下文档加载器允许您从您最喜欢的云提供商加载文档。
| 文档加载器 | 描述 | 合作伙伴套餐 | API 参考 |
|---|---|---|---|
| AWS S3 Directory | Load documents from an AWS S3 directory | ❌ | S3DirectoryLoader |
| AWS S3 File | Load documents from an AWS S3 file | ❌ | S3FileLoader |
| Azure AI Data | Load documents from Azure AI services | ❌ | AzureAIDataLoader |
| Azure Blob Storage Container | Load documents from an Azure Blob Storage container | ❌ | AzureBlobStorageContainerLoader |
| Azure Blob Storage File | Load documents from an Azure Blob Storage file | ❌ | AzureBlobStorageFileLoader |
| Dropbox | Load documents from Dropbox | ❌ | DropboxLoader |
| Google Cloud Storage Directory | Load documents from GCS bucket | ✅ | GCSDirectoryLoader |
| Google Cloud Storage File | Load documents from GCS file object | ✅ | GCSFileLoader |
| Google Drive | Load documents from Google Drive (Google Docs only) | ✅ | GoogleDriveLoader |
| Huawei OBS Directory | Load documents from Huawei Object Storage Service Directory | ❌ | OBSDirectoryLoader |
| Huawei OBS File | Load documents from Huawei Object Storage Service File | ❌ | OBSFileLoader |
| Microsoft OneDrive | Load documents from Microsoft OneDrive | ❌ | OneDriveLoader |
| Microsoft SharePoint | Load documents from Microsoft SharePoint | ❌ | SharePointLoader |
| Tencent COS Directory | Load documents from Tencent Cloud Object Storage Directory | ❌ | TencentCOSDirectoryLoader |
| Tencent COS File | Load documents from Tencent Cloud Object Storage File | ❌ | TencentCOSFileLoader |
社交平台
以下文档加载器允许您从不同的社交媒体平台加载文档。
| 文档加载器 | API 参考 |
|---|---|
| TwitterTweetLoader | |
| RedditPostsLoader |
消息服务
以下文档加载器允许您从不同的消息传递平台加载数据。
| 文档加载器 | API 参考 |
|---|---|
| Telegram | TelegramChatFileLoader |
| WhatsAppChatLoader | |
| Discord | DiscordChatLoader |
| Facebook Chat | FacebookChatLoader |
| Mastodon | MastodonTootsLoader |
生产力工具
以下文档加载器允许您从常用的生产力工具加载数据。
| 文档加载器 | API 参考 |
|---|---|
| Figma | FigmaFileLoader |
| Notion | NotionDirectoryLoader |
| Slack | SlackDirectoryLoader |
| Quip | QuipLoader |
| Trello | TrelloLoader |
| Roam | RoamLoader |
| GitHub | GithubFileLoader |
常见文件类型
以下文档加载器允许您从常见数据格式加载数据。
| 文档加载器 | 数据类型 |
|---|---|
| CSVLoader | CSV files |
| DirectoryLoader | All files in a given directory |
| Unstructured | Many file types (see https://docs.unstructured.io/platform/supported-file-types) |
| JSONLoader | JSON files |
| BSHTMLLoader | HTML files |
| DoclingLoader | Various file types (see https://ds4sd.github.io/docling/) |
所有文档加载器
| 名字 | 描述 |
|---|---|
| acreom | acreom is a dev-first knowledge base with tasks running on local mark... |
| AgentQLLoader | AgentQL's document loader provides structured data extraction from an... |
| AirbyteLoader | Airbyte is a data integration platform for ELT pipelines from APIs, d... |
| Airtable | * Get your API key here. |
| Alibaba Cloud MaxCompute | Alibaba Cloud MaxCompute (previously known as ODPS) is a general purp... |
| Amazon Textract | Amazon Textract is a machine learning (ML) service that automatically... |
| Apify Dataset | Apify Dataset is a scalable append-only storage with sequential acces... |
| ArcGIS | This notebook demonstrates the use of the langchaincommunity.document... |
| ArxivLoader | arXiv is an open-access archive for 2 million scholarly articles in t... |
| AssemblyAI Audio Transcripts | The AssemblyAIAudioTranscriptLoader allows to transcribe audio files ... |
| AstraDB | DataStax Astra DB is a serverless |
| Async Chromium | Chromium is one of the browsers supported by Playwright, a library us... |
| AsyncHtml | AsyncHtmlLoader loads raw HTML from a list of URLs concurrently. |
| Athena | Amazon Athena is a serverless, interactive analytics service built |
| AWS S3 Directory | Amazon Simple Storage Service (Amazon S3) is an object storage service |
| AWS S3 File | Amazon Simple Storage Service (Amazon S3) is an object storage servic... |
| AZLyrics | AZLyrics is a large, legal, every day growing collection of lyrics. |
| Azure AI Data | Azure AI Studio provides the capability to upload data assets to clou... |
| Azure Blob Storage Container | Azure Blob Storage is Microsoft's object storage solution for the clo... |
| Azure Blob Storage File | Azure Files offers fully managed file shares in the cloud that are ac... |
| Azure AI Document Intelligence | Azure AI Document Intelligence (formerly known as Azure Form Recogniz... |
| BibTeX | BibTeX is a file format and reference management system commonly used... |
| BiliBili | Bilibili is one of the most beloved long-form video sites in China. |
| Blackboard | Blackboard Learn (previously the Blackboard Learning Management Syste... |
| Blockchain | Overview |
| Box | The langchain-box package provides two methods to index your files fr... |
| Brave Search | Brave Search is a search engine developed by Brave Software. |
| Browserbase | Browserbase is a developer platform to reliably run, manage, and moni... |
| Browserless | Browserless is a service that allows you to run headless Chrome insta... |
| BSHTMLLoader | This notebook provides a quick overview for getting started with Beau... |
| Cassandra | Cassandra is a NoSQL, row-oriented, highly scalable and highly availa... |
| ChatGPT Data | ChatGPT is an artificial intelligence (AI) chatbot developed by OpenA... |
| College Confidential | College Confidential gives information on 3,800+ colleges and univers... |
| Concurrent Loader | Works just like the GenericLoader but concurrently for those who choo... |
| Confluence | Confluence is a wiki collaboration platform that saves and organizes ... |
| CoNLL-U | CoNLL-U is revised version of the CoNLL-X format. Annotations are enc... |
| Copy Paste | This notebook covers how to load a document object from something you... |
| Couchbase | Couchbase is an award-winning distributed NoSQL cloud database that d... |
| CSV | A comma-separated values (CSV) file is a delimited text file that use... |
| Cube Semantic Layer | This notebook demonstrates the process of retrieving Cube's data mode... |
| Datadog Logs | Datadog is a monitoring and analytics platform for cloud-scale applic... |
| Dedoc | This sample demonstrates the use of Dedoc in combination with LangCha... |
| Diffbot | Diffbot is a suite of ML-based products that make it easy to structur... |
| Discord | Discord is a VoIP and instant messaging social platform. Users have t... |
| Docling | Docling parses PDF, DOCX, PPTX, HTML, and other formats into a rich u... |
| Docugami | This notebook covers how to load documents from Docugami. It provides... |
| Docusaurus | Docusaurus is a static-site generator which provides out-of-the-box d... |
| Dropbox | Dropbox is a file hosting service that brings everything-traditional ... |
| DuckDB | DuckDB is an in-process SQL OLAP database management system. |
| This notebook shows how to load email (.eml) or Microsoft Outlook (.m... | |
| EPub | EPUB is an e-book file format that uses the ".epub" file extension. T... |
| Etherscan | Etherscan is the leading blockchain explorer, search, API and analyt... |
| EverNote | EverNote is intended for archiving and creating notes in which photos... |
| example_data | |
| Facebook Chat | Messenger) is an American proprietary instant messaging app and platf... |
| Fauna | Fauna is a Document Database. |
| Figma | Figma is a collaborative web application for interface design. |
| FireCrawl | FireCrawl crawls and convert any website into LLM-ready data. It craw... |
| Geopandas | Geopandas is an open-source project to make working with geospatial d... |
| Git | Git is a distributed version control system that tracks changes in an... |
| GitBook | GitBook is a modern documentation platform where teams can document e... |
| GitHub | This notebooks shows how you can load issues and pull requests (PRs) ... |
| Glue Catalog | The AWS Glue Data Catalog is a centralized metadata repository that a... |
| Google AlloyDB for PostgreSQL | AlloyDB is a fully managed relational database service that offers hi... |
| Google BigQuery | Google BigQuery is a serverless and cost-effective enterprise data wa... |
| Google Bigtable | Bigtable is a key-value and wide-column store, ideal for fast access ... |
| Google Cloud SQL for SQL server | Cloud SQL is a fully managed relational database service that offers ... |
| Google Cloud SQL for MySQL | Cloud SQL is a fully managed relational database service that offers ... |
| Google Cloud SQL for PostgreSQL | Cloud SQL for PostgreSQL is a fully-managed database service that hel... |
| Google Cloud Storage Directory | Google Cloud Storage is a managed service for storing unstructured da... |
| Google Cloud Storage File | Google Cloud Storage is a managed service for storing unstructured da... |
| Google Firestore in Datastore Mode | Firestore in Datastore Mode is a NoSQL document database built for au... |
| Google Drive | Google Drive is a file storage and synchronization service developed ... |
| Google El Carro for Oracle Workloads | Google El Carro Oracle Operator |
| Google Firestore (Native Mode) | Firestore is a serverless document-oriented database that scales to m... |
| Google Memorystore for Redis | Google Memorystore for Redis is a fully-managed service that is power... |
| Google Spanner | Spanner is a highly scalable database that combines unlimited scalabi... |
| Google Speech-to-Text Audio Transcripts | The SpeechToTextLoader allows to transcribe audio files with the Goog... |
| Grobid | GROBID is a machine learning library for extracting, parsing, and re-... |
| Gutenberg | Project Gutenberg is an online library of free eBooks. |
| Hacker News | Hacker News (sometimes abbreviated as HN) is a social news website fo... |
| Huawei OBS Directory | The following code demonstrates how to load objects from the Huawei O... |
| Huawei OBS File | The following code demonstrates how to load an object from the Huawei... |
| HuggingFace dataset | The Hugging Face Hub is home to over 5,000 datasets in more than 100 ... |
| HyperbrowserLoader | Hyperbrowser is a platform for running and scaling headless browsers.... |
| iFixit | iFixit is the largest, open repair community on the web. The site con... |
| Images | This covers how to load images into a document format that we can use... |
| Image captions | By default, the loader utilizes the pre-trained Salesforce BLIP image... |
| IMSDb | IMSDb is the Internet Movie Script Database. |
| Iugu | Iugu is a Brazilian services and software as a service (SaaS) company... |
| Joplin | Joplin is an open-source note-taking app. Capture your thoughts and s... |
| JSONLoader | This notebook provides a quick overview for getting started with JSON... |
| Jupyter Notebook | Jupyter Notebook (formerly IPython Notebook) is a web-based interacti... |
| Kinetica | This notebooks goes over how to load documents from Kinetica |
| lakeFS | lakeFS provides scalable version control over the data lake, and uses... |
| LangSmith | This notebook provides a quick overview for getting started with the ... |
| LarkSuite (FeiShu) | LarkSuite is an enterprise collaboration platform developed by ByteDa... |
| LLM Sherpa | This notebook covers how to use LLM Sherpa to load files of many type... |
| Mastodon | Mastodon is a federated social media and social networking service. |
| MathPixPDFLoader | Inspired by Daniel Gross's snippet here//gist.github.com/danielgross/... |
| MediaWiki Dump | MediaWiki XML Dumps contain the content of a wiki (wiki pages with al... |
| Merge Documents Loader | Merge the documents returned from a set of specified data loaders. |
| mhtml | MHTML is a is used both for emails but also for archived webpages. MH... |
| Microsoft Excel | The UnstructuredExcelLoader is used to load Microsoft Excel files. Th... |
| Microsoft OneDrive | Microsoft OneDrive (formerly SkyDrive) is a file hosting service oper... |
| Microsoft OneNote | This notebook covers how to load documents from OneNote. |
| Microsoft PowerPoint | Microsoft PowerPoint is a presentation program by Microsoft. |
| Microsoft SharePoint | Microsoft SharePoint is a website-based collaboration system that use... |
| Microsoft Word | Microsoft Word is a word processor developed by Microsoft. |
| Near Blockchain | Overview |
| Modern Treasury | Modern Treasury simplifies complex payment operations. It is a unifie... |
| MongoDB | MongoDB is a NoSQL , document-oriented database that supports JSON-li... |
| Needle Document Loader | Needle makes it easy to create your RAG pipelines with minimal effort. |
| News URL | This covers how to load HTML news articles from a list of URLs into a... |
| Notion DB 2/2 | Notion is a collaboration platform with modified Markdown support tha... |
| Nuclia | Nuclia automatically indexes your unstructured data from any internal... |
| Obsidian | Obsidian is a powerful and extensible knowledge base |
| Open Document Format (ODT) | The Open Document Format for Office Applications (ODF), also known as... |
| Open City Data | Socrata provides an API for city open data. |
| Oracle Autonomous Database | Oracle autonomous database is a cloud database that uses machine lear... |
| Oracle AI Vector Search: Document Processing | Oracle AI Vector Search is designed for Artificial Intelligence (AI) ... |
| Org-mode | A Org Mode document is a document editing, formatting, and organizing... |
| Pandas DataFrame | This notebook goes over how to load data from a pandas DataFrame. |
| parsers | |
| PDFMinerLoader | This notebook provides a quick overview for getting started with PDFM... |
| PDFPlumber | Like PyMuPDF, the output Documents contain detailed metadata about th... |
| Pebblo Safe DocumentLoader | Pebblo enables developers to safely load data and promote their Gen A... |
| Polars DataFrame | This notebook goes over how to load data from a polars DataFrame. |
| Dell PowerScale Document Loader | Dell PowerScale is an enterprise scale out storage system that hosts ... |
| Psychic | This notebook covers how to load documents from Psychic. See here for... |
| PubMed | PubMed® by The National Center for Biotechnology Information, Nationa... |
| PullMdLoader | Loader for converting URLs into Markdown using the pull.md service. |
| PyMuPDFLoader | This notebook provides a quick overview for getting started with PyMu... |
| PyMuPDF4LLM | This notebook provides a quick overview for getting started with PyMu... |
| PyPDFDirectoryLoader | This loader loads all PDF files from a specific directory. |
| PyPDFium2Loader | This notebook provides a quick overview for getting started with PyPD... |
| PyPDFLoader | This notebook provides a quick overview for getting started with PyPD... |
| PySpark | This notebook goes over how to load data from a PySpark DataFrame. |
| Quip | Quip is a collaborative productivity software suite for mobile and We... |
| ReadTheDocs Documentation | Read the Docs is an open-sourced free software documentation hosting ... |
| Recursive URL | The RecursiveUrlLoader lets you recursively scrape all child links fr... |
| Reddit is an American social news aggregation, content rating, and di... | |
| Roam | ROAM is a note-taking tool for networked thought, designed to create ... |
| Rockset | Rockset is a real-time analytics database which enables queries on ma... |
| rspace | This notebook shows how to use the RSpace document loader to import r... |
| RSS Feeds | This covers how to load HTML news articles from a list of RSS feed UR... |
| RST | A reStructured Text (RST) file is a file format for textual data used... |
| scrapfly | ScrapFly |
| ScrapingAnt | Overview |
| SingleStore | The SingleStoreLoader allows you to load documents directly from a Si... |
| Sitemap | Extends from the WebBaseLoader, SitemapLoader loads a sitemap from a ... |
| Slack | Slack is an instant messaging program. |
| Snowflake | This notebooks goes over how to load documents from Snowflake |
| Source Code | This notebook covers how to load source code files using a special ap... |
| Spider | Spider is the fastest and most affordable crawler and scraper that re... |
| Spreedly | Spreedly is a service that allows you to securely store credit cards ... |
| Stripe | Stripe is an Irish-American financial services and software as a serv... |
| Subtitle | The SubRip file format is described on the Matroska multimedia contai... |
| SurrealDB | SurrealDB is an end-to-end cloud-native database designed for modern ... |
| Telegram | Telegram Messenger is a globally accessible freemium, cross-platform,... |
| Tencent COS Directory | Tencent Cloud Object Storage (COS) is a distributed |
| Tencent COS File | Tencent Cloud Object Storage (COS) is a distributed |
| TensorFlow Datasets | TensorFlow Datasets is a collection of datasets ready to use, with Te... |
| TiDB | TiDB Cloud, is a comprehensive Database-as-a-Service (DBaaS) solution... |
| 2Markdown | 2markdown service transforms website content into structured markdown... |
| TOML | TOML is a file format for configuration files. It is intended to be e... |
| Trello | Trello is a web-based project management and collaboration tool that ... |
| TSV | A tab-separated values (TSV) file is a simple, text-based file format... |
| Twitter is an online social media and social networking service. | |
| Unstructured | This notebook covers how to use Unstructured document loader to load ... |
| UnstructuredMarkdownLoader | This notebook provides a quick overview for getting started with Unst... |
| UnstructuredPDFLoader | Overview |
| Upstage | This notebook covers how to get started with UpstageDocumentParseLoad... |
| URL | This example covers how to load HTML documents from a list of URLs in... |
| Vsdx | A visio file (with extension .vsdx) is associated with Microsoft Visi... |
| Weather | OpenWeatherMap is an open-source weather service provider |
| WebBaseLoader | This covers how to use WebBaseLoader to load all text from HTML webpa... |
| WhatsApp Chat | WhatsApp (also called WhatsApp Messenger) is a freeware, cross-platfo... |
| Wikipedia | Wikipedia is a multilingual free online encyclopedia written and main... |
| UnstructuredXMLLoader | This notebook provides a quick overview for getting started with Unst... |
| Xorbits Pandas DataFrame | This notebook goes over how to load data from a xorbits.pandas DataFr... |
| YouTube audio | Building chat or QA applications on YouTube videos is a topic of high... |
| YouTube transcripts | YouTube is an online video sharing and social media platform created ... |
| YoutubeLoaderDL | Loader for Youtube leveraging the yt-dlp library. |
| Yuque | Yuque is a professional cloud-based knowledge base for team collabora... |
| ZeroxPDFLoader | Overview |