如何创建自定义 Document Loader

概述

基于 LLM 的应用程序通常需要从数据库或文件（如 PDF）中提取数据，并将其转换为 LLM 可以使用的格式。在 LangChain 中，这通常涉及创建 Document 对象，这些对象封装提取的文本（page_content）以及元数据 - 包含有关文档的详细信息（如作者姓名或出版日期）的字典。

Document对象通常被格式化为提示，这些提示被馈送到 LLM 中，允许 LLM 使用Document生成所需的响应（例如，总结文档）。Documents可以立即使用，也可以索引到 VectorStore 中以供将来检索和使用。

Document Loading 的主要抽象是：

元件	描述
Document	Contains `text` and `metadata`
BaseLoader	Use to convert raw data into `Documents`
Blob	A representation of binary data that's located either in a file or in memory
BaseBlobParser	Logic to parse a `Blob` to yield `Document` objects

本指南将演示如何编写自定义文档加载和文件解析逻辑;具体来说，我们将了解如何：

通过子类化创建标准文档 LoaderBaseLoader.
使用创建 parserBaseBlobParser并将其与Blob和BlobLoaders.这主要在处理文件时有用。

标准文档加载器

文档加载器可以通过从BaseLoader，它为加载文档提供了一个标准接口。

接口

方法名称	解释
lazy_load	Used to load documents one by one lazily. Use for production code.
alazy_load	Async variant of `lazy_load`
load	Used to load all the documents into memory eagerly. Use for prototyping or interactive work.
aload	Used to load all the documents into memory eagerly. Use for prototyping or interactive work. Added in 2024-04 to LangChain.

这loadmethods 是一种仅用于原型设计的便捷方法 —— 它只是调用list(self.lazy_load()).
这alazy_load有一个默认实现，该实现将委托给lazy_load.如果您使用的是 async，我们建议重写默认实现并提供原生异步实现。

重要

在实现文档加载器时，请勿通过lazy_load或alazy_load方法。

所有配置都应通过初始化程序（init）传递。这是 LangChain 做出的设计选择，以确保在实例化文档加载器后，它拥有加载文档所需的所有信息。

实现

让我们创建一个标准文档加载器的示例，该示例加载文件并从文件中的每一行创建一个文档。

from typing import AsyncIterator, Iterator

from langchain_core.document_loaders import BaseLoader
from langchain_core.documents import Document


class CustomDocumentLoader(BaseLoader):
    """An example document loader that reads a file line by line."""

    def __init__(self, file_path: str) -> None:
        """Initialize the loader with a file path.

        Args:
            file_path: The path to the file to load.
        """
        self.file_path = file_path

    def lazy_load(self) -> Iterator[Document]:  # <-- Does not take any arguments
        """A lazy loader that reads a file line by line.

        When you're implementing lazy load methods, you should use a generator
        to yield documents one by one.
        """
        with open(self.file_path, encoding="utf-8") as f:
            line_number = 0
            for line in f:
                yield Document(
                    page_content=line,
                    metadata={"line_number": line_number, "source": self.file_path},
                )
                line_number += 1

    # alazy_load is OPTIONAL.
    # If you leave out the implementation, a default implementation which delegates to lazy_load will be used!
    async def alazy_load(
        self,
    ) -> AsyncIterator[Document]:  # <-- Does not take any arguments
        """An async lazy loader that reads a file line by line."""
        # Requires aiofiles (install with pip)
        # https://github.com/Tinche/aiofiles
        import aiofiles

        async with aiofiles.open(self.file_path, encoding="utf-8") as f:
            line_number = 0
            async for line in f:
                yield Document(
                    page_content=line,
                    metadata={"line_number": line_number, "source": self.file_path},
                )
                line_number += 1

API 参考：BaseLoader | 公文

测试 🧪

要测试文档加载器，我们需要一个包含一些高质量内容的文件。

with open("./meow.txt", "w", encoding="utf-8") as f:
    quality_content = "meow meow🐱 \n meow meow🐱 \n meow😻😻"
    f.write(quality_content)

loader = CustomDocumentLoader("./meow.txt")

## Test out the lazy load interface
for doc in loader.lazy_load():
    print()
    print(type(doc))
    print(doc)

<class 'langchain_core.documents.base.Document'>
page_content='meow meow🐱 \n' metadata={'line_number': 0, 'source': './meow.txt'}

<class 'langchain_core.documents.base.Document'>
page_content=' meow meow🐱 \n' metadata={'line_number': 1, 'source': './meow.txt'}

<class 'langchain_core.documents.base.Document'>
page_content=' meow😻😻' metadata={'line_number': 2, 'source': './meow.txt'}

## Test out the async implementation
async for doc in loader.alazy_load():
    print()
    print(type(doc))
    print(doc)

<class 'langchain_core.documents.base.Document'>
page_content='meow meow🐱 \n' metadata={'line_number': 0, 'source': './meow.txt'}

<class 'langchain_core.documents.base.Document'>
page_content=' meow meow🐱 \n' metadata={'line_number': 1, 'source': './meow.txt'}

<class 'langchain_core.documents.base.Document'>
page_content=' meow😻😻' metadata={'line_number': 2, 'source': './meow.txt'}

提示

load()在 Jupyter Notebook 等交互式环境中可能会有所帮助。

避免将其用于生产代码，因为预先加载假定所有内容可以放入内存，但情况并非总是如此，尤其是对于企业数据。

loader.load()

[Document(page_content='meow meow🐱 \n', metadata={'line_number': 0, 'source': './meow.txt'}),
 Document(page_content=' meow meow🐱 \n', metadata={'line_number': 1, 'source': './meow.txt'}),
 Document(page_content=' meow😻😻', metadata={'line_number': 2, 'source': './meow.txt'})]

使用文件

许多文档加载程序都涉及解析文件。此类加载器之间的区别通常源于文件的解析方式，而不是文件的加载方式。例如，您可以使用open读取 PDF 或 Markdown 文件的二进制内容，但您需要不同的解析逻辑来将该二进制数据转换为文本。

因此，将解析逻辑与加载逻辑分离可能会有所帮助，这样可以更轻松地重用给定的解析器，而不管数据是如何加载的。

BaseBlobParser 数据库

一个BaseBlobParser是一个接口，它接受blob并输出一个Document对象。一个blob是内存或文件中数据的表示形式。LangChain python 有一个Blobprimitive 的 intent 的 intent 的

from langchain_core.document_loaders import BaseBlobParser, Blob


class MyParser(BaseBlobParser):
    """A simple parser that creates a document from each line."""

    def lazy_parse(self, blob: Blob) -> Iterator[Document]:
        """Parse a blob into a document line by line."""
        line_number = 0
        with blob.as_bytes_io() as f:
            for line in f:
                line_number += 1
                yield Document(
                    page_content=line,
                    metadata={"line_number": line_number, "source": blob.source},
                )

API 参考：BaseBlobParser | 斑点

blob = Blob.from_path("./meow.txt")
parser = MyParser()

list(parser.lazy_parse(blob))

[Document(page_content='meow meow🐱 \n', metadata={'line_number': 1, 'source': './meow.txt'}),
 Document(page_content=' meow meow🐱 \n', metadata={'line_number': 2, 'source': './meow.txt'}),
 Document(page_content=' meow😻😻', metadata={'line_number': 3, 'source': './meow.txt'})]

使用 blob API 还允许直接从内存中加载内容，而无需从文件中读取内容！

blob = Blob(data=b"some data from memory\nmeow")
list(parser.lazy_parse(blob))

[Document(page_content='some data from memory\n', metadata={'line_number': 1, 'source': None}),
 Document(page_content='meow', metadata={'line_number': 2, 'source': None})]

斑点

让我们快速浏览一下 Blob API 的一些内容。

blob = Blob.from_path("./meow.txt", metadata={"foo": "bar"})

blob.encoding

'utf-8'

blob.as_bytes()

b'meow meow\xf0\x9f\x90\xb1 \n meow meow\xf0\x9f\x90\xb1 \n meow\xf0\x9f\x98\xbb\xf0\x9f\x98\xbb'

blob.as_string()

'meow meow🐱 \n meow meow🐱 \n meow😻😻'

blob.as_bytes_io()

<contextlib._GeneratorContextManager at 0x743f34324450>

blob.metadata

{'foo': 'bar'}

blob.source

'./meow.txt'

Blob 加载程序

解析器封装了将二进制数据解析到文档中所需的逻辑，而 blob 加载程序封装了从给定存储位置加载 blob 所需的逻辑。

目前，LangChain只有辅助FileSystemBlobLoader.

您可以使用FileSystemBlobLoader加载 blob，然后使用解析器解析它们。

from langchain_community.document_loaders.blob_loaders import FileSystemBlobLoader

blob_loader = FileSystemBlobLoader(path=".", glob="*.mdx", show_progress=True)

API 参考：FileSystemBlobLoader

parser = MyParser()
for blob in blob_loader.yield_blobs():
    for doc in parser.lazy_parse(blob):
        print(doc)
        break

  0%|          | 0/8 [00:00<?, ?it/s]

page_content='# Microsoft Office\n' metadata={'line_number': 1, 'source': 'office_file.mdx'}
page_content='# Markdown\n' metadata={'line_number': 1, 'source': 'markdown.mdx'}
page_content='# JSON\n' metadata={'line_number': 1, 'source': 'json.mdx'}
page_content='---\n' metadata={'line_number': 1, 'source': 'pdf.mdx'}
page_content='---\n' metadata={'line_number': 1, 'source': 'index.mdx'}
page_content='# File Directory\n' metadata={'line_number': 1, 'source': 'file_directory.mdx'}
page_content='# CSV\n' metadata={'line_number': 1, 'source': 'csv.mdx'}
page_content='# HTML\n' metadata={'line_number': 1, 'source': 'html.mdx'}

通用加载器

LangChain 有一个GenericLoaderabstraction 的BlobLoader替换为BaseBlobParser.

GenericLoader旨在提供标准化的 classmethod，以便于使用现有的BlobLoader实现。目前，只有FileSystemBlobLoader受支持。

from langchain_community.document_loaders.generic import GenericLoader

loader = GenericLoader.from_filesystem(
    path=".", glob="*.mdx", show_progress=True, parser=MyParser()
)

for idx, doc in enumerate(loader.lazy_load()):
    if idx < 5:
        print(doc)

print("... output truncated for demo purposes")

API 参考：GenericLoader

  0%|          | 0/8 [00:00<?, ?it/s]

page_content='# Microsoft Office\n' metadata={'line_number': 1, 'source': 'office_file.mdx'}
page_content='\n' metadata={'line_number': 2, 'source': 'office_file.mdx'}
page_content='>[The Microsoft Office](https://www.office.com/) suite of productivity software includes Microsoft Word, Microsoft Excel, Microsoft PowerPoint, Microsoft Outlook, and Microsoft OneNote. It is available for Microsoft Windows and macOS operating systems. It is also available on Android and iOS.\n' metadata={'line_number': 3, 'source': 'office_file.mdx'}
page_content='\n' metadata={'line_number': 4, 'source': 'office_file.mdx'}
page_content='This covers how to load commonly used file formats including `DOCX`, `XLSX` and `PPTX` documents into a document format that we can use downstream.\n' metadata={'line_number': 5, 'source': 'office_file.mdx'}
... output truncated for demo purposes

自定义通用加载器

如果你真的喜欢创建类，你可以 sub-class 并创建一个类来将 logic 封装在一起。

您可以从此类派生 sub-class 以使用现有加载器加载内容。

from typing import Any


class MyCustomLoader(GenericLoader):
    @staticmethod
    def get_parser(**kwargs: Any) -> BaseBlobParser:
        """Override this method to associate a default parser with the class."""
        return MyParser()

loader = MyCustomLoader.from_filesystem(path=".", glob="*.mdx", show_progress=True)

for idx, doc in enumerate(loader.lazy_load()):
    if idx < 5:
        print(doc)

print("... output truncated for demo purposes")

  0%|          | 0/8 [00:00<?, ?it/s]

page_content='# Microsoft Office\n' metadata={'line_number': 1, 'source': 'office_file.mdx'}
page_content='\n' metadata={'line_number': 2, 'source': 'office_file.mdx'}
page_content='>[The Microsoft Office](https://www.office.com/) suite of productivity software includes Microsoft Word, Microsoft Excel, Microsoft PowerPoint, Microsoft Outlook, and Microsoft OneNote. It is available for Microsoft Windows and macOS operating systems. It is also available on Android and iOS.\n' metadata={'line_number': 3, 'source': 'office_file.mdx'}
page_content='\n' metadata={'line_number': 4, 'source': 'office_file.mdx'}
page_content='This covers how to load commonly used file formats including `DOCX`, `XLSX` and `PPTX` documents into a document format that we can use downstream.\n' metadata={'line_number': 5, 'source': 'office_file.mdx'}
... output truncated for demo purposes