Skip to main content
Open In Colab在 GitHub 上打开

如何按 Headers 拆分 Markdown

赋予动机

许多聊天或 Q+A 应用程序涉及在嵌入和向量存储之前对输入文档进行分块。

Pinecone 的这些说明提供了一些有用的提示:

When a full paragraph or document is embedded, the embedding process considers both the overall context and the relationships between the sentences and phrases within the text. This can result in a more comprehensive vector representation that captures the broader meaning and themes of the text.

如前所述,分块通常旨在将具有公共上下文的文本保持在一起。考虑到这一点,我们可能希望特别尊重文档本身的结构。例如,Markdown 文件按标头进行组织。在特定的标头组中创建块是一个直观的想法。为了应对这一挑战,我们可以使用 MarkdownHeaderTextSplitter。这将按一组指定的标头拆分 Markdown 文件。

例如,如果我们想要拆分这个 markdown:

md = '# Foo\n\n ## Bar\n\nHi this is Jim  \nHi this is Joe\n\n ## Baz\n\n Hi this is Molly' 

我们可以指定要拆分的标头:

[("#", "Header 1"),("##", "Header 2")]

内容按通用标头进行分组或拆分:

{'content': 'Hi this is Jim  \nHi this is Joe', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar'}}
{'content': 'Hi this is Molly', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Baz'}}

让我们看一下下面的一些示例。

基本用法:

%pip install -qU langchain-text-splitters
from langchain_text_splitters import MarkdownHeaderTextSplitter
markdown_document = "# Foo\n\n    ## Bar\n\nHi this is Jim\n\nHi this is Joe\n\n ### Boo \n\n Hi this is Lance \n\n ## Baz\n\n Hi this is Molly"

headers_to_split_on = [
("#", "Header 1"),
("##", "Header 2"),
("###", "Header 3"),
]

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on)
md_header_splits = markdown_splitter.split_text(markdown_document)
md_header_splits
[Document(page_content='Hi this is Jim  \nHi this is Joe', metadata={'Header 1': 'Foo', 'Header 2': 'Bar'}),
Document(page_content='Hi this is Lance', metadata={'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'}),
Document(page_content='Hi this is Molly', metadata={'Header 1': 'Foo', 'Header 2': 'Baz'})]
type(md_header_splits[0])
langchain_core.documents.base.Document

默认情况下,MarkdownHeaderTextSplitter从 output chunk 的内容中剥离 Headers。这可以通过设置strip_headers = False.

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on, strip_headers=False)
md_header_splits = markdown_splitter.split_text(markdown_document)
md_header_splits
[Document(page_content='# Foo  \n## Bar  \nHi this is Jim  \nHi this is Joe', metadata={'Header 1': 'Foo', 'Header 2': 'Bar'}),
Document(page_content='### Boo \nHi this is Lance', metadata={'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'}),
Document(page_content='## Baz \nHi this is Molly', metadata={'Header 1': 'Foo', 'Header 2': 'Baz'})]
注意

默认的MarkdownHeaderTextSplitter去除空格和新行。要保留 Markdown 文档的原始格式,请查看 ExperimentalMarkdownSyntaxTextSplitter

如何将 Markdown 行作为单独的文档返回

默认情况下,MarkdownHeaderTextSplitter根据 中指定的标题聚合行headers_to_split_on.我们可以通过指定return_each_line:

markdown_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on,
return_each_line=True,
)
md_header_splits = markdown_splitter.split_text(markdown_document)
md_header_splits
[Document(page_content='Hi this is Jim', metadata={'Header 1': 'Foo', 'Header 2': 'Bar'}),
Document(page_content='Hi this is Joe', metadata={'Header 1': 'Foo', 'Header 2': 'Bar'}),
Document(page_content='Hi this is Lance', metadata={'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'}),
Document(page_content='Hi this is Molly', metadata={'Header 1': 'Foo', 'Header 2': 'Baz'})]

请注意,此处的标头信息保留在metadata对于每个文档。

如何限制 chunk 大小:

然后,在每个 Markdown 组中,我们可以应用我们想要的任何文本分割器,例如RecursiveCharacterTextSplitter,这允许进一步控制块大小。

markdown_document = "# Intro \n\n    ## History \n\n Markdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9] \n\n Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files. \n\n ## Rise and divergence \n\n As Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for \n\n additional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks. \n\n #### Standardization \n\n From 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort. \n\n ## Implementations \n\n Implementations of Markdown are available for over a dozen programming languages."

headers_to_split_on = [
("#", "Header 1"),
("##", "Header 2"),
]

# MD splits
markdown_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=headers_to_split_on, strip_headers=False
)
md_header_splits = markdown_splitter.split_text(markdown_document)

# Char-level splits
from langchain_text_splitters import RecursiveCharacterTextSplitter

chunk_size = 250
chunk_overlap = 30
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size, chunk_overlap=chunk_overlap
)

# Split
splits = text_splitter.split_documents(md_header_splits)
splits
[Document(page_content='# Intro  \n## History  \nMarkdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9]', metadata={'Header 1': 'Intro', 'Header 2': 'History'}),
Document(page_content='Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files.', metadata={'Header 1': 'Intro', 'Header 2': 'History'}),
Document(page_content='## Rise and divergence \nAs Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for \nadditional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks.', metadata={'Header 1': 'Intro', 'Header 2': 'Rise and divergence'}),
Document(page_content='#### Standardization \nFrom 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort.', metadata={'Header 1': 'Intro', 'Header 2': 'Rise and divergence'}),
Document(page_content='## Implementations \nImplementations of Markdown are available for over a dozen programming languages.', metadata={'Header 1': 'Intro', 'Header 2': 'Implementations'})]