如何拆分 HTML
将 HTML 文档拆分为可管理的块对于各种文本处理任务(例如自然语言处理、搜索索引等)至关重要。在本指南中,我们将探讨 LangChain 提供的三种不同的文本拆分器,您可以使用它们来有效地拆分 HTML 内容:
这些拆分器中的每一个都有独特的功能和用例。本指南将帮助您了解它们之间的区别,为什么您可能会选择一个而不是其他,以及如何有效地使用它们。
%pip install -qU langchain-text-splitters
分路器概述
HTMLHeaderTextSplitter
当您希望根据文档的标题保留文档的层次结构时,这很有用。
描述:根据标题标签(例如<h1>,<h2>,<h3>等),并为与任何给定 chunk 相关的每个 Headers 添加元数据。
功能:
- 在 HTML 元素级别拆分文本。
- 保留文档结构中编码的上下文丰富的信息。
- 可以逐个元素返回块或组合具有相同元数据的元素。
HTMLSectionSplitter
当您想要将 HTML 文档拆分为较大的部分时非常有用,例如<section>,<div>或自定义部分。
描述:类似于 HTMLHeaderTextSplitter,但侧重于根据指定的标签将 HTML 拆分为多个部分。
功能:
- 使用 XSLT 转换来检测和拆分部分。
- 内部使用
RecursiveCharacterTextSplitter对于大型截面。 - 考虑字体大小以确定部分。
HTMLSemanticPreservingSplitter
当您需要确保结构化元素不会在多个块之间拆分,同时保持上下文相关性时,这是理想的选择。
描述: 将 HTML 内容拆分为可管理的块,同时保留重要元素(如表格、列表和其他 HTML 组件)的语义结构。
功能:
- 保留表格、列表和其他指定的 HTML 元素。
- 允许对特定 HTML 标签进行自定义处理程序。
- 确保保持文档的语义含义。
- 内置规范化和停用词去除功能
选择合适的分流器
- 用
HTMLHeaderTextSplitter什么时候:您需要根据 HTML 文档的标题层次结构拆分 HTML 文档,并维护有关标题的元数据。 - 用
HTMLSectionSplitter什么时候:您需要将文档拆分为更大、更通用的部分,可能基于自定义标签或字体大小。 - 用
HTMLSemanticPreservingSplitter什么时候:您需要将文档拆分为多个块,同时保留表格和列表等语义元素,确保它们不会被拆分并维护其上下文。
| 特征 | HTMLHeaderTextSplitter | HTMLSectionSplitter | HTMLSemanticPreservingSplitter |
|---|---|---|---|
| Splits based on headers | Yes | Yes | Yes |
| Preserves semantic elements (tables, lists) | No | No | Yes |
| Adds metadata for headers | Yes | Yes | Yes |
| Custom handlers for HTML tags | No | No | Yes |
| Preserves media (images, videos) | No | No | Yes |
| Considers font sizes | No | Yes | No |
| Uses XSLT transformations | No | Yes | No |
示例 HTML 文档
让我们使用以下 HTML 文档作为示例:
html_string = """
<!DOCTYPE html>
<html lang='en'>
<head>
<meta charset='UTF-8'>
<meta name='viewport' content='width=device-width, initial-scale=1.0'>
<title>Fancy Example HTML Page</title>
</head>
<body>
<h1>Main Title</h1>
<p>This is an introductory paragraph with some basic content.</p>
<h2>Section 1: Introduction</h2>
<p>This section introduces the topic. Below is a list:</p>
<ul>
<li>First item</li>
<li>Second item</li>
<li>Third item with <strong>bold text</strong> and <a href='#'>a link</a></li>
</ul>
<h3>Subsection 1.1: Details</h3>
<p>This subsection provides additional details. Here's a table:</p>
<table border='1'>
<thead>
<tr>
<th>Header 1</th>
<th>Header 2</th>
<th>Header 3</th>
</tr>
</thead>
<tbody>
<tr>
<td>Row 1, Cell 1</td>
<td>Row 1, Cell 2</td>
<td>Row 1, Cell 3</td>
</tr>
<tr>
<td>Row 2, Cell 1</td>
<td>Row 2, Cell 2</td>
<td>Row 2, Cell 3</td>
</tr>
</tbody>
</table>
<h2>Section 2: Media Content</h2>
<p>This section contains an image and a video:</p>
<img src='example_image_link.mp4' alt='Example Image'>
<video controls width='250' src='example_video_link.mp4' type='video/mp4'>
Your browser does not support the video tag.
</video>
<h2>Section 3: Code Example</h2>
<p>This section contains a code block:</p>
<pre><code data-lang="html">
<div>
<p>This is a paragraph inside a div.</p>
</div>
</code></pre>
<h2>Conclusion</h2>
<p>This is the conclusion of the document.</p>
</body>
</html>
"""
使用 HTMLHeaderTextSplitter
HTMLHeaderTextSplitter 是一个“结构感知”的文本拆分器,它在 HTML 元素级别拆分文本,并为每个与任何给定块“相关”的标题添加元数据。它可以逐个元素返回块或将元素与相同的元数据组合在一起,其目标是 (a) 在语义上(或多或少)保持相关文本分组,以及 (b) 保留文档结构中编码的上下文丰富的信息。它可以作为分块管道的一部分与其他文本拆分器一起使用。
它类似于 Markdown 文件的 MarkdownHeaderTextSplitter。
要指定要拆分的标头,请指定headers_to_split_on实例化时HTMLHeaderTextSplitter如下所示。
from langchain_text_splitters import HTMLHeaderTextSplitter
headers_to_split_on = [
("h1", "Header 1"),
("h2", "Header 2"),
("h3", "Header 3"),
]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)
html_header_splits
[Document(metadata={'Header 1': 'Main Title'}, page_content='This is an introductory paragraph with some basic content.'),
Document(metadata={'Header 1': 'Main Title', 'Header 2': 'Section 1: Introduction'}, page_content='This section introduces the topic. Below is a list: \nFirst item Second item Third item with bold text and a link'),
Document(metadata={'Header 1': 'Main Title', 'Header 2': 'Section 1: Introduction', 'Header 3': 'Subsection 1.1: Details'}, page_content="This subsection provides additional details. Here's a table:"),
Document(metadata={'Header 1': 'Main Title', 'Header 2': 'Section 2: Media Content'}, page_content='This section contains an image and a video:'),
Document(metadata={'Header 1': 'Main Title', 'Header 2': 'Section 3: Code Example'}, page_content='This section contains a code block:'),
Document(metadata={'Header 1': 'Main Title', 'Header 2': 'Conclusion'}, page_content='This is the conclusion of the document.')]
要返回每个元素及其关联的标头,请指定return_each_element=True实例化时HTMLHeaderTextSplitter:
html_splitter = HTMLHeaderTextSplitter(
headers_to_split_on,
return_each_element=True,
)
html_header_splits_elements = html_splitter.split_text(html_string)
与上述相比,其中元素是按其 headers 聚合的:
for element in html_header_splits[:2]:
print(element)
page_content='This is an introductory paragraph with some basic content.' metadata={'Header 1': 'Main Title'}
page_content='This section introduces the topic. Below is a list:
First item Second item Third item with bold text and a link' metadata={'Header 1': 'Main Title', 'Header 2': 'Section 1: Introduction'}
现在,每个元素都作为不同的Document:
for element in html_header_splits_elements[:3]:
print(element)
page_content='This is an introductory paragraph with some basic content.' metadata={'Header 1': 'Main Title'}
page_content='This section introduces the topic. Below is a list:' metadata={'Header 1': 'Main Title', 'Header 2': 'Section 1: Introduction'}
page_content='First item Second item Third item with bold text and a link' metadata={'Header 1': 'Main Title', 'Header 2': 'Section 1: Introduction'}
如何从 URL 或 HTML 文件拆分:
要直接从 URL 读取,请将 URL 字符串传递到split_text_from_url方法。
同样,可以将本地 HTML 文件传递给split_text_from_file方法。
url = "https://plato.stanford.edu/entries/goedel/"
headers_to_split_on = [
("h1", "Header 1"),
("h2", "Header 2"),
("h3", "Header 3"),
("h4", "Header 4"),
]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)
# for local file use html_splitter.split_text_from_file(<path_to_file>)
html_header_splits = html_splitter.split_text_from_url(url)
如何限制块大小:
HTMLHeaderTextSplitter,它基于 HTML 标头进行拆分,可以与另一个拆分器组成,该拆分器根据字符长度限制拆分,例如RecursiveCharacterTextSplitter.
这可以使用.split_documents第二个 splitter 的方法:
from langchain_text_splitters import RecursiveCharacterTextSplitter
chunk_size = 500
chunk_overlap = 30
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size, chunk_overlap=chunk_overlap
)
# Split
splits = text_splitter.split_documents(html_header_splits)
splits[80:85]
[Document(metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödel’s Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.1 The First Incompleteness Theorem'}, page_content='We see that Gödel first tried to reduce the consistency problem for analysis to that of arithmetic. This seemed to require a truth definition for arithmetic, which in turn led to paradoxes, such as the Liar paradox (“This sentence is false”) and Berry’s paradox (“The least number not defined by an expression consisting of just fourteen English words”). Gödel then noticed that such paradoxes would not necessarily arise if truth were replaced by provability. But this means that arithmetic truth'),
Document(metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödel’s Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.1 The First Incompleteness Theorem'}, page_content='means that arithmetic truth and arithmetic provability are not co-extensive — whence the First Incompleteness Theorem.'),
Document(metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödel’s Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.1 The First Incompleteness Theorem'}, page_content='This account of Gödel’s discovery was told to Hao Wang very much after the fact; but in Gödel’s contemporary correspondence with Bernays and Zermelo, essentially the same description of his path to the theorems is given. (See Gödel 2003a and Gödel 2003b respectively.) From those accounts we see that the undefinability of truth in arithmetic, a result credited to Tarski, was likely obtained in some form by Gödel by 1931. But he neither publicized nor published the result; the biases logicians'),
Document(metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödel’s Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.1 The First Incompleteness Theorem'}, page_content='result; the biases logicians had expressed at the time concerning the notion of truth, biases which came vehemently to the fore when Tarski announced his results on the undefinability of truth in formal systems 1935, may have served as a deterrent to Gödel’s publication of that theorem.'),
Document(metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödel’s Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.2 The proof of the First Incompleteness Theorem'}, page_content='We now describe the proof of the two theorems, formulating Gödel’s results in Peano arithmetic. Gödel himself used a system related to that defined in Principia Mathematica, but containing Peano arithmetic. In our presentation of the First and Second Incompleteness Theorems we refer to Peano arithmetic as P, following Gödel’s notation.')]
局限性
从一个 HTML 文档到另一个 HTML 文档可能会有相当多的结构变化,而HTMLHeaderTextSplitter将尝试将所有 “相关” 标头附加到任何给定的 chunk 上,它有时会错过某些标头。例如,该算法假设一个信息层次结构,其中标题始终位于关联文本“上方”的节点,即先前的兄弟姐妹、祖先及其组合。在下面的新闻文章中(截至撰写本文档时),文档的结构使得顶级标题的文本虽然标记为 “h1”,但与我们期望它位于 “above” 的文本元素位于不同的子树中,因此我们可以观察到 “h1” 元素及其关联的文本不会显示在块元数据中(但是, 在适用的情况下,我们会看到 “H2” 及其相关文本):
url = "https://www.cnn.com/2023/09/25/weather/el-nino-winter-us-climate/index.html"
headers_to_split_on = [
("h1", "Header 1"),
("h2", "Header 2"),
]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)
html_header_splits = html_splitter.split_text_from_url(url)
print(html_header_splits[1].page_content[:500])
No two El Niño winters are the same, but many have temperature and precipitation trends in common.
Average conditions during an El Niño winter across the continental US.
One of the major reasons is the position of the jet stream, which often shifts south during an El Niño winter. This shift typically brings wetter and cooler weather to the South while the North becomes drier and warmer, according to NOAA.
Because the jet stream is essentially a river of air that storms flow through, they c
使用 HTMLSectionSplitter
在概念上与 HTMLHeaderTextSplitlitter 类似,HTMLSectionSplitter是一个 “结构感知” 文本拆分器,它在元素级别拆分文本,并为每个 “相关” 的标题添加元数据。它允许您按部分拆分 HTML。
它可以逐个元素返回块或将元素与相同的元数据组合在一起,其目标是 (a) 在语义上(或多或少)保持相关文本分组,以及 (b) 保留文档结构中编码的上下文丰富的信息。
用xslt_path提供转换 HTML 的绝对路径,以便它可以根据提供的标记检测部分。默认情况下,使用converting_to_header.xslt文件中的data_connection/document_transformers目录。这是为了将 html 转换为更容易检测部分的格式/布局。例如span根据它们的字体大小,可以转换为标题标签,以便作为一个部分进行检测。
如何拆分 HTML 字符串:
from langchain_text_splitters import HTMLSectionSplitter
headers_to_split_on = [
("h1", "Header 1"),
("h2", "Header 2"),
]
html_splitter = HTMLSectionSplitter(headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)
html_header_splits
[Document(metadata={'Header 1': 'Main Title'}, page_content='Main Title \n This is an introductory paragraph with some basic content.'),
Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content="Section 1: Introduction \n This section introduces the topic. Below is a list: \n \n First item \n Second item \n Third item with bold text and a link \n \n \n Subsection 1.1: Details \n This subsection provides additional details. Here's a table: \n \n \n \n Header 1 \n Header 2 \n Header 3 \n \n \n \n \n Row 1, Cell 1 \n Row 1, Cell 2 \n Row 1, Cell 3 \n \n \n Row 2, Cell 1 \n Row 2, Cell 2 \n Row 2, Cell 3"),
Document(metadata={'Header 2': 'Section 2: Media Content'}, page_content='Section 2: Media Content \n This section contains an image and a video: \n \n \n Your browser does not support the video tag.'),
Document(metadata={'Header 2': 'Section 3: Code Example'}, page_content='Section 3: Code Example \n This section contains a code block: \n \n <div>\n <p>This is a paragraph inside a div.</p>\n </div>'),
Document(metadata={'Header 2': 'Conclusion'}, page_content='Conclusion \n This is the conclusion of the document.')]
如何限制块大小:
HTMLSectionSplitter可以作为分块管道的一部分与其他文本拆分器一起使用。在内部,它使用RecursiveCharacterTextSplitter当 section size 大于 chunk size 时。它还会考虑文本的字体大小,以根据确定的字体大小阈值来确定它是否为部分。
from langchain_text_splitters import RecursiveCharacterTextSplitter
headers_to_split_on = [
("h1", "Header 1"),
("h2", "Header 2"),
("h3", "Header 3"),
]
html_splitter = HTMLSectionSplitter(headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)
chunk_size = 50
chunk_overlap = 5
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size, chunk_overlap=chunk_overlap
)
# Split
splits = text_splitter.split_documents(html_header_splits)
splits
[Document(metadata={'Header 1': 'Main Title'}, page_content='Main Title'),
Document(metadata={'Header 1': 'Main Title'}, page_content='This is an introductory paragraph with some'),
Document(metadata={'Header 1': 'Main Title'}, page_content='some basic content.'),
Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content='Section 1: Introduction'),
Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content='This section introduces the topic. Below is a'),
Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content='is a list:'),
Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content='First item \n Second item'),
Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content='Third item with bold text and a link'),
Document(metadata={'Header 3': 'Subsection 1.1: Details'}, page_content='Subsection 1.1: Details'),
Document(metadata={'Header 3': 'Subsection 1.1: Details'}, page_content='This subsection provides additional details.'),
Document(metadata={'Header 3': 'Subsection 1.1: Details'}, page_content="Here's a table:"),
Document(metadata={'Header 3': 'Subsection 1.1: Details'}, page_content='Header 1 \n Header 2 \n Header 3'),
Document(metadata={'Header 3': 'Subsection 1.1: Details'}, page_content='Row 1, Cell 1 \n Row 1, Cell 2'),
Document(metadata={'Header 3': 'Subsection 1.1: Details'}, page_content='Row 1, Cell 3 \n \n \n Row 2, Cell 1'),
Document(metadata={'Header 3': 'Subsection 1.1: Details'}, page_content='Row 2, Cell 2 \n Row 2, Cell 3'),
Document(metadata={'Header 2': 'Section 2: Media Content'}, page_content='Section 2: Media Content'),
Document(metadata={'Header 2': 'Section 2: Media Content'}, page_content='This section contains an image and a video:'),
Document(metadata={'Header 2': 'Section 2: Media Content'}, page_content='Your browser does not support the video'),
Document(metadata={'Header 2': 'Section 2: Media Content'}, page_content='tag.'),
Document(metadata={'Header 2': 'Section 3: Code Example'}, page_content='Section 3: Code Example'),
Document(metadata={'Header 2': 'Section 3: Code Example'}, page_content='This section contains a code block: \n \n <div>'),
Document(metadata={'Header 2': 'Section 3: Code Example'}, page_content='<p>This is a paragraph inside a div.</p>'),
Document(metadata={'Header 2': 'Section 3: Code Example'}, page_content='</div>'),
Document(metadata={'Header 2': 'Conclusion'}, page_content='Conclusion'),
Document(metadata={'Header 2': 'Conclusion'}, page_content='This is the conclusion of the document.')]
使用 HTMLSemanticPreservingSplitter
这HTMLSemanticPreservingSplitter旨在将 HTML 内容拆分为可管理的块,同时保留重要元素(如表格、列表和其他 HTML 组件)的语义结构。这可确保此类元素不会被拆分为多个块,从而导致失去上下文相关性,例如表标题、列表标题等。
这个拆分器的核心是创建与上下文相关的块。一般递归拆分HTMLHeaderTextSplitter可能会导致表格、列表和其他结构化元素在中间被拆分,从而丢失重要的上下文并创建坏块。
这HTMLSemanticPreservingSplitter对于拆分包含结构化元素(如表格和列表)的 HTML 内容至关重要,尤其是在保持这些元素完好无损至关重要时。此外,它能够为特定 HTML 标签定义自定义处理程序,使其成为处理复杂 HTML 文档的通用工具。
重要说明:max_chunk_size不是块的确定最大大小,则当保留的内容不与块分开时,将计算最大大小,以确保它不会被拆分。当我们将保留的数据添加回 chunk 时,chunk 大小可能会超过max_chunk_size.这对于确保我们保持原始文档的结构至关重要
笔记:
- 我们定义了一个自定义处理程序来重新格式化代码块的内容
- 我们为特定的 html 元素定义了一个拒绝列表,以对它们及其内容进行预处理
- 我们特意设置了一个小的块大小来演示元素的非拆分
# BeautifulSoup is required to use the custom handlers
from bs4 import Tag
from langchain_text_splitters import HTMLSemanticPreservingSplitter
headers_to_split_on = [
("h1", "Header 1"),
("h2", "Header 2"),
]
def code_handler(element: Tag) -> str:
data_lang = element.get("data-lang")
code_format = f"<code:{data_lang}>{element.get_text()}</code>"
return code_format
splitter = HTMLSemanticPreservingSplitter(
headers_to_split_on=headers_to_split_on,
separators=["\n\n", "\n", ". ", "! ", "? "],
max_chunk_size=50,
preserve_images=True,
preserve_videos=True,
elements_to_preserve=["table", "ul", "ol", "code"],
denylist_tags=["script", "style", "head"],
custom_handlers={"code": code_handler},
)
documents = splitter.split_text(html_string)
documents
[Document(metadata={'Header 1': 'Main Title'}, page_content='This is an introductory paragraph with some basic content.'),
Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content='This section introduces the topic'),
Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content='. Below is a list: First item Second item Third item with bold text and a link Subsection 1.1: Details This subsection provides additional details'),
Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content=". Here's a table: Header 1 Header 2 Header 3 Row 1, Cell 1 Row 1, Cell 2 Row 1, Cell 3 Row 2, Cell 1 Row 2, Cell 2 Row 2, Cell 3"),
Document(metadata={'Header 2': 'Section 2: Media Content'}, page_content='This section contains an image and a video:  '),
Document(metadata={'Header 2': 'Section 3: Code Example'}, page_content='This section contains a code block: <code:html> <div> <p>This is a paragraph inside a div.</p> </div> </code>'),
Document(metadata={'Header 2': 'Conclusion'}, page_content='This is the conclusion of the document.')]
保留表格和列表
在此示例中,我们将演示如何使用HTMLSemanticPreservingSplitter可以在 HTML 文档中保留表格和大型列表。块大小将设置为 50 个字符,以说明拆分器如何确保这些元素不会被拆分,即使它们超过定义的最大块大小。
from langchain_text_splitters import HTMLSemanticPreservingSplitter
html_string = """
<!DOCTYPE html>
<html>
<body>
<div>
<h1>Section 1</h1>
<p>This section contains an important table and list that should not be split across chunks.</p>
<table>
<tr>
<th>Item</th>
<th>Quantity</th>
<th>Price</th>
</tr>
<tr>
<td>Apples</td>
<td>10</td>
<td>$1.00</td>
</tr>
<tr>
<td>Oranges</td>
<td>5</td>
<td>$0.50</td>
</tr>
<tr>
<td>Bananas</td>
<td>50</td>
<td>$1.50</td>
</tr>
</table>
<h2>Subsection 1.1</h2>
<p>Additional text in subsection 1.1 that is separated from the table and list.</p>
<p>Here is a detailed list:</p>
<ul>
<li>Item 1: Description of item 1, which is quite detailed and important.</li>
<li>Item 2: Description of item 2, which also contains significant information.</li>
<li>Item 3: Description of item 3, another item that we don't want to split across chunks.</li>
</ul>
</div>
</body>
</html>
"""
headers_to_split_on = [("h1", "Header 1"), ("h2", "Header 2")]
splitter = HTMLSemanticPreservingSplitter(
headers_to_split_on=headers_to_split_on,
max_chunk_size=50,
elements_to_preserve=["table", "ul"],
)
documents = splitter.split_text(html_string)
print(documents)
[Document(metadata={'Header 1': 'Section 1'}, page_content='This section contains an important table and list'), Document(metadata={'Header 1': 'Section 1'}, page_content='that should not be split across chunks.'), Document(metadata={'Header 1': 'Section 1'}, page_content='Item Quantity Price Apples 10 $1.00 Oranges 5 $0.50 Bananas 50 $1.50'), Document(metadata={'Header 2': 'Subsection 1.1'}, page_content='Additional text in subsection 1.1 that is'), Document(metadata={'Header 2': 'Subsection 1.1'}, page_content='separated from the table and list. Here is a'), Document(metadata={'Header 2': 'Subsection 1.1'}, page_content="detailed list: Item 1: Description of item 1, which is quite detailed and important. Item 2: Description of item 2, which also contains significant information. Item 3: Description of item 3, another item that we don't want to split across chunks.")]
解释
在此示例中,HTMLSemanticPreservingSplitter确保整个表和无序列表 (<ul>) 保留在各自的块中。即使块大小设置为 50 个字符,拆分器也会识别出这些元素不应被拆分并保持它们完好无损。
这在处理数据表或列表时尤其重要,因为拆分内容可能会导致上下文丢失或混淆。结果Document对象保留这些元素的完整结构,确保保持信息的上下文相关性。
使用自定义处理程序
这HTMLSemanticPreservingSplitter允许您为特定 HTML 元素定义自定义处理程序。某些平台具有自定义 HTML 标记,这些标记不是由BeautifulSoup,当发生这种情况时,您可以利用自定义处理程序轻松添加格式设置逻辑。
这对于需要特殊处理的元素特别有用,例如<iframe>标签或特定的 'data-' 元素。在此示例中,我们将为iframe标签,将它们转换为类似 Markdown 的链接。
def custom_iframe_extractor(iframe_tag):
iframe_src = iframe_tag.get("src", "")
return f"[iframe:{iframe_src}]({iframe_src})"
splitter = HTMLSemanticPreservingSplitter(
headers_to_split_on=headers_to_split_on,
max_chunk_size=50,
separators=["\n\n", "\n", ". "],
elements_to_preserve=["table", "ul", "ol"],
custom_handlers={"iframe": custom_iframe_extractor},
)
html_string = """
<!DOCTYPE html>
<html>
<body>
<div>
<h1>Section with Iframe</h1>
<iframe src="https://example.com/embed"></iframe>
<p>Some text after the iframe.</p>
<ul>
<li>Item 1: Description of item 1, which is quite detailed and important.</li>
<li>Item 2: Description of item 2, which also contains significant information.</li>
<li>Item 3: Description of item 3, another item that we don't want to split across chunks.</li>
</ul>
</div>
</body>
</html>
"""
documents = splitter.split_text(html_string)
print(documents)
[Document(metadata={'Header 1': 'Section with Iframe'}, page_content='[iframe:https://example.com/embed](https://example.com/embed) Some text after the iframe'), Document(metadata={'Header 1': 'Section with Iframe'}, page_content=". Item 1: Description of item 1, which is quite detailed and important. Item 2: Description of item 2, which also contains significant information. Item 3: Description of item 3, another item that we don't want to split across chunks.")]
解释
在此示例中,我们为iframe标签,将它们转换为类似 Markdown 的链接。当拆分器处理 HTML 内容时,它会使用此自定义处理程序来转换iframe标签中,同时保留表格和列表等其他元素。结果Documentobjects 显示如何根据您提供的自定义逻辑处理 iframe。
重要: 在预测链接等项目时,应注意不要包含.或将 separators 留空。RecursiveCharacterTextSplitter在句号处拆分,这会将链接切成两半。确保提供带有. 相反。
使用自定义处理程序通过 LLM 分析图像
使用自定义处理程序,我们还可以覆盖任何元素的默认处理。一个很好的例子是直接在分块流中插入文档中图像的语义分析。
由于我们的函数是在发现标签时调用的,因此我们可以覆盖<img>标记并关闭preserve_images插入我们想要嵌入到 chunk 中的任何内容。
"""This example assumes you have helper methods `load_image_from_url` and an LLM agent `llm` that can process image data."""
from langchain.agents import AgentExecutor
# This example needs to be replaced with your own agent
llm = AgentExecutor(...)
# This method is a placeholder for loading image data from a URL and is not implemented here
def load_image_from_url(image_url: str) -> bytes:
# Assuming this method fetches the image data from the URL
return b"image_data"
html_string = """
<!DOCTYPE html>
<html>
<body>
<div>
<h1>Section with Image and Link</h1>
<p>
<img src="https://example.com/image.jpg" alt="An example image" />
Some text after the image.
</p>
<ul>
<li>Item 1: Description of item 1, which is quite detailed and important.</li>
<li>Item 2: Description of item 2, which also contains significant information.</li>
<li>Item 3: Description of item 3, another item that we don't want to split across chunks.</li>
</ul>
</div>
</body>
</html>
"""
def custom_image_handler(img_tag) -> str:
img_src = img_tag.get("src", "")
img_alt = img_tag.get("alt", "No alt text provided")
image_data = load_image_from_url(img_src)
semantic_meaning = llm.invoke(image_data)
markdown_text = f"[Image Alt Text: {img_alt} | Image Source: {img_src} | Image Semantic Meaning: {semantic_meaning}]"
return markdown_text
splitter = HTMLSemanticPreservingSplitter(
headers_to_split_on=headers_to_split_on,
max_chunk_size=50,
separators=["\n\n", "\n", ". "],
elements_to_preserve=["ul"],
preserve_images=False,
custom_handlers={"img": custom_image_handler},
)
documents = splitter.split_text(html_string)
print(documents)
[Document(metadata={'Header 1': 'Section with Image and Link'}, page_content='[Image Alt Text: An example image | Image Source: https://example.com/image.jpg | Image Semantic Meaning: semantic-meaning] Some text after the image'),
Document(metadata={'Header 1': 'Section with Image and Link'}, page_content=". Item 1: Description of item 1, which is quite detailed and important. Item 2: Description of item 2, which also contains significant information. Item 3: Description of item 3, another item that we don't want to split across chunks.")]
解释:
使用我们的自定义处理程序编写,以从<img>元素中,我们可以用我们的代理进一步处理数据,并将结果直接插入到我们的 chunk 中。确保preserve_images设置为False否则,默认对<img>fields 将发生。