如何按标记拆分文本

语言模型具有标记限制。您不应超过令牌限制。因此，当您将文本拆分为块时，最好计算标记的数量。有很多分词器。在文本中计算标记时，应使用与语言模型中使用的相同的标记器。

抖音令牌

注意

TikToken 是一个快速的BPEtokenizer 创建者OpenAI.

我们可以使用tiktoken来估计使用的令牌。对于 OpenAI 模型来说，它可能会更准确。

文本的拆分方式：按传入的字符。
块大小的测量方式：通过tiktokentokenizer 的 Tokenizer 中。

CharacterTextSplitter、RecursiveCharacterTextSplitter 和 TokenTextSplitter 可以与 CharacterTextSplitter 一起使用tiktoken径直。

%pip install --upgrade --quiet langchain-text-splitters tiktoken

from langchain_text_splitters import CharacterTextSplitter

# This is a long document we can split up.
with open("state_of_the_union.txt") as f:
    state_of_the_union = f.read()

API 参考：CharacterTextSplitter

使用 CharacterTextSplitter 进行拆分，然后使用tiktoken，使用其.from_tiktoken_encoder()方法。请注意，此方法的 splits 可能大于tiktokentokenizer 的 Tokenizer 中。

这.from_tiktoken_encoder()method 采用encoding_name作为参数（例如cl100k_base）或model_name（例如gpt-4).所有其他参数（如chunk_size,chunk_overlap和separators用于实例化CharacterTextSplitter:

text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    encoding_name="cl100k_base", chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(state_of_the_union)

print(texts[0])

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  

Last year COVID-19 kept us apart. This year we are finally together again. 

Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. 

With a duty to one another to the American people to the Constitution.

要对 chunk 大小实现硬约束，我们可以使用RecursiveCharacterTextSplitter.from_tiktoken_encoder，如果每个 split 的大小较大，则每个 split 将被递归拆分：

from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    model_name="gpt-4",
    chunk_size=100,
    chunk_overlap=0,
)

API 参考：RecursiveCharacterTextSplitter

我们还可以加载一个TokenTextSplittersplitter 的tiktoken直接，并将确保每个 split 小于 chunk size。

from langchain_text_splitters import TokenTextSplitter

text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)

texts = text_splitter.split_text(state_of_the_union)
print(texts[0])

API 参考：TokenTextSplitter

Madam Speaker, Madam Vice President, our

一些书面语言（例如中文和日语）具有编码为 2 个或更多标记的字符。使用TokenTextSplitter可以直接将字符的标记拆分为两个块，从而导致 Unicode 字符格式错误。用RecursiveCharacterTextSplitter.from_tiktoken_encoder或CharacterTextSplitter.from_tiktoken_encoder以确保块包含有效的 Unicode 字符串。

SpaCy

注意

spaCy 是一个用于高级自然语言处理的开源软件库，以编程语言 Python 和 Cython 编写。

LangChain 基于 spaCy 分词器实现分叉器。

文本的拆分方式：依据spaCytokenizer 的 Tokenizer 中。
如何测量块大小：按字符数。

%pip install --upgrade --quiet  spacy

# This is a long document we can split up.
with open("state_of_the_union.txt") as f:
    state_of_the_union = f.read()

from langchain_text_splitters import SpacyTextSplitter

text_splitter = SpacyTextSplitter(chunk_size=1000)

texts = text_splitter.split_text(state_of_the_union)
print(texts[0])

API 参考：SpacyTextSplitter

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman.

Members of Congress and the Cabinet.

Justices of the Supreme Court.

My fellow Americans.  

Last year COVID-19 kept us apart.

This year we are finally together again. 

Tonight, we meet as Democrats Republicans and Independents.

But most importantly as Americans. 

With a duty to one another to the American people to the Constitution. 

And with an unwavering resolve that freedom will always triumph over tyranny. 

Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways.

But he badly miscalculated. 

He thought he could roll into Ukraine and the world would roll over.

Instead he met a wall of strength he never imagined. 

He met the Ukrainian people. 

From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.

句子转换器

SentenceTransformersTokenTextSplitter 是专门用于句子转换器模型的文本拆分器。默认行为是将文本拆分为适合要使用的句子转换器模型的标记窗口的块。

要根据句子转换器分词器拆分文本并限制分词计数，请实例化一个SentenceTransformersTokenTextSplitter.您可以选择指定：

chunk_overlap：标记重叠的整数计数;
model_name：句子转换模型名称，默认为"sentence-transformers/all-mpnet-base-v2";
tokens_per_chunk：每个块所需的令牌计数。

from langchain_text_splitters import SentenceTransformersTokenTextSplitter

splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0)
text = "Lorem "

count_start_and_stop_tokens = 2
text_token_count = splitter.count_tokens(text=text) - count_start_and_stop_tokens
print(text_token_count)

API 参考：SentenceTransformersTokenTextSplitter

token_multiplier = splitter.maximum_tokens_per_chunk // text_token_count + 1

# `text_to_split` does not fit in a single chunk
text_to_split = text * token_multiplier

print(f"tokens in text to split: {splitter.count_tokens(text=text_to_split)}")

tokens in text to split: 514

text_chunks = splitter.split_text(text=text_to_split)

print(text_chunks[1])

lorem

NLTK

注意

自然语言工具包（或更常见的 NLTK）是一套库和程序，用于用 Python 编程语言编写的英语符号和统计自然语言处理（NLP）。

我们不仅可以在 “\n\n” 上拆分，还可以使用NLTK以基于 NLTK 分词器进行拆分。

文本的拆分方式：依据NLTKtokenizer 的 Tokenizer 中。
如何测量块大小：按字符数。

# pip install nltk

# This is a long document we can split up.
with open("state_of_the_union.txt") as f:
    state_of_the_union = f.read()

from langchain_text_splitters import NLTKTextSplitter

text_splitter = NLTKTextSplitter(chunk_size=1000)

API 参考：NLTKTextSplitter

texts = text_splitter.split_text(state_of_the_union)
print(texts[0])

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman.

Members of Congress and the Cabinet.

Justices of the Supreme Court.

My fellow Americans.

Last year COVID-19 kept us apart.

This year we are finally together again.

Tonight, we meet as Democrats Republicans and Independents.

But most importantly as Americans.

With a duty to one another to the American people to the Constitution.

And with an unwavering resolve that freedom will always triumph over tyranny.

Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways.

But he badly miscalculated.

He thought he could roll into Ukraine and the world would roll over.

Instead he met a wall of strength he never imagined.

He met the Ukrainian people.

From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.

Groups of citizens blocking tanks with their bodies.

KoNLPY

注意

KoNLPy：Python 中的韩语 NLP 是一个用于韩语自然语言处理（NLP）的 Python 包。

令牌拆分涉及将文本分割成更小、更易于管理的单元，称为令牌。这些标记通常是单词、短语、符号或其他对进一步处理和分析至关重要的有意义的元素。在英语等语言中，标记拆分通常涉及用空格和标点符号分隔单词。令牌拆分的有效性在很大程度上取决于分词器对语言结构的理解，从而确保生成有意义的令牌。由于为英语设计的分词器无法理解其他语言（如韩语）的独特语义结构，因此它们无法有效地用于韩语语言处理。

使用 KoNLPy 的 Kkma 分析器为韩语进行代币拆分

对于韩文文本，KoNLPY 包含名为Kkma（韩语知识语态分析器）。Kkma提供韩文文本的详细形态分析。它将句子分解为单词，将单词分解为各自的语素，从而识别每个标记的词性。它可以将文本块分割成单独的句子，这对于处理长文本特别有用。

使用注意事项

而Kkma以其详细的分析而闻名，但需要注意的是，这种精度可能会影响处理速度。因此Kkma最适合于分析深度优先于快速文本处理的应用程序。

# pip install konlpy

# This is a long Korean document that we want to split up into its component sentences.
with open("./your_korean_doc.txt") as f:
    korean_document = f.read()

from langchain_text_splitters import KonlpyTextSplitter

text_splitter = KonlpyTextSplitter()

API 参考：KonlpyTextSplitter

texts = text_splitter.split_text(korean_document)
# The sentences are split with "\n\n" characters.
print(texts[0])

춘향전 옛날에 남원에 이 도령이라는 벼슬아치 아들이 있었다.

그의 외모는 빛나는 달처럼 잘생겼고, 그의 학식과 기예는 남보다 뛰어났다.

한편, 이 마을에는 춘향이라는 절세 가인이 살고 있었다.

춘 향의 아름다움은 꽃과 같아 마을 사람들 로부터 많은 사랑을 받았다.

어느 봄날, 도령은 친구들과 놀러 나갔다가 춘 향을 만 나 첫 눈에 반하고 말았다.

두 사람은 서로 사랑하게 되었고, 이내 비밀스러운 사랑의 맹세를 나누었다.

하지만 좋은 날들은 오래가지 않았다.

도령의 아버지가 다른 곳으로 전근을 가게 되어 도령도 떠나 야만 했다.

이별의 아픔 속에서도, 두 사람은 재회를 기약하며 서로를 믿고 기다리기로 했다.

그러나 새로 부임한 관아의 사또가 춘 향의 아름다움에 욕심을 내 어 그녀에게 강요를 시작했다.

춘 향 은 도령에 대한 자신의 사랑을 지키기 위해, 사또의 요구를 단호히 거절했다.

이에 분노한 사또는 춘 향을 감옥에 가두고 혹독한 형벌을 내렸다.

이야기는 이 도령이 고위 관직에 오른 후, 춘 향을 구해 내는 것으로 끝난다.

두 사람은 오랜 시련 끝에 다시 만나게 되고, 그들의 사랑은 온 세상에 전해 지며 후세에까지 이어진다.

- 춘향전 (The Tale of Chunhyang)

Hugging Face 分词器

Hugging Face 有许多分词器。

我们使用 Hugging Face 分词器，即 GPT2TokenizerFast 来计算分词中的文本长度。

文本的拆分方式：按传入的字符。
如何测量块大小：通过Hugging Facetokenizer 的 Tokenizer 中。

from transformers import GPT2TokenizerFast

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

# This is a long document we can split up.
with open("state_of_the_union.txt") as f:
    state_of_the_union = f.read()
from langchain_text_splitters import CharacterTextSplitter

API 参考：CharacterTextSplitter

text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(
    tokenizer, chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(state_of_the_union)

print(texts[0])

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  

Last year COVID-19 kept us apart. This year we are finally together again. 

Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. 

With a duty to one another to the American people to the Constitution.