如何拆分 JSON 数据
此 json 拆分器拆分 json 数据,同时允许控制块大小。它首先遍历 json 数据深度,然后构建较小的 json 块。它尝试保持嵌套的 json 对象完整,但如果需要,它会拆分它们,以保持 min_chunk_size 和 max_chunk_size 之间的块。
如果值不是嵌套的 json,而是非常大的字符串,则不会拆分字符串。如果你需要对数据块大小设置硬上限,请考虑在这些数据块上使用递归文本拆分器来编写它。有一个可选的预处理步骤来拆分列表,首先将它们转换为 json (dict),然后按原样拆分它们。
- 文本的拆分方式:json 值。
- 如何测量块大小:按字符数。
%pip install -qU langchain-text-splitters
首先我们加载一些 json 数据:
import json
import requests
# This is a large nested json object and will be loaded as a python dict
json_data = requests.get("https://api.smith.langchain.com/openapi.json").json()
基本用法
指定max_chunk_size要限制块大小:
from langchain_text_splitters import RecursiveJsonSplitter
splitter = RecursiveJsonSplitter(max_chunk_size=300)
API 参考:RecursiveJsonSplitter
要获取 json 块,请使用.split_json方法:
# Recursively split json data - If you need to access/manipulate the smaller json chunks
json_chunks = splitter.split_json(json_data=json_data)
for chunk in json_chunks[:3]:
print(chunk)
{'openapi': '3.1.0', 'info': {'title': 'LangSmith', 'version': '0.1.0'}, 'servers': [{'url': 'https://api.smith.langchain.com', 'description': 'LangSmith API endpoint.'}]}
{'paths': {'/api/v1/sessions/{session_id}': {'get': {'tags': ['tracer-sessions'], 'summary': 'Read Tracer Session', 'description': 'Get a specific session.', 'operationId': 'read_tracer_session_api_v1_sessions__session_id__get'}}}}
{'paths': {'/api/v1/sessions/{session_id}': {'get': {'security': [{'API Key': []}, {'Tenant ID': []}, {'Bearer Auth': []}]}}}}
要获取 LangChain Document 对象,请使用.create_documents方法:
# The splitter can also output documents
docs = splitter.create_documents(texts=[json_data])
for doc in docs[:3]:
print(doc)
page_content='{"openapi": "3.1.0", "info": {"title": "LangSmith", "version": "0.1.0"}, "servers": [{"url": "https://api.smith.langchain.com", "description": "LangSmith API endpoint."}]}'
page_content='{"paths": {"/api/v1/sessions/{session_id}": {"get": {"tags": ["tracer-sessions"], "summary": "Read Tracer Session", "description": "Get a specific session.", "operationId": "read_tracer_session_api_v1_sessions__session_id__get"}}}}'
page_content='{"paths": {"/api/v1/sessions/{session_id}": {"get": {"security": [{"API Key": []}, {"Tenant ID": []}, {"Bearer Auth": []}]}}}}'
或使用.split_text要直接获取字符串内容:
texts = splitter.split_text(json_data=json_data)
print(texts[0])
print(texts[1])
{"openapi": "3.1.0", "info": {"title": "LangSmith", "version": "0.1.0"}, "servers": [{"url": "https://api.smith.langchain.com", "description": "LangSmith API endpoint."}]}
{"paths": {"/api/v1/sessions/{session_id}": {"get": {"tags": ["tracer-sessions"], "summary": "Read Tracer Session", "description": "Get a specific session.", "operationId": "read_tracer_session_api_v1_sessions__session_id__get"}}}}
如何从列表内容管理区块大小
请注意,此示例中的一个 chunk 大于指定的max_chunk_size的 300 个。查看其中一个较大的块,我们看到那里有一个 list 对象:
print([len(text) for text in texts][:10])
print()
print(texts[3])
[171, 231, 126, 469, 210, 213, 237, 271, 191, 232]
{"paths": {"/api/v1/sessions/{session_id}": {"get": {"parameters": [{"name": "session_id", "in": "path", "required": true, "schema": {"type": "string", "format": "uuid", "title": "Session Id"}}, {"name": "include_stats", "in": "query", "required": false, "schema": {"type": "boolean", "default": false, "title": "Include Stats"}}, {"name": "accept", "in": "header", "required": false, "schema": {"anyOf": [{"type": "string"}, {"type": "null"}], "title": "Accept"}}]}}}}
默认情况下,json 拆分器不会拆分列表。
指定convert_lists=True要预处理 JSON,请使用 将列表内容转换为字典index:item如key:val对:
texts = splitter.split_text(json_data=json_data, convert_lists=True)
让我们看看块的大小。现在它们都低于最大值
print([len(text) for text in texts][:10])
[176, 236, 141, 203, 212, 221, 210, 213, 242, 291]
该列表已转换为 dict,但即使拆分为许多块,也会保留所有需要的上下文信息:
print(texts[1])
{"paths": {"/api/v1/sessions/{session_id}": {"get": {"tags": {"0": "tracer-sessions"}, "summary": "Read Tracer Session", "description": "Get a specific session.", "operationId": "read_tracer_session_api_v1_sessions__session_id__get"}}}}
# We can also look at the documents
docs[1]
Document(page_content='{"paths": {"/api/v1/sessions/{session_id}": {"get": {"tags": ["tracer-sessions"], "summary": "Read Tracer Session", "description": "Get a specific session.", "operationId": "read_tracer_session_api_v1_sessions__session_id__get"}}}}')