如何解析 XML 输出
先决条件
本指南假定您熟悉以下概念:
来自不同提供商的 LLM 通常具有不同的优势,具体取决于他们接受训练的特定数据。这也意味着有些算法在以 JSON 以外的格式生成输出时可能“更好”、更可靠。
本指南介绍如何使用XMLOutputParser提示模型进行 XML 输出,然后将该输出解析为可用格式。
注意
请记住,大型语言模型是泄漏的抽象!您必须使用具有足够容量的 LLM 来生成格式正确的 XML。
在以下示例中,我们使用 Anthropic 的 Claude-2 模型 (https://docs.anthropic.com/claude/docs),这是针对 XML 标记优化的此类模型。
%pip install -qU langchain langchain-anthropic
import os
from getpass import getpass
if "ANTHROPIC_API_KEY" not in os.environ:
os.environ["ANTHROPIC_API_KEY"] = getpass()
让我们从对模型的简单请求开始。
from langchain_anthropic import ChatAnthropic
from langchain_core.output_parsers import XMLOutputParser
from langchain_core.prompts import PromptTemplate
model = ChatAnthropic(model="claude-2.1", max_tokens_to_sample=512, temperature=0.1)
actor_query = "Generate the shortened filmography for Tom Hanks."
output = model.invoke(
f"""{actor_query}
Please enclose the movies in <movie></movie> tags"""
)
print(output.content)
Here is the shortened filmography for Tom Hanks, with movies enclosed in XML tags:
<movie>Splash</movie>
<movie>Big</movie>
<movie>A League of Their Own</movie>
<movie>Sleepless in Seattle</movie>
<movie>Forrest Gump</movie>
<movie>Toy Story</movie>
<movie>Apollo 13</movie>
<movie>Saving Private Ryan</movie>
<movie>Cast Away</movie>
<movie>The Da Vinci Code</movie>
这实际上效果很好!但是,最好将该 XML 解析为更易于使用的格式。我们可以使用XMLOutputParser向提示符添加默认格式指令并将输出的 XML 解析为 dict:
parser = XMLOutputParser()
# We will add these instructions to the prompt below
parser.get_format_instructions()
'The output should be formatted as a XML file.\n1. Output should conform to the tags below. \n2. If tags are not given, make them on your own.\n3. Remember to always open and close all the tags.\n\nAs an example, for the tags ["foo", "bar", "baz"]:\n1. String "<foo>\n <bar>\n <baz></baz>\n </bar>\n</foo>" is a well-formatted instance of the schema. \n2. String "<foo>\n <bar>\n </foo>" is a badly-formatted instance.\n3. String "<foo>\n <tag>\n </tag>\n</foo>" is a badly-formatted instance.\n\nHere are the output tags:\n\`\`\`\nNone\n\`\`\`'
prompt = PromptTemplate(
template="""{query}\n{format_instructions}""",
input_variables=["query"],
partial_variables={"format_instructions": parser.get_format_instructions()},
)
chain = prompt | model | parser
output = chain.invoke({"query": actor_query})
print(output)
{'filmography': [{'movie': [{'title': 'Big'}, {'year': '1988'}]}, {'movie': [{'title': 'Forrest Gump'}, {'year': '1994'}]}, {'movie': [{'title': 'Toy Story'}, {'year': '1995'}]}, {'movie': [{'title': 'Saving Private Ryan'}, {'year': '1998'}]}, {'movie': [{'title': 'Cast Away'}, {'year': '2000'}]}]}
我们还可以添加一些标签来根据我们的需要定制输出。您可以并且应该尝试在 Prompt 的其他部分添加自己的格式提示,以增强或替换默认说明:
parser = XMLOutputParser(tags=["movies", "actor", "film", "name", "genre"])
# We will add these instructions to the prompt below
parser.get_format_instructions()
'The output should be formatted as a XML file.\n1. Output should conform to the tags below. \n2. If tags are not given, make them on your own.\n3. Remember to always open and close all the tags.\n\nAs an example, for the tags ["foo", "bar", "baz"]:\n1. String "<foo>\n <bar>\n <baz></baz>\n </bar>\n</foo>" is a well-formatted instance of the schema. \n2. String "<foo>\n <bar>\n </foo>" is a badly-formatted instance.\n3. String "<foo>\n <tag>\n </tag>\n</foo>" is a badly-formatted instance.\n\nHere are the output tags:\n\`\`\`\n[\'movies\', \'actor\', \'film\', \'name\', \'genre\']\n\`\`\`'
prompt = PromptTemplate(
template="""{query}\n{format_instructions}""",
input_variables=["query"],
partial_variables={"format_instructions": parser.get_format_instructions()},
)
chain = prompt | model | parser
output = chain.invoke({"query": actor_query})
print(output)
{'movies': [{'actor': [{'name': 'Tom Hanks'}, {'film': [{'name': 'Forrest Gump'}, {'genre': 'Drama'}]}, {'film': [{'name': 'Cast Away'}, {'genre': 'Adventure'}]}, {'film': [{'name': 'Saving Private Ryan'}, {'genre': 'War'}]}]}]}
此输出解析器还支持部分块的流式处理。下面是一个示例:
for s in chain.stream({"query": actor_query}):
print(s)
{'movies': [{'actor': [{'name': 'Tom Hanks'}]}]}
{'movies': [{'actor': [{'film': [{'name': 'Forrest Gump'}]}]}]}
{'movies': [{'actor': [{'film': [{'genre': 'Drama'}]}]}]}
{'movies': [{'actor': [{'film': [{'name': 'Cast Away'}]}]}]}
{'movies': [{'actor': [{'film': [{'genre': 'Adventure'}]}]}]}
{'movies': [{'actor': [{'film': [{'name': 'Saving Private Ryan'}]}]}]}
{'movies': [{'actor': [{'film': [{'genre': 'War'}]}]}]}
后续步骤
您现在已经学习了如何提示模型返回 XML。接下来,查看有关获取其他相关技术的结构化输出的更广泛指南。