Skip to main content
Open In Colab在 GitHub 上打开

如何在执行提取时使用参考示例

通常可以通过向 LLM 提供参考示例来提高提取质量。

数据提取尝试生成在文本和其他非结构化或半结构化格式中找到的信息的结构化表示工具调用在这种情况下,通常会使用 LLM 功能。本指南演示了如何构建工具调用的少量示例,以帮助控制提取和类似应用程序的行为。

提示

虽然本指南重点介绍如何将示例与工具调用模型一起使用,但此技术通常适用,并且将有效 也使用 JSON more 或基于提示的技术。

LangChain 对来自 LLM 的消息(包括 tool 调用)实现 tool-call 属性。有关更多详细信息,请参阅我们的工具调用操作指南。为了构建数据提取的参考示例,我们构建了一个包含以下序列的聊天历史记录:

LangChain 采用此约定将工具调用结构化到跨 LLM 模型提供商的对话中。

首先,我们构建一个提示模板,其中包含这些消息的占位符:

from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

# Define a custom prompt to provide instructions and any additional context.
# 1) You can add examples into the prompt template to improve extraction quality
# 2) Introduce additional parameters to take context into account (e.g., include metadata
# about the document from which the text was extracted.)
prompt = ChatPromptTemplate.from_messages(
[
(
"system",
"You are an expert extraction algorithm. "
"Only extract relevant information from the text. "
"If you do not know the value of an attribute asked "
"to extract, return null for the attribute's value.",
),
# ↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓
MessagesPlaceholder("examples"), # <-- EXAMPLES!
# ↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑
("human", "{text}"),
]
)

测试模板:

from langchain_core.messages import (
HumanMessage,
)

prompt.invoke(
{"text": "this is some text", "examples": [HumanMessage(content="testing 1 2 3")]}
)
API 参考:HumanMessage
ChatPromptValue(messages=[SystemMessage(content="You are an expert extraction algorithm. Only extract relevant information from the text. If you do not know the value of an attribute asked to extract, return null for the attribute's value.", additional_kwargs={}, response_metadata={}), HumanMessage(content='testing 1 2 3', additional_kwargs={}, response_metadata={}), HumanMessage(content='this is some text', additional_kwargs={}, response_metadata={})])

定义架构

让我们重复使用提取教程中的 person 架构。

from typing import List, Optional

from langchain_openai import ChatOpenAI
from pydantic import BaseModel, Field


class Person(BaseModel):
"""Information about a person."""

# ^ Doc-string for the entity Person.
# This doc-string is sent to the LLM as the description of the schema Person,
# and it can help to improve extraction results.

# Note that:
# 1. Each field is an `optional` -- this allows the model to decline to extract it!
# 2. Each field has a `description` -- this description is used by the LLM.
# Having a good description can help improve extraction results.
name: Optional[str] = Field(..., description="The name of the person")
hair_color: Optional[str] = Field(
..., description="The color of the person's hair if known"
)
height_in_meters: Optional[str] = Field(..., description="Height in METERs")


class Data(BaseModel):
"""Extracted data about people."""

# Creates a model so that we can extract multiple entities.
people: List[Person]
API 参考:ChatOpenAI

定义参考示例

示例可以定义为 input-output 对的列表。

每个示例都包含一个示例inputtext 和示例output显示应该从文本中提取什么。

重要

这有点杂草丛生,所以请随意跳过。

示例的格式需要与使用的 API 匹配(例如,工具调用或 JSON 模式等)。

在这里,格式化的示例将与工具调用 API 的预期格式匹配,因为这是我们使用的格式。

import uuid
from typing import Dict, List, TypedDict

from langchain_core.messages import (
AIMessage,
BaseMessage,
HumanMessage,
SystemMessage,
ToolMessage,
)
from pydantic import BaseModel, Field


class Example(TypedDict):
"""A representation of an example consisting of text input and expected tool calls.

For extraction, the tool calls are represented as instances of pydantic model.
"""

input: str # This is the example text
tool_calls: List[BaseModel] # Instances of pydantic model that should be extracted


def tool_example_to_messages(example: Example) -> List[BaseMessage]:
"""Convert an example into a list of messages that can be fed into an LLM.

This code is an adapter that converts our example to a list of messages
that can be fed into a chat model.

The list of messages per example corresponds to:

1) HumanMessage: contains the content from which content should be extracted.
2) AIMessage: contains the extracted information from the model
3) ToolMessage: contains confirmation to the model that the model requested a tool correctly.

The ToolMessage is required because some of the chat models are hyper-optimized for agents
rather than for an extraction use case.
"""
messages: List[BaseMessage] = [HumanMessage(content=example["input"])]
tool_calls = []
for tool_call in example["tool_calls"]:
tool_calls.append(
{
"id": str(uuid.uuid4()),
"args": tool_call.dict(),
# The name of the function right now corresponds
# to the name of the pydantic model
# This is implicit in the API right now,
# and will be improved over time.
"name": tool_call.__class__.__name__,
},
)
messages.append(AIMessage(content="", tool_calls=tool_calls))
tool_outputs = example.get("tool_outputs") or [
"You have correctly called this tool."
] * len(tool_calls)
for output, tool_call in zip(tool_outputs, tool_calls):
messages.append(ToolMessage(content=output, tool_call_id=tool_call["id"]))
return messages

接下来,让我们定义示例,然后将它们转换为消息格式。

examples = [
(
"The ocean is vast and blue. It's more than 20,000 feet deep. There are many fish in it.",
Data(people=[]),
),
(
"Fiona traveled far from France to Spain.",
Data(people=[Person(name="Fiona", height_in_meters=None, hair_color=None)]),
),
]


messages = []

for text, tool_call in examples:
messages.extend(
tool_example_to_messages({"input": text, "tool_calls": [tool_call]})
)

我们来测试一下 prompt

example_prompt = prompt.invoke({"text": "this is some text", "examples": messages})

for message in example_prompt.messages:
print(f"{message.type}: {message}")
system: content="You are an expert extraction algorithm. Only extract relevant information from the text. If you do not know the value of an attribute asked to extract, return null for the attribute's value." additional_kwargs={} response_metadata={}
human: content="The ocean is vast and blue. It's more than 20,000 feet deep. There are many fish in it." additional_kwargs={} response_metadata={}
ai: content='' additional_kwargs={} response_metadata={} tool_calls=[{'name': 'Data', 'args': {'people': []}, 'id': '240159b1-1405-4107-a07c-3c6b91b3d5b7', 'type': 'tool_call'}]
tool: content='You have correctly called this tool.' tool_call_id='240159b1-1405-4107-a07c-3c6b91b3d5b7'
human: content='Fiona traveled far from France to Spain.' additional_kwargs={} response_metadata={}
ai: content='' additional_kwargs={} response_metadata={} tool_calls=[{'name': 'Data', 'args': {'people': [{'name': 'Fiona', 'hair_color': None, 'height_in_meters': None}]}, 'id': '3fc521e4-d1d2-4c20-bf40-e3d72f1068da', 'type': 'tool_call'}]
tool: content='You have correctly called this tool.' tool_call_id='3fc521e4-d1d2-4c20-bf40-e3d72f1068da'
human: content='this is some text' additional_kwargs={} response_metadata={}

创建提取器

我们选择一个 LLM。由于我们使用的是工具调用,因此需要一个支持工具调用功能的模型。有关可用的 LLM,请参阅此表

pip install -qU "langchain[openai]"
import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain.chat_models import init_chat_model

llm = init_chat_model("gpt-4-0125-preview", model_provider="openai", temperature=0)

按照提取教程,我们使用.with_structured_output方法根据所需的 schema 构建模型输出:

runnable = prompt | llm.with_structured_output(
schema=Data,
method="function_calling",
include_raw=False,
)

无示例 😿

请注意,即使是功能强大的模型也可能在非常简单的测试用例中失败!

for _ in range(5):
text = "The solar system is large, but earth has only 1 moon."
print(runnable.invoke({"text": text, "examples": []}))
people=[Person(name='earth', hair_color='null', height_in_meters='null')]
``````output
people=[Person(name='earth', hair_color='null', height_in_meters='null')]
``````output
people=[]
``````output
people=[Person(name='earth', hair_color='null', height_in_meters='null')]
``````output
people=[]

附示例 😻

参考示例有助于修复故障!

for _ in range(5):
text = "The solar system is large, but earth has only 1 moon."
print(runnable.invoke({"text": text, "examples": messages}))
people=[]
``````output
people=[]
``````output
people=[]
``````output
people=[]
``````output
people=[]

请注意,我们可以在 Langsmith 跟踪中将少数样本示例视为工具调用。

我们保留了阳性样本的性能:

runnable.invoke(
{
"text": "My name is Harrison. My hair is black.",
"examples": messages,
}
)
Data(people=[Person(name='Harrison', hair_color='black', height_in_meters=None)])