如何定义要计算的目标函数
运行评估需要三个主要部分:
本指南介绍如何根据要评估的应用程序部分定义目标函数。 有关如何创建数据集以及如何定义赋值器的信息,请参阅此处的信息,以及运行评估的端到端示例。
目标函数签名
为了在代码中评估应用程序,我们需要一种方法来运行应用程序。使用evaluate() (Python/TypeScript)中,我们将通过传入目标函数参数来实现此目的。这是一个函数,它接受数据集 Example 的输入,并将应用程序输出作为 dict 返回。在这个函数中,我们可以随心所欲地调用我们的应用程序。我们也可以根据需要格式化输出。关键是我们定义的任何 evaluator 函数都应该使用我们在目标函数中返回的输出格式。
from langsmith import Client
# 'inputs' will come from your dataset.
def dummy_target(inputs: dict) -> dict:
return {"foo": 1, "bar": "two"}
# 'inputs' will come from your dataset.
# 'outputs' will come from your target function.
def evaluator_one(inputs: dict, outputs: dict) -> bool:
return outputs["foo"] == 2
def evaluator_two(inputs: dict, outputs: dict) -> bool:
return len(outputs["bar"]) < 3
client = Client()
results = client.evaluate(
dummy_target, # <-- target function
data="your-dataset-name",
evaluators=[evaluator_one, evaluator_two],
...
)
evaluate()将自动跟踪您的目标函数。
这意味着,如果您在目标函数中运行任何可跟踪代码,这也将被跟踪为目标跟踪的子运行。
示例:单个 LLM 调用
当我们迭代提示或比较模型时,评估单个 LLM 调用可能很有用:
- 蟒
- TypeScript (类型脚本)
- Python (LangChain)
- TypeScript (LangChain)
设置 env varOPENAI_API_KEY并安装 depspip install -U openai langsmith.
from langsmith import wrappers
from openai import OpenAI
# Optionally wrap the OpenAI client to automatically
# trace all model calls.
oai_client = wrappers.wrap_openai(OpenAI())
def target(inputs: dict) -> dict:
# This assumes your dataset has inputs with a 'messages' key.
# You can update to match your dataset schema.
messages = inputs["messages"]
response = oai_client.chat.completions.create(
messages=messages,
model="gpt-4o-mini",
)
return {"answer": response.choices[0].message.content}
Set env var OPENAI_API_KEY and install openai and langsmith.
import OpenAI from 'openai';
import { wrapOpenAI } from "langsmith/wrappers";
const client = wrapOpenAI(new OpenAI());
// This is the function you will evaluate.
const target = async(inputs) => {
// This assumes your dataset has inputs with a `messages` key
const messages = inputs.messages;
const response = await client.chat.completions.create({
messages: messages,
model: 'gpt-4o-mini',
});
return { answer: response.choices[0].message.content };
}
Set env var OPENAI_API_KEY and install deps pip install -U langchain[openai].
from langchain.chat_models import init_chat_model
llm = init_chat_model("openai:gpt-4o-mini")
def target(inputs: dict) -> dict:
# This assumes your dataset has inputs with a `messages` key
messages = inputs["messages"]
response = llm.invoke(messages)
return {"answer": response.content}
Set env var OPENAI_API_KEY and install @langchain/openai.
import { ChatOpenAI } from '@langchain/openai';
// This is the function you will evaluate.
const target = async(inputs) => {
// This assumes your dataset has inputs with a `messages` key
const messages = inputs.messages;
const model = new ChatOpenAI({ model: "gpt-4o-mini" });
const response = await model.invoke(messages);
return {"answer": response.content};
}
示例:非 LLM 组件
有时,您可能希望评估应用程序中不涉及 LLM 的步骤。这包括但不限于:
- RAG 应用程序中的检索步骤
- 工具的执行
在此示例中,我们将展示如何测试一个简单的计算器工具。在实践中,评估对于具有更复杂和难以进行单元测试行为的组件非常有用,例如检索器或在线研究工具。
- 蟒
- TypeScript (类型脚本)
from langsmith import traceable
# Optionally decorate with '@traceable' to trace all invocations of this function.
@traceable
def calculator_tool(operation: str, number1: float, number2: float) -> str:
if operation == "add":
return str(number1 + number2)
elif operation == "subtract":
return str(number1 - number2)
elif operation == "multiply":
return str(number1 * number2)
elif operation == "divide":
return str(number1 / number2)
else:
raise ValueError(f"Unrecognized operation: {operation}.")
# This is the function you will evaluate.
def target(inputs: dict) -> dict:
# This assumes your dataset has inputs with `operation`, `num1`, and `num2` keys.
operation = inputs["operation"]
number1 = inputs["num1"]
number2 = inputs["num2"]
result = calculator_tool(operation, number1, number2)
return {"result": result}
import { traceable } from "langsmith/traceable";
// Optionally wrap in 'traceable' to trace all invocations of this function.
const calculatorTool = traceable(async ({ operation, number1, number2 }) => {
// Functions must return strings
if (operation === "add") {
return (number1 + number2).toString();
} else if (operation === "subtract") {
return (number1 - number2).toString();
} else if (operation === "multiply") {
return (number1 * number2).toString();
} else if (operation === "divide") {
return (number1 / number2).toString();
} else {
throw new Error("Invalid operation.");
}
});
// This is the function you will evaluate.
const target = async (inputs) => {
// This assumes your dataset has inputs with `operation`, `num1`, and `num2` keys
const result = await calculatorTool.invoke({
operation: inputs.operation,
number1: inputs.num1,
number2: inputs.num2,
});
return { result };
}
示例:应用程序或代理
评估代理应用程序的完整输出可以捕获多个组件之间的交互,从而提供更真实的端到端性能图。端到端评估还可能发现在测试独立函数或单个 LLM 调用时可能会遗漏的集成和错误处理问题。
- 蟒
- TypeScript (类型脚本)
from my_agent import agent
# This is the function you will evaluate.
def target(inputs: dict) -> dict:
# This assumes your dataset has inputs with a `messages` key
messages = inputs["messages"]
# Replace `invoke` with whatever you use to call your agent
response = agent.invoke({"messages": messages})
# This assumes your agent output is in the right format
return response
import { agent } from 'my_agent';
// This is the function you will evaluate.
const target = async(inputs) => {
// This assumes your dataset has inputs with a `messages` key
const messages = inputs.messages;
// Replace `invoke` with whatever you use to call your agent
const response = await agent.invoke({ messages });
// This assumes your agent output is in the right format
return response;
}
如果你有一个 LangGraph/LangChain 代理,它接受数据集中定义的输入,并返回你想在计算器中使用的输出格式,你可以直接将该对象作为目标传入:
from my_agent import agent
from langsmith import Client
client = Client()
client.evaluate(agent, ...)