评估快速入门
评估是衡量 LLM 应用程序性能的一种定量方法,这很重要,因为 LLM 的行为并不总是可预测的——提示、模型或输入的微小变化会显著影响结果。评估提供了一种结构化的方法来识别故障、比较应用程序不同版本的更改以及构建更可靠的 AI 应用程序。
评估由三个部分组成:
此快速入门将指导您运行简单的评估,以使用 LangSmith SDK 或 UI 测试 LLM 响应的正确性。
- 开发工具包
- 用户界面
本快速入门使用开源中预构建的 LLM as-judge 评估器openevals包。OpenEvals 包含一组常用的评估器,如果您不熟悉评估,这是一个很好的起点。
如果您希望在评估应用程序的方式上具有更大的灵活性,您还可以使用自己的代码定义完全自定义的评估器。
1. 安装依赖项
- 蟒
- TypeScript (类型脚本)
pip install -U langsmith openevals openai
npm install langsmith openevals openai
If you are using yarn as your package manager, you will also need to manually install @langchain/core as a peer dependency of openevals. This is not required for LangSmith evals in general - you may define evaluators using arbitrary custom code.
2. 创建 LangSmith API 密钥
要创建 API 密钥,请前往 设置 页面。然后单击 Create API Key(创建 API 密钥)。
3. 设置您的环境
由于本快速入门使用 OpenAI 模型,因此您需要将OPENAI_API_KEY环境变量以及
必需的 LangSmith 的:
- 壳
export LANGSMITH_TRACING=true
export LANGSMITH_API_KEY="<your-langchain-api-key>"
# This example uses OpenAI, but you can use other LLM providers if desired
export OPENAI_API_KEY="<your-openai-api-key>"
4. 创建数据集
接下来,定义用于评估应用程序的示例输入和参考输出对:
- 蟒
- TypeScript (类型脚本)
from langsmith import Client
client = Client()
# Programmatically create a dataset in LangSmith
# For other dataset creation methods, see:
# https://docs.smith.langchain.com/evaluation/how_to_guides/manage_datasets_programmatically
# https://docs.smith.langchain.com/evaluation/how_to_guides/manage_datasets_in_application
dataset = client.create_dataset(
dataset_name="Sample dataset", description="A sample dataset in LangSmith."
)
# Create examples
examples = [
{
"inputs": {"question": "Which country is Mount Kilimanjaro located in?"},
"outputs": {"answer": "Mount Kilimanjaro is located in Tanzania."},
},
{
"inputs": {"question": "What is Earth's lowest point?"},
"outputs": {"answer": "Earth's lowest point is The Dead Sea."},
},
]
# Add examples to the dataset
client.create_examples(dataset_id=dataset.id, examples=examples)
import { Client } from "langsmith";
const client = new Client();
// Programmatically create a dataset in LangSmith
// For other dataset creation methods, see:
// https://docs.smith.langchain.com/evaluation/how_to_guides/manage_datasets_programmatically
// https://docs.smith.langchain.com/evaluation/how_to_guides/manage_datasets_in_application
const dataset = await client.createDataset("Sample dataset", {
description: "A sample dataset in LangSmith.",
});
// Create inputs and reference outputs
const examples = [
{
inputs: { question: "Which country is Mount Kilimanjaro located in?" },
outputs: { answer: "Mount Kilimanjaro is located in Tanzania." },
dataset_id: dataset.id,
},
{
inputs: { question: "What is Earth's lowest point?" },
outputs: { answer: "Earth's lowest point is The Dead Sea." },
dataset_id: dataset.id,
},
];
// Add examples to the dataset
await client.createExamples(examples);
5. 定义您要评估的内容
现在,定义包含您正在评估的内容的目标函数。例如,这可能是一个 LLM 调用,其中包括您正在测试的新提示、应用程序的一部分或端到端应用程序。
- 蟒
- TypeScript (类型脚本)
from langsmith import wrappers
from openai import OpenAI
# Wrap the OpenAI client for LangSmith tracing
openai_client = wrappers.wrap_openai(OpenAI())
# Define the application logic you want to evaluate inside a target function
# The SDK will automatically send the inputs from the dataset to your target function
def target(inputs: dict) -> dict:
response = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "Answer the following question accurately"},
{"role": "user", "content": inputs["question"]},
],
)
return { "answer": response.choices[0].message.content.strip() }
import { wrapOpenAI } from "langsmith/wrappers";
import OpenAI from "openai";
const openai = wrapOpenAI(new OpenAI());
// Define the application logic you want to evaluate inside a target function
// The SDK will automatically send the inputs from the dataset to your target function
async function target(inputs: { question: string }): Promise<{ answer: string }> {
const response = await openai.chat.completions.create({
model: "gpt-4o-mini",
messages: [
{ role: "system", content: "Answer the following question accurately" },
{ role: "user", content: inputs.question },
],
});
return { answer: response.choices[0].message.content?.trim() || "" };
}
6. 定义赋值器
导入预构建的提示openevals并创建赋值器。outputs是目标函数的结果。reference_outputs / referenceOutputs来自您在上面的步骤 4 中定义的示例对。
CORRECTNESS_PROMPT只是一个 f 字符串,其中包含"inputs","outputs"和"reference_outputs".
有关自定义 OpenEvals 提示的更多信息,请参阅此处。
- 蟒
- TypeScript (类型脚本)
from openevals.llm import create_llm_as_judge
from openevals.prompts import CORRECTNESS_PROMPT
def correctness_evaluator(inputs: dict, outputs: dict, reference_outputs: dict):
evaluator = create_llm_as_judge(
prompt=CORRECTNESS_PROMPT,
model="openai:o3-mini",
feedback_key="correctness",
)
eval_result = evaluator(
inputs=inputs,
outputs=outputs,
reference_outputs=reference_outputs
)
return eval_result
import { createLLMAsJudge, CORRECTNESS_PROMPT } from "openevals";
const correctnessEvaluator = async (params: {
inputs: Record<string, unknown>;
outputs: Record<string, unknown>;
referenceOutputs?: Record<string, unknown>;
}) => {
const evaluator = createLLMAsJudge({
prompt: CORRECTNESS_PROMPT,
model: "openai:o3-mini",
feedbackKey: "correctness",
});
const evaluatorResult = await evaluator({
inputs: params.inputs,
outputs: params.outputs,
referenceOutputs: params.referenceOutputs,
});
return evaluatorResult;
};
7. 运行并查看结果
最后,运行实验!
- 蟒
- TypeScript (类型脚本)
# After running the evaluation, a link will be provided to view the results in langsmith
experiment_results = client.evaluate(
target,
data="Sample dataset",
evaluators=[
correctness_evaluator,
# can add multiple evaluators here
],
experiment_prefix="first-eval-in-langsmith",
max_concurrency=2,
)
import { evaluate } from "langsmith/evaluation";
// After running the evaluation, a link will be provided to view the results in langsmith
await evaluate(
target,
{
data: "Sample dataset",
evaluators: [
correctnessEvaluator,
// can add multiple evaluators here
],
experimentPrefix: "first-eval-in-langsmith",
maxConcurrency: 2,
}
);
单击评估运行打印的链接以访问 LangSmith Experiments UI,并浏览实验结果。

后续步骤
要了解有关在 LangSmith 中运行实验的更多信息,请阅读评估概念指南。
- 查看 OpenEvals README 以查看所有可用的预构建评估器以及如何自定义它们。
- 了解如何定义包含任意代码的自定义计算器。
- 请参阅操作指南,了解“如何....?” 格式问题的答案。
- 有关端到端演练,请参阅教程。
- 有关每个类和函数的全面描述,请参阅 API 参考。
或者,如果您更喜欢视频教程,请观看 LangSmith 课程简介中的数据集、评估器和实验视频。
1. Navigate to the Playground
LangSmith's Prompt Playground makes it possible to run evaluations over different prompts, new models or test different model configurations. Go to LangSmith's Playground in the UI.
2. Create a prompt
Modify the system prompt to:
Answer the following question accurately:
3. Create a dataset
Click Set up Evaluation, then use the + New button in the dropdown to create a new dataset.
Add the following examples to the dataset:
| Inputs | Reference Outputs |
|---|---|
| question: Which country is Mount Kilimanjaro located in? | output: Mount Kilimanjaro is located in Tanzania. |
| question: What is Earth's lowest point? | output: Earth's lowest point is The Dead Sea. |
Press Save to save your newly created dataset.
4. Add an evaluator
Click +Evaluator. Select Correctness from the pre-built evaluator options. Press Save.
5. Run your evaluation
Press Start on the top right to run your evaluation. Running this evaluation will create an experiment that you can view in full by clicking the experiment name.

Next steps
To learn more about running experiments in LangSmith, read the evaluation conceptual guide.
See the How-to guides for answers to “How do I….?” format questions.
- Learn how to create and manage datasets in the UI
- Learn how to run an evaluation from the prompt playground
If you prefer video tutorials, check out the Playground videos from the Introduction to LangSmith Course.