如何定义自定义赋值器

关键概念

评估员

自定义计算器只是采用数据集示例和生成的应用程序输出并返回一个或多个指标的函数。这些函数可以直接传递到 evaluate（） / aevaluate（）中。

基本示例

蟒
TypeScript （类型脚本）

需要langsmith>=0.2.0

from langsmith import evaluate

def correct(outputs: dict, reference_outputs: dict) -> bool:
"""Check if the answer exactly matches the expected answer."""
    return outputs["answer"] == reference_outputs["answer"]

def dummy_app(inputs: dict) -> dict:
    return {"answer": "hmm i'm not sure", "reasoning": "i didn't understand the question"}

results = evaluate(
    dummy_app,
    data="dataset_name",
    evaluators=[correct]
)

Requires langsmith>=0.2.9

import type { EvaluationResult } from "langsmith/evaluation";

const correct = async ({ outputs, referenceOutputs }: {
  outputs: Record<string, any>;
  referenceOutputs?: Record<string, any>;
}): Promise<EvaluationResult> => {
  const score = outputs?.answer === referenceOutputs?.answer;
  return { key: "correct", score };
}

Evaluator args

自定义计算器函数必须具有特定的参数名称。它们可以采用以下参数的任何子集：

run: Run：应用程序在给定示例中生成的完整 Run 对象。
example: Example：完整数据集示例，包括示例输入、输出（如果可用）和 metdata（如果可用）。
inputs: dict：与数据集中的单个示例对应的输入字典。
outputs: dict：应用程序在给定inputs.
reference_outputs/referenceOutputs: dict：与示例关联的参考输出的字典（如果可用）。

对于大多数用例，您只需要inputs,outputs和reference_outputs.run和example仅当您需要应用程序的实际输入和输出之外的一些额外跟踪或示例元数据时，才有用。

当使用 JS/TS 时，这些都应该作为单个对象参数的一部分传入。

赋值器输出

自定义计算器应返回以下类型之一：

Python 和 JS/TS

dict：形式为{"score" | "value": ..., "key": ...}允许您自定义量度类型（“score”代表数字，“value”代表分类）和量度名称。例如，如果要将整数记录为分类量度，则此选项非常有用。

仅限 Python

int | float | bool：这被解释为一个可以平均、排序等的连续量度。函数名称用作度量的名称。
str：这被解释为分类指标。函数名称用作度量的名称。
list[dict]：使用单个函数返回多个指标。

其他示例

蟒
TypeScript （类型脚本）

需要langsmith>=0.2.0

from langsmith import evaluate, wrappers
from langsmith.schemas import Run, Example
from openai import AsyncOpenAI
# Assumes you've installed pydantic.
from pydantic import BaseModel

# We can still pass in Run and Example objects if we'd like
def correct_old_signature(run: Run, example: Example) -> dict:
    """Check if the answer exactly matches the expected answer."""
    return {"key": "correct", "score": run.outputs["answer"] == example.outputs["answer"]}

# Just evaluate actual outputs
def concision(outputs: dict) -> int:
    """Score how concise the answer is. 1 is the most concise, 5 is the least concise."""
    return min(len(outputs["answer"]) // 1000, 4) + 1

# Use an LLM-as-a-judge
oai_client = wrappers.wrap_openai(AsyncOpenAI())

async def valid_reasoning(inputs: dict, outputs: dict) -> bool:
    """Use an LLM to judge if the reasoning and the answer are consistent."""

    instructions = """\

Given the following question, answer, and reasoning, determine if the reasoning for the \
answer is logically valid and consistent with question and the answer."""

    class Response(BaseModel):
        reasoning_is_valid: bool

    msg = f"Question: {inputs['question']}\nAnswer: {outputs['answer']}\nReasoning: {outputs['reasoning']}"
    response = await oai_client.beta.chat.completions.parse(
        model="gpt-4o-mini",
        messages=[{"role": "system", "content": instructions,}, {"role": "user", "content": msg}],
        response_format=Response
    )
    return response.choices[0].message.parsed.reasoning_is_valid

def dummy_app(inputs: dict) -> dict:
    return {"answer": "hmm i'm not sure", "reasoning": "i didn't understand the question"}

results = evaluate(
    dummy_app,
    data="dataset_name",
    evaluators=[correct_old_signature, concision, valid_reasoning]
)

import { Client } from "langsmith";
import { evaluate } from "langsmith/evaluation";
import { Run, Example } from "langsmith/schemas";
import OpenAI from "openai";

// Type definitions
interface AppInputs {
    question: string;
}

interface AppOutputs {
    answer: string;
    reasoning: string;
}

interface Response {
    reasoning_is_valid: boolean;
}

// Old signature evaluator
function correctOldSignature(run: Run, example: Example) {
    return {
        key: "correct",
        score: run.outputs?.["answer"] === example.outputs?.["answer"],
    };
}

// Output-only evaluator
function concision({ outputs }: { outputs: AppOutputs }) {
    return {
        key: "concision",
        score: Math.min(Math.floor(outputs.answer.length / 1000), 4) + 1,
    };
}

// LLM-as-judge evaluator
const openai = new OpenAI();

async function validReasoning({
    inputs,
    outputs
}: {
    inputs: AppInputs;
    outputs: AppOutputs;
}) {
    const instructions = `      Given the following question, answer, and reasoning, determine if the reasoning for the       answer is logically valid and consistent with question and the answer.`;

    const msg = `Question: ${inputs.question}
Answer: ${outputs.answer}\nReasoning: ${outputs.reasoning}`;

    const response = await openai.chat.completions.create({
        model: "gpt-4",
        messages: [
            { role: "system", content: instructions },
            { role: "user", content: msg }
        ],
        response_format: { type: "json_object" },
        functions: [{
        name: "parse_response",
        parameters: {
            type: "object",
            properties: {
            reasoning_is_valid: {
                type: "boolean",
                description: "Whether the reasoning is valid"
            }
            },
            required: ["reasoning_is_valid"]
        }
        }]
    });

    const parsed = JSON.parse(response.choices[0].message.content ?? "{}") as Response;

    return {
        key: "valid_reasoning",
        score: parsed.reasoning_is_valid ? 1 : 0
    };
}

// Example application
function dummyApp(inputs: AppInputs): AppOutputs {
    return {
        answer: "hmm i'm not sure",
        reasoning: "i didn't understand the question"
    };
}

const results = await evaluate(dummyApp, {
        data: "dataset_name",
        evaluators: [correctOldSignature, concision, validReasoning],
        client: new Client()
});

评估聚合实验结果：定义摘要评估器，用于计算整个实验的指标。
运行比较两个实验的评估：定义成对评估器，它通过比较两个（或多个）实验来计算指标。

如何定义自定义赋值器

基本示例

Evaluator args

赋值器输出

其他示例

这个页面有帮助吗？

您可以在 GitHub 上留下详细的反馈。

基本示例

Evaluator args

赋值器输出

其他示例

相关

这个页面有帮助吗？

您可以在 GitHub 上留下详细的反馈。