Skip to main content

在跟踪中防止记录敏感数据

在某些情况下,出于隐私或安全原因,您可能需要防止记录轨迹的输入和输出。LangSmith 提供了一种在将轨迹的输入和输出发送至 LangSmith 后端之前对其进行过滤的方法。

如果您想完全隐藏跟踪的输入和输出,可以在运行应用程序时设置以下环境变量:

LANGSMITH_HIDE_INPUTS=true
LANGSMITH_HIDE_OUTPUTS=true

这适用于 LangSmith SDK(Python 和 TypeScript)以及 LangChain。

您也可以为给定的 Client 实例自定义并覆盖此行为。这可以通过在 Client 对象上设置 hide_inputshide_outputs 参数来实现(在 TypeScript 中分别为 hideInputshideOutputs)。

对于下面的示例,我们将简单地返回一个空对象作为hide_inputshide_outputs的结果,但您可以根据需要对此进行自定义。

import openai
from langsmith import Client
from langsmith.wrappers import wrap_openai

openai_client = wrap_openai(openai.Client())
langsmith_client = Client(
hide_inputs=lambda inputs: {}, hide_outputs=lambda outputs: {}
)

# The trace produced will have its metadata present, but the inputs will be hidden
openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"},
],
langsmith_extra={"client": langsmith_client},
)

# The trace produced will not have hidden inputs and outputs
openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"},
],
)

基于规则的输入和输出屏蔽

信息

此功能在以下 LangSmith SDK 版本中可用:

  • Python: 0.1.81 及以上版本
  • TypeScript:0.1.33 及以上版本

为了在输入和输出中隐藏特定数据,您可以使用 create_anonymizer / createAnonymizer 函数,并在实例化客户端时传入新创建的匿名化器。该匿名化器可以通过正则表达式列表及替换值构建,也可以通过一个接受并返回字符串值的函数构建。

如果 LANGSMITH_HIDE_INPUTS = true,则输入将跳过匿名化。如果为 LANGSMITH_HIDE_OUTPUTS = true,输出也将同样处理。

然而,如果输入或输出需要发送给客户端,anonymizer 方法将优先于 hide_inputshide_outputs 中找到的函数。默认情况下,create_anonymizer 最多只会查找 10 层嵌套深度,这可以通过 max_depth 参数进行配置。

from langsmith.anonymizer import create_anonymizer
from langsmith import Client, traceable
import re

# create anonymizer from list of regex patterns and replacement values
anonymizer = create_anonymizer([
{ "pattern": r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}", "replace": "<email-address>" },
{ "pattern": r"[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}", "replace": "<UUID>" }
])

# or create anonymizer from a function
email_pattern = re.compile(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}")
uuid_pattern = re.compile(r"[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}")

anonymizer = create_anonymizer(
lambda text: email_pattern.sub("<email-address>", uuid_pattern.sub("<UUID>", text))
)

client = Client(anonymizer=anonymizer)

@traceable(client=client)
def main(inputs: dict) -> dict:
...

请注意,使用匿名化功能在处理复杂的正则表达式或大型负载时可能会导致性能下降,因为匿名化器会在处理前将负载序列化为 JSON。

注意

提升 anonymizer API 的性能已在我们的路线图之中!如果您遇到性能问题,请通过 support@langchain.dev 与我们联系。

旧版本的 LangSmith SDK 可以使用 hide_inputshide_outputs 参数来实现相同的效果。您还可以使用这些参数更高效地处理输入和输出。

import re
from langsmith import Client, traceable

# Define the regex patterns for email addresses and UUIDs
EMAIL_REGEX = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}"
UUID_REGEX = r"[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}"

def replace_sensitive_data(data, depth=10):
if depth == 0:
return data

if isinstance(data, dict):
return {k: replace_sensitive_data(v, depth-1) for k, v in data.items()}
elif isinstance(data, list):
return [replace_sensitive_data(item, depth-1) for item in data]
elif isinstance(data, str):
data = re.sub(EMAIL_REGEX, "<email-address>", data)
data = re.sub(UUID_REGEX, "<UUID>", data)
return data
else:
return data

client = Client(
hide_inputs=lambda inputs: replace_sensitive_data(inputs),
hide_outputs=lambda outputs: replace_sensitive_data(outputs)
)

inputs = {"role": "user", "content": "Hello! My email is user@example.com and my ID is 123e4567-e89b-12d3-a456-426614174000."}
outputs = {"role": "assistant", "content": "Hi! I've noted your email as user@example.com and your ID as 123e4567-e89b-12d3-a456-426614174000."}

@traceable(client=client)
def child(inputs: dict) -> dict:
return outputs

@traceable(client=client)
def parent(inputs: dict) -> dict:
child_outputs = child(inputs)
return child_outputs

parent(inputs)

处理单个函数的输入与输出

信息

process_outputs 参数在 LangSmith SDK 的 Python 版本 0.1.98 及以上版本中可用。

除了客户端级别的输入和输出处理外,LangSmith 还通过 process_inputsprocess_outputs 参数为 @traceable 装饰器提供了函数级别的处理能力。

这些参数接受函数,允许你在将特定函数的输入和输出记录到 LangSmith 之前对其进行转换。这对于减少负载大小、移除敏感信息,或自定义对象在 LangSmith 中针对特定功能的序列化和表示方式非常有用。

这是使用 process_inputsprocess_outputs 的一个示例:

from langsmith import traceable

def process_inputs(inputs: dict) -> dict:
# inputs is a dictionary where keys are argument names and values are the provided arguments
# Return a new dictionary with processed inputs
return {
"processed_key": inputs.get("my_cool_key", "default"),
"length": len(inputs.get("my_cool_key", ""))
}

def process_outputs(output: Any) -> dict:
# output is the direct return value of the function
# Transform the output into a dictionary
# In this case, "output" will be an integer
return {"processed_output": str(output)}

@traceable(process_inputs=process_inputs, process_outputs=process_outputs)
def my_function(my_cool_key: str) -> int:
# Function implementation
return len(my_cool_key)

result = my_function("example")

在此示例中,process_inputs 创建一个包含处理后输入数据的新字典,process_outputs 将输出转换为特定格式后再记录到 LangSmith。

注意

建议在处理器函数中避免修改源对象。相反,应创建并返回包含处理数据的新对象。

对于异步函数,用法类似:

@traceable(process_inputs=process_inputs, process_outputs=process_outputs)
async def async_function(key: str) -> int:
# Async implementation
return len(key)

当同时定义时,这些函数级处理器优先于客户端级处理器(hide_inputshide_outputs)。

Quickstarts

您可以将基于规则的掩码与各种匿名化工具结合使用,以清除输入和输出中的敏感信息。在本指南中,我们将介绍如何使用正则表达式、Microsoft Presidio 和 Amazon Comprehend。

正则表达式

信息

以下实现并不详尽,可能会遗漏某些格式或边缘情况。在生产环境中使用前,请彻底测试任何实现。

您可以使用正则表达式在输入和输出发送到 LangSmith 之前对其进行屏蔽。下面的实现会屏蔽电子邮件地址、电话号码、全名、信用卡号码和社会安全号码。

import re
import openai
from langsmith import Client
from langsmith.wrappers import wrap_openai

# Define regex patterns for various PII
SSN_PATTERN = re.compile(r'\b\d{3}-\d{2}-\d{4}\b')
CREDIT_CARD_PATTERN = re.compile(r'\b(?:\d[ -]*?){13,16}\b')
EMAIL_PATTERN = re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,7}\b')
PHONE_PATTERN = re.compile(r'\b(?:\+?1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}\b')
FULL_NAME_PATTERN = re.compile(r'\b([A-Z][a-z]*\s[A-Z][a-z]*)\b')

def regex_anonymize(text):
"""
Anonymize sensitive information in the text using regex patterns.

Args:
text (str): The input text to be anonymized.

Returns:
str: The anonymized text.
"""
# Replace sensitive information with placeholders
text = SSN_PATTERN.sub('[REDACTED SSN]', text)
text = CREDIT_CARD_PATTERN.sub('[REDACTED CREDIT CARD]', text)
text = EMAIL_PATTERN.sub('[REDACTED EMAIL]', text)
text = PHONE_PATTERN.sub('[REDACTED PHONE]', text)
text = FULL_NAME_PATTERN.sub('[REDACTED NAME]', text)
return text

def recursive_anonymize(data, depth=10):
"""
Recursively traverse the data structure and anonymize sensitive information.
Args:
data (any): The input data to be anonymized.
depth (int): The current recursion depth to prevent excessive recursion.

Returns:
any: The anonymized data.
"""
if depth == 0:
return data

if isinstance(data, dict):
anonymized_dict = {}
for k, v in data.items():
anonymized_value = recursive_anonymize(v, depth - 1)
anonymized_dict[k] = anonymized_value
return anonymized_dict
elif isinstance(data, list):
anonymized_list = []
for item in data:
anonymized_item = recursive_anonymize(item, depth - 1)
anonymized_list.append(anonymized_item)
return anonymized_list
elif isinstance(data, str):
anonymized_data = regex_anonymize(data)
return anonymized_data
else:
return data

openai_client = wrap_openai(openai.Client())

# Initialize the LangSmith client with the anonymization functions
langsmith_client = Client(
hide_inputs=recursive_anonymize, hide_outputs=recursive_anonymize
)

# The trace produced will have its metadata present, but the inputs and outputs will be anonymized
response_with_anonymization = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "My name is John Doe, my SSN is 123-45-6789, my credit card number is 4111 1111 1111 1111, my email is john.doe@example.com, and my phone number is (123) 456-7890."},
],
langsmith_extra={"client": langsmith_client},
)

# The trace produced will not have anonymized inputs and outputs
response_without_anonymization = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "My name is John Doe, my SSN is 123-45-6789, my credit card number is 4111 1111 1111 1111, my email is john.doe@example.com, and my phone number is (123) 456-7890."},
],
)

匿名运行在 LangSmith 中将显示如下:Anonymized run

非匿名化运行在 LangSmith 中将显示如下: Non-anonymized run

Microsoft Presidio

信息

下面的实现提供了一个通用示例,说明如何在用户与 LLM 之间交换的消息中匿名敏感信息。该实现并不全面,也未涵盖所有情况。在生产环境中使用前,请务必对任何实现进行充分测试。

Microsoft Presidio 是一个数据保护和去标识化 SDK。下面的实现使用 Presidio 在输入和输出发送到 LangSmith 之前对其进行匿名化处理。如需最新信息,请参阅 Presidio 的官方文档

要使用 Presidio 及其 spaCy 模型,请安装以下内容:

pip install presidio-analyzer
pip install presidio-anonymizer
python -m spacy download en_core_web_lg

此外,请安装 OpenAI:

pip install openai
import openai
from langsmith import Client
from langsmith.wrappers import wrap_openai
from presidio_anonymizer import AnonymizerEngine
from presidio_analyzer import AnalyzerEngine

anonymizer = AnonymizerEngine()
analyzer = AnalyzerEngine()

def presidio_anonymize(data):
"""
Anonymize sensitive information sent by the user or returned by the model.

Args:
data (any): The data to be anonymized.

Returns:
any: The anonymized data.
"""
message_list = (
data.get('messages') or [data.get('choices', [{}])[0].get('message')]
)
if not message_list or not all(isinstance(msg, dict) and msg for msg in message_list):
return data
for message in message_list:
content = message.get('content', '')
if not content.strip():
print("Empty content detected. Skipping anonymization.")
continue
results = analyzer.analyze(
text=content,
entities=["PERSON", "PHONE_NUMBER", "EMAIL_ADDRESS", "US_SSN"],
language='en'
)
anonymized_result = anonymizer.anonymize(
text=content,
analyzer_results=results
)
message['content'] = anonymized_result.text
return data

openai_client = wrap_openai(openai.Client())

# initialize the langsmith client with the anonymization functions
langsmith_client = Client(
hide_inputs=presidio_anonymize, hide_outputs=presidio_anonymize
)

# The trace produced will have its metadata present, but the inputs and outputs will be anonymized
response_with_anonymization = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "My name is Slim Shady, call me at 313-666-7440 or email me at real.slim.shady@gmail.com"},
],
langsmith_extra={"client": langsmith_client},
)

# The trace produced will not have anonymized inputs and outputs
response_without_anonymization = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "My name is Slim Shady, call me at 313-666-7440 or email me at real.slim.shady@gmail.com"},
],
)

匿名运行在 LangSmith 中将显示如下:Anonymized run

非匿名化运行在 LangSmith 中将显示如下: Non-anonymized run

Amazon Comprehend

信息

下面的实现提供了一个通用示例,说明如何在用户与 LLM 之间交换的消息中匿名敏感信息。该实现并不全面,也未涵盖所有情况。在生产环境中使用前,请务必对任何实现进行充分测试。

Comprehend 是一项自然语言处理服务,可检测个人身份信息。以下实现使用 Comprehend 在数据发送前对输入和输出进行匿名化处理。如需最新信息,请参阅 Comprehend 的 官方文档

要使用 Comprehend,请安装 boto3

pip install boto3

此外,请安装 OpenAI:

pip install openai

您需要先在 AWS 中设置凭据,并使用 AWS CLI 进行身份验证。请遵循此处的说明。

import openai
import boto3
from langsmith import Client
from langsmith.wrappers import wrap_openai

comprehend = boto3.client('comprehend', region_name='us-east-1')

def redact_pii_entities(text, entities):
"""
Redact PII entities in the text based on the detected entities.

Args:
text (str): The original text containing PII.
entities (list): A list of detected PII entities.

Returns:
str: The text with PII entities redacted.
"""
sorted_entities = sorted(entities, key=lambda x: x['BeginOffset'], reverse=True)

redacted_text = text
for entity in sorted_entities:
begin = entity['BeginOffset']
end = entity['EndOffset']
entity_type = entity['Type']
# Define the redaction placeholder based on entity type
placeholder = f"[{entity_type}]"
# Replace the PII in the text with the placeholder
redacted_text = redacted_text[:begin] + placeholder + redacted_text[end:]

return redacted_text

def detect_pii(text):
"""
Detect PII entities in the given text using AWS Comprehend.

Args:
text (str): The text to analyze.

Returns:
list: A list of detected PII entities.
"""
try:
response = comprehend.detect_pii_entities(
Text=text,
LanguageCode='en',
)
entities = response.get('Entities', [])
return entities
except Exception as e:
print(f"Error detecting PII: {e}")
return []

def comprehend_anonymize(data):
"""
Anonymize sensitive information sent by the user or returned by the model.

Args:
data (any): The input data to be anonymized.

Returns:
any: The anonymized data.
"""
message_list = (
data.get('messages') or [data.get('choices', [{}])[0].get('message')]
)
if not message_list or not all(isinstance(msg, dict) and msg for msg in message_list):
return data
for message in message_list:
content = message.get('content', '')
if not content.strip():
print("Empty content detected. Skipping anonymization.")
continue
entities = detect_pii(content)
if entities:
anonymized_text = redact_pii_entities(content, entities)
message['content'] = anonymized_text
else:
print("No PII detected. Content remains unchanged.")

return data

openai_client = wrap_openai(openai.Client())

# initialize the langsmith client with the anonymization functions
langsmith_client = Client(
hide_inputs=comprehend_anonymize, hide_outputs=comprehend_anonymize
)

# The trace produced will have its metadata present, but the inputs and outputs will be anonymized
response_with_anonymization = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "My name is Slim Shady, call me at 313-666-7440 or email me at real.slim.shady@gmail.com"},
],
langsmith_extra={"client": langsmith_client},
)

# The trace produced will not have anonymized inputs and outputs
response_without_anonymization = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "My name is Slim Shady, call me at 313-666-7440 or email me at real.slim.shady@gmail.com"},
],
)

匿名运行在 LangSmith 中将显示如下:Anonymized run

非匿名化运行在 LangSmith 中将显示如下: Non-anonymized run


此页面有帮助吗?


您可以留下详细的反馈 在 GitHub 上