如何通过 CSV 进行问答
LLM 非常适合在各种类型的数据源上构建问答系统。在本节中,我们将介绍如何对存储在 CSV 文件中的数据构建 Q&A 系统。与使用 SQL 数据库一样,使用 CSV 文件的关键是让 LLM 能够访问用于查询数据并与之交互的工具。执行此作的两种主要方法是:
- 建议:将 CSV 加载到 SQL 数据库中,并使用 SQL 教程中概述的方法。
- 为 LLM 提供对 Python 环境的访问权限,在该环境中,它可以使用 Pandas 等库与数据进行交互。
我们将在本指南中介绍这两种方法。
⚠️ 安全说明 ⚠️
上述两种方法都存在重大风险。使用 SQL 需要执行模型生成的 SQL 查询。使用像 Pandas 这样的库需要让模型执行 Python 代码。由于与沙盒 Python 环境相比,严格限定 SQL 连接权限范围和清理 SQL 查询更容易,因此我们强烈建议通过 SQL 与 CSV 数据交互。有关一般安全最佳实践的更多信息,请参阅此处。
设置
本指南的依赖项:
%pip install -qU langchain langchain-openai langchain-community langchain-experimental pandas
设置所需的环境变量:
# Using LangSmith is recommended but not required. Uncomment below lines to use.
# import os
# os.environ["LANGSMITH_TRACING"] = "true"
# os.environ["LANGSMITH_API_KEY"] = getpass.getpass()
如果您还没有 Titanic 数据集,请下载 Titanic 数据集:
!wget https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv -O titanic.csv
import pandas as pd
df = pd.read_csv("titanic.csv")
print(df.shape)
print(df.columns.tolist())
(887, 8)
['Survived', 'Pclass', 'Name', 'Sex', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard', 'Fare']
SQL
推荐使用 SQL 与 CSV 数据交互,因为与使用任意 Python 相比,它更容易限制权限和清理查询。
大多数 SQL 数据库都可以轻松地将 CSV 文件作为表加载(DuckDB、SQLite 等)。完成此作后,您可以使用 SQL 教程中概述的所有链和代理创建技术。下面是一个快速示例,说明我们如何使用 SQLite 执行此作:
from langchain_community.utilities import SQLDatabase
from sqlalchemy import create_engine
engine = create_engine("sqlite:///titanic.db")
df.to_sql("titanic", engine, index=False)
887
db = SQLDatabase(engine=engine)
print(db.dialect)
print(db.get_usable_table_names())
print(db.run("SELECT * FROM titanic WHERE Age < 2;"))
sqlite
['titanic']
[(1, 2, 'Master. Alden Gates Caldwell', 'male', 0.83, 0, 2, 29.0), (0, 3, 'Master. Eino Viljami Panula', 'male', 1.0, 4, 1, 39.6875), (1, 3, 'Miss. Eleanor Ileen Johnson', 'female', 1.0, 1, 1, 11.1333), (1, 2, 'Master. Richard F Becker', 'male', 1.0, 2, 1, 39.0), (1, 1, 'Master. Hudson Trevor Allison', 'male', 0.92, 1, 2, 151.55), (1, 3, 'Miss. Maria Nakid', 'female', 1.0, 0, 2, 15.7417), (0, 3, 'Master. Sidney Leonard Goodwin', 'male', 1.0, 5, 2, 46.9), (1, 3, 'Miss. Helene Barbara Baclini', 'female', 0.75, 2, 1, 19.2583), (1, 3, 'Miss. Eugenie Baclini', 'female', 0.75, 2, 1, 19.2583), (1, 2, 'Master. Viljo Hamalainen', 'male', 0.67, 1, 1, 14.5), (1, 3, 'Master. Bertram Vere Dean', 'male', 1.0, 1, 2, 20.575), (1, 3, 'Master. Assad Alexander Thomas', 'male', 0.42, 0, 1, 8.5167), (1, 2, 'Master. Andre Mallet', 'male', 1.0, 0, 2, 37.0042), (1, 2, 'Master. George Sibley Richards', 'male', 0.83, 1, 1, 18.75)]
并创建一个 SQL 代理来与之交互:
pip install -qU "langchain[openai]"
import getpass
import os
if not os.environ.get("OPENAI_API_KEY"):
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")
from langchain.chat_models import init_chat_model
llm = init_chat_model("gpt-4o-mini", model_provider="openai")
from langchain_community.agent_toolkits import create_sql_agent
agent_executor = create_sql_agent(llm, db=db, agent_type="openai-tools", verbose=True)
agent_executor.invoke({"input": "what's the average age of survivors"})
[1m> Entering new SQL Agent Executor chain...[0m
[32;1m[1;3m
Invoking: `sql_db_list_tables` with `{}`
[0m[38;5;200m[1;3mtitanic[0m[32;1m[1;3m
Invoking: `sql_db_schema` with `{'table_names': 'titanic'}`
[0m[33;1m[1;3m
CREATE TABLE titanic (
"Survived" BIGINT,
"Pclass" BIGINT,
"Name" TEXT,
"Sex" TEXT,
"Age" FLOAT,
"Siblings/Spouses Aboard" BIGINT,
"Parents/Children Aboard" BIGINT,
"Fare" FLOAT
)
/*
3 rows from titanic table:
Survived Pclass Name Sex Age Siblings/Spouses Aboard Parents/Children Aboard Fare
0 3 Mr. Owen Harris Braund male 22.0 1 0 7.25
1 1 Mrs. John Bradley (Florence Briggs Thayer) Cumings female 38.0 1 0 71.2833
1 3 Miss. Laina Heikkinen female 26.0 0 0 7.925
*/[0m[32;1m[1;3m
Invoking: `sql_db_query` with `{'query': 'SELECT AVG(Age) AS Average_Age FROM titanic WHERE Survived = 1'}`
[0m[36;1m[1;3m[(28.408391812865496,)][0m[32;1m[1;3mThe average age of survivors in the Titanic dataset is approximately 28.41 years.[0m
[1m> Finished chain.[0m
{'input': "what's the average age of survivors",
'output': 'The average age of survivors in the Titanic dataset is approximately 28.41 years.'}
这种方法很容易推广到多个 CSV,因为我们只需将每个 CSV 作为自己的表加载到我们的数据库中即可。请参阅下面的多个 CSV 部分。
熊猫
除了 SQL,我们还可以使用 pandas 等数据分析库和 LLM 的代码生成功能来与 CSV 数据进行交互。同样,除非您有广泛的保护措施,否则此方法不适合生产使用案例。因此,我们的代码执行工具和构造函数位于langchain-experimental包。
Chains
大多数 LLM 都接受过足够多的 pandas Python 代码的训练,他们只需被要求即可生成代码:
ai_msg = llm.invoke(
"I have a pandas DataFrame 'df' with columns 'Age' and 'Fare'. Write code to compute the correlation between the two columns. Return Markdown for a Python code snippet and nothing else."
)
print(ai_msg.content)
\`\`\`python
correlation = df['Age'].corr(df['Fare'])
correlation
\`\`\`
我们可以将此功能与 Python 执行工具相结合,以创建一个简单的数据分析链。我们首先需要将 CSV 表加载为 DataFrame,并授予该工具对此 DataFrame 的访问权限:
import pandas as pd
from langchain_core.prompts import ChatPromptTemplate
from langchain_experimental.tools import PythonAstREPLTool
df = pd.read_csv("titanic.csv")
tool = PythonAstREPLTool(locals={"df": df})
tool.invoke("df['Fare'].mean()")
32.30542018038331
为了帮助强制正确使用我们的 Python 工具,我们将使用工具调用:
llm_with_tools = llm.bind_tools([tool], tool_choice=tool.name)
response = llm_with_tools.invoke(
"I have a dataframe 'df' and want to know the correlation between the 'Age' and 'Fare' columns"
)
response
AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'call_SBrK246yUbdnJemXFC8Iod05', 'function': {'arguments': '{"query":"df.corr()[\'Age\'][\'Fare\']"}', 'name': 'python_repl_ast'}, 'type': 'function'}]}, response_metadata={'token_usage': {'completion_tokens': 13, 'prompt_tokens': 125, 'total_tokens': 138}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': 'fp_3b956da36b', 'finish_reason': 'stop', 'logprobs': None}, id='run-1fd332ba-fa72-4351-8182-d464e7368311-0', tool_calls=[{'name': 'python_repl_ast', 'args': {'query': "df.corr()['Age']['Fare']"}, 'id': 'call_SBrK246yUbdnJemXFC8Iod05'}])
response.tool_calls
[{'name': 'python_repl_ast',
'args': {'query': "df.corr()['Age']['Fare']"},
'id': 'call_SBrK246yUbdnJemXFC8Iod05'}]
我们将添加一个 tools 输出解析器,以将函数调用提取为 dict:
from langchain_core.output_parsers.openai_tools import JsonOutputKeyToolsParser
parser = JsonOutputKeyToolsParser(key_name=tool.name, first_tool_only=True)
(llm_with_tools | parser).invoke(
"I have a dataframe 'df' and want to know the correlation between the 'Age' and 'Fare' columns"
)
{'query': "df[['Age', 'Fare']].corr()"}
并与 prompt 结合使用,这样我们就可以指定一个问题,而无需在每次调用时都指定 dataframe 信息:
system = f"""You have access to a pandas dataframe `df`. \
Here is the output of `df.head().to_markdown()`:
\`\`\`
{df.head().to_markdown()}
\`\`\`
Given a user question, write the Python code to answer it. \
Return ONLY the valid Python code and nothing else. \
Don't assume you have access to any libraries other than built-in Python ones and pandas."""
prompt = ChatPromptTemplate.from_messages([("system", system), ("human", "{question}")])
code_chain = prompt | llm_with_tools | parser
code_chain.invoke({"question": "What's the correlation between age and fare"})
{'query': "df[['Age', 'Fare']].corr()"}
最后,我们将添加 Python 工具,以便实际执行生成的代码:
chain = prompt | llm_with_tools | parser | tool
chain.invoke({"question": "What's the correlation between age and fare"})
0.11232863699941621
就这样,我们有一个简单的数据分析链。我们可以通过查看 LangSmith 轨迹来获得中间步骤的峰值:https://smith.langchain.com/public/b1309290-7212-49b7-bde2-75b39a32b49a/r
我们可以在最后添加一个额外的 LLM 调用来生成对话式响应,这样我们就不仅仅是使用工具输出进行响应。为此,我们需要添加聊天记录MessagesPlaceholder到我们的提示符中:
from operator import itemgetter
from langchain_core.messages import ToolMessage
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import MessagesPlaceholder
from langchain_core.runnables import RunnablePassthrough
system = f"""You have access to a pandas dataframe `df`. \
Here is the output of `df.head().to_markdown()`:
\`\`\`
{df.head().to_markdown()}
\`\`\`
Given a user question, write the Python code to answer it. \
Don't assume you have access to any libraries other than built-in Python ones and pandas.
Respond directly to the question once you have enough information to answer it."""
prompt = ChatPromptTemplate.from_messages(
[
(
"system",
system,
),
("human", "{question}"),
# This MessagesPlaceholder allows us to optionally append an arbitrary number of messages
# at the end of the prompt using the 'chat_history' arg.
MessagesPlaceholder("chat_history", optional=True),
]
)
def _get_chat_history(x: dict) -> list:
"""Parse the chain output up to this point into a list of chat history messages to insert in the prompt."""
ai_msg = x["ai_msg"]
tool_call_id = x["ai_msg"].additional_kwargs["tool_calls"][0]["id"]
tool_msg = ToolMessage(tool_call_id=tool_call_id, content=str(x["tool_output"]))
return [ai_msg, tool_msg]
chain = (
RunnablePassthrough.assign(ai_msg=prompt | llm_with_tools)
.assign(tool_output=itemgetter("ai_msg") | parser | tool)
.assign(chat_history=_get_chat_history)
.assign(response=prompt | llm | StrOutputParser())
.pick(["tool_output", "response"])
)
chain.invoke({"question": "What's the correlation between age and fare"})
{'tool_output': 0.11232863699941616,
'response': 'The correlation between age and fare is approximately 0.1123.'}
以下是此运行的 LangSmith 跟踪:https://smith.langchain.com/public/14e38d70-45b1-4b81-8477-9fd2b7c07ea6/r
代理
对于复杂问题,LLM 能够迭代执行代码,同时保持其先前执行的输入和输出可能会有所帮助。这就是 Agent 发挥作用的地方。它们允许 LLM 决定需要调用工具多少次,并跟踪到目前为止的执行情况。create_pandas_dataframe_agent 是一个内置代理,可以轻松使用 DataFrame:
from langchain_experimental.agents import create_pandas_dataframe_agent
agent = create_pandas_dataframe_agent(
llm, df, agent_type="openai-tools", verbose=True, allow_dangerous_code=True
)
agent.invoke(
{
"input": "What's the correlation between age and fare? is that greater than the correlation between fare and survival?"
}
)
[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `python_repl_ast` with `{'query': "df[['Age', 'Fare']].corr().iloc[0,1]"}`
[0m[36;1m[1;3m0.11232863699941621[0m[32;1m[1;3m
Invoking: `python_repl_ast` with `{'query': "df[['Fare', 'Survived']].corr().iloc[0,1]"}`
[0m[36;1m[1;3m0.2561785496289603[0m[32;1m[1;3mThe correlation between Age and Fare is approximately 0.112, and the correlation between Fare and Survival is approximately 0.256.
Therefore, the correlation between Fare and Survival (0.256) is greater than the correlation between Age and Fare (0.112).[0m
[1m> Finished chain.[0m
{'input': "What's the correlation between age and fare? is that greater than the correlation between fare and survival?",
'output': 'The correlation between Age and Fare is approximately 0.112, and the correlation between Fare and Survival is approximately 0.256.\n\nTherefore, the correlation between Fare and Survival (0.256) is greater than the correlation between Age and Fare (0.112).'}
以下是此运行的 LangSmith 跟踪:https://smith.langchain.com/public/6a86aee2-4f22-474a-9264-bd4c7283e665/r
多个 CSV
要处理多个 CSV(或 DataFrame),我们只需要将多个 DataFrame 传递给我们的 Python 工具。我们create_pandas_dataframe_agentconstructor 可以开箱即用地做到这一点,我们可以传入一个 DataFrame 列表,而不仅仅是一个。如果我们自己构建一条链,我们可以做这样的事情:
df_1 = df[["Age", "Fare"]]
df_2 = df[["Fare", "Survived"]]
tool = PythonAstREPLTool(locals={"df_1": df_1, "df_2": df_2})
llm_with_tool = llm.bind_tools(tools=[tool], tool_choice=tool.name)
df_template = """\`\`\`python
{df_name}.head().to_markdown()
>>> {df_head}
\`\`\`"""
df_context = "\n\n".join(
df_template.format(df_head=_df.head().to_markdown(), df_name=df_name)
for _df, df_name in [(df_1, "df_1"), (df_2, "df_2")]
)
system = f"""You have access to a number of pandas dataframes. \
Here is a sample of rows from each dataframe and the python code that was used to generate the sample:
{df_context}
Given a user question about the dataframes, write the Python code to answer it. \
Don't assume you have access to any libraries other than built-in Python ones and pandas. \
Make sure to refer only to the variables mentioned above."""
prompt = ChatPromptTemplate.from_messages([("system", system), ("human", "{question}")])
chain = prompt | llm_with_tool | parser | tool
chain.invoke(
{
"question": "return the difference in the correlation between age and fare and the correlation between fare and survival"
}
)
0.14384991262954416
以下是此运行的 LangSmith 跟踪:https://smith.langchain.com/public/cc2a7d7f-7c5a-4e77-a10c-7b5420fcd07f/r
沙箱代码执行
有许多工具(如 E2B 和 Bearly)为 Python 代码执行提供沙盒环境,以实现更安全的代码执行链和代理。
后续步骤
对于更高级的数据分析应用程序,我们建议您查看:
- SQL 教程:使用 SQL 数据库和 CSV 的许多挑战对于任何结构化数据类型都是通用的,因此即使您使用 Pandas 进行 CSV 数据分析,阅读 SQL 技术也很有用。
- 工具使用:有关使用调用工具的链和代理时的一般最佳实践的指南
- 代理:了解构建 LLM 代理的基础知识。
- 集成:E2B 和 Earlyly 等沙盒环境、SQLDatabase 等实用程序、Spark DataFrame 代理等相关代理。