如何加载 CSV
逗号分隔值 (CSV) 文件是使用逗号分隔值的分隔文本文件。文件的每一行都是一条数据记录。每条记录由一个或多个字段组成,用逗号分隔。
LangChain 实现了一个 CSV Loader,它将 CSV 文件加载到 Document 对象序列中。CSV 文件的每一行都转换为一个文档。
from langchain_community.document_loaders.csv_loader import CSVLoader
file_path = "../integrations/document_loaders/example_data/mlb_teams_2012.csv"
loader = CSVLoader(file_path=file_path)
data = loader.load()
for record in data[:2]:
print(record)
API 参考:CSVLoader
page_content='Team: Nationals\n"Payroll (millions)": 81.34\n"Wins": 98' metadata={'source': '../../../docs/integrations/document_loaders/example_data/mlb_teams_2012.csv', 'row': 0}
page_content='Team: Reds\n"Payroll (millions)": 82.20\n"Wins": 97' metadata={'source': '../../../docs/integrations/document_loaders/example_data/mlb_teams_2012.csv', 'row': 1}
自定义 CSV 解析和加载
CSVLoader将接受csv_argskwarg 支持自定义传递给 Python 的csv.DictReader.请参阅 csv 模块文档 有关支持哪些 csv args 的更多信息。
loader = CSVLoader(
file_path=file_path,
csv_args={
"delimiter": ",",
"quotechar": '"',
"fieldnames": ["MLB Team", "Payroll in millions", "Wins"],
},
)
data = loader.load()
for record in data[:2]:
print(record)
page_content='MLB Team: Team\nPayroll in millions: "Payroll (millions)"\nWins: "Wins"' metadata={'source': '../../../docs/integrations/document_loaders/example_data/mlb_teams_2012.csv', 'row': 0}
page_content='MLB Team: Nationals\nPayroll in millions: 81.34\nWins: 98' metadata={'source': '../../../docs/integrations/document_loaders/example_data/mlb_teams_2012.csv', 'row': 1}
指定列以标识文档源
这"source"key 的 Document metadata 可以使用 CSV 的列进行设置。使用source_column参数指定从每行创建的文档的源。否则file_path将用作从 CSV 文件创建的所有文档的源。
当将从 CSV 文件加载的文档用于使用源回答问题的链时,这非常有用。
loader = CSVLoader(file_path=file_path, source_column="Team")
data = loader.load()
for record in data[:2]:
print(record)
page_content='Team: Nationals\n"Payroll (millions)": 81.34\n"Wins": 98' metadata={'source': 'Nationals', 'row': 0}
page_content='Team: Reds\n"Payroll (millions)": 82.20\n"Wins": 97' metadata={'source': 'Reds', 'row': 1}
从字符串加载
Python 的tempfile在直接处理 CSV 字符串时可以使用。
import tempfile
from io import StringIO
string_data = """
"Team", "Payroll (millions)", "Wins"
"Nationals", 81.34, 98
"Reds", 82.20, 97
"Yankees", 197.96, 95
"Giants", 117.62, 94
""".strip()
with tempfile.NamedTemporaryFile(delete=False, mode="w+") as temp_file:
temp_file.write(string_data)
temp_file_path = temp_file.name
loader = CSVLoader(file_path=temp_file_path)
data = loader.load()
for record in data[:2]:
print(record)
page_content='Team: Nationals\n"Payroll (millions)": 81.34\n"Wins": 98' metadata={'source': 'Nationals', 'row': 0}
page_content='Team: Reds\n"Payroll (millions)": 82.20\n"Wins": 97' metadata={'source': 'Reds', 'row': 1}