如何构建知识图谱
在本指南中,我们将介绍基于非结构化文本构建知识图谱的基本方法。然后,可以将结构化图用作 RAG 应用程序中的知识库。
⚠️ 安全说明 ⚠️
构建知识图谱需要执行对数据库的写入访问权限。这样做存在固有的风险。请确保在导入数据之前对其进行验证和确认。有关一般安全最佳实践的更多信息,请参阅此处。
架构
概括地说,从文本构建知识图谱的步骤是:
- 从文本中提取结构化信息:Model 用于从文本中提取结构化图信息。
- 存储到图形数据库中:将提取的结构化图形信息存储到图形数据库中,支持下游 RAG 应用程序
设置
首先,获取所需的包并设置环境变量。 在此示例中,我们将使用 Neo4j 图形数据库。
%pip install --upgrade --quiet langchain langchain-neo4j langchain-openai langchain-experimental neo4j
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.
在本指南中,我们默认使用 OpenAI 模型。
import getpass
import os
os.environ["OPENAI_API_KEY"] = getpass.getpass()
# Uncomment the below to use LangSmith. Not required.
# os.environ["LANGSMITH_API_KEY"] = getpass.getpass()
# os.environ["LANGSMITH_TRACING"] = "true"
········
接下来,我们需要定义 Neo4j 凭证和连接。 按照这些安装步骤设置 Neo4j 数据库。
import os
from langchain_neo4j import Neo4jGraph
os.environ["NEO4J_URI"] = "bolt://localhost:7687"
os.environ["NEO4J_USERNAME"] = "neo4j"
os.environ["NEO4J_PASSWORD"] = "password"
graph = Neo4jGraph(refresh_schema=False)
LLM 图形转换器
从文本中提取图形数据可以将非结构化信息转换为结构化格式,从而促进更深入的洞察和更高效地浏览复杂的关系和模式。这LLMGraphTransformer通过利用 LLM 对实体及其关系进行解析和分类,将文本文档转换为结构化图形文档。LLM 模型的选择通过确定提取的图形数据的准确性和细微差别来显着影响输出。
import os
from langchain_experimental.graph_transformers import LLMGraphTransformer
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(temperature=0, model_name="gpt-4-turbo")
llm_transformer = LLMGraphTransformer(llm=llm)
现在我们可以传入示例文本并检查结果。
from langchain_core.documents import Document
text = """
Marie Curie, born in 1867, was a Polish and naturalised-French physicist and chemist who conducted pioneering research on radioactivity.
She was the first woman to win a Nobel Prize, the first person to win a Nobel Prize twice, and the only person to win a Nobel Prize in two scientific fields.
Her husband, Pierre Curie, was a co-winner of her first Nobel Prize, making them the first-ever married couple to win the Nobel Prize and launching the Curie family legacy of five Nobel Prizes.
She was, in 1906, the first woman to become a professor at the University of Paris.
"""
documents = [Document(page_content=text)]
graph_documents = await llm_transformer.aconvert_to_graph_documents(documents)
print(f"Nodes:{graph_documents[0].nodes}")
print(f"Relationships:{graph_documents[0].relationships}")
Nodes:[Node(id='Marie Curie', type='Person', properties={}), Node(id='Pierre Curie', type='Person', properties={}), Node(id='University Of Paris', type='Organization', properties={})]
Relationships:[Relationship(source=Node(id='Marie Curie', type='Person', properties={}), target=Node(id='Pierre Curie', type='Person', properties={}), type='MARRIED', properties={}), Relationship(source=Node(id='Marie Curie', type='Person', properties={}), target=Node(id='University Of Paris', type='Organization', properties={}), type='PROFESSOR', properties={})]
检查下图以更好地掌握生成的知识图谱的结构。

请注意,图形构建过程是不确定的,因为我们使用的是 LLM。因此,每次执行的结果可能会略有不同。
此外,您还可以根据需要灵活地定义特定类型的节点和关系以进行提取。
llm_transformer_filtered = LLMGraphTransformer(
llm=llm,
allowed_nodes=["Person", "Country", "Organization"],
allowed_relationships=["NATIONALITY", "LOCATED_IN", "WORKED_AT", "SPOUSE"],
)
graph_documents_filtered = await llm_transformer_filtered.aconvert_to_graph_documents(
documents
)
print(f"Nodes:{graph_documents_filtered[0].nodes}")
print(f"Relationships:{graph_documents_filtered[0].relationships}")
Nodes:[Node(id='Marie Curie', type='Person', properties={}), Node(id='Pierre Curie', type='Person', properties={}), Node(id='University Of Paris', type='Organization', properties={})]
Relationships:[Relationship(source=Node(id='Marie Curie', type='Person', properties={}), target=Node(id='Pierre Curie', type='Person', properties={}), type='SPOUSE', properties={}), Relationship(source=Node(id='Marie Curie', type='Person', properties={}), target=Node(id='University Of Paris', type='Organization', properties={}), type='WORKED_AT', properties={})]
要更精确地定义图形架构,请考虑对关系使用三元组方法。在这种方法中,每个元组由三个元素组成:源节点、关系类型和目标节点。
allowed_relationships = [
("Person", "SPOUSE", "Person"),
("Person", "NATIONALITY", "Country"),
("Person", "WORKED_AT", "Organization"),
]
llm_transformer_tuple = LLMGraphTransformer(
llm=llm,
allowed_nodes=["Person", "Country", "Organization"],
allowed_relationships=allowed_relationships,
)
graph_documents_filtered = await llm_transformer_tuple.aconvert_to_graph_documents(
documents
)
print(f"Nodes:{graph_documents_filtered[0].nodes}")
print(f"Relationships:{graph_documents_filtered[0].relationships}")
Nodes:[Node(id='Marie Curie', type='Person', properties={}), Node(id='Pierre Curie', type='Person', properties={}), Node(id='University Of Paris', type='Organization', properties={})]
Relationships:[Relationship(source=Node(id='Marie Curie', type='Person', properties={}), target=Node(id='Pierre Curie', type='Person', properties={}), type='SPOUSE', properties={}), Relationship(source=Node(id='Marie Curie', type='Person', properties={}), target=Node(id='University Of Paris', type='Organization', properties={}), type='WORKED_AT', properties={})]
为了更好地理解生成的图形,我们可以再次将其可视化。

这node_properties参数启用节点属性的提取,从而允许创建更详细的图形。
当设置为True,LLM 会自动识别和提取相关的节点属性。
相反,如果node_properties定义为字符串列表,则 LLM 会选择性地仅从文本中检索指定的属性。
llm_transformer_props = LLMGraphTransformer(
llm=llm,
allowed_nodes=["Person", "Country", "Organization"],
allowed_relationships=["NATIONALITY", "LOCATED_IN", "WORKED_AT", "SPOUSE"],
node_properties=["born_year"],
)
graph_documents_props = await llm_transformer_props.aconvert_to_graph_documents(
documents
)
print(f"Nodes:{graph_documents_props[0].nodes}")
print(f"Relationships:{graph_documents_props[0].relationships}")
Nodes:[Node(id='Marie Curie', type='Person', properties={'born_year': '1867'}), Node(id='Pierre Curie', type='Person', properties={}), Node(id='University Of Paris', type='Organization', properties={}), Node(id='Poland', type='Country', properties={}), Node(id='France', type='Country', properties={})]
Relationships:[Relationship(source=Node(id='Marie Curie', type='Person', properties={}), target=Node(id='Poland', type='Country', properties={}), type='NATIONALITY', properties={}), Relationship(source=Node(id='Marie Curie', type='Person', properties={}), target=Node(id='France', type='Country', properties={}), type='NATIONALITY', properties={}), Relationship(source=Node(id='Marie Curie', type='Person', properties={}), target=Node(id='Pierre Curie', type='Person', properties={}), type='SPOUSE', properties={}), Relationship(source=Node(id='Marie Curie', type='Person', properties={}), target=Node(id='University Of Paris', type='Organization', properties={}), type='WORKED_AT', properties={})]
存储到图形数据库
生成的图形文档可以使用add_graph_documents方法。
graph.add_graph_documents(graph_documents_props)
大多数图形数据库都支持索引以优化数据导入和检索。由于我们可能事先不知道所有节点标签,因此我们可以通过使用baseEntityLabel参数。
graph.add_graph_documents(graph_documents, baseEntityLabel=True)
结果将如下所示:

最后一个选项是同时导入提取的节点和关系的源文档。这种方法允许我们跟踪每个实体出现在哪些文档中。
graph.add_graph_documents(graph_documents, include_source=True)
Graph 将具有以下结构:

在此可视化中,源文档以蓝色突出显示,从中提取的所有实体都通过MENTIONS关系。