Skip to main content
Open In ColabOpen on GitHub

如何构建知识图谱

在本指南中,我们将介绍基于非结构化文本构建知识图谱的基本方法。构建好的图谱之后,可以作为 RAG 应用中的知识库使用。

⚠️ 安全提示 ⚠️

构建知识图谱需要对数据库执行写入操作。这样做存在固有风险。在导入数据之前,请确保验证和检查数据。有关通用安全最佳实践的更多信息,请 点击此处

架构

从高层次来看,从文本构建知识图谱的步骤包括:

  1. 从文本中提取结构化信息: 模型用于从文本中提取结构化图信息。
  2. 存储到图数据库: 将提取的结构化图信息存储到图数据库中,可支持下游的RAG应用

设置

首先,获取所需的包并设置环境变量。 在此示例中,我们将使用 Neo4j 图数据库。

%pip install --upgrade --quiet  langchain langchain-neo4j langchain-openai langchain-experimental neo4j

[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.

在本指南中,我们默认使用 OpenAI 模型。

import getpass
import os

os.environ["OPENAI_API_KEY"] = getpass.getpass()

# Uncomment the below to use LangSmith. Not required.
# os.environ["LANGSMITH_API_KEY"] = getpass.getpass()
# os.environ["LANGSMITH_TRACING"] = "true"
 ········

接下来,我们需要定义Neo4j的凭据和连接信息。 请按照 这些安装步骤 来设置一个Neo4j数据库。

import os

from langchain_neo4j import Neo4jGraph

os.environ["NEO4J_URI"] = "bolt://localhost:7687"
os.environ["NEO4J_USERNAME"] = "neo4j"
os.environ["NEO4J_PASSWORD"] = "password"

graph = Neo4jGraph(refresh_schema=False)
API 参考:Neo4jGraph

大型语言模型图变换器

从文本中提取图数据能够将非结构化信息转换为结构化格式,从而促进对复杂关系和模式的深入洞察,并实现更高效的导航。LLMGraphTransformer通过利用大型语言模型(LLM)解析和分类实体及其关系,将文本文档转换为结构化的图文档。所选择的LLM模型会显著影响输出结果,因为它决定了提取的图数据的准确性和细微差别。

import os

from langchain_experimental.graph_transformers import LLMGraphTransformer
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(temperature=0, model_name="gpt-4-turbo")

llm_transformer = LLMGraphTransformer(llm=llm)

现在我们可以输入示例文本并检查结果。

from langchain_core.documents import Document

text = """
Marie Curie, born in 1867, was a Polish and naturalised-French physicist and chemist who conducted pioneering research on radioactivity.
She was the first woman to win a Nobel Prize, the first person to win a Nobel Prize twice, and the only person to win a Nobel Prize in two scientific fields.
Her husband, Pierre Curie, was a co-winner of her first Nobel Prize, making them the first-ever married couple to win the Nobel Prize and launching the Curie family legacy of five Nobel Prizes.
She was, in 1906, the first woman to become a professor at the University of Paris.
"""
documents = [Document(page_content=text)]
graph_documents = await llm_transformer.aconvert_to_graph_documents(documents)
print(f"Nodes:{graph_documents[0].nodes}")
print(f"Relationships:{graph_documents[0].relationships}")
API 参考:文档
Nodes:[Node(id='Marie Curie', type='Person', properties={}), Node(id='Pierre Curie', type='Person', properties={}), Node(id='University Of Paris', type='Organization', properties={})]
Relationships:[Relationship(source=Node(id='Marie Curie', type='Person', properties={}), target=Node(id='Pierre Curie', type='Person', properties={}), type='MARRIED', properties={}), Relationship(source=Node(id='Marie Curie', type='Person', properties={}), target=Node(id='University Of Paris', type='Organization', properties={}), type='PROFESSOR', properties={})]

仔细查看以下图像,以更好地理解生成的知识图谱的结构。

graph_construction1.png

请注意,由于我们使用了大语言模型,图的构建过程具有非确定性。因此,每次执行时可能会得到略微不同的结果。

此外,您还可以根据需求灵活定义特定类型的节点和关系以进行提取。

llm_transformer_filtered = LLMGraphTransformer(
llm=llm,
allowed_nodes=["Person", "Country", "Organization"],
allowed_relationships=["NATIONALITY", "LOCATED_IN", "WORKED_AT", "SPOUSE"],
)
graph_documents_filtered = await llm_transformer_filtered.aconvert_to_graph_documents(
documents
)
print(f"Nodes:{graph_documents_filtered[0].nodes}")
print(f"Relationships:{graph_documents_filtered[0].relationships}")
Nodes:[Node(id='Marie Curie', type='Person', properties={}), Node(id='Pierre Curie', type='Person', properties={}), Node(id='University Of Paris', type='Organization', properties={})]
Relationships:[Relationship(source=Node(id='Marie Curie', type='Person', properties={}), target=Node(id='Pierre Curie', type='Person', properties={}), type='SPOUSE', properties={}), Relationship(source=Node(id='Marie Curie', type='Person', properties={}), target=Node(id='University Of Paris', type='Organization', properties={}), type='WORKED_AT', properties={})]

为了更精确地定义图模式,可以考虑使用三元组方法来表示关系。在此方法中,每个三元组由三个元素组成:源节点、关系类型和目标节点。

allowed_relationships = [
("Person", "SPOUSE", "Person"),
("Person", "NATIONALITY", "Country"),
("Person", "WORKED_AT", "Organization"),
]

llm_transformer_tuple = LLMGraphTransformer(
llm=llm,
allowed_nodes=["Person", "Country", "Organization"],
allowed_relationships=allowed_relationships,
)
graph_documents_filtered = await llm_transformer_tuple.aconvert_to_graph_documents(
documents
)
print(f"Nodes:{graph_documents_filtered[0].nodes}")
print(f"Relationships:{graph_documents_filtered[0].relationships}")
Nodes:[Node(id='Marie Curie', type='Person', properties={}), Node(id='Pierre Curie', type='Person', properties={}), Node(id='University Of Paris', type='Organization', properties={})]
Relationships:[Relationship(source=Node(id='Marie Curie', type='Person', properties={}), target=Node(id='Pierre Curie', type='Person', properties={}), type='SPOUSE', properties={}), Relationship(source=Node(id='Marie Curie', type='Person', properties={}), target=Node(id='University Of Paris', type='Organization', properties={}), type='WORKED_AT', properties={})]

为了更好地理解生成的图表,我们可以再次对其进行可视化。

graph_construction2.png

node_properties 参数启用节点属性的提取,允许创建更详细的图。 当设置为 True 时,LLM 会自主识别并提取相关的节点属性。 相反,如果将 node_properties 定义为字符串列表,则 LLM 仅从文本中选择性地检索指定的属性。

llm_transformer_props = LLMGraphTransformer(
llm=llm,
allowed_nodes=["Person", "Country", "Organization"],
allowed_relationships=["NATIONALITY", "LOCATED_IN", "WORKED_AT", "SPOUSE"],
node_properties=["born_year"],
)
graph_documents_props = await llm_transformer_props.aconvert_to_graph_documents(
documents
)
print(f"Nodes:{graph_documents_props[0].nodes}")
print(f"Relationships:{graph_documents_props[0].relationships}")
Nodes:[Node(id='Marie Curie', type='Person', properties={'born_year': '1867'}), Node(id='Pierre Curie', type='Person', properties={}), Node(id='University Of Paris', type='Organization', properties={}), Node(id='Poland', type='Country', properties={}), Node(id='France', type='Country', properties={})]
Relationships:[Relationship(source=Node(id='Marie Curie', type='Person', properties={}), target=Node(id='Poland', type='Country', properties={}), type='NATIONALITY', properties={}), Relationship(source=Node(id='Marie Curie', type='Person', properties={}), target=Node(id='France', type='Country', properties={}), type='NATIONALITY', properties={}), Relationship(source=Node(id='Marie Curie', type='Person', properties={}), target=Node(id='Pierre Curie', type='Person', properties={}), type='SPOUSE', properties={}), Relationship(source=Node(id='Marie Curie', type='Person', properties={}), target=Node(id='University Of Paris', type='Organization', properties={}), type='WORKED_AT', properties={})]

存储到图数据库

生成的图表文档可以使用 add_graph_documents 方法存储到图数据库中。

graph.add_graph_documents(graph_documents_props)

大多数图数据库都支持索引,以优化数据的导入和检索。由于我们可能无法提前知道所有的节点标签,因此可以通过使用 baseEntityLabel 参数为每个节点添加一个次要基础标签来处理这种情况。

graph.add_graph_documents(graph_documents, baseEntityLabel=True)

结果将如下所示:

graph_construction3.png

最后一个选项是同时导入提取的节点和关系对应的源文档。这种方法使我们能够追踪每个实体出现在哪些文档中。

graph.add_graph_documents(graph_documents, include_source=True)

图将具有以下结构:

graph_construction4.png

在此可视化中,源文档以蓝色高亮显示,其中提取的所有实体均通过 MENTIONS 关系连接。