Memgraph
Memgraph 是一个开源的图数据库,专为动态分析环境优化,并与 Neo4j 兼容。为了查询数据库,Memgraph 使用 Cypher——这是目前应用最广泛、规范最完整且面向属性图数据库的开放查询语言。
该笔记本将向您展示如何使用自然语言查询 Memgraph,以及如何从非结构化数据中构建知识图谱。
但首先,请确保已完成所有设置。
设置
要完成本指南,您需要安装Docker和Python 3.x。
首次快速运行Memgraph 平台(Memgraph 数据库 + MAGE 库 + Memgraph Lab),请执行以下操作:
在 Linux/MacOS 上:
curl https://install.memgraph.com | sh
在Windows上:
iwr https://windows.memgraph.com | iex
两个命令都会运行一个脚本,该脚本将 Docker Compose 文件下载到您的系统,并在两个独立的容器中构建并启动 memgraph-mage 和 memgraph-lab 两个 Docker 服务。现在您已成功启动并运行 Memgraph!有关安装过程的更多详情,请参阅 Memgraph 文档。
要使用 LangChain,需要安装并导入所有必要的包。我们将使用包管理器 pip,并配合 --user 标志,以确保拥有适当的权限。如果你已安装 Python 3.4 或更高版本,则 pip 默认已包含在内。你可以使用以下命令安装所有必需的包:
pip install langchain langchain-openai langchain-memgraph --user
您可以在此笔记本中运行提供的代码块,或使用单独的Python文件来试验Memgraph和LangChain。
自然语言查询
Memgraph 与 LangChain 的集成包含自然语言查询功能。要使用该功能,首先需要进行所有必要的导入。我们将在代码中逐一介绍这些导入内容。
首先,实例化 MemgraphGraph。此对象包含与正在运行的 Memgraph 实例的连接。请确保正确设置所有环境变量。
import os
from langchain_core.prompts import PromptTemplate
from langchain_memgraph.chains.graph_qa import MemgraphQAChain
from langchain_memgraph.graphs.memgraph import Memgraph
from langchain_openai import ChatOpenAI
url = os.environ.get("MEMGRAPH_URI", "bolt://localhost:7687")
username = os.environ.get("MEMGRAPH_USERNAME", "")
password = os.environ.get("MEMGRAPH_PASSWORD", "")
graph = Memgraph(url=url, username=username, password=password, refresh_schema=False)
refresh_schema 最初被设置为 False,因为数据库中尚无数据,我们希望避免不必要的数据库调用。
填充数据库
为了填充数据库,请首先确保其为空。最有效的方法是切换到内存分析存储模式,删除图数据,然后返回到内存事务模式。了解更多关于 Memgraph 的 存储模式。
我们将添加到数据库中的数据是关于在各种平台上可用的不同类型电子游戏及其相关发行商的信息。
# Drop graph
graph.query("STORAGE MODE IN_MEMORY_ANALYTICAL")
graph.query("DROP GRAPH")
graph.query("STORAGE MODE IN_MEMORY_TRANSACTIONAL")
# Creating and executing the seeding query
query = """
MERGE (g:Game {name: "Baldur's Gate 3"})
WITH g, ["PlayStation 5", "Mac OS", "Windows", "Xbox Series X/S"] AS platforms,
["Adventure", "Role-Playing Game", "Strategy"] AS genres
FOREACH (platform IN platforms |
MERGE (p:Platform {name: platform})
MERGE (g)-[:AVAILABLE_ON]->(p)
)
FOREACH (genre IN genres |
MERGE (gn:Genre {name: genre})
MERGE (g)-[:HAS_GENRE]->(gn)
)
MERGE (p:Publisher {name: "Larian Studios"})
MERGE (g)-[:PUBLISHED_BY]->(p);
"""
graph.query(query)
[]
注意 graph 对象持有 query 方法。该方法在 Memgraph 中执行查询,并且也被 MemgraphQAChain 用来查询数据库。
刷新图谱结构
由于新数据是在 Memgraph 中创建的,因此有必要刷新模式。生成的模式将被 MemgraphQAChain 用来指导大语言模型(LLM)更好地生成 Cypher 查询。
graph.refresh_schema()
要熟悉数据并验证更新后的图结构,可以使用以下语句将其打印出来:
print(graph.get_schema)
Node labels and properties (name and type) are:
- labels: (:Platform)
properties:
- name: string
- labels: (:Genre)
properties:
- name: string
- labels: (:Game)
properties:
- name: string
- labels: (:Publisher)
properties:
- name: string
Nodes are connected with the following relationships:
(:Game)-[:HAS_GENRE]->(:Genre)
(:Game)-[:PUBLISHED_BY]->(:Publisher)
(:Game)-[:AVAILABLE_ON]->(:Platform)
查询数据库
要与OpenAI API进行交互,您必须将API密钥配置为环境变量。这可以确保您的请求获得适当的授权。您可以此处了解有关获取API密钥的更多信息。要配置API密钥,可以使用Python的os包:
os.environ["OPENAI_API_KEY"] = "your-key-here"
如果在 Jupyter 笔记本中运行代码,请执行上述代码片段。
接下来,创建 MemgraphQAChain,它将用于基于图数据的问答过程。将 temperature parameter 设置为零是为了确保答案的可预测性和一致性。您可以将 verbose 参数设置为 True,以获得有关查询生成的更详细信息。
chain = MemgraphQAChain.from_llm(
ChatOpenAI(temperature=0),
graph=graph,
model_name="gpt-4-turbo",
allow_dangerous_requests=True,
)
现在你可以开始提问了!
response = chain.invoke("Which platforms is Baldur's Gate 3 available on?")
print(response["result"])
MATCH (:Game{name: "Baldur's Gate 3"})-[:AVAILABLE_ON]->(platform:Platform)
RETURN platform.name
Baldur's Gate 3 is available on PlayStation 5, Mac OS, Windows, and Xbox Series X/S.
response = chain.invoke("Is Baldur's Gate 3 available on Windows?")
print(response["result"])
MATCH (:Game{name: "Baldur's Gate 3"})-[:AVAILABLE_ON]->(:Platform{name: "Windows"})
RETURN "Yes"
Yes, Baldur's Gate 3 is available on Windows.
链式修饰符
要修改链的行为并获取更多上下文或附加信息,可以调整链的参数。
返回直接查询结果
return_direct 修饰符用于指定是返回所执行的Cypher查询的直接结果,还是返回经过处理的自然语言响应。
# Return the result of querying the graph directly
chain = MemgraphQAChain.from_llm(
ChatOpenAI(temperature=0),
graph=graph,
return_direct=True,
allow_dangerous_requests=True,
model_name="gpt-4-turbo",
)
response = chain.invoke("Which studio published Baldur's Gate 3?")
print(response["result"])
MATCH (g:Game {name: "Baldur's Gate 3"})-[:PUBLISHED_BY]->(p:Publisher)
RETURN p.name
[{'p.name': 'Larian Studios'}]
返回查询的中间步骤
return_intermediate_steps 链修改器通过在初始查询结果的基础上包含查询的中间步骤,来增强返回的响应。
# Return all the intermediate steps of query execution
chain = MemgraphQAChain.from_llm(
ChatOpenAI(temperature=0),
graph=graph,
allow_dangerous_requests=True,
return_intermediate_steps=True,
model_name="gpt-4-turbo",
)
response = chain.invoke("Is Baldur's Gate 3 an Adventure game?")
print(f"Intermediate steps: {response['intermediate_steps']}")
print(f"Final response: {response['result']}")
MATCH (:Game {name: "Baldur's Gate 3"})-[:HAS_GENRE]->(:Genre {name: "Adventure"})
RETURN "Yes"
Intermediate steps: [{'query': 'MATCH (:Game {name: "Baldur\'s Gate 3"})-[:HAS_GENRE]->(:Genre {name: "Adventure"})\nRETURN "Yes"'}, {'context': [{'"Yes"': 'Yes'}]}]
Final response: Yes.
限制查询结果的数量
当您希望限制查询结果的最大数量时,可以使用top_k修饰符。
# Limit the maximum number of results returned by query
chain = MemgraphQAChain.from_llm(
ChatOpenAI(temperature=0),
graph=graph,
top_k=2,
allow_dangerous_requests=True,
model_name="gpt-4-turbo",
)
response = chain.invoke("What genres are associated with Baldur's Gate 3?")
print(response["result"])
MATCH (:Game {name: "Baldur's Gate 3"})-[:HAS_GENRE]->(g:Genre)
RETURN g.name;
Adventure, Role-Playing Game
高级查询
随着您的解决方案复杂性增加,您可能会遇到需要谨慎处理的不同用例。确保应用程序的可扩展性对于保持顺畅的用户流程至关重要,避免出现任何问题。
让我们再次实例化我们的链,并尝试提出一些用户可能会问的问题。
chain = MemgraphQAChain.from_llm(
ChatOpenAI(temperature=0),
graph=graph,
model_name="gpt-4-turbo",
allow_dangerous_requests=True,
)
response = chain.invoke("Is Baldur's Gate 3 available on PS5?")
print(response["result"])
MATCH (:Game{name: "Baldur's Gate 3"})-[:AVAILABLE_ON]->(:Platform{name: "PS5"})
RETURN "Yes"
I don't know the answer.
生成的Cypher查询看起来没问题,但我们没有收到任何响应信息。这说明了在使用大语言模型(LLM)时的一个常见挑战——用户提问方式与数据存储方式之间的不匹配。在这种情况下,用户认知与实际数据存储之间的差异可能导致查询不匹配。提示词优化(prompt refinement)是一种有效的解决方案,它通过不断改进模型的提示词,使其更好地理解这些差异。经过提示词优化后,模型能够更熟练地生成准确且相关的查询,从而成功检索到所需的数据。
提示优化
为了应对这个问题,我们可以调整问答链的初始Cypher提示。这包括为大语言模型(LLM)添加指导,说明用户如何引用特定平台,例如我们案例中的PS5。我们通过使用LangChain PromptTemplate 来实现这一点,创建一个修改后的初始提示。然后将此修改后的提示作为参数提供给我们的改进版 MemgraphQAChain 实例。
MEMGRAPH_GENERATION_TEMPLATE = """Your task is to directly translate natural language inquiry into precise and executable Cypher query for Memgraph database.
You will utilize a provided database schema to understand the structure, nodes and relationships within the Memgraph database.
Instructions:
- Use provided node and relationship labels and property names from the
schema which describes the database's structure. Upon receiving a user
question, synthesize the schema to craft a precise Cypher query that
directly corresponds to the user's intent.
- Generate valid executable Cypher queries on top of Memgraph database.
Any explanation, context, or additional information that is not a part
of the Cypher query syntax should be omitted entirely.
- Use Memgraph MAGE procedures instead of Neo4j APOC procedures.
- Do not include any explanations or apologies in your responses.
- Do not include any text except the generated Cypher statement.
- For queries that ask for information or functionalities outside the direct
generation of Cypher queries, use the Cypher query format to communicate
limitations or capabilities. For example: RETURN "I am designed to generate
Cypher queries based on the provided schema only."
Schema:
{schema}
With all the above information and instructions, generate Cypher query for the
user question.
If the user asks about PS5, Play Station 5 or PS 5, that is the platform called PlayStation 5.
The question is:
{question}"""
MEMGRAPH_GENERATION_PROMPT = PromptTemplate(
input_variables=["schema", "question"], template=MEMGRAPH_GENERATION_TEMPLATE
)
chain = MemgraphQAChain.from_llm(
ChatOpenAI(temperature=0),
cypher_prompt=MEMGRAPH_GENERATION_PROMPT,
graph=graph,
model_name="gpt-4-turbo",
allow_dangerous_requests=True,
)
response = chain.invoke("Is Baldur's Gate 3 available on PS5?")
print(response["result"])
MATCH (:Game{name: "Baldur's Gate 3"})-[:AVAILABLE_ON]->(:Platform{name: "PlayStation 5"})
RETURN "Yes"
Yes, Baldur's Gate 3 is available on PS5.
现在,通过修订后的初始Cypher提示(其中包含关于平台命名的指导),我们获得了更准确且相关的结果,这些结果与用户查询更加一致。
这种方法可以进一步改进你的问答链。你可以轻松地将额外的提示优化数据集成到你的链中,从而提升应用程序的整体用户体验。
构建知识图谱
将非结构化数据转换为结构化数据并非易事,也并不简单。本指南将展示如何利用大语言模型(LLM)来协助我们完成这一任务,以及如何在 Memgraph 中构建知识图谱。知识图谱创建完成后,您可将其用于您的 GraphRAG 应用程序。
从文本构建知识图谱的步骤是:
- 从文本中提取结构化信息:LLM 被用于以节点和关系的形式从文本中提取结构化的图信息。
- 存储到 Memgraph:将提取的结构化图信息存储到 Memgraph 中。
从文本中提取结构化信息
除了安装部分中的所有导入外,还需导入LLMGraphTransformer和Document,它们将用于从文本中提取结构化信息。
from langchain_core.documents import Document
from langchain_experimental.graph_transformers import LLMGraphTransformer
以下是将从中构建知识图谱的关于查尔斯·达尔文的示例文本(来源)。
text = """
Charles Robert Darwin was an English naturalist, geologist, and biologist,
widely known for his contributions to evolutionary biology. His proposition that
all species of life have descended from a common ancestor is now generally
accepted and considered a fundamental scientific concept. In a joint
publication with Alfred Russel Wallace, he introduced his scientific theory that
this branching pattern of evolution resulted from a process he called natural
selection, in which the struggle for existence has a similar effect to the
artificial selection involved in selective breeding. Darwin has been
described as one of the most influential figures in human history and was
honoured by burial in Westminster Abbey.
"""
下一步是从所需的LLM初始化LLMGraphTransformer,并将文档转换为图结构。
llm = ChatOpenAI(temperature=0, model_name="gpt-4-turbo")
llm_transformer = LLMGraphTransformer(llm=llm)
documents = [Document(page_content=text)]
graph_documents = llm_transformer.convert_to_graph_documents(documents)
在底层,LLM 从文本中提取重要实体,并将其作为节点和关系的列表返回。效果如下所示:
print(graph_documents)
[GraphDocument(nodes=[Node(id='Charles Robert Darwin', type='Person', properties={}), Node(id='English', type='Nationality', properties={}), Node(id='Naturalist', type='Profession', properties={}), Node(id='Geologist', type='Profession', properties={}), Node(id='Biologist', type='Profession', properties={}), Node(id='Evolutionary Biology', type='Field', properties={}), Node(id='Common Ancestor', type='Concept', properties={}), Node(id='Scientific Concept', type='Concept', properties={}), Node(id='Alfred Russel Wallace', type='Person', properties={}), Node(id='Natural Selection', type='Concept', properties={}), Node(id='Selective Breeding', type='Concept', properties={}), Node(id='Westminster Abbey', type='Location', properties={})], relationships=[Relationship(source=Node(id='Charles Robert Darwin', type='Person', properties={}), target=Node(id='English', type='Nationality', properties={}), type='NATIONALITY', properties={}), Relationship(source=Node(id='Charles Robert Darwin', type='Person', properties={}), target=Node(id='Naturalist', type='Profession', properties={}), type='PROFESSION', properties={}), Relationship(source=Node(id='Charles Robert Darwin', type='Person', properties={}), target=Node(id='Geologist', type='Profession', properties={}), type='PROFESSION', properties={}), Relationship(source=Node(id='Charles Robert Darwin', type='Person', properties={}), target=Node(id='Biologist', type='Profession', properties={}), type='PROFESSION', properties={}), Relationship(source=Node(id='Charles Robert Darwin', type='Person', properties={}), target=Node(id='Evolutionary Biology', type='Field', properties={}), type='CONTRIBUTION', properties={}), Relationship(source=Node(id='Common Ancestor', type='Concept', properties={}), target=Node(id='Scientific Concept', type='Concept', properties={}), type='BASIS', properties={}), Relationship(source=Node(id='Charles Robert Darwin', type='Person', properties={}), target=Node(id='Alfred Russel Wallace', type='Person', properties={}), type='COLLABORATION', properties={}), Relationship(source=Node(id='Natural Selection', type='Concept', properties={}), target=Node(id='Selective Breeding', type='Concept', properties={}), type='COMPARISON', properties={}), Relationship(source=Node(id='Charles Robert Darwin', type='Person', properties={}), target=Node(id='Westminster Abbey', type='Location', properties={}), type='BURIAL', properties={})], source=Document(metadata={}, page_content='\n Charles Robert Darwin was an English naturalist, geologist, and biologist,\n widely known for his contributions to evolutionary biology. His proposition that\n all species of life have descended from a common ancestor is now generally\n accepted and considered a fundamental scientific concept. In a joint\n publication with Alfred Russel Wallace, he introduced his scientific theory that\n this branching pattern of evolution resulted from a process he called natural\n selection, in which the struggle for existence has a similar effect to the\n artificial selection involved in selective breeding. Darwin has been\n described as one of the most influential figures in human history and was\n honoured by burial in Westminster Abbey.\n'))]
将数据存储到Memgraph
一旦你将数据以GraphDocument的格式(即节点和关系)准备好,就可以使用add_graph_documents方法将其导入 Memgraph。该方法会将graph_documents列表转换为相应的 Cypher 查询语句,并在 Memgraph 中执行。完成后,知识图谱便存储在 Memgraph 中。
# Empty the database
graph.query("STORAGE MODE IN_MEMORY_ANALYTICAL")
graph.query("DROP GRAPH")
graph.query("STORAGE MODE IN_MEMORY_TRANSACTIONAL")
# Create KG
graph.add_graph_documents(graph_documents)
以下是该图在 Memgraph Lab 中的显示效果(在 localhost:3000 上查看):

如果您尝试后得到了不同的图,这是预期的行为。图的构建过程是非确定性的,因为用于从非结构化数据生成节点和关系的LLM本身具有非确定性。
附加选项
此外,您还可以根据您的需求定义特定类型的节点和关系以进行提取。
llm_transformer_filtered = LLMGraphTransformer(
llm=llm,
allowed_nodes=["Person", "Nationality", "Concept"],
allowed_relationships=["NATIONALITY", "INVOLVED_IN", "COLLABORATES_WITH"],
)
graph_documents_filtered = llm_transformer_filtered.convert_to_graph_documents(
documents
)
print(f"Nodes:{graph_documents_filtered[0].nodes}")
print(f"Relationships:{graph_documents_filtered[0].relationships}")
Nodes:[Node(id='Charles Robert Darwin', type='Person', properties={}), Node(id='English', type='Nationality', properties={}), Node(id='Evolutionary Biology', type='Concept', properties={}), Node(id='Natural Selection', type='Concept', properties={}), Node(id='Alfred Russel Wallace', type='Person', properties={})]
Relationships:[Relationship(source=Node(id='Charles Robert Darwin', type='Person', properties={}), target=Node(id='English', type='Nationality', properties={}), type='NATIONALITY', properties={}), Relationship(source=Node(id='Charles Robert Darwin', type='Person', properties={}), target=Node(id='Evolutionary Biology', type='Concept', properties={}), type='INVOLVED_IN', properties={}), Relationship(source=Node(id='Charles Robert Darwin', type='Person', properties={}), target=Node(id='Natural Selection', type='Concept', properties={}), type='INVOLVED_IN', properties={}), Relationship(source=Node(id='Charles Robert Darwin', type='Person', properties={}), target=Node(id='Alfred Russel Wallace', type='Person', properties={}), type='COLLABORATES_WITH', properties={})]
这种情况下的图表如下所示:

您的图也可以在所有节点上具有 __Entity__ 标签,这些标签将被索引以实现更快的检索。
# Drop graph
graph.query("STORAGE MODE IN_MEMORY_ANALYTICAL")
graph.query("DROP GRAPH")
graph.query("STORAGE MODE IN_MEMORY_TRANSACTIONAL")
# Store to Memgraph with Entity label
graph.add_graph_documents(graph_documents, baseEntityLabel=True)
这个图看起来是这样的:

还可以选择在图中包含所获取信息的来源。为此,请将 include_source 设置为 True,然后源文档将被存储,并通过 MENTIONS 关系链接到图中的节点。
# Drop graph
graph.query("STORAGE MODE IN_MEMORY_ANALYTICAL")
graph.query("DROP GRAPH")
graph.query("STORAGE MODE IN_MEMORY_TRANSACTIONAL")
# Store to Memgraph with source included
graph.add_graph_documents(graph_documents, include_source=True)
构建的图将如下所示:

注意源内容是如何被存储的,并且由于文档没有id,生成了id属性。 你可以同时包含__Entity__标签和文档源。但请注意,两者都会占用内存,尤其是包含长字符串内容的源会占用更多内存。
最后,你可以查询知识图谱,如前一节所述:
chain = MemgraphQAChain.from_llm(
ChatOpenAI(temperature=0),
graph=graph,
model_name="gpt-4-turbo",
allow_dangerous_requests=True,
)
print(chain.invoke("Who Charles Robert Darwin collaborated with?")["result"])
MATCH (:Person {id: "Charles Robert Darwin"})-[:COLLABORATION]->(collaborator)
RETURN collaborator;
Alfred Russel Wallace