Ontotext GraphDB
Ontotext GraphDB 是一个符合 RDF 和 SPARQL 标准的图数据库和知识发现工具。
该笔记本展示了如何使用大语言模型(LLMs)为
Ontotext GraphDB提供自然语言查询(NLQ 到 SPARQL,也称为text2sparql)。
图数据库LLM功能
GraphDB 支持一些如此处所述的LLM集成功能:
- 使用来自您的知识图谱(KG)的数据,通过魔法谓词向大语言模型(LLM)请求文本、列表或表格
- 查询解释
- 结果解释、摘要、改写、翻译
- 在向量数据库中对知识图谱实体进行索引
- 支持任何文本嵌入算法和向量数据库
- 使用与 GraphDB 用于 Elastic、Solr、Lucene 的相同强大连接器(索引)语言
- RDF数据更改自动同步到知识图谱实体索引
- 支持嵌套对象(GraphDB 版本 10.5 中不支持 UI)
- 将KG实体序列化为如下文本(例如,针对葡萄酒数据集):
Franvino:
- is a RedWine.
- made from grape Merlo.
- made from grape Cabernet Franc.
- has sugar dry.
- has year 2012.
- 使用定义的知识图谱实体索引的简单聊天机器人
本教程中,我们不会使用 GraphDB LLM 集成功能,而是使用 NLQ 的 SPARQL 生成。我们将使用您可以在此处查看的 Star Wars API(SWAPI)本体和数据集 这里。
设置
您需要一个正在运行的GraphDB实例。本教程展示了如何使用GraphDB Docker镜像在本地运行数据库。它提供了一个docker compose设置,将星球大战数据集导入GraphDB中。所有必要的文件(包括此笔记本)均可从GitHub仓库langchain-graphdb-qa-chain-demo下载。
- 安装 Docker。本教程使用 Docker 版本
24.0.7创建,该版本捆绑了 Docker Compose。对于较早的 Docker 版本,您可能需要单独安装 Docker Compose。 - 克隆 GitHub 仓库 langchain-graphdb-qa-chain-demo 到您计算机上的本地文件夹。
- 使用以下脚本从同一文件夹启动 GraphDB
docker build --tag graphdb .
docker compose up -d graphdb
您需要等待几秒钟,以便数据库在 http://localhost:7200/ 上启动。星球大战数据集 starwars-data.trig 将自动加载到 langchain 仓库中。可以使用本地SPARQL端点 http://localhost:7200/repositories/langchain 来执行查询。您还可以从您喜爱的网页浏览器中打开GraphDB Workbench http://localhost:7200/sparql,在那里您可以交互式地进行查询。
- 设置工作环境
如果使用 conda,请创建并激活一个新的 conda 环境,例如:
conda create -n graph_ontotext_graphdb_qa python=3.12
conda activate graph_ontotext_graphdb_qa
安装以下库:
pip install jupyter==1.1.1
pip install rdflib==7.1.1
pip install langchain-community==0.3.4
pip install langchain-openai==0.2.4
使用 Jupyter 运行
jupyter notebook
指定本体
为了使大语言模型(LLM)能够生成SPARQL,它需要了解知识图谱的模式(本体)。这可以通过在 OntotextGraphDBGraph 类上使用以下两个参数之一来提供:
query_ontology:在SPARQL端点上执行的CONSTRUCT查询,用于返回知识图谱(KG)模式语句。我们建议将本体存储在独立的命名图中,这样可以更轻松地仅获取相关语句(如下面示例所示)。不支持DESCRIBE查询,因为DESCRIBE会返回对称简洁有界描述(SCBD),即也包括传入的类链接。对于包含百万实例的大规模图谱,这种方式效率不高。请查看 https://github.com/eclipse-rdf4j/rdf4j/issues/4857local_file:一个本地RDF本体文件。支持的RDF格式包括Turtle、RDF/XML、JSON-LD、N-Triples、Notation-3、Trig、Trix、N-Quads。
在任何一种情况下,本体转储应:
- 包含有关类、属性、属性与类的关联(使用 rdfs:domain、schema:domainIncludes 或 OWL 约束)以及分类体系(重要个体)的足够信息。
- 不包含过于冗长且与SPARQL构建无关的定义和示例。
from langchain_community.graphs import OntotextGraphDBGraph
# feeding the schema using a user construct query
graph = OntotextGraphDBGraph(
query_endpoint="http://localhost:7200/repositories/langchain",
query_ontology="CONSTRUCT {?s ?p ?o} FROM <https://swapi.co/ontology/> WHERE {?s ?p ?o}",
)
# feeding the schema using a local RDF file
graph = OntotextGraphDBGraph(
query_endpoint="http://localhost:7200/repositories/langchain",
local_file="/path/to/langchain_graphdb_tutorial/starwars-ontology.nt", # change the path here
)
无论哪种方式,本体(模式)都会以Turtle的形式输入给大语言模型(LLM),因为带有适当前缀的Turtle最为紧凑,也最容易让大语言模型记住。
《星球大战》本体模型有些特别,它包含了许多关于类的具体三元组,例如物种 :Aleena 生活在 <planet/38> 上,是 :Reptile 的子类,具有某些典型特征(平均身高、平均寿命、皮肤颜色),并且特定的个体(角色)是该类的代表实例:
@prefix : <https://swapi.co/vocabulary/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
:Aleena a owl:Class, :Species ;
rdfs:label "Aleena" ;
rdfs:isDefinedBy <https://swapi.co/ontology/> ;
rdfs:subClassOf :Reptile, :Sentient ;
:averageHeight 80.0 ;
:averageLifespan "79" ;
:character <https://swapi.co/resource/aleena/47> ;
:film <https://swapi.co/resource/film/4> ;
:language "Aleena" ;
:planet <https://swapi.co/resource/planet/38> ;
:skinColor "blue", "gray" .
...
为了使本教程保持简单,我们使用了未加密的 GraphDB。如果 GraphDB 是受保护的,则应在初始化 OntotextGraphDBGraph 之前设置环境变量 'GRAPHDB_USERNAME' 和 'GRAPHDB_PASSWORD'。
os.environ["GRAPHDB_USERNAME"] = "graphdb-user"
os.environ["GRAPHDB_PASSWORD"] = "graphdb-password"
graph = OntotextGraphDBGraph(
query_endpoint=...,
query_ontology=...
)
针对星球大战数据集的问答
我们现在可以使用 OntotextGraphDBQAChain 来提问一些问题。
import os
from langchain.chains import OntotextGraphDBQAChain
from langchain_openai import ChatOpenAI
# We'll be using an OpenAI model which requires an OpenAI API Key.
# However, other models are available as well:
# https://python.langchain.com/docs/integrations/chat/
# Set the environment variable `OPENAI_API_KEY` to your OpenAI API key
os.environ["OPENAI_API_KEY"] = "sk-***"
# Any available OpenAI model can be used here.
# We use 'gpt-4-1106-preview' because of the bigger context window.
# The 'gpt-4-1106-preview' model_name will deprecate in the future and will change to 'gpt-4-turbo' or similar,
# so be sure to consult with the OpenAI API https://platform.openai.com/docs/models for the correct naming.
chain = OntotextGraphDBQAChain.from_llm(
ChatOpenAI(temperature=0, model_name="gpt-4-1106-preview"),
graph=graph,
verbose=True,
allow_dangerous_requests=True,
)
让我们问一个简单的问题。
chain.invoke({chain.input_key: "What is the climate on Tatooine?"})[chain.output_key]
[1m> Entering new OntotextGraphDBQAChain chain...[0m
Generated SPARQL:
[32;1m[1;3mPREFIX : <https://swapi.co/vocabulary/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?climate
WHERE {
?planet rdfs:label "Tatooine" ;
:climate ?climate .
}[0m
[1m> Finished chain.[0m
'The climate on Tatooine is arid.'
再复杂一点的例子。
chain.invoke({chain.input_key: "What is the climate on Luke Skywalker's home planet?"})[
chain.output_key
]
[1m> Entering new OntotextGraphDBQAChain chain...[0m
Generated SPARQL:
[32;1m[1;3mPREFIX : <https://swapi.co/vocabulary/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ?climate
WHERE {
?character rdfs:label "Luke Skywalker" .
?character :homeworld ?planet .
?planet :climate ?climate .
}[0m
[1m> Finished chain.[0m
"The climate on Luke Skywalker's home planet is arid."
我们还可以提出更复杂的问题,例如
chain.invoke(
{
chain.input_key: "What is the average box office revenue for all the Star Wars movies?"
}
)[chain.output_key]
[1m> Entering new OntotextGraphDBQAChain chain...[0m
Generated SPARQL:
[32;1m[1;3mPREFIX : <https://swapi.co/vocabulary/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT (AVG(?boxOffice) AS ?averageBoxOfficeRevenue)
WHERE {
?film a :Film .
?film :boxOffice ?boxOfficeValue .
BIND(xsd:decimal(?boxOfficeValue) AS ?boxOffice)
}
[0m
[1m> Finished chain.[0m
'The average box office revenue for all the Star Wars movies is approximately 754.1 million dollars.'
链式修饰符
Ontotext GraphDB 问答链支持提示优化,可进一步改进您的问答链并提升应用程序的整体用户体验。
\"生成SPARQL\"提示
提示用于根据用户问题和知识图谱(KG)模式生成SPARQL查询。
-
sparql_generation_prompt默认值:
GRAPHDB_SPARQL_GENERATION_TEMPLATE = """
Write a SPARQL SELECT query for querying a graph database.
The ontology schema delimited by triple backticks in Turtle format is:
```
{schema}
```
Use only the classes and properties provided in the schema to construct the SPARQL query.
Do not use any classes or properties that are not explicitly provided in the SPARQL query.
Include all necessary prefixes.
Do not include any explanations or apologies in your responses.
Do not wrap the query in backticks.
Do not include any text except the SPARQL query generated.
The question delimited by triple backticks is:
```
{prompt}
```
"""
GRAPHDB_SPARQL_GENERATION_PROMPT = PromptTemplate(
input_variables=["schema", "prompt"],
template=GRAPHDB_SPARQL_GENERATION_TEMPLATE,
)
\"SPARQL 修复\"提示
有时,大语言模型(LLM)可能会生成包含语法错误或缺少前缀等的SPARQL查询。该链将通过提示大语言模型在一定次数内进行修正来尝试修复此问题。
-
sparql_fix_prompt默认值:
GRAPHDB_SPARQL_FIX_TEMPLATE = """
This following SPARQL query delimited by triple backticks
```
{generated_sparql}
```
is not valid.
The error delimited by triple backticks is
```
{error_message}
```
Give me a correct version of the SPARQL query.
Do not change the logic of the query.
Do not include any explanations or apologies in your responses.
Do not wrap the query in backticks.
Do not include any text except the SPARQL query generated.
The ontology schema delimited by triple backticks in Turtle format is:
```
{schema}
```
"""
GRAPHDB_SPARQL_FIX_PROMPT = PromptTemplate(
input_variables=["error_message", "generated_sparql", "schema"],
template=GRAPHDB_SPARQL_FIX_TEMPLATE,
) -
max_fix_retries默认值:
5
\"回答\"提示
提示用于根据从数据库返回的结果和用户的初始问题来回答问题。默认情况下,LLM 被指示仅使用返回结果中的信息。如果结果集为空,LLM 应说明其无法回答该问题。
-
qa_prompt默认值:
GRAPHDB_QA_TEMPLATE = """Task: Generate a natural language response from the results of a SPARQL query.
You are an assistant that creates well-written and human understandable answers.
The information part contains the information provided, which you can use to construct an answer.
The information provided is authoritative, you must never doubt it or try to use your internal knowledge to correct it.
Make your response sound like the information is coming from an AI assistant, but don't add any information.
Don't use internal knowledge to answer the question, just say you don't know if no information is available.
Information:
{context}
Question: {prompt}
Helpful Answer:"""
GRAPHDB_QA_PROMPT = PromptTemplate(
input_variables=["context", "prompt"], template=GRAPHDB_QA_TEMPLATE
)
完成使用 GraphDB 进行问答操作后,你可以通过在包含 Docker Compose 文件的目录中运行 docker compose down -v --remove-orphans 来关闭 Docker 环境。