Ontotext GraphDB

Ontotext GraphDB 是符合 RDF 和 SPARQL 的图形数据库和知识发现工具。

此笔记本介绍如何使用 LLM 提供自然语言查询（NLQ 到 SPARQL，也称为text2sparql）为Ontotext GraphDB.

GraphDB LLM 功能

GraphDB支持某些 LLM 集成功能，如下所述：

GPT 查询

magic 谓词，使用知识图谱（KG）中的数据向 LLM 请求文本、列表或表格
查询说明
结果解释、摘要、改写、翻译

retrieval-graphdb-connector

在向量数据库中对 KG 实体进行索引
支持任何文本嵌入算法和向量数据库
使用 GraphDB 用于 Elastic、Solr、Lucene 的相同强大连接器（索引）语言
将 RDF 数据中的更改自动同步到 KG 实体索引
支持嵌套对象（GraphDB 版本 10.5 不支持 UI）
将 KG 实体序列化为文本，如下所示（例如，对于 Wines 数据集）：

Franvino:
- is a RedWine.
- made from grape Merlo.
- made from grape Cabernet Franc.
- has sugar dry.
- has year 2012.

Talk-to-Graph （对图对话）

使用定义的 KG 实体索引的简单聊天机器人

在本教程中，我们不会使用 GraphDB LLM 集成，而是SPARQL从 NLQ 生成。我们将使用Star Wars API (SWAPI）本体和 Dataset 进行检查。

建立

您需要一个正在运行的 GraphDB 实例。本教程介绍如何使用 GraphDB Docker 镜像在本地运行数据库。它提供了一个 docker compose 设置，该设置使用 Star Wars 数据集填充 GraphDB。包括此笔记本在内的所有必要文件都可以从 GitHub 存储库 langchain-graphdb-qa-chain-demo 下载。

安装 Docker。本教程是使用 Docker 版本创建的24.0.7它捆绑了 Docker Compose。对于早期的 Docker 版本，您可能需要单独安装 Docker Compose。
将 GitHub 存储库 langchain-graphdb-qa-chain-demo 克隆到计算机上的本地文件夹中。
使用从同一文件夹执行的以下脚本启动 GraphDB

docker build --tag graphdb .
docker compose up -d graphdb

您需要等待几秒钟，以便数据库启动http://localhost:7200/.星球大战数据集starwars-data.trig会自动加载到langchain存储库。本地 SPARQL 终端节点http://localhost:7200/repositories/langchain可用于对其运行查询。您还可以从自己喜欢的 Web 浏览器打开 GraphDB Workbenchhttp://localhost:7200/sparql您可以在其中以交互方式进行查询。

设置工作环境

如果您使用conda，创建并激活新的 conda 环境，例如：

conda create -n graph_ontotext_graphdb_qa python=3.12
conda activate graph_ontotext_graphdb_qa

安装以下库：

pip install jupyter==1.1.1
pip install rdflib==7.1.1
pip install langchain-community==0.3.4
pip install langchain-openai==0.2.4

使用运行 Jupyter

jupyter notebook

指定本体

为了让 LLM 能够生成 SPARQL，它需要知道知识图架构（本体）。可以使用OntotextGraphDBGraph类：

query_ontology：一个CONSTRUCTquery 的请求，并在 SPARQL 终端节点上执行并返回 KG 架构语句。我们建议您将本体存储在其自己的命名图中，这样可以更轻松地仅获取相关语句（如下例所示）。DESCRIBE不支持查询，因为DESCRIBE返回对称简明有界描述（SCBD），即传入的类链接。对于具有 100 万个实例的大型图形，这效率不高。检查 https://github.com/eclipse-rdf4j/rdf4j/issues/4857
local_file：本地 RDF 本体文件。支持的 RDF 格式包括Turtle,RDF/XML,JSON-LD,N-Triples,Notation-3,Trig,Trix,N-Quads.

在任何一种情况下，本体转储都应该：

包括有关类、属性、类的属性附件（使用 rdfs：domain、schema：domainIncludes 或 OWL 限制）和分类法（重要个体）的足够信息。
不包括对 SPARQL 构建没有帮助的过于冗长和不相关的定义和示例。

from langchain_community.graphs import OntotextGraphDBGraph

# feeding the schema using a user construct query

graph = OntotextGraphDBGraph(
    query_endpoint="http://localhost:7200/repositories/langchain",
    query_ontology="CONSTRUCT {?s ?p ?o} FROM <https://swapi.co/ontology/> WHERE {?s ?p ?o}",
)

API 参考：OntotextGraphDBGraph

# feeding the schema using a local RDF file

graph = OntotextGraphDBGraph(
    query_endpoint="http://localhost:7200/repositories/langchain",
    local_file="/path/to/langchain_graphdb_tutorial/starwars-ontology.nt",  # change the path here
)

无论哪种方式，本体（模式）都会作为Turtle因为Turtle使用适当的前缀是最紧凑的，也是 LLM 最容易记住的。

星球大战本体论有点不寻常，因为它包含了很多关于类的特定三元组，例如物种:AleenaLive on （活在）<planet/38>，它们是:Reptile，具有某些典型特征（平均身高、平均寿命、肤色），并且特定的个体（角色）是该类的代表：

@prefix : <https://swapi.co/vocabulary/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

:Aleena a owl:Class, :Species ;
    rdfs:label "Aleena" ;
    rdfs:isDefinedBy <https://swapi.co/ontology/> ;
    rdfs:subClassOf :Reptile, :Sentient ;
    :averageHeight 80.0 ;
    :averageLifespan "79" ;
    :character <https://swapi.co/resource/aleena/47> ;
    :film <https://swapi.co/resource/film/4> ;
    :language "Aleena" ;
    :planet <https://swapi.co/resource/planet/38> ;
    :skinColor "blue", "gray" .

    ...

为了使本教程简单，我们使用不安全的 GraphDB。如果 GraphDB 是安全的，您应该GRAPHDB_PASSWORD GRAPHDB_USERNAME在初始化OntotextGraphDBGraph.

os.environ["GRAPHDB_USERNAME"] = "graphdb-user"
os.environ["GRAPHDB_PASSWORD"] = "graphdb-password"

graph = OntotextGraphDBGraph(
    query_endpoint=...,
    query_ontology=...
)

针对 StarWars 数据集的问题解答

我们现在可以使用OntotextGraphDBQAChain提出一些问题。

import os

from langchain.chains import OntotextGraphDBQAChain
from langchain_openai import ChatOpenAI

# We'll be using an OpenAI model which requires an OpenAI API Key.
# However, other models are available as well:
# https://python.langchain.com/docs/integrations/chat/

# Set the environment variable `OPENAI_API_KEY` to your OpenAI API key
os.environ["OPENAI_API_KEY"] = "sk-***"

# Any available OpenAI model can be used here.
# We use 'gpt-4-1106-preview' because of the bigger context window.
# The 'gpt-4-1106-preview' model_name will deprecate in the future and will change to 'gpt-4-turbo' or similar,
# so be sure to consult with the OpenAI API https://platform.openai.com/docs/models for the correct naming.

chain = OntotextGraphDBQAChain.from_llm(
    ChatOpenAI(temperature=0, model_name="gpt-4-1106-preview"),
    graph=graph,
    verbose=True,
    allow_dangerous_requests=True,
)

API 参考：OntotextGraphDBQAChain | 聊天OpenAI

让我们问一个简单的问题。

chain.invoke({chain.input_key: "What is the climate on Tatooine?"})[chain.output_key]

[1m> Entering new OntotextGraphDBQAChain chain...[0m
Generated SPARQL:
[32;1m[1;3mPREFIX : <https://swapi.co/vocabulary/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?climate
WHERE {
  ?planet rdfs:label "Tatooine" ;
          :climate ?climate .
}[0m

[1m> Finished chain.[0m

'The climate on Tatooine is arid.'

而且稍微复杂一些。

chain.invoke({chain.input_key: "What is the climate on Luke Skywalker's home planet?"})[
    chain.output_key
]

[1m> Entering new OntotextGraphDBQAChain chain...[0m
Generated SPARQL:
[32;1m[1;3mPREFIX : <https://swapi.co/vocabulary/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?climate
WHERE {
  ?character rdfs:label "Luke Skywalker" .
  ?character :homeworld ?planet .
  ?planet :climate ?climate .
}[0m

[1m> Finished chain.[0m

"The climate on Luke Skywalker's home planet is arid."

我们还可以提出更复杂的问题，例如

chain.invoke(
    {
        chain.input_key: "What is the average box office revenue for all the Star Wars movies?"
    }
)[chain.output_key]

[1m> Entering new OntotextGraphDBQAChain chain...[0m
Generated SPARQL:
[32;1m[1;3mPREFIX : <https://swapi.co/vocabulary/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT (AVG(?boxOffice) AS ?averageBoxOfficeRevenue)
WHERE {
  ?film a :Film .
  ?film :boxOffice ?boxOfficeValue .
  BIND(xsd:decimal(?boxOfficeValue) AS ?boxOffice)
}
[0m

[1m> Finished chain.[0m

'The average box office revenue for all the Star Wars movies is approximately 754.1 million dollars.'

链修饰符

Ontotext GraphDB QA 链允许及时优化，以进一步改进您的 QA 链并增强应用程序的整体用户体验。

“SPARQL 生成”提示符

该提示用于根据用户问题和 KG 架构生成 SPARQL 查询。

sparql_generation_prompt

默认值：

  GRAPHDB_SPARQL_GENERATION_TEMPLATE = """
  Write a SPARQL SELECT query for querying a graph database.
  The ontology schema delimited by triple backticks in Turtle format is:
  ```
  {schema}
  ```
  Use only the classes and properties provided in the schema to construct the SPARQL query.
  Do not use any classes or properties that are not explicitly provided in the SPARQL query.
  Include all necessary prefixes.
  Do not include any explanations or apologies in your responses.
  Do not wrap the query in backticks.
  Do not include any text except the SPARQL query generated.
  The question delimited by triple backticks is:
  ```
  {prompt}
  ```
  """
  GRAPHDB_SPARQL_GENERATION_PROMPT = PromptTemplate(
      input_variables=["schema", "prompt"],
      template=GRAPHDB_SPARQL_GENERATION_TEMPLATE,
  )

“SPARQL 修复”提示

有时，LLM 可能会生成带有语法错误或缺少前缀等的 SPARQL 查询。链将尝试通过提示 LLM 更正一定次数来修改这一点。

sparql_fix_prompt

默认值：

  GRAPHDB_SPARQL_FIX_TEMPLATE = """
  This following SPARQL query delimited by triple backticks
  ```
  {generated_sparql}
  ```
  is not valid.
  The error delimited by triple backticks is
  ```
  {error_message}
  ```
  Give me a correct version of the SPARQL query.
  Do not change the logic of the query.
  Do not include any explanations or apologies in your responses.
  Do not wrap the query in backticks.
  Do not include any text except the SPARQL query generated.
  The ontology schema delimited by triple backticks in Turtle format is:
  ```
  {schema}
  ```
  """
  
  GRAPHDB_SPARQL_FIX_PROMPT = PromptTemplate(
      input_variables=["error_message", "generated_sparql", "schema"],
      template=GRAPHDB_SPARQL_FIX_TEMPLATE,
  )

max_fix_retries

默认值：5

“Answering” 提示

提示用于根据从数据库返回的结果和初始用户问题来回答问题。默认情况下，指示 LLM 仅使用返回结果中的信息。如果结果集为空，则 LLM 应通知它无法回答问题。

qa_prompt

默认值：

  GRAPHDB_QA_TEMPLATE = """Task: Generate a natural language response from the results of a SPARQL query.
  You are an assistant that creates well-written and human understandable answers.
  The information part contains the information provided, which you can use to construct an answer.
  The information provided is authoritative, you must never doubt it or try to use your internal knowledge to correct it.
  Make your response sound like the information is coming from an AI assistant, but don't add any information.
  Don't use internal knowledge to answer the question, just say you don't know if no information is available.
  Information:
  {context}
  
  Question: {prompt}
  Helpful Answer:"""
  GRAPHDB_QA_PROMPT = PromptTemplate(
      input_variables=["context", "prompt"], template=GRAPHDB_QA_TEMPLATE
  )

使用 GraphDB 完成 QA 后，您可以通过运行docker compose down -v --remove-orphans从包含 Docker compose 文件的目录中。