知识图谱提取
在本指南中,我们使用 outlines 从非结构化文本中提取知识图谱。
我们将使用 llama.cpp 通过 llama-cpp-python 库。Outlines 支持 llama-cpp-python,但我们需要自己安装它。
我们通过传入 HuggingFace Hub 上仓库的名称以及文件名(或 glob 模式)来下载模型权重。
import llama_cpp
from outlines import generate, models
model = models.llamacpp("NousResearch/Hermes-2-Pro-Llama-3-8B-GGUF",
"Hermes-2-Pro-Llama-3-8B-Q4_K_M.gguf",
tokenizer=llama_cpp.llama_tokenizer.LlamaHFTokenizer.from_pretrained(
"NousResearch/Hermes-2-Pro-Llama-3-8B"
),
n_gpu_layers=-1,
flash_attn=True,
n_ctx=8192,
verbose=False)
(可选)将模型权重存储在自定义文件夹中
默认情况下,模型权重会下载到 Hub 缓存中,但如果想将权重存储在自定义文件夹中,我们可以从 HuggingFace 拉取一个由 NousResearch 开发的量化 GGUF 模型 Hermes-2-Pro-Llama-3-8B。
wget https://hugging-face.cn/NousResearch/Hermes-2-Pro-Llama-3-8B-GGUF/resolve/main/Hermes-2-Pro-Llama-3-8B-Q4_K_M.gguf
我们初始化模型
import llama_cpp
from llama_cpp import Llama
from outlines import generate, models
llm = Llama(
"/path/to/model/Hermes-2-Pro-Llama-3-8B-Q4_K_M.gguf",
tokenizer=llama_cpp.llama_tokenizer.LlamaHFTokenizer.from_pretrained(
"NousResearch/Hermes-2-Pro-Llama-3-8B"
),
n_gpu_layers=-1,
flash_attn=True,
n_ctx=8192,
verbose=False
)
知识图谱提取
我们首先需要为知识图谱中的每个节点和每条边定义相应的 Pydantic 类。
from pydantic import BaseModel, Field
class Node(BaseModel):
"""Node of the Knowledge Graph"""
id: int = Field(..., description="Unique identifier of the node")
label: str = Field(..., description="Label of the node")
property: str = Field(..., description="Property of the node")
class Edge(BaseModel):
"""Edge of the Knowledge Graph"""
source: int = Field(..., description="Unique source of the edge")
target: int = Field(..., description="Unique target of the edge")
label: str = Field(..., description="Label of the edge")
property: str = Field(..., description="Property of the edge")
然后我们定义知识图谱的 Pydantic 类并获取其 JSON Schema。
from typing import List
class KnowledgeGraph(BaseModel):
"""Generated Knowledge Graph"""
nodes: List[Node] = Field(..., description="List of nodes of the knowledge graph")
edges: List[Edge] = Field(..., description="List of edges of the knowledge graph")
schema = KnowledgeGraph.model_json_schema()
然后我们需要将提示词适配到 Hermes JSON Schema 提示词格式。
def generate_hermes_prompt(user_prompt):
return (
"<|im_start|>system\n"
"You are a world class AI model who answers questions in JSON "
f"Here's the json schema you must adhere to:\n<schema>\n{schema}\n</schema><|im_end|>\n"
"<|im_start|>user\n"
+ user_prompt
+ "<|im_end|>"
+ "\n<|im_start|>assistant\n"
"<schema>"
)
对于给定的用户提示,例如
我们可以使用 generate.json
,传入之前定义的 Pydantic 类,并使用 Hermes 提示词调用生成器。
from outlines import generate, models
model = models.LlamaCpp(llm)
generator = generate.json(model, KnowledgeGraph)
prompt = generate_hermes_prompt(user_prompt)
response = generator(prompt, max_tokens=1024, temperature=0, seed=42)
我们获得了知识图谱的节点和边。
print(response.nodes)
print(response.edges)
# [Node(id=1, label='Alice', property='Person'),
# Node(id=2, label='Bob', property='Person'),
# Node(id=3, label='Charlie', property='Person')]
# [Edge(source=1, target=2, label='love', property='Relationship'),
# Edge(source=1, target=3, label='hate', property='Relationship')]
(可选)可视化知识图谱
我们可以使用 Graphviz 库来可视化生成的知识图谱。有关详细的安装说明,请参见此处。
from graphviz import Digraph
dot = Digraph()
for node in response.nodes:
dot.node(str(node.id), node.label, shape='circle', width='1', height='1')
for edge in response.edges:
dot.edge(str(edge.source), str(edge.target), label=edge.label)
dot.render('knowledge-graph.gv', view=True)
此示例最初由 Alonso Silva 贡献。