跳到内容

知识图谱提取

在本指南中,我们使用 outlines 从非结构化文本中提取知识图谱。

我们将使用 llama.cpp 通过 llama-cpp-python 库。Outlines 支持 llama-cpp-python,但我们需要自己安装它。

pip install llama-cpp-python

我们通过传入 HuggingFace Hub 上仓库的名称以及文件名(或 glob 模式)来下载模型权重。

import llama_cpp
from outlines import generate, models

model = models.llamacpp("NousResearch/Hermes-2-Pro-Llama-3-8B-GGUF",
            "Hermes-2-Pro-Llama-3-8B-Q4_K_M.gguf",
            tokenizer=llama_cpp.llama_tokenizer.LlamaHFTokenizer.from_pretrained(
            "NousResearch/Hermes-2-Pro-Llama-3-8B"
            ),
            n_gpu_layers=-1,
            flash_attn=True,
            n_ctx=8192,
            verbose=False)

(可选)将模型权重存储在自定义文件夹中

默认情况下,模型权重会下载到 Hub 缓存中,但如果想将权重存储在自定义文件夹中,我们可以从 HuggingFace 拉取一个由 NousResearch 开发的量化 GGUF 模型 Hermes-2-Pro-Llama-3-8B

wget https://hugging-face.cn/NousResearch/Hermes-2-Pro-Llama-3-8B-GGUF/resolve/main/Hermes-2-Pro-Llama-3-8B-Q4_K_M.gguf

我们初始化模型

import llama_cpp
from llama_cpp import Llama
from outlines import generate, models

llm = Llama(
    "/path/to/model/Hermes-2-Pro-Llama-3-8B-Q4_K_M.gguf",
    tokenizer=llama_cpp.llama_tokenizer.LlamaHFTokenizer.from_pretrained(
        "NousResearch/Hermes-2-Pro-Llama-3-8B"
    ),
    n_gpu_layers=-1,
    flash_attn=True,
    n_ctx=8192,
    verbose=False
)

知识图谱提取

我们首先需要为知识图谱中的每个节点和每条边定义相应的 Pydantic 类。

from pydantic import BaseModel, Field

class Node(BaseModel):
    """Node of the Knowledge Graph"""

    id: int = Field(..., description="Unique identifier of the node")
    label: str = Field(..., description="Label of the node")
    property: str = Field(..., description="Property of the node")


class Edge(BaseModel):
    """Edge of the Knowledge Graph"""

    source: int = Field(..., description="Unique source of the edge")
    target: int = Field(..., description="Unique target of the edge")
    label: str = Field(..., description="Label of the edge")
    property: str = Field(..., description="Property of the edge")

然后我们定义知识图谱的 Pydantic 类并获取其 JSON Schema。

from typing import List

class KnowledgeGraph(BaseModel):
    """Generated Knowledge Graph"""

    nodes: List[Node] = Field(..., description="List of nodes of the knowledge graph")
    edges: List[Edge] = Field(..., description="List of edges of the knowledge graph")

schema = KnowledgeGraph.model_json_schema()

然后我们需要将提示词适配到 Hermes JSON Schema 提示词格式

def generate_hermes_prompt(user_prompt):
    return (
        "<|im_start|>system\n"
        "You are a world class AI model who answers questions in JSON "
        f"Here's the json schema you must adhere to:\n<schema>\n{schema}\n</schema><|im_end|>\n"
        "<|im_start|>user\n"
        + user_prompt
        + "<|im_end|>"
        + "\n<|im_start|>assistant\n"
        "<schema>"
    )

对于给定的用户提示,例如

user_prompt = "Alice loves Bob and she hates Charlie."

我们可以使用 generate.json,传入之前定义的 Pydantic 类,并使用 Hermes 提示词调用生成器。

from outlines import generate, models

model = models.LlamaCpp(llm)
generator = generate.json(model, KnowledgeGraph)
prompt = generate_hermes_prompt(user_prompt)
response = generator(prompt, max_tokens=1024, temperature=0, seed=42)

我们获得了知识图谱的节点和边。

print(response.nodes)
print(response.edges)
# [Node(id=1, label='Alice', property='Person'),
# Node(id=2, label='Bob', property='Person'),
# Node(id=3, label='Charlie', property='Person')]
# [Edge(source=1, target=2, label='love', property='Relationship'),
# Edge(source=1, target=3, label='hate', property='Relationship')]

(可选)可视化知识图谱

我们可以使用 Graphviz 库来可视化生成的知识图谱。有关详细的安装说明,请参见此处

from graphviz import Digraph

dot = Digraph()
for node in response.nodes:
    dot.node(str(node.id), node.label, shape='circle', width='1', height='1')
for edge in response.edges:
    dot.edge(str(edge.source), str(edge.target), label=edge.label)

dot.render('knowledge-graph.gv', view=True)

Image of the Extracted Knowledge Graph

此示例最初由 Alonso Silva 贡献。