生成合成数据和带引用的问答

本教程改编自 instructor-ollama notebook。我们从一个简单的例子开始，生成合成数据，然后通过提供引用的方式解决问答问题。

我们将使用 llama.cpp 和 llama-cpp-python 库。Outlines 支持 llama-cpp-python，但我们需要自行安装。

pip install llama-cpp-python

通过传递 HuggingFace Hub 上的仓库名称以及文件名（或通配符模式）来下载模型权重。

import llama_cpp
from outlines import generate, models

model = models.llamacpp("NousResearch/Hermes-2-Pro-Llama-3-8B-GGUF",
            "Hermes-2-Pro-Llama-3-8B-Q4_K_M.gguf",
            tokenizer=llama_cpp.llama_tokenizer.LlamaHFTokenizer.from_pretrained(
            "NousResearch/Hermes-2-Pro-Llama-3-8B"
            ),
            n_gpu_layers=-1,
            flash_attn=True,
            n_ctx=8192,
            verbose=False)

（可选）将模型权重存储在自定义文件夹中

默认情况下，模型权重会下载到 hub 缓存中，但如果想将权重存储在自定义文件夹中，我们可以从 HuggingFace 拉取由 NousResearch 提供的量化 GGUF 模型 Hermes-2-Pro-Llama-3-8B。

wget https://hugging-face.cn/NousResearch/Hermes-2-Pro-Llama-3-8B-GGUF/resolve/main/Hermes-2-Pro-Llama-3-8B-Q4_K_M.gguf

我们初始化模型

import llama_cpp
from llama_cpp import Llama
from outlines import generate, models

llm = Llama(
    "/path/to/model/Hermes-2-Pro-Llama-3-8B-Q4_K_M.gguf",
    tokenizer=llama_cpp.llama_tokenizer.LlamaHFTokenizer.from_pretrained(
        "NousResearch/Hermes-2-Pro-Llama-3-8B"
    ),
    n_gpu_layers=-1,
    flash_attn=True,
    n_ctx=8192,
    verbose=False
)

生成合成数据

我们首先需要定义表示用户的 Pydantic 类

from pydantic import BaseModel, Field

class UserDetail(BaseModel):
    id: int = Field(..., description="Unique identifier") # so the model keeps track of the number of users
    first_name: str
    last_name: str
    age: int

然后我们定义表示用户列表的 Pydantic 类

from typing import List

class Users(BaseModel):
    users: List[UserDetail]

我们可以通过传递刚刚定义的 Pydantic 类来使用 generate.json，并调用生成器。

model = models.LlamaCpp(llm)
generator = generate.json(model, Users)
response = generator("Create 5 fake users", max_tokens=1024, temperature=0, seed=42)
print(response.users)
# [UserDetail(id=1, first_name='John', last_name='Doe', age=25),
# UserDetail(id=2, first_name='Jane', last_name='Doe', age=30),
# UserDetail(id=3, first_name='Bob', last_name='Smith', age=40),
# UserDetail(id=4, first_name='Alice', last_name='Smith', age=35),
# UserDetail(id=5, first_name='John', last_name='Smith', age=20)]

for user in response.users:
    print(user.first_name)
    print(user.last_name)
    print(user.age)
    print(#####)
# John
# Doe
# 25
# #####
# Jane
# Doe
# 30
# #####
# Bob
# Smith
# 40
# #####
# Alice
# Smith
# 35
# #####
# John
# Smith
# 20
# #####

带引用的问答

我们首先需要定义用于带引用问答的 Pydantic 类

from typing import List
from pydantic import BaseModel

class QuestionAnswer(BaseModel):
    question: str
    answer: str
    citations: List[str]

schema = QuestionAnswer.model_json_schema()

然后我们需要将提示词调整为适用于 JSON schema 的 Hermes 提示词格式。

def generate_hermes_prompt(question, context, schema=schema):
    return (
        "<|im_start|>system\n"
        "You are a world class AI model who answers questions in JSON with correct and exact citations "
        "extracted from the `Context`. "
        f"Here's the json schema you must adhere to:\n<schema>\n{schema}\n</schema><|im_end|>\n"
        "<|im_start|>user\n"
        + "`Context`: "
        + context
        + "\n`Question`: "
        + question + "<|im_end|>"
        + "\n<|im_start|>assistant\n"
        "<schema>"
    )

我们可以通过传递之前定义的 Pydantic 类来使用 generate.json，并使用 Hermes 提示词调用生成器。

question = "What did the author do during college?"
context = """
My name is Jason Liu, and I grew up in Toronto Canada but I was born in China.
I went to an arts high school but in university I studied Computational Mathematics and physics.
As part of coop I worked at many companies including Stitchfix, Facebook.
I also started the Data Science club at the University of Waterloo and I was the president of the club for 2 years.
"""
generator = generate.json(model, QuestionAnswer)
prompt = generate_hermes_prompt(question, context)
response = generator(prompt, max_tokens=1024, temperature=0, seed=42)
print(response)
# QuestionAnswer(question='What did the author do during college?', answer='The author studied Computational Mathematics and physics in university and was also involved in starting the Data Science club, serving as its president for 2 years.', citations=['I went to an arts high school but in university I studied Computational Mathematics and physics.', 'I also started the Data Science club at the University of Waterloo and I was the president of the club for 2 years.'])

对于问题-上下文对列表，我们可以做同样的操作

question1 = "Where was John born?"
context1 = """
John Doe is a software engineer who was born in New York, USA.
He studied Computer Science at the Massachusetts Institute of Technology.
During his studies, he interned at Google and Microsoft.
He also founded the Artificial Intelligence club at his university and served as its president for three years.
"""

question2 = "What did Emily study in university?"
context2 = """
Emily Smith is a data scientist from London, England.
She attended the University of Cambridge where she studied Statistics and Machine Learning.
She interned at IBM and Amazon during her summer breaks.
Emily was also the head of the Women in Tech society at her university.
"""

question3 = "Which companies did Robert intern at?"
context3 = """
Robert Johnson, originally from Sydney, Australia, is a renowned cybersecurity expert.
He studied Information Systems at the University of Melbourne.
Robert interned at several cybersecurity firms including NortonLifeLock and McAfee.
He was also the leader of the Cybersecurity club at his university.
"""

question4 = "What club did Alice start at her university?"
context4 = """
Alice Williams, a native of Dublin, Ireland, is a successful web developer.
She studied Software Engineering at Trinity College Dublin.
Alice interned at several tech companies including Shopify and Squarespace.
She started the Web Development club at her university and was its president for two years.
"""

question5 = "What did Michael study in high school?"
context5 = """
Michael Brown is a game developer from Tokyo, Japan.
He attended a specialized high school where he studied Game Design.
He later attended the University of Tokyo where he studied Computer Science.
Michael interned at Sony and Nintendo during his university years.
He also started the Game Developers club at his university.
"""

for question, context in [
    (question1, context1),
    (question2, context2),
    (question3, context3),
    (question4, context4),
    (question5, context5),
]:
    final_prompt = my_final_prompt(question, context)
    generator = generate.json(model, QuestionAnswer)
    response = generator(final_prompt, max_tokens=1024, temperature=0, seed=42)
    display(question)
    display(response.answer)
    display(response.citations)
    print("\n\n")

# 'Where was John born?'
# 'John Doe was born in New York, USA.'
# ['John Doe is a software engineer who was born in New York, USA.']
#
#
# 'What did Emily study in university?'
# 'Emily studied Statistics and Machine Learning in university.'
# ['She attended the University of Cambridge where she studied Statistics and Machine Learning.']
#
#
# 'Which companies did Robert intern at?'
# 'Robert interned at NortonLifeLock and McAfee.'
# ['Robert Johnson, originally from Sydney, Australia, is a renowned cybersecurity expert. He interned at several cybersecurity firms including NortonLifeLock and McAfee.']
#
#
# 'What club did Alice start at her university?'
# 'Alice started the Web Development club at her university.'
# ['Alice Williams, a native of Dublin, Ireland, is a successful web developer. She started the Web Development club at her university and was its president for two years.']
#
#
# 'What did Michael study in high school?'
# 'Michael studied Game Design in high school.'
# ['Michael Brown is a game developer from Tokyo, Japan. He attended a specialized high school where he studied Game Design.']

本示例由 Alonso Silva 最初贡献。