跳到内容

使用 Chain of Density 提示技术总结文档

好的摘要应该信息丰富、简洁明了。虽然大型语言模型通常擅长总结文档,但其摘要往往冗长且包含冗余信息;信息密度较低。这时,一种新的提示技术 Chain of Density 应运而生。在本例中,我们将展示如何使用 Outlines 的提示词模板和结构化生成能力,通过几行代码实现 Chain of Density。

我们将尝试总结的文章是维基百科上关于 Alan Turing 页面的前三个段落

article = """
Alan Mathison Turing OBE FRS (/ˈtjʊərɪŋ/; 23 June 1912 – 7 June 1954) was an English mathematician, computer scientist, logician, cryptanalyst, philosopher and theoretical biologist.[5] Turing was highly influential in the development of theoretical computer science, providing a formalisation of the concepts of algorithm and computation with the Turing machine, which can be considered a model of a general-purpose computer.[6][7][8] He is widely considered to be the father of theoretical computer science and artificial intelligence.[9]

Born in Maida Vale, London, Turing was raised in southern England. He graduated at King's College, Cambridge, with a degree in mathematics. Whilst he was a fellow at Cambridge, he published a proof demonstrating that some purely mathematical yes–no questions can never be answered by computation. He defined a Turing machine and proved that the halting problem for Turing machines is undecidable. In 1938, he obtained his PhD from the Department of Mathematics at Princeton University. During the Second World War, Turing worked for the Government Code and Cypher School at Bletchley Park, Britain's codebreaking centre that produced Ultra intelligence. For a time he led Hut 8, the section that was responsible for German naval cryptanalysis. Here, he devised a number of techniques for speeding the breaking of German ciphers, including improvements to the pre-war Polish bomba method, an electromechanical machine that could find settings for the Enigma machine. Turing played a crucial role in cracking intercepted coded messages that enabled the Allies to defeat the Axis powers in many crucial engagements, including the Battle of the Atlantic.[10][11]

After the war, Turing worked at the National Physical Laboratory, where he designed the Automatic Computing Engine, one of the first designs for a stored-program computer. In 1948, Turing joined Max Newman's Computing Machine Laboratory at the Victoria University of Manchester, where he helped develop the Manchester computers[12] and became interested in mathematical biology. He wrote a paper on the chemical basis of morphogenesis[1] and predicted oscillating chemical reactions such as the Belousov–Zhabotinsky reaction, first observed in the 1960s. Despite these accomplishments, Turing was never fully recognised in Britain during his lifetime because much of his work was covered by the Official Secrets Act.[13]
"""

Chain Of Density 的工作原理

Chain Of Density 首先要求模型生成一个初步的长且不具体的摘要。然后它要求模型通过以下方式额外生成 4 个摘要

  1. 识别先前摘要中遗漏的 1-3 个实体;
  2. 添加在上一步中标记为遗漏的所有实体,同时不删除现有实体;
  3. 使摘要更简洁;

提示词还要求模型返回一个 JSON 对象列表,其中包含遗漏的实体和新的摘要。这正是结构化生成派上用场的地方 :) 论文提供了提示词和一个示例

Figure 2 in the paper

现在我们可以实现论文中提供的提示词

from outlines import Template


chain_of_density = Template.from_string(
    """Article: {{ article }}

    You will generate increasingly concise, entity-dense summaries of the above Article.

    Repeat the following 2 steps 5 times.

    Step 1. Identify 1-3 informative Entities ("; " delimited) from the Article which are missing from the previously generated summary.
    Step 2. Write a new, denser summary of identical length which covers every entity and detail from the previous summary plus the Missing Entities.

    A Missing Entity is:
    - Relevant: to the main story.
    - Specific: descriptive yet concise (5 words or fewer).
    - Novel: not in the previous summary.
    - Faithful: present in the Article.
    - Anywhere: located anywhere in the Article.

    Guidelines:
    - The first summary should be long (4-5 sentences, ~80 words) yet highly non-specific, containing little information beyond the entities marked as missing. Use overly verbose language and fillers (e.g., "this article discusses") to reach ~80 words.
    - Make every word count: rewrite the previous summary to improve flow and make space for additional entities.
    - Make space with fusion, compression, and removal of uninformative phrases like "the article discusses".
    - The summaries should become highly dense and concise yet self-contained, e.g., easily understood without the Article.
    - Missing entities can appear anywhere in the new summary.
    - Never drop entities from the previous summary. If space cannot be made, add fewer new entities.

    Remember, use the exact same number of words for each summary.

    Answer in JSON. The JSON should be a a dictionary with key "summaries" that contains a list (length 5) of dictionaries whose keys are "Missing_Entities" and "Denser_Summary".
    """
)
注意

请注意,我们稍微修改了提示词,使其返回一个包含摘要的 JSON 对象,而不是一个摘要列表。

Outlines 实现

我们将使用 Outlines 的 JSON 结构化生成功能,确保模型的输出与提示词中指定的格式一致。我们首先使用 Pydantic 定义要求模型返回的 JSON 对象。一个 JSON 对象包含一个 Summary 对象列表,每个 Summary 对象包含遗漏的实体和新的摘要

from pydantic import BaseModel, conlist

class Summary(BaseModel):
    missing_entities: str
    denser_summary: str

class Summaries(BaseModel):
    summaries: conlist(Summary, max_length=5, min_length=5)

现在我们通过将要总结的文章传递给模板来生成提示词。我们使用 AutoAWQ 库加载 Mistral-7B 的量化版本,然后使用 JSON 结构化生成来生成摘要

model = outlines.models.transformers("TheBloke/Mistral-7B-OpenOrca-AWQ")

prompt = chain_of_density(article)
result = outlines.generate.json(model, Summaries)(prompt)

现在我们可以查看结果

print(result.model_dump())
# {'summaries': [
#     {
#       'missing_entities': 'English mathematician, cryptanalyst, philosopher',
#       'denser_summary': 'Alan Mathison Turing was an English mathematician, cryptanalyst, philosopher.'
#     },
#     {
#       'missing_entities': '',
#       'denser_summary': "Alan Mathison Turing was an English mathematician who was a crucial figure in WW2's Bletchley Park codebreaking centre and designed one of the first computers."
#     },
#     {
#       'missing_entities': 'cryptanalyst, studied, biology, father',
#       'denser_summary': 'Alan Mathison Turing was an English cryptanalyst, studied theoretical computer science, and contributed to mathematical biology.'
#     },
#     {
#       'missing_entities': 'biology, morphogenesis, chemical',
#       'denser_summary': 'Alan Mathison Turing was an English cryptanalyst, studied theoretical computer science, and predicted chemical reactions in morphogenesis.
#     '},
#     {
#       'missing_entities': '',
#       'denser_summary': 'Alan Mathison Turing was an English cryptanalyst, developed computer science, and made strides in mathematical biology research.'
#       }
# ]}

考虑到我们使用了相对较小的模型来生成摘要,结果还不错!Chain of Density 似乎是一种非常有效的提示技术,即使对于小型量化模型,也能生成密集的信息摘要。它在 Outlines 中的实现也非常简洁。

请注意,这是我尝试的第一篇文章,它开箱即用。请在其他文章上试用,并在 Twitter 上分享结果,或者在 Outlines 仓库中 开启一个新的讨论