使用视觉语言模型将PDF转换为结构化输出

使用语言模型的一个常见任务是询问语言模型关于PDF文件的问题。

通常，输出是非结构化文本，即与你的PDF“对话”。

在某些情况下，你可能希望从PDF中提取结构化信息，例如表格、列表、引用等。

PDF难以进行机器读取。但是，你可以简单地将PDF转换为图像，然后使用视觉语言模型从图像中提取结构化信息。

本Cookbook演示了如何

将PDF转换为图像列表
使用视觉语言模型从图像中提取结构化信息

依赖项

你需要安装这些依赖项

pip install outlines pillow transformers torch==2.4.0 pdf2image

# Optional, but makes the output look nicer
pip install rich

导入所需的库

from PIL import Image
import outlines
import torch
from transformers import AutoProcessor
from pydantic import BaseModel
from typing import List, Optional
from pdf2image import convert_from_path
import os
from rich import print
import requests

选择模型

我们已经使用 Pixtral 12b 和 Qwen2-VL-7B-Instruct 测试了这个示例。

要使用 Pixtral

from transformers import LlavaForConditionalGeneration
model_name="mistral-community/pixtral-12b"
model_class=LlavaForConditionalGeneration

要使用 Qwen-2-VL

from transformers import Qwen2VLForConditionalGeneration
model_name = "Qwen/Qwen2-VL-7B-Instruct"
model_class = Qwen2VLForConditionalGeneration

你可以使用以下命令将模型加载到内存中

# This loads the model into memory. On your first run,
# it will have to download the model, which might take a while.
model = outlines.models.transformers_vision(
    model_name,
    model_class=model_class,
    model_kwargs={
        "device_map": "auto",
        "torch_dtype": torch.bfloat16,
    },
    processor_kwargs={
        "device": "auto",
    },
)

将PDF转换为图像

我们将使用 pdf2image 库将PDF的每一页转换为图像。

convert_pdf_to_images 是一个方便函数，它将PDF的每一页转换为图像，并在提供了 output_dir 时选择性地将图像保存到磁盘。

注意：dpi 参数很重要。它控制图像的分辨率。高DPI图像质量更高，可能产生更好的结果，但它们也更大、处理速度更慢且需要更多内存。

from pdf2image import convert_from_path
from PIL import Image
import os
from typing import List, Optional

def convert_pdf_to_images(
    pdf_path: str,
    output_dir: Optional[str] = None,
    dpi: int = 120,
    fmt: str = 'PNG'
) -> List[Image.Image]:
    """
    Convert a PDF file to a list of PIL Image objects.

    Args:
        pdf_path: Path to the PDF file
        output_dir: Optional directory to save the images
        dpi: Resolution for the conversion. High DPI is high quality, but also slow and memory intensive.
        fmt: Output format (PNG recommended for quality)

    Returns:
        List of PIL Image objects
    """
    # Convert PDF to list of images
    images = convert_from_path(
        pdf_path,
        dpi=dpi,
        fmt=fmt
    )

    # Optionally save images
    if output_dir:
        os.makedirs(output_dir, exist_ok=True)
        for i, image in enumerate(images):
            image.save(os.path.join(output_dir, f'page_{i+1}.{fmt.lower()}'))

    return images

我们将使用描述了 Outlines 用于结构化生成方法的 Louf & Willard 论文。

要下载PDF，运行

# Download the PDF file
pdf_url = "https://arxiv.org/pdf/2307.09702"
response = requests.get(pdf_url)

# Save the PDF locally
with open("louf-willard.pdf", "wb") as f:
    f.write(response.content)

现在，我们可以将PDF转换为图像列表

# Load the pdf
images = convert_pdf_to_images(
    "louf-willard.pdf",
    dpi=120,
    output_dir="output_images"
)

从图像中提取结构化信息

你可以提取的结构化输出与Outlines中的其他地方完全相同——你可以使用正则表达式、JSON schema、从选项列表中选择等。

将数据提取到JSON中

假设你想遍历PDF的每一页，并提取页面描述、关键要点和页码。

你可以通过定义一个JSON schema，然后使用 outlines.generate.json 来提取数据。

首先，定义你想要提取的结构

class PageSummary(BaseModel):
    description: str
    key_takeaways: List[str]
    page_number: int

其次，我们需要设置prompt。添加特殊token可能会很棘手，因此我们使用 transformers 的 AutoProcessor 来为我们应用特殊token。为此，我们指定一个消息列表，其中每条消息都是一个带有 role 和 content 键的字典。

图像使用 type: "image" 表示，文本使用 type: "text" 表示。

messages = [
    {
        "role": "user",
        "content": [
            # The text you're passing to the model --
            # this is where you do your standard prompting.
            {"type": "text", "text": f"""
                Describe the page in a way that is easy for a PhD student to understand.

                Return the information in the following JSON schema:
                {PageSummary.model_json_schema()}

                Here is the page:
                """
            },

            # Don't need to pass in an image, since we do this
            # when we call the generator function down below.
            {"type": "image", "image": ""},
        ],
    }
]

# Convert the messages to the final prompt
processor = AutoProcessor.from_pretrained(model_name)
instruction = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

现在我们遍历每个图像，提取结构化信息

# Page summarizer function
page_summary_generator = outlines.generate.json(model, PageSummary)

for image in images:
    result = page_summary_generator(instruction, [image])
    print(result)

使用正则表达式提取arxiv论文标识符

arXiv 论文标识符是每篇论文的唯一标识符。这些标识符的格式为 arXiv:YYMM.NNNNN (末尾五位数字) 或 arXiv:YYMM.NNNN (末尾四位数字)。arXiv 标识符通常水印在上传到 arXiv 的论文上。

arXiv 标识符后面可选地跟一个版本号，即 arXiv:YYMM.NNNNNvX。

我们可以使用正则表达式来定义这个模式

paper_regex = r'arXiv:\d{2}[01]\d\.\d{4,5}(v\d)?'

我们可以从正则表达式构建一个提取函数

id_extractor = outlines.generate.regex(model, paper_regex)

现在，我们可以从第一个图像中提取 arxiv 论文标识符

arxiv_instruction = processor.apply_chat_template(
    [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": f"""
                Extract the arxiv paper identifier from the page.

                Here is the page:
                """},
                {"type": "image", "image": ""},
            ],
        }
    ],
    tokenize=False,
    add_generation_prompt=True
)

# Extract the arxiv paper identifier
paper_id = id_extractor(arxiv_instruction, [images[0]])

截至撰写本文时，arxiv 论文标识符是

arXiv:2307.09702v4

你的版本号可能不同，但 vX 之前的部分应该匹配。

将论文分类到几个类别之一

outlines.generate.choice 允许模型从几个选项中选择一个。假设我们想将论文分类为关于“语言模型”、“经济学”、“结构化生成”或“其他”。

让我们定义一些我们可能感兴趣的类别

categories = [
    "llms",
    "cell biology",
    "other"
]

现在我们可以构建prompt

categorization_instruction = processor.apply_chat_template(
    [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": f"""
                Please choose one of the following categories
                that best describes the paper.

                {categories}

                Here is the paper:
                """},

                {"type": "image", "image": ""},
            ],
        }
    ],
    tokenize=False,
    add_generation_prompt=True
)

现在我们可以向模型展示第一页并提取类别

# Build the choice extractor
categorizer = outlines.generate.choice(
    model,
    categories
)

# Categorize the paper
category = categorizer(categorization_instruction, [images[0]])
print(category)

应该返回

llms

附加说明

你可以通过以下方式向模型提供多个图像

添加额外的图像消息
向 generate 函数提供图像列表

例如，要有两个图像，你可以这样做

two_image_prompt = processor.apply_chat_template(
    [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "are both of these images of hot dogs?"},

                # Tell the model there are two images
                {"type": "image", "image": ""},
                {"type": "image", "image": ""},
            ],
        }
    ],
    tokenize=False,
    add_generation_prompt=True
)

# Pass two images to the model
generator = outlines.generate.choice(
    model,
    ["hot dog", "not hot dog"]
)

result = generator(
    two_image_prompt,

    # Pass two images to the model
    [images[0], images[1]]
)
print(result)

使用论文的前两页（它们不是热狗的图片），我们应该得到

not hot dog