使用 Outlines 的视觉-语言模型

本指南演示了如何将 Outlines 与视觉-语言模型结合使用，利用新的 transformers_vision 模块。视觉-语言模型可以处理文本和图像，从而实现图像字幕生成、视觉问答等任务。

我们将使用 Mistral 的 Pixtral-12B 模型，以利用其部分视觉推理能力以及一个用于生成多阶段原子字幕的工作流程。

设置

首先，我们需要安装必要的依赖项。除了 Outlines，我们还需要安装 transformers 库以及我们将要使用的视觉-语言模型的任何特定要求。

pip install outlines transformers torch

初始化模型

我们将使用 transformers_vision 函数来初始化我们的视觉-语言模型。此函数专门用于处理可以同时处理文本和图像输入的模型。今天我们将使用带有 llama 分词器的 Pixtral 模型。（目前 mistral 分词器尚不支持）。

import torch
from transformers import (
    LlavaForConditionalGeneration,
)
model_name="mistral-community/pixtral-12b" # original magnet model is able to be loaded without issue
model_class=LlavaForConditionalGeneration

def get_vision_model(model_name: str, model_class: VisionModel):
    model_kwargs = {
        "torch_dtype": torch.bfloat16,
        "attn_implementation": "flash_attention_2",
        "device_map": "auto",
    }
    processor_kwargs = {
        "device": "cuda",
    }

    model = outlines.models.transformers_vision(
        model.model_name,
        model_class=model.model_class,
        model_kwargs=model_kwargs,
        processor_kwargs=processor_kwargs,
    )
    return model
model = get_vision_model(model_name, model_class)

定义模式

接下来，我们将为视觉-语言模型的预期输出定义一个模式。此模式将有助于构建模型的响应。

from pydantic import BaseModel, Field, confloat, constr
from pydantic.types import StringConstraints, PositiveFloat
from typing import List
from typing_extensions import Annotated

from enum import StrEnum
class TagType(StrEnum):
    ENTITY = "Entity"
    RELATIONSHIP = "Relationship"
    STYLE = "Style"
    ATTRIBUTE = "Attribute"
    COMPOSITION = "Composition"
    CONTEXTUAL = "Contextual"
    TECHNICAL = "Technical"
    SEMANTIC = "Semantic"

class ImageTag(BaseModel):
    tag: Annotated[
        constr(min_length=1, max_length=30),
        Field(
            description=(
                "Descriptive keyword or phrase representing the tag."
            )
        )
    ]
    category: TagType
    confidence: Annotated[
        confloat(le=1.0),
        Field(
            description=(
                "Confidence score for the tag, between 0 (exclusive) and 1 (inclusive)."
            )
        )
    ]

class ImageData(BaseModel):
    tags_list: List[ImageTag] = Field(..., min_items=8, max_items=20)
    short_caption: Annotated[str, StringConstraints(min_length=10, max_length=150)]
    dense_caption: Annotated[str, StringConstraints(min_length=100, max_length=2048)]

image_data_generator = outlines.generate.json(model, ImageData)

此模式定义了图像标签的结构，包括实体、关系、风格等类别，以及简短和密集型字幕。

准备提示

我们将创建一个提示，指导模型如何分析图像并生成结构化输出

pixtral_instruction = """
<s>[INST]
<Task>You are a structured image analysis agent. Generate comprehensive tag list, caption, and dense caption for an image classification system.</Task>
<TagCategories requirement="You should generate a minimum of 1 tag for each category." confidence="Confidence score for the tag, between 0 (exclusive) and 1 (inclusive).">
- Entity : The content of the image, including the objects, people, and other elements.
- Relationship : The relationships between the entities in the image.
- Style : The style of the image, including the color, lighting, and other stylistic elements.
- Attribute : The most important attributes of the entities and relationships in the image.
- Composition : The composition of the image, including the arrangement of elements.
- Contextual : The contextual elements of the image, including the background, foreground, and other elements.
- Technical : The technical elements of the image, including the camera angle, lighting, and other technical details.
- Semantic : The semantic elements of the image, including the meaning of the image, the symbols, and other semantic details.
<Examples note="These show the expected format as an abstraction.">
{
  "tags_list": [
    {
      "tag": "subject 1",
      "category": "Entity",
      "confidence": 0.98
    },
    {
      "tag": "subject 2",
      "category": "Entity",
      "confidence": 0.95
    },
    {
      "tag": "subject 1 runs from subject 2",
      "category": "Relationship",
      "confidence": 0.90
    },
   }
</Examples>
</TagCategories>
<ShortCaption note="The short caption should be a concise single sentence caption of the image content with a maximum length of 100 characters.">
<DenseCaption note="The dense caption should be a descriptive but grounded narrative paragraph of the image content with high quality narrative prose. It should incorporate elements from each of the tag categories to provide a broad dense caption">\n[IMG][/INST]
""".strip()

此提示为模型提供了详细的指令，说明如何为图像分析生成全面的标签列表、字幕和密集型字幕。由于指令的顺序，原始标签生成作为字幕任务的一种视觉基础，减少了所需的手动后处理量。

生成结构化输出

现在我们可以使用我们的模型根据输入图像生成结构化输出

def img_from_url(url):
    img_byte_stream = BytesIO(urlopen(url).read())
    return Image.open(img_byte_stream).convert("RGB")

image_url="https://upload.wikimedia.org/wikipedia/commons/9/98/Aldrin_Apollo_11_original.jpg"
image= img_from_url(image_url)
result = image_data_generator(
    pixtral_instruction,
    [image]
)
print(result)

此代码从 URL 加载图像，将其与指令提示一起传递给我们的视觉-语言模型，并根据定义的模式生成结构化输出。我们最终会得到类似这样的输出，可供您流水线的下一阶段使用

{'tags_list': [{'tag': 'astronaut',
   'category': <TagType.ENTITY: 'Entity'>,
   'confidence': 0.99},
  {'tag': 'moon', 'category': <TagType.ENTITY: 'Entity'>, 'confidence': 0.98},
  {'tag': 'space suit',
   'category': <TagType.ATTRIBUTE: 'Attribute'>,
   'confidence': 0.97},
  {'tag': 'lunar module',
   'category': <TagType.ENTITY: 'Entity'>,
   'confidence': 0.95},
  {'tag': 'shadow of astronaut',
   'category': <TagType.COMPOSITION: 'Composition'>,
   'confidence': 0.95},
  {'tag': 'footprints in moon dust',
   'category': <TagType.CONTEXTUAL: 'Contextual'>,
   'confidence': 0.93},
  {'tag': 'low angle shot',
   'category': <TagType.TECHNICAL: 'Technical'>,
   'confidence': 0.92},
  {'tag': 'human first steps on the moon',
   'category': <TagType.SEMANTIC: 'Semantic'>,
   'confidence': 0.95}],
 'short_caption': 'First man on the Moon',
 'dense_caption': "The figure clad in a pristine white space suit, emblazoned with the American flag, stands powerfully on the moon's desolate and rocky surface. The lunar module, a workhorse of space engineering, looms in the background, its metallic legs sinking slightly into the dust where footprints and tracks from the mission's journey are clearly visible. The photograph captures the astronaut from a low angle, emphasizing his imposing presence against the desolate lunar backdrop. The stark contrast between the blacks and whiteslicks of lost light and shadow adds dramatic depth to this seminal moment in human achievement."}

结论

Outlines 中的 transformers_vision 模块提供了一种强大的方法来处理视觉-语言模型。它允许结构化生成结合图像分析和自然语言处理的输出，为详细图像字幕生成、视觉问答等复杂任务开辟了可能性。

通过利用 Pixtral-12B 等模型的能力以及 Outlines 的结构化输出生成，您可以创建能够以高度结构化和可定制的方式理解和描述视觉内容的复杂应用程序。