使用 VLM 提取收据数据
设置
您需要安装依赖项
导入库
加载所有必需的库
# LLM stuff
import outlines
import torch
from transformers import AutoProcessor
from pydantic import BaseModel, Field
from typing import Literal, Optional, List
# Image stuff
from PIL import Image
import requests
# Rich for pretty printing
from rich import print
选择模型
此示例已使用 mistral-community/pixtral-12b
(HF 链接) 和 Qwen/Qwen2-VL-7B-Instruct
(HF 链接) 进行测试。
我们推荐使用 Qwen-2-VL,因为我们发现它比 Pixtral 更准确。
如果您想使用 Qwen-2-VL,可以按以下步骤操作
# To use Qwen-2-VL:
from transformers import Qwen2VLForConditionalGeneration
model_name = "Qwen/Qwen2-VL-7B-Instruct"
model_class = Qwen2VLForConditionalGeneration
如果您想使用 Pixtral,可以按以下步骤操作
# To use Pixtral:
from transformers import LlavaForConditionalGeneration
model_name="mistral-community/pixtral-12b"
model_class=LlavaForConditionalGeneration
加载模型
将模型加载到内存中
model = outlines.models.transformers_vision(
model_name,
model_class=model_class,
model_kwargs={
"device_map": "auto",
"torch_dtype": torch.bfloat16,
},
processor_kwargs={
"device": "cuda", # set to "cpu" if you don't have a GPU
},
)
图像处理
图像可能很大。在 GPU 资源紧张的环境中,您可能需要将图像调整到较小尺寸。
这是一个帮助函数,可以做到这一点
def load_and_resize_image(image_path, max_size=1024):
"""
Load and resize an image while maintaining aspect ratio
Args:
image_path: Path to the image file
max_size: Maximum dimension (width or height) of the output image
Returns:
PIL Image: Resized image
"""
image = Image.open(image_path)
# Get current dimensions
width, height = image.size
# Calculate scaling factor
scale = min(max_size / width, max_size / height)
# Only resize if image is larger than max_size
if scale < 1:
new_width = int(width * scale)
new_height = int(height * scale)
image = image.resize((new_width, new_height), Image.Resampling.LANCZOS)
return image
您可以通过更改 max_size
参数来改变图像的分辨率。较小的 max_size 会使图像更模糊,但处理会更快并需要更少内存。
加载图像
加载图像并调整大小。我们提供了一张 Trader Joe's 收据的示例图片,但您可以使用任何您喜欢的图片。
图像看起来像这样
# Path to the image
image_path = "https://raw.githubusercontent.com/dottxt-ai/outlines/refs/heads/main/docs/cookbook/images/trader-joes-receipt.jpg"
# Download the image
response = requests.get(image_path)
with open("receipt.png", "wb") as f:
f.write(response.content)
# Load + resize the image
image = load_and_resize_image("receipt.png")
定义输出结构
我们将定义一个 Pydantic 模型来描述我们希望从图像中提取的数据。
在我们的案例中,我们希望提取以下信息
- 商店名称
- 商店地址
- 商店电话
- 商品列表,包括名称、数量、单价和总价
- 税金
- 总计
- 日期
- 付款方式
大多数字段是可选的,因为并非所有收据都包含所有信息。
class Item(BaseModel):
name: str
quantity: Optional[int]
price_per_unit: Optional[float]
total_price: Optional[float]
class ReceiptSummary(BaseModel):
store_name: str
store_address: str
store_number: Optional[int]
items: List[Item]
tax: Optional[float]
total: Optional[float]
# Date is in the format YYYY-MM-DD. We can apply a regex pattern to ensure it's formatted correctly.
date: Optional[str] = Field(pattern=r'\d{4}-\d{2}-\d{2}', description="Date in the format YYYY-MM-DD")
payment_method: Literal["cash", "credit", "debit", "check", "other"]
准备提示
我们将使用 AutoProcessor
将图像和文本提示转换为模型可以理解的格式。实际上,这段代码会将用户、系统、助手和图像 token 添加到提示中。
# Set up the content you want to send to the model
messages = [
{
"role": "user",
"content": [
{
# The image is provided as a PIL Image object
"type": "image",
"image": image,
},
{
"type": "text",
"text": f"""You are an expert at extracting information from receipts.
Please extract the information from the receipt. Be as detailed as possible --
missing or misreporting information is a crime.
Return the information in the following JSON schema:
{ReceiptSummary.model_json_schema()}
"""},
],
}
]
# Convert the messages to the final prompt
processor = AutoProcessor.from_pretrained(model_name)
prompt = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
如果您好奇,发送给模型的最终提示(大致)看起来像这样
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
<|vision_start|><|image_pad|><|vision_end|>
You are an expert at extracting information from receipts.
Please extract the information from the receipt. Be as detailed as
possible -- missing or misreporting information is a crime.
Return the information in the following JSON schema:
<JSON SCHEMA OMITTED>
<|im_end|>
<|im_start|>assistant
运行模型
# Prepare a function to process receipts
receipt_summary_generator = outlines.generate.json(
model,
ReceiptSummary,
# Greedy sampling is a good idea for numeric
# data extraction -- no randomness.
sampler=outlines.samplers.greedy()
)
# Generate the receipt summary
result = receipt_summary_generator(prompt, [image])
print(result)
输出
输出应如下所示
ReceiptSummary(
store_name="Trader Joe's",
store_address='401 Bay Street, San Francisco, CA 94133',
store_number=0,
items=[
Item(name='BANANA EACH', quantity=7, price_per_unit=0.23, total_price=1.61),
Item(name='BAREBELLS CHOCOLATE DOUG', quantity=1, price_per_unit=2.29, total_price=2.29),
Item(name='BAREBELLS CREAMY CRISP', quantity=1, price_per_unit=2.29, total_price=2.29),
Item(name='BAREBELLS CHOCOLATE DOUG', quantity=1, price_per_unit=2.29, total_price=2.29),
Item(name='BAREBELLS CARAMEL CASHEW', quantity=2, price_per_unit=2.29, total_price=4.58),
Item(name='BAREBELLS CREAMY CRISP', quantity=1, price_per_unit=2.29, total_price=2.29),
Item(name='SPINDRIFT ORANGE MANGO 8', quantity=1, price_per_unit=7.49, total_price=7.49),
Item(name='Bottle Deposit', quantity=8, price_per_unit=0.05, total_price=0.4),
Item(name='MILK ORGANIC GALLON WHOL', quantity=1, price_per_unit=6.79, total_price=6.79),
Item(name='CLASSIC GREEK SALAD', quantity=1, price_per_unit=3.49, total_price=3.49),
Item(name='COBB SALAD', quantity=1, price_per_unit=5.99, total_price=5.99),
Item(name='PEPPER BELL RED XL EACH', quantity=1, price_per_unit=1.29, total_price=1.29),
Item(name='BAG FEE.', quantity=1, price_per_unit=0.25, total_price=0.25),
Item(name='BAG FEE.', quantity=1, price_per_unit=0.25, total_price=0.25)
],
tax=0.68,
total=41.98,
date='2023-11-04',
payment_method='debit',
)
瞧!您已成功使用 LLM 从收据中提取信息。
附赠:吐槽用户的收据
您可以通过在 ReceiptSummary
模型末尾添加一个 roast
字段来吐槽用户的收据。
这将为您提供一个如下所示的结果
ReceiptSummary(
...
roast="You must be a fan of Trader Joe's because you bought enough
items to fill a small grocery bag and still had to pay for a bag fee.
Maybe you should start using reusable bags to save some money and the
environment."
)
Qwen 不是特别幽默,但值得一试。