使用 BentoML 运行 Outlines

BentoML 是一个开源模型服务库，用于使用 Python 构建高性能、可扩展的 AI 应用程序。它提供了服务优化、模型打包和生产部署所需的工具。

本指南将向您展示如何使用 BentoML 在本地 GPU 上以及在 BentoCloud（一个面向企业 AI 团队的 AI 推理平台）中运行使用 Outlines 编写的程序。本指南中的示例源代码也可在 examples/bentoml/ 目录下找到。

导入模型

首先我们需要下载一个 LLM（本例中使用 Mistral-7B-v0.1，您也可以使用任何其他 LLM），并将模型导入到 BentoML 的模型仓库。让我们从 PyPi 安装 BentoML 和其他依赖（最好在虚拟环境中）

pip install -r requirements.txt

然后将以下代码片段保存为 import_model.py 并运行 python import_model.py。

注意：您需要首先在 Hugging Face 上接受相关条款才能访问 Mistral-7B-v0.1。

import bentoml

MODEL_ID = "mistralai/Mistral-7B-v0.1"
BENTO_MODEL_TAG = MODEL_ID.lower().replace("/", "--")

def import_model(model_id, bento_model_tag):

    import torch
    from transformers import AutoModelForCausalLM, AutoTokenizer

    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_ID,
        torch_dtype=torch.float16,
        low_cpu_mem_usage=True,
    )

    with bentoml.models.create(bento_model_tag) as bento_model_ref:
        tokenizer.save_pretrained(bento_model_ref.path)
        model.save_pretrained(bento_model_ref.path)


if __name__ == "__main__":
    import_model(MODEL_ID, BENTO_MODEL_TAG)

您可以通过运行以下命令验证下载是否成功：

$ bentoml models list

Tag                                          Module  Size        Creation Time
mistralai--mistral-7b-v0.1:m7lmf5ac2cmubnnz          13.49 GiB   2024-04-25 06:52:39

定义 BentoML 服务

模型准备好后，我们可以定义一个 BentoML 服务来封装模型的能力。

我们将运行 README 中的 JSON 结构化生成示例，使用以下 schema：

DEFAULT_SCHEMA = """{
    "title": "Character",
    "type": "object",
    "properties": {
        "name": {
            "title": "Name",
            "maxLength": 10,
            "type": "string"
        },
        "age": {
            "title": "Age",
            "type": "integer"
        },
        "armor": {"$ref": "#/definitions/Armor"},
        "weapon": {"$ref": "#/definitions/Weapon"},
        "strength": {
            "title": "Strength",
            "type": "integer"
        }
    },
    "required": ["name", "age", "armor", "weapon", "strength"],
    "definitions": {
        "Armor": {
            "title": "Armor",
            "description": "An enumeration.",
            "enum": ["leather", "chainmail", "plate"],
            "type": "string"
        },
        "Weapon": {
            "title": "Weapon",
            "description": "An enumeration.",
            "enum": ["sword", "axe", "mace", "spear", "bow", "crossbow"],
            "type": "string"
        }
    }
}"""

首先，我们需要通过使用 @bentoml.service 装饰器装饰一个普通类（此处为 Outlines）来定义一个 BentoML 服务。我们向此装饰器传递一些配置以及希望此服务在 BentoCloud 中运行的 GPU（此处为具有 24GB 内存的 L4）

import typing as t
import bentoml

from import_model import BENTO_MODEL_TAG

@bentoml.service(
    traffic={
        "timeout": 300,
    },
    resources={
        "gpu": 1,
        "gpu_type": "nvidia-l4",
    },
)
class Outlines:

    bento_model_ref = bentoml.models.get(BENTO_MODEL_TAG)

    def __init__(self) -> None:

        import outlines
        import torch
        self.model = outlines.models.transformers(
            self.bento_model_ref.path,
            device="cuda",
            model_kwargs={"torch_dtype": torch.float16},
        )

    ...

然后我们需要使用 @bentoml.api 装饰器来装饰 Outlines 类的 generate 方法，以定义一个 HTTP 端点

    ...

    @bentoml.api
    async def generate(
        self,
        prompt: str = "Give me a character description.",
        json_schema: t.Optional[str] = DEFAULT_SCHEMA,
    ) -> t.Dict[str, t.Any]:

        import outlines

        generator = outlines.generate.json(self.model, json_schema)
        character = generator(prompt)

        return character

这里的 @bentoml.api 装饰器将 generate 定义为一个 HTTP 端点，该端点接受包含两个字段的 JSON 请求体：prompt 和 json_schema（可选，允许 HTTP 客户端提供自己的 JSON schema）。函数签名的类型提示将用于验证传入的 JSON 请求。您可以通过使用 @bentoml.api 装饰 Outlines 类的其他方法来定义任意数量的 HTTP 端点。

现在您可以将上述代码保存到 service.py 中（或使用此实现），并使用 BentoML CLI 运行代码。

在本地运行进行测试和调试

然后您可以通过以下命令在本地运行一个服务器：

bentoml serve .

服务器现在已在 https://:3000 激活。您可以使用 Swagger UI 或其他不同方式与其交互

CURL

curl -X 'POST' \
  'https://:3000/generate' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "prompt": "Give me a character description."
}'

Python 客户端

import bentoml

with bentoml.SyncHTTPClient("https://:3000") as client:
    response = client.generate(
        prompt="Give me a character description"
    )
    print(response)

预期输出

{
  "name": "Aura",
  "age": 15,
  "armor": "plate",
  "weapon": "sword",
  "strength": 20
}

部署到 BentoCloud

服务准备就绪后，您可以将其部署到 BentoCloud 以获得更好的管理和可扩展性。如果您还没有 BentoCloud 账户，请注册。

请确保您已登录到 BentoCloud，然后运行以下命令进行部署。

bentoml deploy .

应用程序在 BentoCloud 上启动并运行后，您可以通过公开的 URL 访问它。

注意：对于您自己基础设施中的自定义部署，请使用 BentoML 生成 OCI 兼容镜像。