使用 PEFT 进行提示微调

作者: Pere Martra

在这个 notebook 中，我们将介绍如何使用 PEFT 库对预训练模型进行提示微调。

要查看与 PEFT 兼容的完整模型列表，请参考他们的文档。

可以使用 PEFT 进行训练的模型示例包括 Bloom、Llama、GPT-J、GPT-2、BERT 等等。Hugging Face 正在努力将更多模型添加到库中。

提示微调简要介绍

这是一种用于模型的附加微调技术。这意味着我们不会修改原始模型的任何权重。你可能会想，那么我们将如何进行微调呢？好吧，我们将训练添加到模型中的额外层。这就是为什么它被称为附加技术。

考虑到它是一种附加技术，并且它的名字是提示调整，似乎很明显我们将要添加和训练的层与提示有关。

Prompt_Tuning_Diagram

我们通过使模型能够用其获取的知识增强提示的一部分来创建一种超提示。然而，提示的这部分不能翻译成自然语言。这就好像我们已经掌握了用嵌入表达自己并生成高效提示的能力。

在每次训练周期中，唯一可以修改以最小化损失函数的权重是集成到提示中的权重。

这种技术的主要结果是，要训练的参数数量确实很少。然而，我们遇到了第二个，也许更重要的结果，即由于我们不修改预训练模型的权重，它不会改变其行为或忘记它以前学到的任何信息。

训练更快，更具成本效益。此外，我们可以训练各种模型，在推理时，我们只需要加载一个基础模型以及新的较小的训练模型，因为原始模型的权重没有被修改。

我们将在 notebook 中做什么？

我们将使用两个数据集训练两个不同的模型，每个数据集只使用 Bloom 家族的一个预训练模型。一个模型将使用提示数据集进行训练，而另一个模型将使用激励句子数据集进行训练。我们将比较两个模型在训练前后对同一问题的结果。

此外，我们还将探讨如何只加载基础模型的一个副本到内存中，同时加载两个模型。

加载 PEFT 库

这个库包含了各种微调技术的 Hugging Face 实现，包括提示调整。

!pip install -q peft==0.8.2

!pip install -q datasets==2.14.5

从 transformers 库中，我们导入必要的类来实例化模型和分词器。

from transformers import AutoModelForCausalLM, AutoTokenizer

加载模型和分词器。

Bloom 是使用 PEFT 库进行提示调整训练的可用模型中最小最智能的模型之一。你可以从 Bloom 家族中选择任何模型，我鼓励你至少尝试其中两个以观察它们之间的差异。

我选择最小的模型以最小化训练时间并避免在 Colab 中出现内存问题。

model_name = "bigscience/bloomz-560m"
# model_name="bigscience/bloom-1b1"
NUM_VIRTUAL_TOKENS = 4
NUM_EPOCHS = 6

tokenizer = AutoTokenizer.from_pretrained(model_name)
foundational_model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)

使用预训练的 bloom 模型进行推理

如果你想要实现更多样化和原创的生成，取消注释下面的 model.generate 中的参数：temperature、top_p 和 do_sample。

在默认配置下，模型的响应在多次调用中保持一致。

# this function returns the outputs from the model received, and inputs.
def get_outputs(model, inputs, max_new_tokens=100):
    outputs = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        max_new_tokens=max_new_tokens,
        # temperature=0.2,
        # top_p=0.95,
        # do_sample=True,
        repetition_penalty=1.5,  # Avoid repetition.
        early_stopping=True,  # The model can stop before reach the max_length
        eos_token_id=tokenizer.eos_token_id,
    )
    return outputs

由于我们希望有两个不同的训练模型，我将创建两个不同的提示。

第一个模型将使用包含提示的数据集进行训练，第二个模型将使用激励句子的数据集进行训练。

第一个模型将收到提示 “我希望你扮演一个励志教练。“，第二个模型将收到提示 “有两件对你来说很重要的事情：”

但首先，我要收集一些未经微调的模型的结果。

>>> input_prompt = tokenizer("I want you to act as a motivational coach. ", return_tensors="pt")
>>> foundational_outputs_prompt = get_outputs(foundational_model, input_prompt, max_new_tokens=50)

>>> print(tokenizer.batch_decode(foundational_outputs_prompt, skip_special_tokens=True))

["I want you to act as a motivational coach.  Don't be afraid of being challenged."]

>>> input_sentences = tokenizer("There are two nice things that should matter to you:", return_tensors="pt")
>>> foundational_outputs_sentence = get_outputs(foundational_model, input_sentences, max_new_tokens=50)

>>> print(tokenizer.batch_decode(foundational_outputs_sentence, skip_special_tokens=True))

['There are two nice things that should matter to you: the price and quality of your product.']

两个答案或多或少都是正确的。任何 Bloom 模型都是预先训练的，能够准确和合理地生成句子。让我们看看，在训练之后，响应是否相等或者生成得更加准确。

准备数据集

使用的数据集包括：

import os

# os.environ["TOKENIZERS_PARALLELISM"] = "false"

from datasets import load_dataset

dataset_prompt = "fka/awesome-chatgpt-prompts"

# Create the Dataset to create prompts.
data_prompt = load_dataset(dataset_prompt)
data_prompt = data_prompt.map(lambda samples: tokenizer(samples["prompt"]), batched=True)
train_sample_prompt = data_prompt["train"].select(range(50))

display(train_sample_prompt)

>>> print(train_sample_prompt[:1])

&#123;'act': ['Linux Terminal'], 'prompt': ['I want you to act as a linux terminal. I will type commands and you will reply with what the terminal should show. I want you to only reply with the terminal output inside one unique code block, and nothing else. do not write explanations. do not type commands unless I instruct you to do so. when i need to tell you something in english, i will do so by putting text inside curly brackets &#123;like this}. my first command is pwd'], 'input_ids': [[44, 4026, 1152, 427, 1769, 661, 267, 104105, 28434, 17, 473, 2152, 4105, 49123, 530, 1152, 2152, 57502, 1002, 3595, 368, 28434, 3403, 6460, 17, 473, 4026, 1152, 427, 3804, 57502, 1002, 368, 28434, 10014, 14652, 2592, 19826, 4400, 10973, 15, 530, 16915, 4384, 17, 727, 1130, 11602, 184637, 17, 727, 1130, 4105, 49123, 35262, 473, 32247, 1152, 427, 727, 1427, 17, 3262, 707, 3423, 427, 13485, 1152, 7747, 361, 170205, 15, 707, 2152, 727, 1427, 1331, 55385, 5484, 14652, 6291, 999, 117805, 731, 29726, 1119, 96, 17, 2670, 3968, 9361, 632, 269, 42512]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

dataset_sentences = load_dataset("Abirate/english_quotes")

data_sentences = dataset_sentences.map(lambda samples: tokenizer(samples["quote"]), batched=True)
train_sample_sentences = data_sentences["train"].select(range(25))
train_sample_sentences = train_sample_sentences.remove_columns(["author", "tags"])

display(train_sample_sentences)

微调

PEFT 配置

API 文档： https://huggingface.co/docs/peft/main/en/package_reference/tuners#peft.PromptTuningConfig

我们可以对两个要训练的模型使用相同的配置。

from peft import get_peft_model, PromptTuningConfig, TaskType, PromptTuningInit

generation_config = PromptTuningConfig(
    task_type=TaskType.CAUSAL_LM,  # This type indicates the model will generate text.
    prompt_tuning_init=PromptTuningInit.RANDOM,  # The added virtual tokens are initializad with random numbers
    num_virtual_tokens=NUM_VIRTUAL_TOKENS,  # Number of virtual tokens to be added and trained.
    tokenizer_name_or_path=model_name,  # The pre-trained model.
)

创建两个提示调整模型。

我们将使用相同的预训练模型和相同的配置来创建两个相同的提示调整模型。

>>> peft_model_prompt = get_peft_model(foundational_model, generation_config)
>>> print(peft_model_prompt.print_trainable_parameters())

trainable params: 4,096 || all params: 559,218,688 || trainable%: 0.0007324504863471229
None

>>> peft_model_sentences = get_peft_model(foundational_model, generation_config)
>>> print(peft_model_sentences.print_trainable_parameters())

trainable params: 4,096 || all params: 559,218,688 || trainable%: 0.0007324504863471229
None

太神奇了：你看到可训练参数的减少了吗？我们将要训练可用参数的 0.001%。

现在我们要创建训练参数，并且在这两次训练中我们将使用相同的配置。

from transformers import TrainingArguments


def create_training_arguments(path, learning_rate=0.0035, epochs=6):
    training_args = TrainingArguments(
        output_dir=path,  # Where the model predictions and checkpoints will be written
        use_cpu=True,  # This is necessary for CPU clusters.
        auto_find_batch_size=True,  # Find a suitable batch size that will fit into memory automatically
        learning_rate=learning_rate,  # Higher learning rate than full Fine-Tuning
        num_train_epochs=epochs,
    )
    return training_args

import os

working_dir = "./"

# Is best to store the models in separate folders.
# Create the name of the directories where to store the models.
output_directory_prompt = os.path.join(working_dir, "peft_outputs_prompt")
output_directory_sentences = os.path.join(working_dir, "peft_outputs_sentences")

# Just creating the directoris if not exist.
if not os.path.exists(working_dir):
    os.mkdir(working_dir)
if not os.path.exists(output_directory_prompt):
    os.mkdir(output_directory_prompt)
if not os.path.exists(output_directory_sentences):
    os.mkdir(output_directory_sentences)

在创建 TrainingArguments 时，我们需要指明包含模型的目录。

training_args_prompt = create_training_arguments(output_directory_prompt, 0.003, NUM_EPOCHS)
training_args_sentences = create_training_arguments(output_directory_sentences, 0.003, NUM_EPOCHS)

训练

我们将为每个要训练的模型创建一个 trainer 对象。

from transformers import Trainer, DataCollatorForLanguageModeling


def create_trainer(model, training_args, train_dataset):
    trainer = Trainer(
        model=model,  # We pass in the PEFT version of the foundation model, bloomz-560M
        args=training_args,  # The args for the training.
        train_dataset=train_dataset,  # The dataset used to tyrain the model.
        data_collator=DataCollatorForLanguageModeling(
            tokenizer, mlm=False
        ),  # mlm=False indicates not to use masked language modeling
    )
    return trainer

# Training first model.
trainer_prompt = create_trainer(peft_model_prompt, training_args_prompt, train_sample_prompt)
trainer_prompt.train()

# Training second model.
trainer_sentences = create_trainer(peft_model_sentences, training_args_sentences, train_sample_sentences)
trainer_sentences.train()

在不到 10 分钟的时间内（在 M1 Pro上的 CPU 时间），我们使用同一个基础模型训练了两个不同任务的模型。

保存模型

我们将要保存模型。只要我们有创建它们的预训练模型在内存中，这些模型就可以使用了。

trainer_prompt.model.save_pretrained(output_directory_prompt)
trainer_sentences.model.save_pretrained(output_directory_sentences)

推理

你可以从之前保存的路径加载模型，并根据我们的输入要求模型生成文本！

from peft import PeftModel

loaded_model_prompt = PeftModel.from_pretrained(
    foundational_model,
    output_directory_prompt,
    # device_map='auto',
    is_trainable=False,
)

>>> loaded_model_prompt_outputs = get_outputs(loaded_model_prompt, input_prompt)
>>> print(tokenizer.batch_decode(loaded_model_prompt_outputs, skip_special_tokens=True))

['I want you to act as a motivational coach.  You will be helping students learn how they can improve their performance in the classroom and at school.']

如果我们比较两个答案，有些东西改变了。

预训练模型： 我希望你扮演一个激励教练。不要害怕被挑战。
微调模型： 我希望你扮演一个激励教练。如果你感到焦虑，你可以使用这个方法。

我们必须记住，我们只训练了模型几分钟，但它们已经足够让我们得到更接近我们想要的结果的响应。

loaded_model_prompt.load_adapter(output_directory_sentences, adapter_name="quotes")
loaded_model_prompt.set_adapter("quotes")

>>> loaded_model_sentences_outputs = get_outputs(loaded_model_prompt, input_sentences)
>>> print(tokenizer.batch_decode(loaded_model_sentences_outputs, skip_special_tokens=True))

['There are two nice things that should matter to you: the weather and your health.']

对于第二个模型，我们得到了类似的结果。

预训练模型： 有两件对你来说很重要的事情：你的产品的价格和质量。
微调模型： 有两件对你来说很重要的事情：天气和你的健康。

结论

提示微调是一种惊人的技术，可以节省我们数小时的训练时间和大量的金钱。在这个 notebook 中，我们只用了几分钟就训练了两个模型，并且我们可以将两个模型都保存在内存中，为不同的客户提供服务。

如果你想要尝试不同的组合和模型，这个 notebook 已经准备好使用 Bloom 家族中的另一个模型。

你可以更改训练的轮数、虚拟 token 的数量和第三个单元格中的模型。然而，有许多配置需要更改。如果你正在寻找一个很好的练习，你可以用固定值替换虚拟 token 的随机初始化。

微调模型的响应可能在每次我们训练它们时都会有所不同。我粘贴了我的一次训练的结果，但实际结果可能会有所不同。

< > Update on GitHub

Open-Source AI Cookbook