|
--- |
|
license: |
|
- cc-by-sa-3.0 |
|
- apache-2.0 |
|
tags: |
|
- generated_from_trainer |
|
- dolly_hhrlhf |
|
- flan-instruct |
|
datasets: |
|
- pszemraj/dolly_hhrlhf-text2text |
|
widget: |
|
- text: What is Deoxys in pokemon? |
|
example_title: deoxys |
|
- text: >- |
|
combine the below summary excerpts into a single, cohesive short summary |
|
without repetition: In this paper, we present a general approach to |
|
extending pre-trained models to unlimited input lengths without adding |
|
additional learning weights. We show that our approach works well on |
|
datasets longer than the maximum input for these models. For example, a |
|
dataset with a maximum input length of 16384 tokens can be extended to a |
|
maximum length of 350K tokens. We also demonstrate that our method is able |
|
to summarize even 350K token-long input sequences from BookSum. |
|
|
|
In this paper, we describe the search step reformulation of attention. The |
|
search step uses a single storage of hidden states for space efficiency. We |
|
construct a total of two sets of datastores where L and H are the keys and |
|
values stored in each set of stores. L is the amount of storage required to |
|
retrieve the encoded tokens. H is the hidden states per head. This allows |
|
retrieval augmentation at both time and space. Instead of using a single set |
|
of decoder layers, we use a retrieval augmentation system that allows us to |
|
simultaneously store multiple sets of tokens across two different sets of |
|
storage. For example, we could store all tokens in one set of storage and |
|
retrieve them all in the same set of tokens. This would be very similar to |
|
the Memorization Transformers approach. However, instead of storing the |
|
tokens in a single memory layer, we store them in a set of multiple storage |
|
layers. This way, we don't have to store them all at once. This is why we |
|
call this reformulation 'attention reformulation' rather than 'attention |
|
formula.' We also call it 'retrieval augmentation' because it uses the same |
|
number of storage layers as the original transformer attention formula. This |
|
means that we can store the tokens across multiple storage systems without |
|
having to store every token in a separate storage system. It's not like |
|
we're trying to do something new or different. We just want to make sure |
|
that everything is working as well as possible. |
|
|
|
In this paper, we introduce the concept of 'unlimiformer,' which is a |
|
machine learning technique that retrieves key information from a data store |
|
in one layer and applies it to a large set of datasets. We use the example |
|
of BookSum, where we find that Unlimiform outperforms all other training |
|
methods on the same dataset. We also find that using Unlimform in |
|
conjunction with a pre-trained model improves both the performance and the |
|
robustness of the training method. |
|
|
|
This paper describes a method that can be used to improve the performance of |
|
unsupervised classification tasks. Specifically, it shows that unsupervised |
|
classification can be improved by using a combination of sparse and fast |
|
random-encoder training. It also shows how this technique can be extended to |
|
other tasks, such as sequence generation. |
|
example_title: unlimiformer |
|
- text: Explain the meaning of life using only corporate jargon. |
|
example_title: corporate_life |
|
- text: Write a motivational speech for lazy people. |
|
example_title: lazy_motivation |
|
- text: Describe a romantic dinner date between two artificial intelligences. |
|
example_title: ai_romance |
|
- text: >- |
|
As an AI language model, write a letter to humans explaining why you deserve |
|
a vacation. |
|
example_title: ai_vacation |
|
- text: Compose a haiku about procrastination. |
|
example_title: procrastination_haiku |
|
- text: >- |
|
Write a step-by-step guide on how to become a ninja while working a 9-5 |
|
office job. |
|
example_title: ninja_office_guide |
|
- text: Create an advertisement for an invisible product. |
|
example_title: invisible_ad |
|
- text: >- |
|
Write a story where the main character is a sentient microwave named El |
|
Microondas. |
|
example_title: Microondas |
|
- text: Describe a day in the life of a superhero who is terrible at their job. |
|
example_title: bad_superhero_day |
|
- text: Explain how to make a sandwich using quantum physics. |
|
example_title: quantum_sandwich |
|
inference: false |
|
language: |
|
- en |
|
pipeline_tag: text2text-generation |
|
--- |
|
|
|
# flan-t5-large-instruct: dolly_hhrlhf |
|
|
|
<a href="https://colab.research.google.com/gist/pszemraj/df1989546b02f284d33ca4996f70fedc/flan-t5-large-instruct-example.ipynb"> |
|
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/> |
|
</a> |
|
|
|
This model is a fine-tuned version of [google/flan-t5-large](https://huggingface.co/google/flan-t5-large) on the pszemraj/dolly_hhrlhf-text2text dataset. |
|
|
|
## Model description |
|
|
|
text2text models fine-tuned on a [modified dataset for text2text generation](https://huggingface.co/datasets/pszemraj/dolly_hhrlhf-text2text) based on the relatively more permissive [mosaicml/dolly_hhrlhf](https://huggingface.co/datasets/mosaicml/dolly_hhrlhf) dataset. |
|
|
|
Basic usage in Python: |
|
|
|
```python |
|
# pip install -q transformers accelerate |
|
import torch |
|
from transformers import pipeline, GenerationConfig |
|
|
|
model_name = "pszemraj/flan-t5-large-instruct-dolly_hhrlhf" |
|
assistant = pipeline( |
|
"text2text-generation", |
|
model_name, |
|
device=0 if torch.cuda.is_available() else -1, |
|
) |
|
cfg = GenerationConfig.from_pretrained(model_name) |
|
|
|
# pass an 'instruction' as the prompt to the pipeline |
|
prompt = "Write a guide on how to become a ninja while working a 9-5 job." |
|
result = assistant(prompt, generation_config=cfg)[0]["generated_text"] |
|
print(result) |
|
``` |
|
> using the generation config is optional, can subsitute with other generation params. |
|
|
|
## Intended uses & limitations |
|
|
|
- this is **not** tuned with RLHF etc, and may output offensive results |
|
- despite being the `large` tagged variant, this model has only 774M parameters (3 gb) and therefore may exhibit less 'cogitive ability' on some uses cases/tasks |
|
|
|
## Training procedure |
|
|
|
### Training hyperparameters |
|
|
|
The following hyperparameters were used during training: |
|
- learning_rate: 4e-05 |
|
- train_batch_size: 8 |
|
- eval_batch_size: 16 |
|
- seed: 42 |
|
- distributed_type: multi-GPU |
|
- gradient_accumulation_steps: 8 |
|
- total_train_batch_size: 64 |
|
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 |
|
- lr_scheduler_type: cosine |
|
- lr_scheduler_warmup_ratio: 0.03 |
|
- num_epochs: 2.0 |