Upload OpenAssistant/falcon-7b-sft-top1-696 ctranslate fp16 weights

ee75241 over 1 year ago

5.98 kB

	---
	license: apache-2.0
	language:
	- en
	- de
	- es
	- fr
	tags:
	- ctranslate2
	- int8
	- float16
	- sft
	pipeline_tag: text-generation
	widget:
	- text: >-
	<\|prompter\|>What is a meme, and what's the history behind this
	word?<\|endoftext\|><\|assistant\|>
	- text: <\|prompter\|>What's the Earth total population<\|endoftext\|><\|assistant\|>
	- text: >-
	<\|prompter\|>Write a story about future of AI
	development<\|endoftext\|><\|assistant\|>
	datasets:
	- OpenAssistant/oasst1
	library_name: transformers
	---
	# # Fast-Inference with Ctranslate2
	Speedup inference while reducing memory by 2x-4x using int8 inference in C++ on CPU or GPU.

	quantized version of [OpenAssistant/falcon-7b-sft-top1-696](https://huggingface.co/OpenAssistant/falcon-7b-sft-top1-696)
	```bash
	pip install hf-hub-ctranslate2>=2.10.0 ctranslate2>=3.16.0
	```

	```python
	# from transformers import AutoTokenizer
	model_name = "michaelfeil/ct2fast-falcon-7b-sft-top1-696"

	from hf_hub_ctranslate2 import GeneratorCT2fromHfHub
	model = GeneratorCT2fromHfHub(
	# load in int8 on CUDA
	model_name_or_path=model_name,
	device="cuda",
	compute_type="int8_float16",
	# tokenizer=AutoTokenizer.from_pretrained("{ORG}/{NAME}")
	)
	outputs = model.generate(
	text=["def fibonnaci(", "User: How are you doing? Bot:"],
	max_length=64,
	include_prompt_in_result=False
	)
	print(outputs)
	```

	Checkpoint compatible to [ctranslate2>=3.16.0](https://github.com/OpenNMT/CTranslate2)
	and [hf-hub-ctranslate2>=2.10.0](https://github.com/michaelfeil/hf-hub-ctranslate2)
	- `compute_type=int8_float16` for `device="cuda"`
	- `compute_type=int8` for `device="cpu"`

	Converted on 2023-06-16 using
	```
	ct2-transformers-converter --model OpenAssistant/falcon-7b-sft-top1-696 --output_dir ~/tmp-ct2fast-falcon-7b-sft-top1-696 --force --copy_files tokenizer.json README.md tokenizer_config.json generation_config.json special_tokens_map.json .gitattributes --quantization int8_float16 --trust_remote_code
	```

	# Licence and other remarks:
	This is just a quantized version. Licence conditions are intended to be idential to original huggingface repo.

	# Original description


	# Open-Assistant Falcon 7B SFT OASST-TOP1 Model

	This model is a fine-tuning of TII's [Falcon 7B](https://huggingface.co/tiiuae/falcon-7b) LLM.
	It was trained with 11,123 top-1 (high-quality) demonstrations of the OASST data set (exported on June 2, 2023) with a batch size of 128 for 8 epochs with LIMA style dropout (p=0.2) and a context-length of 2048 tokens.

	## Model Details

	- Finetuned from: [tiiuae/falcon-7b](https://huggingface.co/tiiuae/falcon-7b)
	- Model type: Causal decoder-only transformer language model
	- Language: English, German, Spanish, French (and limited capabilities in Italian, Portuguese, Polish, Dutch, Romanian, Czech, Swedish);
	- Weights & Biases: [Training log](https://wandb.ai/open-assistant/public-sft/runs/25apbcld) (Checkpoint: 696 steps)
	- Code: [Open-Assistant/model/model_training](https://github.com/LAION-AI/Open-Assistant/tree/main/model/model_training)
	- Demo: [Continuations for 250 random prompts](https://open-assistant.github.io/oasst-model-eval/?f=https%3A%2F%2Fraw.githubusercontent.com%2FOpen-Assistant%2Foasst-model-eval%2Fmain%2Fsampling_reports%2Fchat-gpt%2F2023-04-11_gpt-3.5-turbo_lottery.json%0Ahttps%3A%2F%2Fraw.githubusercontent.com%2FOpen-Assistant%2Foasst-model-eval%2Fmain%2Fsampling_reports%2Foasst-sft%2F2023-06-05_OpenAssistant_falcon-7b-sft-top1-696_sampling_noprefix2.json)
	- License: Apache 2.0
	- Contact: [Open-Assistant Discord](https://ykilcher.com/open-assistant-discord)


	## Prompting

	Two special tokens are used to mark the beginning of user and assistant turns:
	`<\|prompter\|>` and `<\|assistant\|>`. Each turn ends with a `<\|endoftext\|>` token.

	Input prompt example:
	```
	<\|prompter\|>What is a meme, and what's the history behind this word?<\|endoftext\|><\|assistant\|>
	```
	The input ends with the `<\|assistant\|>` token to signal that the model should
	start generating the assistant reply.


	## Sample Code

	```python
	from transformers import AutoTokenizer
	import transformers
	import torch

	model = "OpenAssistant/falcon-7b-sft-top1-696"

	tokenizer = AutoTokenizer.from_pretrained(model)
	pipeline = transformers.pipeline(
	"text-generation",
	model=model,
	tokenizer=tokenizer,
	torch_dtype=torch.bfloat16,
	trust_remote_code=True,
	device_map="auto",
	)

	input_text="<\|prompter\|>What is a meme, and what's the history behind this word?<\|endoftext\|><\|assistant\|>"

	sequences = pipeline(
	input_text,
	max_length=500,
	do_sample=True,
	return_full_text=False,
	top_k=10,
	num_return_sequences=1,
	eos_token_id=tokenizer.eos_token_id,
	)
	for seq in sequences:
	print(f"Result: {seq['generated_text']}")
	```


	## Configuration Details

	Model:
	```
	falcon-7b:
	dtype: bf16
	log_dir: "falcon_log_7b"
	learning_rate: 1e-5
	model_name: "tiiuae/falcon-7b"
	deepspeed_config: configs/zero_config.json
	output_dir: falcon
	weight_decay: 0.0
	max_length: 2048
	save_strategy: steps
	eval_steps: 80
	save_steps: 80
	warmup_steps: 20
	gradient_checkpointing: true
	gradient_accumulation_steps: 4
	per_device_train_batch_size: 4
	per_device_eval_batch_size: 8
	num_train_epochs: 8
	save_total_limit: 4
	residual_dropout: 0.2
	residual_dropout_lima: true
	```

	Dataset:
	```
	oasst-top1:
	# oasst_export: 11123 (100.00%)
	datasets:
	- oasst_export:
	lang: "bg,ca,cs,da,de,en,es,fr,hr,hu,it,nl,pl,pt,ro,ru,sl,sr,sv,uk" # sft-8.0
	input_file_path: 2023-06-02_oasst_all_labels.jsonl.gz
	val_split: 0.05
	top_k: 1
	```

	Train command:
	```
	deepspeed trainer_sft.py --configs defaults falcon-7b oasst-top1 --cache_dir <data_cache_dir> --output_dir <output_path> --deepspeed
	```

	Export command:
	```
	python export_model.py --dtype bf16 --hf_repo_name OpenAssistant/falcon-7b-sft-top1 --trust_remote_code --auth_token <auth_token> <output_path> --max_shard_size 2GB
	```