keras
/

falcon_refinedweb_1b_en

Text Generation

text-generation-inference

Model card Files Files and versions Community

falcon_refinedweb_1b_en / README.md

Divyasreepat's picture

Update README.md with new model card content

ff35905 verified 3 months ago

|

3.67 kB

	---
	library_name: keras-hub
	license: apache-2.0
	language:
	- en
	tags:
	- text-generation-inference
	- keras
	pipeline_tag: text-generation
	---
	### Model Overview
	# Model Summary

	Falcon-RW-1B is a 1B parameters causal decoder-only model built by [TII](https://www.tii.ae/) and trained on 350B tokens of [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb). The architecture of the model is adopted from the GPT-3 paper ([Brown et al., 2020](https://arxiv.org/abs/2005.14165)) but it uses ALiBi.

	## Use

	### Direct Use
	Research on large language models, specifically the influence of adequately filtered and deduplicated web data on the properties of large language models (fairness, safety, limitations, capabilities, etc.).

	### Out-of-scope Use
	Production use without adequate assessment of risks and mitigation; any use cases which may be considered irresponsible or harmful.

	## Bias, Risks, and Limitations

	Falcon-RW-1B is trained on English data only, and will not generalize appropriately to other languages. Furthermore, as it is trained on a large-scale corpora representative of the web, it will carry the stereotypes and biases commonly encountered online.

	## Recommendations

	We recommend users of Falcon-RW-1B to consider finetuning it for the specific set of tasks of interest, and for guardrails and appropriate precautions to be taken for any production use.

	## Training Details

	### Training Data
	Falcon-RW-1B was trained on 350B tokens of RefinedWeb, a high-quality filtered and deduplicated web dataset. The data was tokenized with the GPT-2 tokenizer.

	### Training Procedure
	Falcon-RW-1B was trained on 32 A100 40GB GPUs, using only data parallelism with ZeRO.

	### Training Hyperparameters
	Hyperparameters were adapted from the GPT-3 paper ([Brown et al., 2020](https://arxiv.org/abs/2005.14165)).

	\| Hyperparameter \| Value \| Comment \|
	\|----------------\|----------\|-------------------------------------------\|
	\| Precision \| bfloat16 \| \|
	\| Optimizer \| AdamW \| \|
	\| Learning rate \| 2e-4 \| 500M tokens warm-up, cosine decay to 2e-5 \|
	\| Weight decay \| 1e-1 \| \|
	\| Batch size \| 512 \| 4B tokens ramp-up \|

	### Speeds, Sizes, Times
	Training happened in early December 2022 and took about six days.

	### Evaluation
	See the [paper on arXiv](https://arxiv.org/abs/2306.01116) for in-depth evaluation.

	## Technical Specifications

	### Model Architecture and Objective
	Falcon-RW-1B is a causal decoder-only model trained on a causal language modeling task (i.e., predict the next token).

	The architecture is adapted from the GPT-3 paper ([Brown et al., 2020](https://arxiv.org/abs/2005.14165)), but uses ALiBi ([Ofir et al., 2021](https://arxiv.org/abs/2108.12409)).

	\| Hyperparameter \| Value \|
	\|:------------------:\|:---------:\|
	\| Layers \| 24 \|
	\| d_model \| 2048 \|
	\| head_dim \| 64 \|
	\| Vocabulary \| 50304 \|
	\| Sequence length \| 2048 \|

	## Citation
	```
	@article{refinedweb,
	title={The {R}efined{W}eb dataset for {F}alcon {LLM}: outperforming curated corpora with web data, and web data only},
	author={Guilherme Penedo and Quentin Malartic and Daniel Hesslow and Ruxandra Cojocaru and Alessandro Cappelli and Hamza Alobeidli and Baptiste Pannier and Ebtesam Almazrouei and Julien Launay},
	journal={arXiv preprint arXiv:2306.01116},
	eprint={2306.01116},
	eprinttype = {arXiv},
	url={https://arxiv.org/abs/2306.01116},
	year={2023}
	}
	```