--- language: ko license: cc-by-nc-sa-4.0 tags: - gpt2 --- # Model Card for kogpt2-base-v2 # Model Details ## Model Description [GPT-2](https://openai.com/blog/better-language-models/)는 주어진 텍스트의 다음 단어를 잘 예측할 수 있도록 학습된 언어모델이며 문장 생성에 최적화 되어 있습니다. `KoGPT2`는 부족한 한국어 성능을 극복하기 위해 40GB 이상의 텍스트로 학습된 한국어 디코더(`decoder`) 언어모델입니다. - **Developed by:** SK Telecom - **Shared by [Optional]:** SK Telecom - **Model type:** Text Generation - **Language(s) (NLP):** Korean - **License:** cc-by-nc-sa-4.0 - **Parent Model:** GPT-2 - **Resources for more information:** - [GitHub Repo](https://github.com/SKT-AI/KoGPT2/tree/master) - [Model Demo Space](https://huggingface.co/spaces/gogamza/kogpt2-base-v2) # Uses ## Direct Use This model can be used for the task of Text Generation ## Downstream Use [Optional] More information needed. ## Out-of-Scope Use The model should not be used to intentionally create hostile or alienating environments for people. # Bias, Risks, and Limitations Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups. ## Recommendations Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations. # Training Details ## Training Data The model authors also note in the [GitHub Repo](https://github.com/SKT-AI/KoGPT2/tree/master): [`tokenizers`](https://github.com/huggingface/tokenizers) 패키지의 `Character BPE tokenizer`로 학습되었습니다. 사전 크기는 51,200 이며 대화에 자주 쓰이는 아래와 같은 이모티콘, 이모지 등을 추가하여 해당 토큰의 인식 능력을 올렸습니다. > 😀, 😁, 😆, 😅, 🤣, .. , `:-)`, `:)`, `-)`, `(-:`... [한국어 위키 백과](https://ko.wikipedia.org/) 이외, 뉴스, [모두의 말뭉치 v1.0](https://corpus.korean.go.kr/), [청와대 국민청원](https://github.com/akngs/petitions) 등의 다양한 데이터가 모델 학습에 사용되었습니다. ## Training Procedure ### Preprocessing More information needed ### Speeds, Sizes, Times | Model | # of params | Type | # of layers | # of heads | ffn_dim | hidden_dims | |--------------|:----:|:-------:|--------:|--------:|--------:|--------------:| | `kogpt2-base-v2` | 125M | Decoder | 12 | 12 | 3072 | 768 | # Evaluation ## Testing Data, Factors & Metrics ### Testing Data More information needed ### Factors More information needed ### Metrics More information needed ## Results ### Classification or Regression | | [NSMC](https://github.com/e9t/nsmc)(acc) | [KorSTS](https://github.com/kakaobrain/KorNLUDatasets)(spearman) | |---|---|---| | **KoGPT2 2.0** | 89.1 | 77.8 | # Model Examination More information needed # Environmental Impact Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). - **Hardware Type:** More information needed - **Hours used:** More information needed - **Cloud Provider:** More information needed - **Compute Region:** More information needed - **Carbon Emitted:** More information needed # Technical Specifications [optional] ## Model Architecture and Objective More information needed ## Compute Infrastructure More information needed ### Hardware More information needed ### Software More information needed. # Citation **BibTeX:** More information needed # Glossary [optional] More information needed # More Information [optional] More information needed # Model Card Authors [optional] SK Telecom in collaboration with Ezi Ozoani and the Hugging Face team # Model Card Contact The model authors also note in the [GitHub Repo](https://github.com/SKT-AI/KoGPT2/tree/master) > `KoGPT2` 관련 이슈는 [이곳](https://github.com/SKT-AI/KoGPT2/issues)에 올려주세요. # How to Get Started with the Model Use the code below to get started with the model.
Click to expand ```python from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("skt/kogpt2-base-v2") model = AutoModelForCausalLM.from_pretrained("skt/kogpt2-base-v2") ```