--- language: - en library_name: transformers tags: - gpt - llm - large language model - thor service inference: false --- # Model Card ## Summary - Base model: [facebook/opt-2.7b](https://huggingface.co/facebook/opt-2.7b) ## Usage To use the model with the `transformers` library on a machine with GPUs, first make sure you have the `transformers`, `accelerate` and `torch` libraries installed. ```bash pip install transformers==4.30.2 pip install einops==0.6.1 pip install accelerate==0.20.3 pip install torch==2.0.0 ``` ```python import torch from transformers import pipeline generate_text = pipeline( model="shashank-mugiwara/thor", torch_dtype="auto", trust_remote_code=True, use_fast=True, device_map={"": "cuda:0"}, ) res = generate_text( "What is thor service?", min_new_tokens=2, max_new_tokens=256, do_sample=False, num_beams=1, temperature=float(0.3), repetition_penalty=float(1.2), renormalize_logits=True ) print(res[0]["generated_text"]) ``` You can print a sample prompt after the preprocessing step to see how it is feed to the tokenizer: ```python print(generate_text.preprocess("What is thor service?")["prompt_text"]) ``` ```python import torch from h2oai_pipeline import H2OTextGenerationPipeline from transformers import AutoModelForCausalLM, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained( "shashank-mugiwara/thor", use_fast=True, padding_side="left", trust_remote_code=True, ) model = AutoModelForCausalLM.from_pretrained( "shashank-mugiwara/thor", torch_dtype="auto", device_map={"": "cuda:0"}, trust_remote_code=True, ) generate_text = H2OTextGenerationPipeline(model=model, tokenizer=tokenizer) res = generate_text( "Why is drinking water so healthy?", min_new_tokens=2, max_new_tokens=256, do_sample=False, num_beams=1, temperature=float(0.3), repetition_penalty=float(1.2), renormalize_logits=True ) print(res[0]["generated_text"]) ``` You may also construct the pipeline from the loaded model and tokenizer yourself and consider the preprocessing steps: ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "shashank-mugiwara/thor" # either local folder or huggingface model name prompt = "<|prompt|>What is thor service?<|answer|>" tokenizer = AutoTokenizer.from_pretrained( model_name, use_fast=True, trust_remote_code=True, ) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype="auto", device_map={"": "cuda:0"}, trust_remote_code=True, ) model.cuda().eval() inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).to("cuda") # generate configuration can be modified to your needs tokens = model.generate( input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], min_new_tokens=2, max_new_tokens=256, do_sample=False, num_beams=1, temperature=float(0.3), repetition_penalty=float(1.2), renormalize_logits=True )[0] tokens = tokens[inputs["input_ids"].shape[1]:] answer = tokenizer.decode(tokens, skip_special_tokens=True) print(answer) ``` ## Quantization and sharding You can load the models using quantization by specifying ```load_in_8bit=True``` or ```load_in_4bit=True```. Also, sharding on multiple GPUs is possible by setting ```device_map=auto```.