RonanMcGovern commited on
Commit
043bfe2
·
1 Parent(s): 9e32c4e

add inference guide

Browse files
Files changed (1) hide show
  1. README.md +51 -2
README.md CHANGED
@@ -6,15 +6,64 @@ tags:
6
  - Composer
7
  - MosaicML
8
  - llm-foundry
 
 
 
 
9
  inference: true
10
  ---
11
 
12
 
13
- # Llama 2 - hosted inference
14
 
15
- This is simply an 8-bit version of the Llama-2-7B model.
16
  - 8-bits allows the model to be below 10 GB
17
  - This allows for hosted inference of the model on the model's home page
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
 
19
  ~
20
 
 
6
  - Composer
7
  - MosaicML
8
  - llm-foundry
9
+ - hosted inference
10
+ - 8 bit
11
+ - 8bit
12
+ - 8-bit
13
  inference: true
14
  ---
15
 
16
 
17
+ # MPT 7B Instruct - hosted inference
18
 
19
+ This is simply an 8-bit version of the mpt-7b-instruct model.
20
  - 8-bits allows the model to be below 10 GB
21
  - This allows for hosted inference of the model on the model's home page
22
+ - Note that inference may be slow unless you have a HuggingFace Pro plan.
23
+
24
+ If you want to run inference yourself (e.g. in a Colab notebook) you can try:
25
+ ```
26
+ !pip install -q -U git+https://github.com/huggingface/accelerate.git
27
+ !pip install -q -U bitsandbytes
28
+ !pip install -q -U git+https://github.com/huggingface/transformers.git
29
+
30
+ model_id = 'Trelis/mpt-7b-instruct-hosted-inference-8bit'
31
+
32
+ import transformers
33
+ from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline, TextStreamer
34
+
35
+ config = transformers.AutoConfig.from_pretrained(model_id, trust_remote_code=True)
36
+ config.init_device = 'cuda:0' # Unclear whether this really helps a lot or interacts with device_map.
37
+ config.max_seq_len = 512
38
+
39
+ model = AutoModelForCausalLM.from_pretrained(model_id, load_in_8bit=True, config=config)
40
+
41
+ # MPT Inference
42
+ def stream(user_instruction):
43
+ INSTRUCTION_KEY = "### Instruction:"
44
+ RESPONSE_KEY = "### Response:"
45
+ INTRO_BLURB = "Below is an instruction that describes a task. Write a response that appropriately completes the request."
46
+ PROMPT_FOR_GENERATION_FORMAT = """{intro}
47
+ {instruction_key}
48
+ {instruction}
49
+ {response_key}
50
+ """.format(
51
+ intro=INTRO_BLURB,
52
+ instruction_key=INSTRUCTION_KEY,
53
+ instruction="{instruction}",
54
+ response_key=RESPONSE_KEY,
55
+ )
56
+
57
+ prompt = PROMPT_FOR_GENERATION_FORMAT.format(instruction=user_instruction)
58
+
59
+ inputs = tokenizer([prompt], return_tensors="pt").to("cuda:0")
60
+ streamer = TextStreamer(tokenizer)
61
+
62
+ # Despite returning the usual output, the streamer will also print the generated text to stdout.
63
+ _ = model.generate(**inputs, streamer=streamer, max_new_tokens=500, eos_token_id=0, temperature=1)
64
+
65
+ stream('Count to ten')
66
+ ```
67
 
68
  ~
69