ctranslate2-4you
/

Mistral-Nemo-Instruct-2407-ct2-int8

Model card Files Files and versions Community

ctranslate2-4you commited on Oct 22, 2024

Commit

bb1d05d

·

verified ·

1 Parent(s): f95ed93

Create README.md

Files changed (1) hide show

README.md +82 -0

README.md ADDED Viewed

	@@ -0,0 +1,82 @@

+---
+base_model:
+- mistralai/Mistral-Nemo-Instruct-2407
+---
+Ctranslate2 conversion of the model located at [mistralai/Mistral-Nemo-Instruct-2407](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407)
+Conversion script with graphical user interface can be downloaded [HERE](https://github.com/BBC-Esq/Ctranslate2-Converter)
+## Tested with Ctranslate 4.4.0 and Torch 2.2.2
+- NOTE: Ctranslate2 will soon release version 4.5.0, which will require greater than Torch 2.2.2.
+## Example Usage:
+```
+import os
+import sys
+import ctranslate2
+import gc
+import torch
+from transformers import AutoTokenizer
+system_message = "You are a helpful person who answers questions."
+user_message = "Hello, how are you today? I'd like you to write me a funny poem that is a parody of Milton's Paradise Lost if you are familiar with that famous epic poem?"
+model_dir = r"D:\Scripts\bench_chat\models\mistralai--Mistral-Nemo-Instruct-2407-ct2-int8"
+def build_prompt_mistral_nemo():
+    prompt = f"""<s>
+[INST]{system_message}
+{user_message}[/INST]"""
+    return prompt
+def main():
+    model_name = os.path.basename(model_dir)
+    print(f"\033[32mLoading the model: {model_name}...\033[0m")
+    intra_threads = max(os.cpu_count() - 4, 4)
+    generator = ctranslate2.Generator(
+        model_dir,
+        device="cuda",
+        compute_type="int8",
+        intra_threads=intra_threads
+    )
+    tokenizer = AutoTokenizer.from_pretrained(model_dir, add_prefix_space=None)
+    prompt = build_prompt_mistral_nemo()
+    tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))
+    results_batch = generator.generate_batch(
+        [tokens],
+        include_prompt_in_result=False,
+        max_batch_size=4096,
+        batch_type="tokens",
+        beam_size=1,
+        num_hypotheses=1,
+        max_length=512,
+        sampling_temperature=0.0,
+    )
+    output = tokenizer.decode(results_batch[0].sequences_ids[0])
+    print("\nGenerated response:")
+    print(output)
+    del generator
+    del tokenizer
+    torch.cuda.empty_cache()
+    gc.collect()
+if __name__ == "__main__":
+    main()
+```