jangrzybek commited on
Commit
86f76d5
·
verified ·
1 Parent(s): 51547a8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +60 -3
README.md CHANGED
@@ -1,3 +1,60 @@
1
- ---
2
- license: llama3.1
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: llama3.1
3
+ ---
4
+ ![llama.cpp](https://user-images.githubusercontent.com/1991296/230134379-7181e485-c521-4d23-a0d6-f7b3b61ba524.png "llama.cpp")
5
+ # Ampere® optimized llama.cpp
6
+ ![llama.cpp pull count](https://img.shields.io/docker/pulls/amperecomputingai/llama.cpp?logo=meta&logoColor=black&label=llama.cpp&labelColor=violet&color=purple)
7
+
8
+ Ampere® optimized build of [llama.cpp](https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#llamacpp) with full support for rich collection of GGUF models available at HuggingFace: [GGUF models](https://huggingface.co/models?search=gguf)
9
+
10
+ **For best results we recommend using models in our custom quantization formats available here: [AmpereComputing HF](https://huggingface.co/AmpereComputing)**
11
+
12
+ This Docker image can be run on bare metal Ampere® CPUs and Ampere® based VMs available in the cloud.
13
+
14
+ Release notes and binary executables are available on our [GitHub](https://github.com/AmpereComputingAI/llama.cpp/releases)
15
+
16
+ ## Starting container
17
+ Default entrypoint runs the server binary of llama.cpp, mimicking behavior of original llama.cpp server image: [docker image](https://github.com/ggerganov/llama.cpp/blob/master/.devops/llama-server.Dockerfile)
18
+
19
+ To launch shell instead, do this:
20
+
21
+ ```bash
22
+ sudo docker run --privileged=true --name llama --entrypoint /bin/bash -it amperecomputingai/llama.cpp:latest
23
+ ```
24
+ Quick start example will be presented at docker container launch:
25
+
26
+ ![quick start](https://ampereaimodelzoo.s3.eu-central-1.amazonaws.com/pictures/Screenshot+2024-04-30+at+22.37.13.png "quick start")
27
+
28
+ Make sure to visit us at [Ampere Solutions Portal](https://solutions.amperecomputing.com/solutions/ampere-ai)!
29
+
30
+ ## Quantization
31
+ Ampere® optimized build of llama.cpp provides support for two new quantization methods, Q4_K_4 and Q8R16, offering model size and perplexity similar to Q4_K and Q8_0, respectively, but performing up to 1.5-2x faster on inference.
32
+
33
+ First, you'll need to convert the model to the GGUF format using [this script](https://github.com/ggerganov/llama.cpp/blob/master/convert_hf_to_gguf.py):
34
+
35
+ ```bash
36
+ python3 convert-hf-to-gguf.py [path to the original model] --outtype [f32, f16, bf16 or q8_0] --outfile [output path]
37
+ ```
38
+
39
+ For example:
40
+
41
+ ```bash
42
+ python3 convert-hf-to-gguf.py path/to/llama2 --outtype f16 --outfile llama-2-7b-f16.gguf
43
+ ```
44
+
45
+ Next, you can quantize the model using the following command:
46
+ ```bash
47
+ ./llama-quantize [input file] [output file] [quantization method]
48
+ ```
49
+
50
+ For example:
51
+ ```bash
52
+ ./llama-quantize llama-2-7b-f16.gguf llama-2-7b-Q8R16.gguf Q8R16
53
+ ```
54
+
55
+ ## Support
56
+
57
+ Please contact us at <[email protected]>
58
+
59
+ ## LEGAL NOTICE
60
+ By accessing, downloading or using this software and any required dependent software (the “Ampere AI Software”), you agree to the terms and conditions of the software license agreements for the Ampere AI Software, which may also include notices, disclaimers, or license terms for third party software included with the Ampere AI Software. Please refer to the [Ampere AI Software EULA v1.6](https://ampereaidevelop.s3.eu-central-1.amazonaws.com/Ampere+AI+Software+EULA+-+v1.6.pdf) or other similarly-named text file for additional details.