kirp
/

Text Generation
GGUF
English
Inference Endpoints
kirp's picture
Update README.md
916ec69
metadata
license: apache-2.0
datasets:
  - cerebras/SlimPajama-627B
  - bigcode/starcoderdata
  - OpenAssistant/oasst_top1_2023-08-25
language:
  - en
quantized_by:
  - kirp
pipeline_tag: text-generation

🔥 Good news

You can download the model from PY007 without any change to llama.cpp.

Here is a demo.

Pay attention

To use this model, you need to change the rope part of llama.cpp/llama.cpp. (From mode 0 to mode 2 rope)

Change 2568 and 2572 line from

struct ggml_tensor * Kcur = ggml_rope_custom_inplace(ctx0, ggml_reshape_3d(ctx0, tmpk, n_embd_head, n_head_kv, N), n_past, n_embd_head, 0, 0, freq_base, freq_scale);
struct ggml_tensor * Qcur = ggml_rope_custom_inplace(ctx0, ggml_reshape_3d(ctx0, tmpq, n_embd_head, n_head, N),    n_past, n_embd_head, 0, 0, freq_base, freq_scale);

to

struct ggml_tensor * Kcur = ggml_rope_custom_inplace(ctx0, ggml_reshape_3d(ctx0, tmpk, n_embd_head, n_head_kv, N), n_past, n_embd_head, 2, 0, freq_base, freq_scale);
struct ggml_tensor * Qcur = ggml_rope_custom_inplace(ctx0, ggml_reshape_3d(ctx0, tmpq, n_embd_head, n_head, N),    n_past, n_embd_head, 2, 0, freq_base, freq_scale);

TinyLlama-1.1B Chat v0.2 GGUF

Description

This repo contains GGUF format model files for PY007's TinyLlama 1.1B Chat v0.2

About GGUF

GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp.

The key benefit of GGUF is that it is a extensible, future-proof format which stores more information about the model as metadata. It also includes significantly improved tokenization code, including for the first time full support for special tokens. This should improve performance, especially with models that use new special tokens and implement custom prompt templates.

Here are a list of clients and libraries that are known to support GGUF:

  • llama.cpp.
  • text-generation-webui, the most widely used web UI, with many features and powerful extensions.
  • KoboldCpp, a fully featured web UI, with full GPU accel across multiple platforms and GPU architectures. Especially good for story telling.
  • LM Studio, an easy-to-use and powerful local GUI with GPU acceleration on both Windows (NVidia and AMD), and macOS.
  • LoLLMS Web UI, a great web UI with many interesting and unique features, including a full model library for easy model selection.
  • ctransformers, a Python library with GPU accel, LangChain support, and OpenAI-compatible AI server.
  • llama-cpp-python, a Python library with GPU accel, LangChain support, and OpenAI-compatible API server.
  • candle, a Rust ML framework with a focus on performance, including GPU support, and ease of use.

Prompt template: TinyLlama chat

<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n

Example:

<|im_start|>user
Explain huggingface.<|im_end|>
<|im_start|>assistant
Hugging Face is a platform for building and hosting open-source applications. It provides a simple interface for developers to build, deploy, and host any application on the web. Hugging Face offers a wide range of services, including:

1. API Gateway: This service allows developers to create REST APIs that can be accessed by other Hugging Face services.

2. Functions: This service provides functions that can be used for processing data and making predictions.

3. Transformers: These are a set of algorithms that allow developers to process large amounts of text data and generate new content.

4. Datasets: Hugging Face provides datasets that can be used to train models, evaluate them, and make predictions.

5. CLI: This service provides a command-line interface for developers to build, deploy, and manage their applications.

6. Documentation: This service provides documentation for the different services and features available on Hugging Face's platform.

7. Community: The Hugging Face community is made up of developers, data scientists, and other experts who can provide support and resources for using and building on Hugging Face's platforms.<|im_end|>

Compatibility

These quantised GGUF files are compatible with llama.cpp from August 21st 2023 onwards, as of commit 6381d4e110bd0ec02843a60bbeb8b6fc37a9ace9

They are now also compatible with many third party UIs and libraries - please see the list at the top of the README.

Explanation of quantisation methods

Click to see details

The new methods available are:

  • GGML_TYPE_Q2_K - "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw)
  • GGML_TYPE_Q3_K - "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw.
  • GGML_TYPE_Q4_K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw.
  • GGML_TYPE_Q5_K - "type-1" 5-bit quantization. Same super-block structure as GGML_TYPE_Q4_K resulting in 5.5 bpw
  • GGML_TYPE_Q6_K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw

Refer to the Provided Files table below to see what files use which methods, and how.

Example llama.cpp command

For compatibility with older versions of llama.cpp, or for any third-party libraries or clients that haven't yet updated for GGUF, please use GGML files instead.

./main -m ./models/ggml-model-q4_k_m.gguf \
        -n 512 --color --temp 0 -e \
        -p "<|im_start|>user\nExplain huggingface.<|im_end|>\n<|im_start|>assistant\n"