Exllama (GPTQ)

Exllama is a standalone Python/C++/CUDA implementation of Llama for use with 4-bit GPTQ weights, designed to be fast and memory-efficient on modern GPUs.

This backend:

provides support for GPTQ models
requires CUDA runtime

note

This is an experimental backend and it may change in the future.

Example

warning

Please make sure to change syntax to #syntax=ghcr.io/sozercan/aikit:latest in the examples below.

https://github.com/sozercan/aikit/blob/main/test/aikitfile-exllama.yaml

Example​

Example