Skip to main content

Exllama (GPTQ)

Exllama is a standalone Python/C++/CUDA implementation of Llama for use with 4-bit GPTQ weights, designed to be fast and memory-efficient on modern GPUs.

This backend:

  • provides support for GPTQ models
  • requires CUDA runtime
note

This is an experimental backend and it may change in the future.

Example

warning

Please make sure to change syntax to #syntax=ghcr.io/sozercan/aikit:latest in the examples below.

https://github.com/sozercan/aikit/blob/main/test/aikitfile-exllama.yaml