Demystifying LLM Quantization: GPTQ, AWQ & GGUF

Introduction

You've downloaded a 7B-parameter model, loaded it in FP16, and watched your GPU memory hit 14GB. Sound familiar? Quantization is how you get that same model running in 3.5–4.5GB — on a laptop GPU or even a CPU — with surprisingly little quality loss.

In this post I break down the three dominant quantization approaches in the LLM ecosystem: GPTQ, AWQ, and GGUF. For each, I explain the core idea, show working Python code, and discuss when to use which.

What Is Quantization?

Quantization means storing numbers with fewer bits. A model trained in FP32 (32 bits per weight) can be compressed to FP16 (16 bits), INT8 (8 bits), or INT4 (4 bits). The model doesn't "think" in 4-bit integers — during inference, weights are dequantized back to higher precision for computation.

Here's the intuition. Given original FP32 weights:

original = [0.40, 0.79, 1.22]
print(f"Original FP32 weights: {original}")

# Quantize to INT4 range (0-15) using scale + zero-point
scale = (max(original) - min(original)) / 15
zero_point = min(original)
print(f"Scale: {scale:.4f}, Zero point: {zero_point}")

quantized = [round((x - zero_point) / scale) for x in original]
print(f"Quantized INT4:        {quantized}")

# Dequantize back
dequantized = [round((q * scale) + zero_point, 2) for q in quantized]
print(f"Dequantized:           {dequantized}")

# Show the error
errors = [round(abs(o - d), 4) for o, d in zip(original, dequantized)]
print(f"Quantization error:    {errors}")
print(f"Max error: {max(errors)} (the 'quantization tax')")

# Try it with more values
print("\n--- Try with 10 random weights ---")
import random
random.seed(42)
weights = [round(random.uniform(-1, 1), 3) for _ in range(10)]
s = (max(weights) - min(weights)) / 15
zp = min(weights)
q = [round((x - zp) / s) for x in weights]
dq = [round((v * s) + zp, 3) for v in q]
total_err = sum(abs(o - d) for o, d in zip(weights, dq)) / len(weights)
print(f"Original:    {weights}")
print(f"Dequantized: {dq}")
print(f"Avg error:   {total_err:.4f}")

That tiny rounding error is the quantization tax. The art is minimizing it where it matters most.

Three Types of Quantization

1. Weight Quantization (Mature)

Compresses stored model weights from FP32 → FP16 → INT8 → INT4. This is the most common approach. A 7B model drops from ~14GB (FP16) to ~3.5–4.5GB (INT4).

2. Activation Quantization (Emerging)

Targets intermediate tensors during the forward pass. Still experimental but promising for reducing compute during inference.

3. KV-Cache Quantization (Frontier)

The big new frontier. Using FP8/INT8 for the key-value cache reduces memory by 2–4×, enabling much longer context windows on consumer hardware.

GPTQ: The Hessian Approach (2022)

GPTQ was the first practical method to quantize full LLMs to INT4 post-training. It uses the Hessian matrix (second-order gradient information) to determine optimal rounding for each weight, minimizing the overall output error layer by layer.

Key characteristics:

Post-training — no retraining needed
Treats all weights equally within groups (typically 128 weights per group)
Robust to imperfect calibration data
Computationally expensive during quantization, but inference is lightweight

Quantizing a Model with GPTQ

Open in Google Colab (GPU required)

# Install dependencies
pip install auto-gptq transformers torch

from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

model_name = "meta-llama/Llama-2-7b-hf"

# Configure quantization
quantize_config = BaseQuantizeConfig(
    bits=4,
    group_size=128,
    desc_act=False,  # set True for better quality, slower
)

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoGPTQForCausalLM.from_pretrained(
    model_name, quantize_config=quantize_config
)

# Prepare calibration data (small representative sample)
calibration_data = [
    tokenizer(text, return_tensors="pt")
    for text in [
        "The capital of Kenya is Nairobi, a major economic hub in East Africa.",
        "Machine learning models can be compressed using quantization techniques.",
        "The Nairobi Securities Exchange lists companies across multiple sectors.",
    ]
]

# Quantize (this takes a while)
model.quantize(calibration_data)

# Save the quantized model
model.save_quantized("llama2-7b-gptq-4bit")
tokenizer.save_pretrained("llama2-7b-gptq-4bit")

Loading a Pre-Quantized GPTQ Model

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "TheBloke/Llama-2-7B-GPTQ"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype="auto",
)

prompt = "Explain quantization in machine learning:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

AWQ: Activation-Aware Quantization (2024)

AWQ improves on GPTQ with a key insight: not all weights are equally important. Instead of treating every weight the same, AWQ identifies "salient" weights — those that correspond to large activation magnitudes — and protects them during quantization.

It uses a technique called Duo Scaling that balances activation and weight magnitudes to find the optimal per-channel scaling factors.

Key characteristics:

Often achieves higher accuracy at 4-bit than GPTQ, especially on reasoning tasks
Faster to quantize than GPTQ
More sensitive to calibration dataset quality
It matters who quantized the model — prefer community-trusted sources

Quantizing with AWQ

Open in Google Colab (GPU required)

# Install dependencies
pip install autoawq transformers torch

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_name = "meta-llama/Llama-2-7b-hf"

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Configure and quantize
quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM",  # use GEMM for GPU inference
}

model.quantize(tokenizer, quant_config=quant_config)

# Save
model.save_quantized("llama2-7b-awq-4bit")
tokenizer.save_pretrained("llama2-7b-awq-4bit")

Loading a Pre-Quantized AWQ Model

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "TheBloke/Llama-2-7B-AWQ"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype="auto",
)

prompt = "What is activation-aware quantization?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

GGUF: The Universal Runtime Format

GGUF is not a quantization algorithm — it's a container and runtime format created for llama.cpp. It packages quantized weights (at various levels from Q2 to Q8), the tokenizer, and model metadata into a single portable file.

The killer feature: GGUF runs on CPU, GPU, and Apple M-series with minimal overhead, no Python runtime needed. This makes it the go-to format for local LLM deployment.

Running a GGUF Model with llama-cpp-python

Open in Google Colab

# Install with GPU support (CUDA)
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python

# Or CPU-only
pip install llama-cpp-python

from llama_cpp import Llama

# Load a GGUF model (download .gguf file from HuggingFace)
llm = Llama(
    model_path="./llama-2-7b.Q4_K_M.gguf",
    n_ctx=2048,        # context window
    n_gpu_layers=-1,   # offload all layers to GPU (-1 = all)
    verbose=False,
)

# Generate text
response = llm(
    "Explain LLM quantization in simple terms:",
    max_tokens=256,
    temperature=0.7,
    top_p=0.9,
)

print(response["choices"][0]["text"])

GGUF Quantization Levels

GGUF files come in multiple quantization levels. Here's a practical guide:

Level	Bits	7B Model Size	Quality	Use Case
Q2_K	2.5	~2.8 GB	Low	Experimentation only
Q4_K_M	4.5	~4.1 GB	Good	Best balance for most users
Q5_K_M	5.5	~4.8 GB	Very Good	When you have a bit more RAM
Q6_K	6.5	~5.5 GB	Excellent	Near-original quality
Q8_0	8	~7.2 GB	Best	Minimal quality loss

For most use cases, Q4_K_M hits the sweet spot between size and quality.

GPTQ vs AWQ vs GGUF: When to Use Which

Factor	GPTQ	AWQ	GGUF
Best for	GPU inference	GPU inference (higher quality)	CPU/hybrid inference
Quality at 4-bit	Good	Better (especially reasoning)	Depends on quant level
Speed	Fast on GPU	Fast on GPU	Great on CPU + GPU split
Calibration sensitivity	Low	Higher	N/A (pre-quantized)
Ecosystem	HuggingFace	HuggingFace	llama.cpp / Ollama
Apple Silicon	Limited	Limited	Excellent

Practical Tips

Calibration data matters: Use 100+ diverse samples. On a 4GB GPU, ~96×1024 tokens is a good target.
Source matters for AWQ: Professionally quantized models (from autoawq or official teams) perform noticeably better than random community uploads.
Always benchmark on your actual task: Run the quantized model on real-world inputs before deploying. A model that benchmarks well on perplexity might still fail on your specific use case.
Embedding and norm layers stay unquantized: These are small but critical for quality — every major method leaves them in FP16.
For local development: Start with GGUF Q4_K_M. If you need HuggingFace ecosystem compatibility, go AWQ.

Key Takeaways

Quantization makes LLMs accessible on consumer hardware with 4–5× memory reduction
GPTQ is the battle-tested choice for GPU inference with broad ecosystem support
AWQ delivers better quality at 4-bit by protecting activation-important weights
GGUF is the universal format for running models locally across any hardware
KV-cache quantization is the next frontier for enabling longer context on smaller GPUs