Introduction

You've downloaded a 7B-parameter model, loaded it in FP16, and watched your GPU memory hit 14GB. Sound familiar? Quantization is how you get that same model running in 3.5–4.5GB — on a laptop GPU or even a CPU — with surprisingly little quality loss.

In this post I break down the three dominant quantization approaches in the LLM ecosystem: GPTQ, AWQ, and GGUF. For each, I explain the core idea, show working Python code, and discuss when to use which.

What Is Quantization?

Quantization means storing numbers with fewer bits. A model trained in FP32 (32 bits per weight) can be compressed to FP16 (16 bits), INT8 (8 bits), or INT4 (4 bits). The model doesn't "think" in 4-bit integers — during inference, weights are dequantized back to higher precision for computation.

Here's the intuition. Given original FP32 weights:

original = [0.40, 0.79, 1.22]

# Quantize to INT4 range (0-15) using scale + zero-point
scale = (max(original) - min(original)) / 15
zero_point = min(original)

quantized = [round((x - zero_point) / scale) for x in original]
# quantized = [0, 7, 15]

# Dequantize back
dequantized = [(q * scale) + zero_point for q in quantized]
# dequantized = [0.40, 0.78, 1.22]  (close, not exact)

That tiny rounding error is the quantization tax. The art is minimizing it where it matters most.

Three Types of Quantization

1. Weight Quantization (Mature)

Compresses stored model weights from FP32 → FP16 → INT8 → INT4. This is the most common approach. A 7B model drops from ~14GB (FP16) to ~3.5–4.5GB (INT4).

2. Activation Quantization (Emerging)

Targets intermediate tensors during the forward pass. Still experimental but promising for reducing compute during inference.

3. KV-Cache Quantization (Frontier)

The big new frontier. Using FP8/INT8 for the key-value cache reduces memory by 2–4×, enabling much longer context windows on consumer hardware.

GPTQ: The Hessian Approach (2022)

GPTQ was the first practical method to quantize full LLMs to INT4 post-training. It uses the Hessian matrix (second-order gradient information) to determine optimal rounding for each weight, minimizing the overall output error layer by layer.

Key characteristics:

  • Post-training — no retraining needed
  • Treats all weights equally within groups (typically 128 weights per group)
  • Robust to imperfect calibration data
  • Computationally expensive during quantization, but inference is lightweight

Quantizing a Model with GPTQ

# Install dependencies
pip install auto-gptq transformers torch
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

model_name = "meta-llama/Llama-2-7b-hf"

# Configure quantization
quantize_config = BaseQuantizeConfig(
    bits=4,
    group_size=128,
    desc_act=False,  # set True for better quality, slower
)

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoGPTQForCausalLM.from_pretrained(
    model_name, quantize_config=quantize_config
)

# Prepare calibration data (small representative sample)
calibration_data = [
    tokenizer(text, return_tensors="pt")
    for text in [
        "The capital of Kenya is Nairobi, a major economic hub in East Africa.",
        "Machine learning models can be compressed using quantization techniques.",
        "The Nairobi Securities Exchange lists companies across multiple sectors.",
    ]
]

# Quantize (this takes a while)
model.quantize(calibration_data)

# Save the quantized model
model.save_quantized("llama2-7b-gptq-4bit")
tokenizer.save_pretrained("llama2-7b-gptq-4bit")

Loading a Pre-Quantized GPTQ Model

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "TheBloke/Llama-2-7B-GPTQ"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype="auto",
)

prompt = "Explain quantization in machine learning:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

AWQ: Activation-Aware Quantization (2024)

AWQ improves on GPTQ with a key insight: not all weights are equally important. Instead of treating every weight the same, AWQ identifies "salient" weights — those that correspond to large activation magnitudes — and protects them during quantization.

It uses a technique called Duo Scaling that balances activation and weight magnitudes to find the optimal per-channel scaling factors.

Key characteristics:

  • Often achieves higher accuracy at 4-bit than GPTQ, especially on reasoning tasks
  • Faster to quantize than GPTQ
  • More sensitive to calibration dataset quality
  • It matters who quantized the model — prefer community-trusted sources

Quantizing with AWQ

# Install dependencies
pip install autoawq transformers torch
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_name = "meta-llama/Llama-2-7b-hf"

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Configure and quantize
quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM",  # use GEMM for GPU inference
}

model.quantize(tokenizer, quant_config=quant_config)

# Save
model.save_quantized("llama2-7b-awq-4bit")
tokenizer.save_pretrained("llama2-7b-awq-4bit")

Loading a Pre-Quantized AWQ Model

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "TheBloke/Llama-2-7B-AWQ"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype="auto",
)

prompt = "What is activation-aware quantization?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

GGUF: The Universal Runtime Format

GGUF is not a quantization algorithm — it's a container and runtime format created for llama.cpp. It packages quantized weights (at various levels from Q2 to Q8), the tokenizer, and model metadata into a single portable file.

The killer feature: GGUF runs on CPU, GPU, and Apple M-series with minimal overhead, no Python runtime needed. This makes it the go-to format for local LLM deployment.

Running a GGUF Model with llama-cpp-python

# Install with GPU support (CUDA)
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python

# Or CPU-only
pip install llama-cpp-python
from llama_cpp import Llama

# Load a GGUF model (download .gguf file from HuggingFace)
llm = Llama(
    model_path="./llama-2-7b.Q4_K_M.gguf",
    n_ctx=2048,        # context window
    n_gpu_layers=-1,   # offload all layers to GPU (-1 = all)
    verbose=False,
)

# Generate text
response = llm(
    "Explain LLM quantization in simple terms:",
    max_tokens=256,
    temperature=0.7,
    top_p=0.9,
)

print(response["choices"][0]["text"])

GGUF Quantization Levels

GGUF files come in multiple quantization levels. Here's a practical guide:

Level Bits 7B Model Size Quality Use Case
Q2_K2.5~2.8 GBLowExperimentation only
Q4_K_M4.5~4.1 GBGoodBest balance for most users
Q5_K_M5.5~4.8 GBVery GoodWhen you have a bit more RAM
Q6_K6.5~5.5 GBExcellentNear-original quality
Q8_08~7.2 GBBestMinimal quality loss

For most use cases, Q4_K_M hits the sweet spot between size and quality.

GPTQ vs AWQ vs GGUF: When to Use Which

Factor GPTQ AWQ GGUF
Best forGPU inferenceGPU inference (higher quality)CPU/hybrid inference
Quality at 4-bitGoodBetter (especially reasoning)Depends on quant level
SpeedFast on GPUFast on GPUGreat on CPU + GPU split
Calibration sensitivityLowHigherN/A (pre-quantized)
EcosystemHuggingFaceHuggingFacellama.cpp / Ollama
Apple SiliconLimitedLimitedExcellent

Practical Tips

  • Calibration data matters: Use 100+ diverse samples. On a 4GB GPU, ~96×1024 tokens is a good target.
  • Source matters for AWQ: Professionally quantized models (from autoawq or official teams) perform noticeably better than random community uploads.
  • Always benchmark on your actual task: Run the quantized model on real-world inputs before deploying. A model that benchmarks well on perplexity might still fail on your specific use case.
  • Embedding and norm layers stay unquantized: These are small but critical for quality — every major method leaves them in FP16.
  • For local development: Start with GGUF Q4_K_M. If you need HuggingFace ecosystem compatibility, go AWQ.

Key Takeaways

  • Quantization makes LLMs accessible on consumer hardware with 4–5× memory reduction
  • GPTQ is the battle-tested choice for GPU inference with broad ecosystem support
  • AWQ delivers better quality at 4-bit by protecting activation-important weights
  • GGUF is the universal format for running models locally across any hardware
  • KV-cache quantization is the next frontier for enabling longer context on smaller GPUs

Further Reading