Hardware: a CUDA GPU with at least 24 GB VRAM for AWQ calibration of a 7B model. Colab Pro A100 works. CPU is enough for GGUF conversion, slower. Python: 3.11+.
Quantization Is a Deployment Tactic
Quantization sounds like a research topic. It's a deployment tactic. The reason to learn it is that it's the difference between needing an A100 and needing a 4090. Or between needing a 4090 and running on a M-series MacBook.
What we'll build: take the merged 4B (or 7B) model from Part 6 and produce three deployment-ready artefacts:
- AWQ — for vLLM serving (Part 8)
- GGUF — for llama.cpp / Ollama / Apple Silicon
- A benchmark table comparing FP16 / AWQ / GGUF on size, latency, and quality
The 2026 Quantization Landscape
| Format | Bits | Best for | Hardware |
|---|---|---|---|
| FP16 / BF16 | 16 | Training, baseline serving | Any CUDA |
| FP8 (e4m3fn) | 8 | High-throughput serving when supported | Hopper / Blackwell |
| AWQ | 4 | GPU serving with vLLM (Marlin kernel) | Any modern CUDA |
| GPTQ | 4 | Legacy compat; AWQ is generally better now | Any modern CUDA |
| GGUF | 2–8 | CPU / Apple Silicon / mixed setups | Anything — that's the point |
| NF4 (bitsandbytes) | 4 | Training (QLoRA), occasional inference | Any CUDA |
For a deeper conceptual treatment of how AWQ, GPTQ, and GGUF actually compress the weights, see Demystifying LLM Quantization. This post is the operational pipeline.
Why Calibration Data Matters
AWQ and GPTQ both pick which weights to compress aggressively and which to preserve carefully by looking at activations on a small calibration dataset. Garbage calibration set, garbage quant.
Rules of thumb:
- ~128 examples is the standard. More doesn't help much; less can hurt
- Examples should look like real production prompts. If you'll serve customer-support replies, calibrate on customer-support replies. Don't use Wikipedia just because it's available
- Cover the prompt-length distribution. If real prompts are 500 tokens, don't calibrate on 50-token snippets
Make AWQ
Install:
uv add autoawq==0.2.10 transformers==4.50.0 torch==2.5.0 datasets==3.2.0
Save as scripts/make_awq.py:
"""Quantize a merged HuggingFace checkpoint to 4-bit AWQ for vLLM."""
from __future__ import annotations
import json
from pathlib import Path
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
MODEL_IN = "outputs/qwen3-4b-domain-merged"
MODEL_OUT = "outputs/qwen3-4b-domain-awq"
CALIB_FILE = "data/calibration.jsonl" # 128 representative prompts
def load_calibration(path: str) -> list[str]:
samples = []
with open(path) as f:
for line in f:
obj = json.loads(line)
# Use the same shape as production traffic: instruction + input
text = obj["instruction"]
if obj.get("input"):
text += "\n\n" + obj["input"]
samples.append(text)
return samples
def main() -> None:
quant_config = {
"zero_point": True,
"q_group_size": 128,
"w_bit": 4,
"version": "GEMM", # Marlin in vLLM uses this
}
model = AutoAWQForCausalLM.from_pretrained(MODEL_IN, safetensors=True, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL_IN, trust_remote_code=True)
calib_data = load_calibration(CALIB_FILE)
model.quantize(tokenizer, quant_config=quant_config, calib_data=calib_data)
Path(MODEL_OUT).mkdir(parents=True, exist_ok=True)
model.save_quantized(MODEL_OUT)
tokenizer.save_pretrained(MODEL_OUT)
print(f"AWQ model saved to {MODEL_OUT}")
if __name__ == "__main__":
main()
Run it:
uv run python scripts/make_awq.py
The output directory contains model.safetensors (the quantized weights) and config.json with the AWQ metadata. vLLM will autodetect both in Part 8.
Make GGUF
GGUF is llama.cpp's container format. Install llama.cpp once:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j
Save as scripts/make_gguf.sh:
#!/usr/bin/env bash
set -euo pipefail
LLAMA_CPP=./llama.cpp
MODEL_IN=outputs/qwen3-4b-domain-merged
OUT_DIR=outputs/qwen3-4b-domain-gguf
mkdir -p "$OUT_DIR"
# 1. Convert HF checkpoint to f16 GGUF
python "$LLAMA_CPP/convert_hf_to_gguf.py" \
"$MODEL_IN" --outfile "$OUT_DIR/model-f16.gguf" --outtype f16
# 2. Quantize to Q4_K_M (the safe default)
"$LLAMA_CPP/llama-quantize" "$OUT_DIR/model-f16.gguf" "$OUT_DIR/model-Q4_K_M.gguf" Q4_K_M
# 3. Optionally also produce Q5_K_M (higher quality) and Q8_0 (very low loss)
"$LLAMA_CPP/llama-quantize" "$OUT_DIR/model-f16.gguf" "$OUT_DIR/model-Q5_K_M.gguf" Q5_K_M
"$LLAMA_CPP/llama-quantize" "$OUT_DIR/model-f16.gguf" "$OUT_DIR/model-Q8_0.gguf" Q8_0
Run:
chmod +x scripts/make_gguf.sh && ./scripts/make_gguf.sh
Quant level cheat sheet:
| Level | Bits | Use when |
|---|---|---|
| Q4_K_M | ~4.5 | Default. Best size/quality trade for most apps |
| Q5_K_M | ~5.5 | Quality matters more than size; ~25% larger |
| Q8_0 | 8 | Near-lossless; deploy when size doesn't matter much |
| Q3_K_S, Q2_K | 2–3 | Aggressive compression; quality degrades visibly. Test first |
Benchmarking
You quantized the model. Did it still work? Build a tiny benchmark.
Save as scripts/benchmark.py:
"""Compare FP16, AWQ, and GGUF on size, latency, and quality."""
from __future__ import annotations
import json, os, time
from pathlib import Path
from statistics import mean
# 1. Size on disk
def dir_size(path: str) -> float:
total = 0
for root, _, files in os.walk(path):
for f in files:
total += os.path.getsize(os.path.join(root, f))
return total / (1024 ** 3)
# 2. Latency: time-to-first-token + tokens/sec on a fixed prompt
def run_hf(model_path: str, prompt: str, n_new: int = 100) -> dict:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, device_map="auto")
inputs = tok(prompt, return_tensors="pt").to(model.device)
# Warmup
model.generate(**inputs, max_new_tokens=10)
torch.cuda.synchronize()
t0 = time.perf_counter()
out = model.generate(**inputs, max_new_tokens=n_new, do_sample=False)
torch.cuda.synchronize()
elapsed = time.perf_counter() - t0
tok_per_sec = n_new / elapsed
return {"elapsed_s": elapsed, "tok_per_sec": tok_per_sec}
# 3. Perplexity on a held-out set (lower = better)
def perplexity(model_path: str, eval_path: str) -> float:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, device_map="auto")
losses = []
with open(eval_path) as f:
for line in f:
text = json.loads(line)["text"]
inputs = tok(text, return_tensors="pt", truncation=True, max_length=1024).to(model.device)
with torch.no_grad():
loss = model(**inputs, labels=inputs["input_ids"]).loss
losses.append(loss.item())
return float(torch.tensor(mean(losses)).exp())
def main() -> None:
PROMPT = "Convert temperature.\n\n75 F"
EVAL = "data/eval.jsonl"
rows = []
for label, path in [
("FP16", "outputs/qwen3-4b-domain-merged"),
("AWQ-4", "outputs/qwen3-4b-domain-awq"),
]:
size = dir_size(path)
lat = run_hf(path, PROMPT)
ppl = perplexity(path, EVAL)
rows.append((label, size, lat["tok_per_sec"], ppl))
print(f"{'Format':<8} {'Size GB':>8} {'tok/s':>8} {'PPL':>8}")
for label, size, tps, ppl in rows:
print(f"{label:<8} {size:>8.2f} {tps:>8.1f} {ppl:>8.2f}")
if __name__ == "__main__":
main()
Typical numbers for a 4B model on a single GPU:
| Format | Size | Throughput | Perplexity (lower is better) |
|---|---|---|---|
| FP16 | ~8 GB | ~40 tok/s | baseline |
| AWQ-4 | ~2.5 GB | ~110 tok/s | baseline + 0.3% |
| Q4_K_M (GGUF, llama.cpp) | ~2.5 GB | varies (CPU vs GPU) | baseline + 0.5% |
The numbers will vary with model and hardware. The shape is what matters: AWQ buys you ~3.2× the size compression and ~2.5× the throughput for under 1% perplexity loss.
The "Trained FP16, Served INT4" Trade-off
You trained the LoRA against the FP16 base in Part 6. You're now serving against an AWQ base. There's a quality drift here that depends on how much the AWQ-quantized base diverges from the FP16 base on the directions the LoRA learned.
For most applications this drift is small. For applications where structural adherence matters (one of mine, the structured document generator from that post, runs exactly this trade-off), the cleaner path is to re-train the LoRA against the AWQ base directly. If you have GPU budget, do this. If you don't, document the trade-off in your code with a comment so future-you remembers it's there.
When Not to Quantize
- Small models (<3B). You're already past the "needs an A100" threshold. The compression doesn't justify the perplexity hit, and inference is GPU-fast either way
- Very long context. Some quantization techniques degrade more on long contexts. Test before deploying
- Your bottleneck is CPU/network, not GPU memory. Quantization helps memory; if memory isn't the constraint, you're not gaining anything
Key Takeaways
- Quantization is operational, not academic. The reason to do it is to fit a model on smaller hardware
- AWQ for GPU serving with vLLM. GGUF for CPU / Apple Silicon / mixed deployment
- Calibration data must look like production prompts. The default mistake (calibrate on Wikipedia) gives a worse quant for your task
- Q4_K_M is the safe GGUF default. Q5_K_M when quality matters; Q8_0 when size doesn't
- Always benchmark size, throughput, perplexity, and a task metric. A quant that loses 5% perplexity may be unusable for your task even if it sounds reasonable
- Re-training the LoRA against the quantized base is the cleanest deployment path. If GPU budget doesn't allow, document the FP16-trained / INT4-served gap explicitly
Next Up
Part 8 takes the AWQ artefact from this post and serves it with vLLM — multi-user concurrency, multi-LoRA support, prefix caching. Then we swap the vLLM endpoint into the Part 5 RAG system, completing the loop.