Part 7: Quantization for Deployment

Hardware: a CUDA GPU with at least 24 GB VRAM for AWQ calibration of a 7B model. Colab Pro A100 works. CPU is enough for GGUF conversion, slower. Python: 3.11+.

Quantization Is a Deployment Tactic

Quantization sounds like a research topic. It's a deployment tactic. The reason to learn it is that it's the difference between needing an A100 and needing a 4090. Or between needing a 4090 and running on a M-series MacBook.

What we'll build: take the merged 4B (or 7B) model from Part 6 and produce three deployment-ready artefacts:

AWQ — for vLLM serving (Part 8)
GGUF — for llama.cpp / Ollama / Apple Silicon
A benchmark table comparing FP16 / AWQ / GGUF on size, latency, and quality

The 2026 Quantization Landscape

Format	Bits	Best for	Hardware
FP16 / BF16	16	Training, baseline serving	Any CUDA
FP8 (e4m3fn)	8	High-throughput serving when supported	Hopper / Blackwell
AWQ	4	GPU serving with vLLM (Marlin kernel)	Any modern CUDA
GPTQ	4	Legacy compat; AWQ is generally better now	Any modern CUDA
GGUF	2–8	CPU / Apple Silicon / mixed setups	Anything — that's the point
NF4 (bitsandbytes)	4	Training (QLoRA), occasional inference	Any CUDA

For a deeper conceptual treatment of how AWQ, GPTQ, and GGUF actually compress the weights, see Demystifying LLM Quantization. This post is the operational pipeline.

Why Calibration Data Matters

AWQ and GPTQ both pick which weights to compress aggressively and which to preserve carefully by looking at activations on a small calibration dataset. Garbage calibration set, garbage quant.

Rules of thumb:

~128 examples is the standard. More doesn't help much; less can hurt
Examples should look like real production prompts. If you'll serve customer-support replies, calibrate on customer-support replies. Don't use Wikipedia just because it's available
Cover the prompt-length distribution. If real prompts are 500 tokens, don't calibrate on 50-token snippets

Make AWQ

Install:

uv add autoawq==0.2.10 transformers==4.50.0 torch==2.5.0 datasets==3.2.0

Save as scripts/make_awq.py:

"""Quantize a merged HuggingFace checkpoint to 4-bit AWQ for vLLM."""
from __future__ import annotations
import json
from pathlib import Path

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

MODEL_IN = "outputs/qwen3-4b-domain-merged"
MODEL_OUT = "outputs/qwen3-4b-domain-awq"
CALIB_FILE = "data/calibration.jsonl"  # 128 representative prompts


def load_calibration(path: str) -> list[str]:
    samples = []
    with open(path) as f:
        for line in f:
            obj = json.loads(line)
            # Use the same shape as production traffic: instruction + input
            text = obj["instruction"]
            if obj.get("input"):
                text += "\n\n" + obj["input"]
            samples.append(text)
    return samples


def main() -> None:
    quant_config = {
        "zero_point": True,
        "q_group_size": 128,
        "w_bit": 4,
        "version": "GEMM",   # Marlin in vLLM uses this
    }

    model = AutoAWQForCausalLM.from_pretrained(MODEL_IN, safetensors=True, device_map="auto")
    tokenizer = AutoTokenizer.from_pretrained(MODEL_IN, trust_remote_code=True)

    calib_data = load_calibration(CALIB_FILE)
    model.quantize(tokenizer, quant_config=quant_config, calib_data=calib_data)

    Path(MODEL_OUT).mkdir(parents=True, exist_ok=True)
    model.save_quantized(MODEL_OUT)
    tokenizer.save_pretrained(MODEL_OUT)
    print(f"AWQ model saved to {MODEL_OUT}")


if __name__ == "__main__":
    main()

Run it:

uv run python scripts/make_awq.py

The output directory contains model.safetensors (the quantized weights) and config.json with the AWQ metadata. vLLM will autodetect both in Part 8.

Make GGUF

GGUF is llama.cpp's container format. Install llama.cpp once:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j

Save as scripts/make_gguf.sh:

#!/usr/bin/env bash
set -euo pipefail

LLAMA_CPP=./llama.cpp
MODEL_IN=outputs/qwen3-4b-domain-merged
OUT_DIR=outputs/qwen3-4b-domain-gguf
mkdir -p "$OUT_DIR"

# 1. Convert HF checkpoint to f16 GGUF
python "$LLAMA_CPP/convert_hf_to_gguf.py" \
  "$MODEL_IN" --outfile "$OUT_DIR/model-f16.gguf" --outtype f16

# 2. Quantize to Q4_K_M (the safe default)
"$LLAMA_CPP/llama-quantize" "$OUT_DIR/model-f16.gguf" "$OUT_DIR/model-Q4_K_M.gguf" Q4_K_M

# 3. Optionally also produce Q5_K_M (higher quality) and Q8_0 (very low loss)
"$LLAMA_CPP/llama-quantize" "$OUT_DIR/model-f16.gguf" "$OUT_DIR/model-Q5_K_M.gguf" Q5_K_M
"$LLAMA_CPP/llama-quantize" "$OUT_DIR/model-f16.gguf" "$OUT_DIR/model-Q8_0.gguf"   Q8_0

Run:

chmod +x scripts/make_gguf.sh && ./scripts/make_gguf.sh

Quant level cheat sheet:

Level	Bits	Use when
Q4_K_M	~4.5	Default. Best size/quality trade for most apps
Q5_K_M	~5.5	Quality matters more than size; ~25% larger
Q8_0	8	Near-lossless; deploy when size doesn't matter much
Q3_K_S, Q2_K	2–3	Aggressive compression; quality degrades visibly. Test first

Benchmarking

You quantized the model. Did it still work? Build a tiny benchmark.

Save as scripts/benchmark.py:

"""Compare FP16, AWQ, and GGUF on size, latency, and quality."""
from __future__ import annotations
import json, os, time
from pathlib import Path
from statistics import mean

# 1. Size on disk
def dir_size(path: str) -> float:
    total = 0
    for root, _, files in os.walk(path):
        for f in files:
            total += os.path.getsize(os.path.join(root, f))
    return total / (1024 ** 3)


# 2. Latency: time-to-first-token + tokens/sec on a fixed prompt
def run_hf(model_path: str, prompt: str, n_new: int = 100) -> dict:
    import torch
    from transformers import AutoModelForCausalLM, AutoTokenizer
    tok = AutoTokenizer.from_pretrained(model_path)
    model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, device_map="auto")
    inputs = tok(prompt, return_tensors="pt").to(model.device)

    # Warmup
    model.generate(**inputs, max_new_tokens=10)
    torch.cuda.synchronize()

    t0 = time.perf_counter()
    out = model.generate(**inputs, max_new_tokens=n_new, do_sample=False)
    torch.cuda.synchronize()
    elapsed = time.perf_counter() - t0
    tok_per_sec = n_new / elapsed
    return {"elapsed_s": elapsed, "tok_per_sec": tok_per_sec}


# 3. Perplexity on a held-out set (lower = better)
def perplexity(model_path: str, eval_path: str) -> float:
    import torch
    from transformers import AutoModelForCausalLM, AutoTokenizer
    tok = AutoTokenizer.from_pretrained(model_path)
    model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, device_map="auto")
    losses = []
    with open(eval_path) as f:
        for line in f:
            text = json.loads(line)["text"]
            inputs = tok(text, return_tensors="pt", truncation=True, max_length=1024).to(model.device)
            with torch.no_grad():
                loss = model(**inputs, labels=inputs["input_ids"]).loss
            losses.append(loss.item())
    return float(torch.tensor(mean(losses)).exp())


def main() -> None:
    PROMPT = "Convert temperature.\n\n75 F"
    EVAL = "data/eval.jsonl"

    rows = []
    for label, path in [
        ("FP16",  "outputs/qwen3-4b-domain-merged"),
        ("AWQ-4", "outputs/qwen3-4b-domain-awq"),
    ]:
        size = dir_size(path)
        lat = run_hf(path, PROMPT)
        ppl = perplexity(path, EVAL)
        rows.append((label, size, lat["tok_per_sec"], ppl))

    print(f"{'Format':<8} {'Size GB':>8} {'tok/s':>8} {'PPL':>8}")
    for label, size, tps, ppl in rows:
        print(f"{label:<8} {size:>8.2f} {tps:>8.1f} {ppl:>8.2f}")


if __name__ == "__main__":
    main()

Typical numbers for a 4B model on a single GPU:

Format	Size	Throughput	Perplexity (lower is better)
FP16	~8 GB	~40 tok/s	baseline
AWQ-4	~2.5 GB	~110 tok/s	baseline + 0.3%
Q4_K_M (GGUF, llama.cpp)	~2.5 GB	varies (CPU vs GPU)	baseline + 0.5%

The numbers will vary with model and hardware. The shape is what matters: AWQ buys you ~3.2× the size compression and ~2.5× the throughput for under 1% perplexity loss.

The "Trained FP16, Served INT4" Trade-off

You trained the LoRA against the FP16 base in Part 6. You're now serving against an AWQ base. There's a quality drift here that depends on how much the AWQ-quantized base diverges from the FP16 base on the directions the LoRA learned.

For most applications this drift is small. For applications where structural adherence matters (one of mine, the structured document generator from that post, runs exactly this trade-off), the cleaner path is to re-train the LoRA against the AWQ base directly. If you have GPU budget, do this. If you don't, document the trade-off in your code with a comment so future-you remembers it's there.

When Not to Quantize

Small models (<3B). You're already past the "needs an A100" threshold. The compression doesn't justify the perplexity hit, and inference is GPU-fast either way
Very long context. Some quantization techniques degrade more on long contexts. Test before deploying
Your bottleneck is CPU/network, not GPU memory. Quantization helps memory; if memory isn't the constraint, you're not gaining anything

Key Takeaways

Quantization is operational, not academic. The reason to do it is to fit a model on smaller hardware
AWQ for GPU serving with vLLM. GGUF for CPU / Apple Silicon / mixed deployment
Calibration data must look like production prompts. The default mistake (calibrate on Wikipedia) gives a worse quant for your task
Q4_K_M is the safe GGUF default. Q5_K_M when quality matters; Q8_0 when size doesn't
Always benchmark size, throughput, perplexity, and a task metric. A quant that loses 5% perplexity may be unusable for your task even if it sounds reasonable
Re-training the LoRA against the quantized base is the cleanest deployment path. If GPU budget doesn't allow, document the FP16-trained / INT4-served gap explicitly

Next Up

Part 8 takes the AWQ artefact from this post and serves it with vLLM — multi-user concurrency, multi-LoRA support, prefix caching. Then we swap the vLLM endpoint into the Part 5 RAG system, completing the loop.

Next: Part 8 — Serving Quantized Models with vLLM