Hardware: a CUDA GPU with 8 GB+ VRAM works for both teacher inference (4B model) and student training (1.5B QLoRA). Free Colab T4 works. Python: 3.11+.

Quantization vs Distillation

Quantization compresses weights. Distillation compresses behaviour. They solve different problems and they compose — a distilled model can also be quantized. Why bother with distillation when you've already quantized? Because no amount of quantization gets a 7B model to run on a phone. For that you need a fundamentally smaller architecture, and the cheapest way to make a small model good at your task is to teach it from a big one.

What we'll build:

  1. Use the Part 6 fine-tuned 4B model as a teacher to generate synthetic (input, output) pairs
  2. Filter the synthetic data for quality — this is where most distillation projects fail
  3. Fine-tune a 1.5B base on the synthetic data using the same Unsloth pipeline as Part 6
  4. Evaluate student vs teacher on the same eval set

Three Flavours of Distillation

FlavourWhat it matchesPractical?
Response-basedStudent matches teacher outputsYes — works with any teacher, including closed-source APIs
Feature-basedStudent matches teacher hidden statesOpen-weight teachers only; complex to set up
Relation-basedStudent matches relationships between teacher outputsResearch-grade; rarely used in practice

For the rest of this post we use response-based distillation. It's the only flavour that's both practical and robust enough for a tutorial.

Why You'd Distil

  • Latency. A 1.5B model serves at 2–5× the throughput of a 4B for the same hardware
  • Cost. Smaller model, smaller GPU, smaller bill
  • Privacy / on-device. Running on a phone or laptop without a GPU is impossible at 7B; perfectly fine at 1.5B
  • Edge deployment. Same point: where the GPU isn't

Step 1: Generate Synthetic Data

The teacher is your Part 6 fine-tuned 4B model, served via the Part 8 vLLM endpoint. Save as scripts/generate_synthetic.py:

"""Use the teacher (Part 6 model on Part 8 endpoint) to generate (input, output) pairs."""
from __future__ import annotations
import json
import random
from pathlib import Path
from openai import OpenAI

# Hit the local vLLM endpoint
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8001/v1", timeout=60.0)
TEACHER_MODEL = "domain"   # the LoRA alias from Part 8
OUT_PATH = Path("data/synthetic.jsonl")

# Seed prompts — the diversity of these directly determines dataset diversity.
SEEDS = [
    "Convert temperature.\n\n{value} F",
    "Format as JSON.\n\nname={name} age={age}",
    "Summarise in one sentence.\n\n{paragraph}",
]

NAMES = ["Alice", "Bob", "Charlie", "Dana", "Erin", "Felix"]
PARAGRAPHS = [
    "FastAPI is a Python web framework with async support and automatic OpenAPI docs.",
    "Postgres has supported full-text search natively since version 8.3 via tsvector.",
    "Quantization compresses model weights with a small accuracy cost.",
    # ... add more
]


def generate_input() -> str:
    template = random.choice(SEEDS)
    return template.format(
        value=random.randint(0, 200),
        name=random.choice(NAMES),
        age=random.randint(18, 80),
        paragraph=random.choice(PARAGRAPHS),
    )


def teacher_answer(prompt: str) -> str:
    resp = client.chat.completions.create(
        model=TEACHER_MODEL,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=256,
        temperature=0.0,
    )
    return resp.choices[0].message.content or ""


def main(n: int = 2000) -> None:
    OUT_PATH.parent.mkdir(parents=True, exist_ok=True)
    with open(OUT_PATH, "w") as f:
        for i in range(n):
            inp = generate_input()
            out = teacher_answer(inp)
            f.write(json.dumps({"instruction": inp, "input": "", "output": out}) + "\n")
            if (i + 1) % 100 == 0:
                print(f"generated {i + 1}/{n}")


if __name__ == "__main__":
    main()

Run it — with vLLM serving on a single GPU and a 4B teacher, ~2,000 examples generates in ~10 minutes.

Step 2: Quality Control (the Hardest Part)

Most synthetic datasets are 50% useful and 50% noise. The cleanup is the work. The filters that catch the most damage:

"""Filter synthetic data: dedup, length sanity, keyword guards, diversity sample."""
from __future__ import annotations
import json
from collections import Counter
from pathlib import Path

IN_PATH = Path("data/synthetic.jsonl")
OUT_PATH = Path("data/synthetic_clean.jsonl")


def load(path: Path) -> list[dict]:
    return [json.loads(l) for l in path.read_text().splitlines() if l.strip()]


def dedup(rows: list[dict]) -> list[dict]:
    seen = set()
    out = []
    for r in rows:
        key = (r["instruction"].strip(), r["output"].strip())
        if key in seen: continue
        seen.add(key); out.append(r)
    return out


def length_sane(rows: list[dict], min_out: int = 5, max_out: int = 800) -> list[dict]:
    return [r for r in rows if min_out <= len(r["output"]) <= max_out]


def hallucination_guard(rows: list[dict]) -> list[dict]:
    """Drop outputs that look like the model giving up or going off-topic."""
    bad_phrases = [
        "i don't know", "as an ai language model", "i cannot",
        "sorry, i", "let me think", "[redacted]",
    ]
    return [r for r in rows if not any(b in r["output"].lower() for b in bad_phrases)]


def diversity_sample(rows: list[dict], cap_per_template: int = 50) -> list[dict]:
    """If one input shape dominates, cap it so the dataset isn't 80% one task."""
    counts: Counter[str] = Counter()
    out = []
    for r in rows:
        # Collapse the instruction to its template (rough): first 4 words.
        key = " ".join(r["instruction"].split()[:4])
        if counts[key] >= cap_per_template: continue
        counts[key] += 1; out.append(r)
    return out


def main() -> None:
    rows = load(IN_PATH)
    n0 = len(rows)
    rows = dedup(rows);                     n1 = len(rows)
    rows = length_sane(rows);               n2 = len(rows)
    rows = hallucination_guard(rows);       n3 = len(rows)
    rows = diversity_sample(rows);          n4 = len(rows)
    OUT_PATH.write_text("\n".join(json.dumps(r) for r in rows))
    print(f"raw {n0} → dedup {n1} → length {n2} → halluc {n3} → diversity {n4}")


if __name__ == "__main__":
    main()

Typical numbers: 2,000 raw → 1,200 clean. If your filtered set is 95% of raw, your filters are too lax. If it's 30% of raw, your generation is too noisy.

Step 3: Pick a Student

Rules of thumb:

  • Same family as teacher when possible. Qwen3-4B teacher → Qwen3-1.5B student. The shared tokeniser and pretraining data make distillation smoother
  • Instruction-tuned variant if you want chat behaviour. A base model needs more data to learn chat formatting from scratch
  • Don't drop too far in size. A 4B → 1.5B distillation usually works. 4B → 0.5B often fails because the small model can't hold the same task complexity

For this post: Qwen/Qwen3-1.5B-Instruct.

Step 4: Train the Student

Same Unsloth + SFTTrainer pipeline as Part 6. Save as scripts/distil_train.py:

"""QLoRA fine-tune Qwen3-1.5B on the cleaned synthetic dataset."""
from datasets import load_dataset
from unsloth import FastLanguageModel
from trl import SFTTrainer, SFTConfig

MODEL_NAME = "unsloth/Qwen3-1.5B-Instruct"
MAX_SEQ_LENGTH = 2048
DATA_PATH = "data/synthetic_clean.jsonl"
OUTPUT_DIR = "outputs/qwen3-1.5b-distilled"

model, tok = FastLanguageModel.from_pretrained(
    model_name=MODEL_NAME,
    max_seq_length=MAX_SEQ_LENGTH,
    dtype=None, load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
    model,
    r=16, lora_alpha=16, lora_dropout=0,
    target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=42,
)


def fmt(ex):
    msgs = [{"role": "user", "content": ex["instruction"]},
             {"role": "assistant", "content": ex["output"]}]
    return {"text": tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=False)}


ds = load_dataset("json", data_files=DATA_PATH, split="train").map(fmt, remove_columns=["instruction","input","output"])
ds = ds.train_test_split(test_size=0.05, seed=42)

trainer = SFTTrainer(
    model=model, tokenizer=tok,
    train_dataset=ds["train"], eval_dataset=ds["test"],
    args=SFTConfig(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=20,
        num_train_epochs=2,
        learning_rate=2e-4,
        bf16=False, fp16=True,
        logging_steps=20, eval_strategy="steps", eval_steps=100,
        save_strategy="steps", save_steps=200, save_total_limit=2,
        output_dir=OUTPUT_DIR,
        max_seq_length=MAX_SEQ_LENGTH,
        dataset_text_field="text",
        report_to="none", seed=42,
    ),
)
trainer.train()
model.save_pretrained(OUTPUT_DIR); tok.save_pretrained(OUTPUT_DIR)
print(f"student adapter saved to {OUTPUT_DIR}")

Step 5: Evaluate Student vs Teacher

The honest comparison is on a held-out set of real inputs, not the synthetic ones. Save as scripts/eval_student.py:

"""Compare student vs teacher on a held-out eval set."""
import json
from openai import OpenAI

teacher = OpenAI(api_key="EMPTY", base_url="http://localhost:8001/v1")
student = OpenAI(api_key="EMPTY", base_url="http://localhost:8002/v1")  # second vLLM, see below

EVAL = [json.loads(l) for l in open("data/real_eval.jsonl")]


def gen(client, model: str, prompt: str) -> str:
    return client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=256, temperature=0.0,
    ).choices[0].message.content or ""


def exact_match(a: str, b: str) -> bool:
    return a.strip().lower() == b.strip().lower()


t_hits, s_hits = 0, 0
for case in EVAL:
    expected = case["expected"]
    t = gen(teacher, "domain", case["prompt"])
    s = gen(student, "distilled", case["prompt"])
    t_hits += exact_match(t, expected)
    s_hits += exact_match(s, expected)

print(f"teacher  EM: {t_hits / len(EVAL):.2%}")
print(f"student  EM: {s_hits / len(EVAL):.2%}")
print(f"retained:    {s_hits / max(t_hits, 1):.0%}")

Run a second vLLM process on a different port for the student:

vllm serve outputs/qwen3-1.5b-distilled --port 8002 --enable-lora \
  --lora-modules "distilled=outputs/qwen3-1.5b-distilled" \
  --gpu-memory-utilization 0.4

What good looks like: the student hits 80–95% of the teacher's task performance. Below that and either your synthetic dataset is too small / too noisy, or the student is too small for the task.

When Distillation Fails

SymptomCause
Student stuck below 50% of teacherStudent capacity too small, or synthetic dataset too narrow
Student matches teacher on synthetic eval, fails on realSynthetic data not representative of real distribution — expand seed prompts
Student inherits teacher mistakesThe teacher's wrong answers became labels. Fix the teacher first, regenerate
Student trained but generates garbageChat-template mismatch between teacher's outputs and student's expectations

Distillation in the Wild

The chain of "GPT-4 → synthetic data → small open-source model" is how the entire ecosystem of small instruction-tuned models was built in 2023–2024. The technique is the same; you're just doing it on your own task with your own teacher. The outputs of distillation can also be quantized (Part 7) and served on vLLM (Part 8) — the techniques compose.

Key Takeaways

  • Quantization compresses weights. Distillation compresses behaviour. They solve different problems and they compose
  • Response-based distillation is the practical flavour. Generate (input, output) from the teacher, train the student on that
  • Synthetic data quality is the work. Dedup, length sanity, hallucination guards, diversity caps. ~60% retention is normal
  • Pick a student in the same family as the teacher when possible. 4B → 1.5B usually works; 4B → 0.5B often doesn't
  • Eval on real inputs, not synthetic ones. The honest metric is "X% of teacher performance retained"
  • Distilled models can themselves be quantized. The compression techniques compose

Next Up

Part 10 ties everything together. Concurrency patterns, caching at every layer, structured logging, monitoring, graceful degradation. Plus a one-page checklist for shipping AI to production.

Next: Part 10 — Scaling AI Systems