Part 9: Knowledge Distillation

Hardware: a CUDA GPU with 8 GB+ VRAM works for both teacher inference (4B model) and student training (1.5B QLoRA). Free Colab T4 works. Python: 3.11+.

Quantization vs Distillation

Quantization compresses weights. Distillation compresses behaviour. They solve different problems and they compose — a distilled model can also be quantized. Why bother with distillation when you've already quantized? Because no amount of quantization gets a 7B model to run on a phone. For that you need a fundamentally smaller architecture, and the cheapest way to make a small model good at your task is to teach it from a big one.

What we'll build:

Use the Part 6 fine-tuned 4B model as a teacher to generate synthetic (input, output) pairs
Filter the synthetic data for quality — this is where most distillation projects fail
Fine-tune a 1.5B base on the synthetic data using the same Unsloth pipeline as Part 6
Evaluate student vs teacher on the same eval set

Three Flavours of Distillation

Flavour	What it matches	Practical?
Response-based	Student matches teacher outputs	Yes — works with any teacher, including closed-source APIs
Feature-based	Student matches teacher hidden states	Open-weight teachers only; complex to set up
Relation-based	Student matches relationships between teacher outputs	Research-grade; rarely used in practice

For the rest of this post we use response-based distillation. It's the only flavour that's both practical and robust enough for a tutorial.

Why You'd Distil

Latency. A 1.5B model serves at 2–5× the throughput of a 4B for the same hardware
Cost. Smaller model, smaller GPU, smaller bill
Privacy / on-device. Running on a phone or laptop without a GPU is impossible at 7B; perfectly fine at 1.5B
Edge deployment. Same point: where the GPU isn't

Step 1: Generate Synthetic Data

The teacher is your Part 6 fine-tuned 4B model, served via the Part 8 vLLM endpoint. Save as scripts/generate_synthetic.py:

"""Use the teacher (Part 6 model on Part 8 endpoint) to generate (input, output) pairs."""
from __future__ import annotations
import json
import random
from pathlib import Path
from openai import OpenAI

# Hit the local vLLM endpoint
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8001/v1", timeout=60.0)
TEACHER_MODEL = "domain"   # the LoRA alias from Part 8
OUT_PATH = Path("data/synthetic.jsonl")

# Seed prompts — the diversity of these directly determines dataset diversity.
SEEDS = [
    "Convert temperature.\n\n{value} F",
    "Format as JSON.\n\nname={name} age={age}",
    "Summarise in one sentence.\n\n{paragraph}",
]

NAMES = ["Alice", "Bob", "Charlie", "Dana", "Erin", "Felix"]
PARAGRAPHS = [
    "FastAPI is a Python web framework with async support and automatic OpenAPI docs.",
    "Postgres has supported full-text search natively since version 8.3 via tsvector.",
    "Quantization compresses model weights with a small accuracy cost.",
    # ... add more
]


def generate_input() -> str:
    template = random.choice(SEEDS)
    return template.format(
        value=random.randint(0, 200),
        name=random.choice(NAMES),
        age=random.randint(18, 80),
        paragraph=random.choice(PARAGRAPHS),
    )


def teacher_answer(prompt: str) -> str:
    resp = client.chat.completions.create(
        model=TEACHER_MODEL,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=256,
        temperature=0.0,
    )
    return resp.choices[0].message.content or ""


def main(n: int = 2000) -> None:
    OUT_PATH.parent.mkdir(parents=True, exist_ok=True)
    with open(OUT_PATH, "w") as f:
        for i in range(n):
            inp = generate_input()
            out = teacher_answer(inp)
            f.write(json.dumps({"instruction": inp, "input": "", "output": out}) + "\n")
            if (i + 1) % 100 == 0:
                print(f"generated {i + 1}/{n}")


if __name__ == "__main__":
    main()

Run it — with vLLM serving on a single GPU and a 4B teacher, ~2,000 examples generates in ~10 minutes.

Step 2: Quality Control (the Hardest Part)

Most synthetic datasets are 50% useful and 50% noise. The cleanup is the work. The filters that catch the most damage:

"""Filter synthetic data: dedup, length sanity, keyword guards, diversity sample."""
from __future__ import annotations
import json
from collections import Counter
from pathlib import Path

IN_PATH = Path("data/synthetic.jsonl")
OUT_PATH = Path("data/synthetic_clean.jsonl")


def load(path: Path) -> list[dict]:
    return [json.loads(l) for l in path.read_text().splitlines() if l.strip()]


def dedup(rows: list[dict]) -> list[dict]:
    seen = set()
    out = []
    for r in rows:
        key = (r["instruction"].strip(), r["output"].strip())
        if key in seen: continue
        seen.add(key); out.append(r)
    return out


def length_sane(rows: list[dict], min_out: int = 5, max_out: int = 800) -> list[dict]:
    return [r for r in rows if min_out <= len(r["output"]) <= max_out]


def hallucination_guard(rows: list[dict]) -> list[dict]:
    """Drop outputs that look like the model giving up or going off-topic."""
    bad_phrases = [
        "i don't know", "as an ai language model", "i cannot",
        "sorry, i", "let me think", "[redacted]",
    ]
    return [r for r in rows if not any(b in r["output"].lower() for b in bad_phrases)]


def diversity_sample(rows: list[dict], cap_per_template: int = 50) -> list[dict]:
    """If one input shape dominates, cap it so the dataset isn't 80% one task."""
    counts: Counter[str] = Counter()
    out = []
    for r in rows:
        # Collapse the instruction to its template (rough): first 4 words.
        key = " ".join(r["instruction"].split()[:4])
        if counts[key] >= cap_per_template: continue
        counts[key] += 1; out.append(r)
    return out


def main() -> None:
    rows = load(IN_PATH)
    n0 = len(rows)
    rows = dedup(rows);                     n1 = len(rows)
    rows = length_sane(rows);               n2 = len(rows)
    rows = hallucination_guard(rows);       n3 = len(rows)
    rows = diversity_sample(rows);          n4 = len(rows)
    OUT_PATH.write_text("\n".join(json.dumps(r) for r in rows))
    print(f"raw {n0} → dedup {n1} → length {n2} → halluc {n3} → diversity {n4}")


if __name__ == "__main__":
    main()

Typical numbers: 2,000 raw → 1,200 clean. If your filtered set is 95% of raw, your filters are too lax. If it's 30% of raw, your generation is too noisy.

Step 3: Pick a Student

Rules of thumb:

Same family as teacher when possible. Qwen3-4B teacher → Qwen3-1.5B student. The shared tokeniser and pretraining data make distillation smoother
Instruction-tuned variant if you want chat behaviour. A base model needs more data to learn chat formatting from scratch
Don't drop too far in size. A 4B → 1.5B distillation usually works. 4B → 0.5B often fails because the small model can't hold the same task complexity

For this post: Qwen/Qwen3-1.5B-Instruct.

Step 4: Train the Student

Same Unsloth + SFTTrainer pipeline as Part 6. Save as scripts/distil_train.py:

"""QLoRA fine-tune Qwen3-1.5B on the cleaned synthetic dataset."""
from datasets import load_dataset
from unsloth import FastLanguageModel
from trl import SFTTrainer, SFTConfig

MODEL_NAME = "unsloth/Qwen3-1.5B-Instruct"
MAX_SEQ_LENGTH = 2048
DATA_PATH = "data/synthetic_clean.jsonl"
OUTPUT_DIR = "outputs/qwen3-1.5b-distilled"

model, tok = FastLanguageModel.from_pretrained(
    model_name=MODEL_NAME,
    max_seq_length=MAX_SEQ_LENGTH,
    dtype=None, load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
    model,
    r=16, lora_alpha=16, lora_dropout=0,
    target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=42,
)


def fmt(ex):
    msgs = [{"role": "user", "content": ex["instruction"]},
             {"role": "assistant", "content": ex["output"]}]
    return {"text": tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=False)}


ds = load_dataset("json", data_files=DATA_PATH, split="train").map(fmt, remove_columns=["instruction","input","output"])
ds = ds.train_test_split(test_size=0.05, seed=42)

trainer = SFTTrainer(
    model=model, tokenizer=tok,
    train_dataset=ds["train"], eval_dataset=ds["test"],
    args=SFTConfig(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=20,
        num_train_epochs=2,
        learning_rate=2e-4,
        bf16=False, fp16=True,
        logging_steps=20, eval_strategy="steps", eval_steps=100,
        save_strategy="steps", save_steps=200, save_total_limit=2,
        output_dir=OUTPUT_DIR,
        max_seq_length=MAX_SEQ_LENGTH,
        dataset_text_field="text",
        report_to="none", seed=42,
    ),
)
trainer.train()
model.save_pretrained(OUTPUT_DIR); tok.save_pretrained(OUTPUT_DIR)
print(f"student adapter saved to {OUTPUT_DIR}")

Step 5: Evaluate Student vs Teacher

The honest comparison is on a held-out set of real inputs, not the synthetic ones. Save as scripts/eval_student.py:

"""Compare student vs teacher on a held-out eval set."""
import json
from openai import OpenAI

teacher = OpenAI(api_key="EMPTY", base_url="http://localhost:8001/v1")
student = OpenAI(api_key="EMPTY", base_url="http://localhost:8002/v1")  # second vLLM, see below

EVAL = [json.loads(l) for l in open("data/real_eval.jsonl")]


def gen(client, model: str, prompt: str) -> str:
    return client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=256, temperature=0.0,
    ).choices[0].message.content or ""


def exact_match(a: str, b: str) -> bool:
    return a.strip().lower() == b.strip().lower()


t_hits, s_hits = 0, 0
for case in EVAL:
    expected = case["expected"]
    t = gen(teacher, "domain", case["prompt"])
    s = gen(student, "distilled", case["prompt"])
    t_hits += exact_match(t, expected)
    s_hits += exact_match(s, expected)

print(f"teacher  EM: {t_hits / len(EVAL):.2%}")
print(f"student  EM: {s_hits / len(EVAL):.2%}")
print(f"retained:    {s_hits / max(t_hits, 1):.0%}")

Run a second vLLM process on a different port for the student:

vllm serve outputs/qwen3-1.5b-distilled --port 8002 --enable-lora \
  --lora-modules "distilled=outputs/qwen3-1.5b-distilled" \
  --gpu-memory-utilization 0.4

What good looks like: the student hits 80–95% of the teacher's task performance. Below that and either your synthetic dataset is too small / too noisy, or the student is too small for the task.

When Distillation Fails

Symptom	Cause
Student stuck below 50% of teacher	Student capacity too small, or synthetic dataset too narrow
Student matches teacher on synthetic eval, fails on real	Synthetic data not representative of real distribution — expand seed prompts
Student inherits teacher mistakes	The teacher's wrong answers became labels. Fix the teacher first, regenerate
Student trained but generates garbage	Chat-template mismatch between teacher's outputs and student's expectations

Distillation in the Wild

The chain of "GPT-4 → synthetic data → small open-source model" is how the entire ecosystem of small instruction-tuned models was built in 2023–2024. The technique is the same; you're just doing it on your own task with your own teacher. The outputs of distillation can also be quantized (Part 7) and served on vLLM (Part 8) — the techniques compose.

Key Takeaways

Quantization compresses weights. Distillation compresses behaviour. They solve different problems and they compose
Response-based distillation is the practical flavour. Generate (input, output) from the teacher, train the student on that
Synthetic data quality is the work. Dedup, length sanity, hallucination guards, diversity caps. ~60% retention is normal
Pick a student in the same family as the teacher when possible. 4B → 1.5B usually works; 4B → 0.5B often doesn't
Eval on real inputs, not synthetic ones. The honest metric is "X% of teacher performance retained"
Distilled models can themselves be quantized. The compression techniques compose

Next Up

Part 10 ties everything together. Concurrency patterns, caching at every layer, structured logging, monitoring, graceful degradation. Plus a one-page checklist for shipping AI to production.

Next: Part 10 — Scaling AI Systems