Hardware: a CUDA GPU with 8 GB+ VRAM works for both teacher inference (4B model) and student training (1.5B QLoRA). Free Colab T4 works. Python: 3.11+.
Quantization vs Distillation
Quantization compresses weights. Distillation compresses behaviour. They solve different problems and they compose — a distilled model can also be quantized. Why bother with distillation when you've already quantized? Because no amount of quantization gets a 7B model to run on a phone. For that you need a fundamentally smaller architecture, and the cheapest way to make a small model good at your task is to teach it from a big one.
What we'll build:
- Use the Part 6 fine-tuned 4B model as a teacher to generate synthetic (input, output) pairs
- Filter the synthetic data for quality — this is where most distillation projects fail
- Fine-tune a 1.5B base on the synthetic data using the same Unsloth pipeline as Part 6
- Evaluate student vs teacher on the same eval set
Three Flavours of Distillation
| Flavour | What it matches | Practical? |
|---|---|---|
| Response-based | Student matches teacher outputs | Yes — works with any teacher, including closed-source APIs |
| Feature-based | Student matches teacher hidden states | Open-weight teachers only; complex to set up |
| Relation-based | Student matches relationships between teacher outputs | Research-grade; rarely used in practice |
For the rest of this post we use response-based distillation. It's the only flavour that's both practical and robust enough for a tutorial.
Why You'd Distil
- Latency. A 1.5B model serves at 2–5× the throughput of a 4B for the same hardware
- Cost. Smaller model, smaller GPU, smaller bill
- Privacy / on-device. Running on a phone or laptop without a GPU is impossible at 7B; perfectly fine at 1.5B
- Edge deployment. Same point: where the GPU isn't
Step 1: Generate Synthetic Data
The teacher is your Part 6 fine-tuned 4B model, served via the Part 8 vLLM endpoint. Save as scripts/generate_synthetic.py:
"""Use the teacher (Part 6 model on Part 8 endpoint) to generate (input, output) pairs."""
from __future__ import annotations
import json
import random
from pathlib import Path
from openai import OpenAI
# Hit the local vLLM endpoint
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8001/v1", timeout=60.0)
TEACHER_MODEL = "domain" # the LoRA alias from Part 8
OUT_PATH = Path("data/synthetic.jsonl")
# Seed prompts — the diversity of these directly determines dataset diversity.
SEEDS = [
"Convert temperature.\n\n{value} F",
"Format as JSON.\n\nname={name} age={age}",
"Summarise in one sentence.\n\n{paragraph}",
]
NAMES = ["Alice", "Bob", "Charlie", "Dana", "Erin", "Felix"]
PARAGRAPHS = [
"FastAPI is a Python web framework with async support and automatic OpenAPI docs.",
"Postgres has supported full-text search natively since version 8.3 via tsvector.",
"Quantization compresses model weights with a small accuracy cost.",
# ... add more
]
def generate_input() -> str:
template = random.choice(SEEDS)
return template.format(
value=random.randint(0, 200),
name=random.choice(NAMES),
age=random.randint(18, 80),
paragraph=random.choice(PARAGRAPHS),
)
def teacher_answer(prompt: str) -> str:
resp = client.chat.completions.create(
model=TEACHER_MODEL,
messages=[{"role": "user", "content": prompt}],
max_tokens=256,
temperature=0.0,
)
return resp.choices[0].message.content or ""
def main(n: int = 2000) -> None:
OUT_PATH.parent.mkdir(parents=True, exist_ok=True)
with open(OUT_PATH, "w") as f:
for i in range(n):
inp = generate_input()
out = teacher_answer(inp)
f.write(json.dumps({"instruction": inp, "input": "", "output": out}) + "\n")
if (i + 1) % 100 == 0:
print(f"generated {i + 1}/{n}")
if __name__ == "__main__":
main()
Run it — with vLLM serving on a single GPU and a 4B teacher, ~2,000 examples generates in ~10 minutes.
Step 2: Quality Control (the Hardest Part)
Most synthetic datasets are 50% useful and 50% noise. The cleanup is the work. The filters that catch the most damage:
"""Filter synthetic data: dedup, length sanity, keyword guards, diversity sample."""
from __future__ import annotations
import json
from collections import Counter
from pathlib import Path
IN_PATH = Path("data/synthetic.jsonl")
OUT_PATH = Path("data/synthetic_clean.jsonl")
def load(path: Path) -> list[dict]:
return [json.loads(l) for l in path.read_text().splitlines() if l.strip()]
def dedup(rows: list[dict]) -> list[dict]:
seen = set()
out = []
for r in rows:
key = (r["instruction"].strip(), r["output"].strip())
if key in seen: continue
seen.add(key); out.append(r)
return out
def length_sane(rows: list[dict], min_out: int = 5, max_out: int = 800) -> list[dict]:
return [r for r in rows if min_out <= len(r["output"]) <= max_out]
def hallucination_guard(rows: list[dict]) -> list[dict]:
"""Drop outputs that look like the model giving up or going off-topic."""
bad_phrases = [
"i don't know", "as an ai language model", "i cannot",
"sorry, i", "let me think", "[redacted]",
]
return [r for r in rows if not any(b in r["output"].lower() for b in bad_phrases)]
def diversity_sample(rows: list[dict], cap_per_template: int = 50) -> list[dict]:
"""If one input shape dominates, cap it so the dataset isn't 80% one task."""
counts: Counter[str] = Counter()
out = []
for r in rows:
# Collapse the instruction to its template (rough): first 4 words.
key = " ".join(r["instruction"].split()[:4])
if counts[key] >= cap_per_template: continue
counts[key] += 1; out.append(r)
return out
def main() -> None:
rows = load(IN_PATH)
n0 = len(rows)
rows = dedup(rows); n1 = len(rows)
rows = length_sane(rows); n2 = len(rows)
rows = hallucination_guard(rows); n3 = len(rows)
rows = diversity_sample(rows); n4 = len(rows)
OUT_PATH.write_text("\n".join(json.dumps(r) for r in rows))
print(f"raw {n0} → dedup {n1} → length {n2} → halluc {n3} → diversity {n4}")
if __name__ == "__main__":
main()
Typical numbers: 2,000 raw → 1,200 clean. If your filtered set is 95% of raw, your filters are too lax. If it's 30% of raw, your generation is too noisy.
Step 3: Pick a Student
Rules of thumb:
- Same family as teacher when possible. Qwen3-4B teacher → Qwen3-1.5B student. The shared tokeniser and pretraining data make distillation smoother
- Instruction-tuned variant if you want chat behaviour. A base model needs more data to learn chat formatting from scratch
- Don't drop too far in size. A 4B → 1.5B distillation usually works. 4B → 0.5B often fails because the small model can't hold the same task complexity
For this post: Qwen/Qwen3-1.5B-Instruct.
Step 4: Train the Student
Same Unsloth + SFTTrainer pipeline as Part 6. Save as scripts/distil_train.py:
"""QLoRA fine-tune Qwen3-1.5B on the cleaned synthetic dataset."""
from datasets import load_dataset
from unsloth import FastLanguageModel
from trl import SFTTrainer, SFTConfig
MODEL_NAME = "unsloth/Qwen3-1.5B-Instruct"
MAX_SEQ_LENGTH = 2048
DATA_PATH = "data/synthetic_clean.jsonl"
OUTPUT_DIR = "outputs/qwen3-1.5b-distilled"
model, tok = FastLanguageModel.from_pretrained(
model_name=MODEL_NAME,
max_seq_length=MAX_SEQ_LENGTH,
dtype=None, load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
model,
r=16, lora_alpha=16, lora_dropout=0,
target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
bias="none",
use_gradient_checkpointing="unsloth",
random_state=42,
)
def fmt(ex):
msgs = [{"role": "user", "content": ex["instruction"]},
{"role": "assistant", "content": ex["output"]}]
return {"text": tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=False)}
ds = load_dataset("json", data_files=DATA_PATH, split="train").map(fmt, remove_columns=["instruction","input","output"])
ds = ds.train_test_split(test_size=0.05, seed=42)
trainer = SFTTrainer(
model=model, tokenizer=tok,
train_dataset=ds["train"], eval_dataset=ds["test"],
args=SFTConfig(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_steps=20,
num_train_epochs=2,
learning_rate=2e-4,
bf16=False, fp16=True,
logging_steps=20, eval_strategy="steps", eval_steps=100,
save_strategy="steps", save_steps=200, save_total_limit=2,
output_dir=OUTPUT_DIR,
max_seq_length=MAX_SEQ_LENGTH,
dataset_text_field="text",
report_to="none", seed=42,
),
)
trainer.train()
model.save_pretrained(OUTPUT_DIR); tok.save_pretrained(OUTPUT_DIR)
print(f"student adapter saved to {OUTPUT_DIR}")
Step 5: Evaluate Student vs Teacher
The honest comparison is on a held-out set of real inputs, not the synthetic ones. Save as scripts/eval_student.py:
"""Compare student vs teacher on a held-out eval set."""
import json
from openai import OpenAI
teacher = OpenAI(api_key="EMPTY", base_url="http://localhost:8001/v1")
student = OpenAI(api_key="EMPTY", base_url="http://localhost:8002/v1") # second vLLM, see below
EVAL = [json.loads(l) for l in open("data/real_eval.jsonl")]
def gen(client, model: str, prompt: str) -> str:
return client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
max_tokens=256, temperature=0.0,
).choices[0].message.content or ""
def exact_match(a: str, b: str) -> bool:
return a.strip().lower() == b.strip().lower()
t_hits, s_hits = 0, 0
for case in EVAL:
expected = case["expected"]
t = gen(teacher, "domain", case["prompt"])
s = gen(student, "distilled", case["prompt"])
t_hits += exact_match(t, expected)
s_hits += exact_match(s, expected)
print(f"teacher EM: {t_hits / len(EVAL):.2%}")
print(f"student EM: {s_hits / len(EVAL):.2%}")
print(f"retained: {s_hits / max(t_hits, 1):.0%}")
Run a second vLLM process on a different port for the student:
vllm serve outputs/qwen3-1.5b-distilled --port 8002 --enable-lora \
--lora-modules "distilled=outputs/qwen3-1.5b-distilled" \
--gpu-memory-utilization 0.4
What good looks like: the student hits 80–95% of the teacher's task performance. Below that and either your synthetic dataset is too small / too noisy, or the student is too small for the task.
When Distillation Fails
| Symptom | Cause |
|---|---|
| Student stuck below 50% of teacher | Student capacity too small, or synthetic dataset too narrow |
| Student matches teacher on synthetic eval, fails on real | Synthetic data not representative of real distribution — expand seed prompts |
| Student inherits teacher mistakes | The teacher's wrong answers became labels. Fix the teacher first, regenerate |
| Student trained but generates garbage | Chat-template mismatch between teacher's outputs and student's expectations |
Distillation in the Wild
The chain of "GPT-4 → synthetic data → small open-source model" is how the entire ecosystem of small instruction-tuned models was built in 2023–2024. The technique is the same; you're just doing it on your own task with your own teacher. The outputs of distillation can also be quantized (Part 7) and served on vLLM (Part 8) — the techniques compose.
Key Takeaways
- Quantization compresses weights. Distillation compresses behaviour. They solve different problems and they compose
- Response-based distillation is the practical flavour. Generate (input, output) from the teacher, train the student on that
- Synthetic data quality is the work. Dedup, length sanity, hallucination guards, diversity caps. ~60% retention is normal
- Pick a student in the same family as the teacher when possible. 4B → 1.5B usually works; 4B → 0.5B often doesn't
- Eval on real inputs, not synthetic ones. The honest metric is "X% of teacher performance retained"
- Distilled models can themselves be quantized. The compression techniques compose
Next Up
Part 10 ties everything together. Concurrency patterns, caching at every layer, structured logging, monitoring, graceful degradation. Plus a one-page checklist for shipping AI to production.