Hardware required: a CUDA GPU with at least 8 GB VRAM. Free Colab T4 works for everything in this post — that's the fallback path. Python: 3.11+.
The Real Lesson
The thing every "fine-tune your LLM" tutorial gets wrong: starting with the code. The first 70% of a successful fine-tune is the dataset. The last 30% is the training run. Most people invert this and wonder why their adapter is useless.
This post is the full pipeline, dataset to deployment-ready adapter. We'll fine-tune Qwen/Qwen3-4B-Instruct on a small domain-specific instruction set, save the adapter, smoke-test it. Part 7 quantizes the merged model; Part 8 serves it via vLLM.
When to Fine-Tune
Decision flowchart:
- The model is missing knowledge — recent events, your private docs → RAG (Part 5)
- The model is missing behaviour — a specific output format, tone, code style → fine-tune
- The model is making a style mistake on every call → try prompt engineering first; fine-tune only if prompts can't fix it
- You need both — fine-tune the behaviour, RAG the knowledge. They compose
LoRA in One Paragraph
Freeze the base model's weights. Add small trainable matrices (low-rank decomposition: A is r × d, B is d × r) on top of selected linear layers. Only those matrices train. Inference adds A @ B to the original output. Result: an adapter file 50–500 MB instead of a 14 GB model, trains in hours instead of days, deploys by swapping which adapter is loaded on top of one shared base.
QLoRA in One Paragraph
Same as LoRA but the frozen base is loaded in 4-bit. Training compute happens in higher precision but the giant base weight matrix is compressed in memory. Result: you can fine-tune a 7B model on 8 GB of VRAM. The whole reason a Colab T4 is enough.
Dataset Preparation (the 70%)
Quality > quantity. 500 clean instruction/output pairs beats 5,000 noisy ones every time. Format depends on the model's chat template, but the canonical instruction shape is:
{
"instruction": "Convert this temperature to Celsius.",
"input": "98.6 F",
"output": "37.0 C"
}
Save as JSONL, one example per line. The conversion script:
"""Turn a CSV of (instruction, input, output) into a JSONL training file."""
import csv, json, sys
from pathlib import Path
def main(csv_path: str, out_path: str) -> None:
with open(csv_path) as f, open(out_path, "w") as g:
reader = csv.DictReader(f)
for row in reader:
example = {
"instruction": row["instruction"].strip(),
"input": (row.get("input") or "").strip(),
"output": row["output"].strip(),
}
if not example["instruction"] or not example["output"]:
continue
g.write(json.dumps(example) + "\n")
if __name__ == "__main__":
main(sys.argv[1], sys.argv[2])
Hyperparameters That Matter
| Parameter | Default | What it does |
|---|---|---|
Rank r | 16 | How expressive the adapter is. Higher = more capacity, larger file. 8–32 covers most cases |
| Alpha | 16 | Scaling factor. Effective LR is multiplied by alpha / r. Set alpha = r as a starting point |
| Learning rate | 2e-4 | QLoRA tolerates higher LR than full fine-tunes. 1e-4 to 5e-4 is the safe range |
| Target modules | all-linear | Train all attention + MLP projections. Attention-only is cheaper but worse for structural adherence |
| Dropout | 0 | QLoRA is regularised by quantisation noise; extra dropout usually hurts |
| Epochs | 1–3 | Stop when eval loss flattens. More epochs = overfit |
| Batch size | 1 × grad accum 4 | On a T4, real batch is 1; gradient accumulation simulates 4. Bigger if VRAM allows |
Setting Up Unsloth on Colab T4
The whole training run fits in a Colab notebook. Cells in order:
!pip install -q unsloth==2026.4 trl==0.12.0 peft==0.13.0 datasets==3.2.0 bitsandbytes==0.45.0
Complete Training Script
Save as train.py. This runs locally on an 8 GB GPU or in a Colab cell:
"""QLoRA fine-tune of Qwen3-4B-Instruct on a small instruction dataset."""
from __future__ import annotations
import os
from datasets import load_dataset
from unsloth import FastLanguageModel
from trl import SFTTrainer, SFTConfig
MODEL_NAME = "unsloth/Qwen3-4B-Instruct"
MAX_SEQ_LENGTH = 2048
DATASET_PATH = "data/train.jsonl"
OUTPUT_DIR = "outputs/qwen3-4b-domain-adapter"
# 1. Load model in 4-bit
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=MODEL_NAME,
max_seq_length=MAX_SEQ_LENGTH,
dtype=None, # auto: bf16 on Hopper, fp16 on T4
load_in_4bit=True,
)
# 2. Add LoRA adapters
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha=16,
lora_dropout=0,
bias="none",
use_gradient_checkpointing="unsloth", # ~30% less VRAM, no quality loss
random_state=42,
)
# 3. Format examples to the model's chat template
def format_example(example: dict) -> dict:
messages = []
user_text = example["instruction"]
if example.get("input"):
user_text += "\n\n" + example["input"]
messages.append({"role": "user", "content": user_text})
messages.append({"role": "assistant", "content": example["output"]})
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
return {"text": text}
dataset = load_dataset("json", data_files=DATASET_PATH, split="train")
dataset = dataset.map(format_example, remove_columns=dataset.column_names)
dataset = dataset.train_test_split(test_size=0.05, seed=42)
# 4. Train
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
args=SFTConfig(
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
warmup_steps=10,
num_train_epochs=2,
learning_rate=2e-4,
bf16=False, # T4 doesn't have bf16; use fp16
fp16=True,
logging_steps=10,
eval_strategy="steps",
eval_steps=50,
save_strategy="steps",
save_steps=100,
save_total_limit=2,
output_dir=OUTPUT_DIR,
max_seq_length=MAX_SEQ_LENGTH,
dataset_text_field="text",
report_to="none",
seed=42,
),
)
trainer.train()
# 5. Save adapter
model.save_pretrained(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)
print(f"adapter saved to {OUTPUT_DIR}")
Sample Dataset
50 examples in a domain you care about. Save as data/train.jsonl:
{"instruction": "Summarise in one sentence.", "input": "FastAPI is a Python web framework with async support, automatic OpenAPI docs, and Pydantic validation built in.", "output": "FastAPI is an async Python web framework that auto-generates docs and validates with Pydantic."}
{"instruction": "Convert temperature.", "input": "98.6 F", "output": "37.0 C"}
{"instruction": "Format as JSON.", "input": "name=Alice age=30", "output": "{\"name\": \"Alice\", \"age\": 30}"}
// ... 47 more pairs in the same domain
Replace these examples with whatever your actual task is. Customer-support replies, code reviews, classification labels — the format is the same.
Monitoring the Run
Watch two numbers:
- Train loss — should decrease smoothly. If it's spiky, your LR is too high. If it plateaus instantly, something's wrong with your data formatting
- Eval loss — should decrease, then plateau. When it starts going up, you're overfitting; stop now
The Unsloth trainer prints both every 10 steps. A healthy small-dataset QLoRA run looks like train loss falling from ~2.0 to ~0.6 over a few hundred steps, eval loss following at a slight gap.
Smoke-Test the Adapter
"""Generate a few outputs from the trained adapter to eyeball quality."""
from unsloth import FastLanguageModel
model, tok = FastLanguageModel.from_pretrained(
model_name="outputs/qwen3-4b-domain-adapter",
max_seq_length=2048,
load_in_4bit=True,
)
FastLanguageModel.for_inference(model)
prompts = [
"Convert temperature.\n\n75 F",
"Format as JSON.\n\nname=Bob age=42",
]
for p in prompts:
inputs = tok.apply_chat_template(
[{"role": "user", "content": p}],
tokenize=True, add_generation_prompt=True, return_tensors="pt",
).to("cuda")
out = model.generate(input_ids=inputs, max_new_tokens=80, temperature=0.0)
print(tok.decode(out[0], skip_special_tokens=True))
print("---")
Evaluating Properly
Eyeballing a few outputs is fine for "did training do anything?" but you need a real eval before you trust the adapter:
- Task metric. The thing the model is supposed to do. Exact-match for classifications. ROUGE for summarisation. Pass-rate for code. Build this from a held-out set you didn't train on
- Sanity check on a general benchmark. Run MMLU subset (or a 100-example sample). If general capability has collapsed, you've overfit
- Compared to the base model. If your fine-tune isn't measurably better than the base, the fine-tune is a failed run. This catches "I trained for 5 hours and changed nothing" silently
Adapter or Merged?
For deployment you have two choices:
| Option | When |
|---|---|
| Keep adapter separate; load on top of base at runtime | You serve multiple specialised adapters from one base (vLLM does this natively — see Part 8) |
| Merge adapter into base, save as a single model | You deploy one specialised model. Quantize the merged result — that's Part 7 |
For this series we merge, because Part 7 quantizes the merged model. The merge is one line:
model = model.merge_and_unload()
model.save_pretrained("outputs/qwen3-4b-domain-merged")
tok.save_pretrained("outputs/qwen3-4b-domain-merged")
Common Failure Modes
| Symptom | Cause | Fix |
|---|---|---|
| Model parrots training data verbatim | Overfit (loss went to ~0) | Fewer epochs or more data |
| Loss flat from step 1 | Wrong chat template; tokenizer mismatch | Verify apply_chat_template output matches what the base model expects |
| Eval loss climbs | Overfitting or LR too high | Lower LR, early-stop at the eval-loss minimum |
| Output is garbage post-training | Stop tokens wrong, or you trained on the wrong field | Print formatted training examples; confirm the assistant turn ends correctly |
| Adapter is trained but saved poorly | Forgot save_pretrained, saved checkpoint dir instead | Always save to a clean output dir; verify the dir contains adapter_model.safetensors |
Key Takeaways
- Fine-tune for behaviour. RAG for knowledge. Don't fine-tune to add facts — it works badly and takes forever
- Dataset is 70% of the work. 500 clean examples beat 5,000 noisy ones
- QLoRA + Unsloth fits 7B fine-tunes on a free Colab T4
- Default to
r=16, alpha=16, all-linear, dropout=0, lr=2e-4, epochs=1-3. Tune one knob at a time - Monitor train and eval loss. Stop early when eval loss plateaus or rises
- Always evaluate against a held-out set and against the base model. A fine-tune that doesn't beat the base is a failed run
- Merge for single-model deployment; keep adapter separate for multi-adapter serving
Next Up
Part 7 takes the merged model from this part and quantizes it — AWQ for vLLM serving, GGUF for Ollama / llama.cpp / Apple Silicon. The deeper dive on quantization theory lives at Demystifying LLM Quantization.