Part 6: Fine-Tuning with LoRA and QLoRA

Hardware required: a CUDA GPU with at least 8 GB VRAM. Free Colab T4 works for everything in this post — that's the fallback path. Python: 3.11+.

The Real Lesson

The thing every "fine-tune your LLM" tutorial gets wrong: starting with the code. The first 70% of a successful fine-tune is the dataset. The last 30% is the training run. Most people invert this and wonder why their adapter is useless.

This post is the full pipeline, dataset to deployment-ready adapter. We'll fine-tune Qwen/Qwen3-4B-Instruct on a small domain-specific instruction set, save the adapter, smoke-test it. Part 7 quantizes the merged model; Part 8 serves it via vLLM.

When to Fine-Tune

Decision flowchart:

The model is missing knowledge — recent events, your private docs → RAG (Part 5)
The model is missing behaviour — a specific output format, tone, code style → fine-tune
The model is making a style mistake on every call → try prompt engineering first; fine-tune only if prompts can't fix it
You need both — fine-tune the behaviour, RAG the knowledge. They compose

LoRA in One Paragraph

Freeze the base model's weights. Add small trainable matrices (low-rank decomposition: A is r × d, B is d × r) on top of selected linear layers. Only those matrices train. Inference adds A @ B to the original output. Result: an adapter file 50–500 MB instead of a 14 GB model, trains in hours instead of days, deploys by swapping which adapter is loaded on top of one shared base.

QLoRA in One Paragraph

Same as LoRA but the frozen base is loaded in 4-bit. Training compute happens in higher precision but the giant base weight matrix is compressed in memory. Result: you can fine-tune a 7B model on 8 GB of VRAM. The whole reason a Colab T4 is enough.

Dataset Preparation (the 70%)

Quality > quantity. 500 clean instruction/output pairs beats 5,000 noisy ones every time. Format depends on the model's chat template, but the canonical instruction shape is:

{
  "instruction": "Convert this temperature to Celsius.",
  "input": "98.6 F",
  "output": "37.0 C"
}

Save as JSONL, one example per line. The conversion script:

"""Turn a CSV of (instruction, input, output) into a JSONL training file."""
import csv, json, sys
from pathlib import Path


def main(csv_path: str, out_path: str) -> None:
    with open(csv_path) as f, open(out_path, "w") as g:
        reader = csv.DictReader(f)
        for row in reader:
            example = {
                "instruction": row["instruction"].strip(),
                "input": (row.get("input") or "").strip(),
                "output": row["output"].strip(),
            }
            if not example["instruction"] or not example["output"]:
                continue
            g.write(json.dumps(example) + "\n")


if __name__ == "__main__":
    main(sys.argv[1], sys.argv[2])

Hyperparameters That Matter

Parameter	Default	What it does
Rank `r`	16	How expressive the adapter is. Higher = more capacity, larger file. 8–32 covers most cases
Alpha	16	Scaling factor. Effective LR is multiplied by `alpha / r`. Set alpha = r as a starting point
Learning rate	2e-4	QLoRA tolerates higher LR than full fine-tunes. 1e-4 to 5e-4 is the safe range
Target modules	`all-linear`	Train all attention + MLP projections. Attention-only is cheaper but worse for structural adherence
Dropout	0	QLoRA is regularised by quantisation noise; extra dropout usually hurts
Epochs	1–3	Stop when eval loss flattens. More epochs = overfit
Batch size	1 × grad accum 4	On a T4, real batch is 1; gradient accumulation simulates 4. Bigger if VRAM allows

Setting Up Unsloth on Colab T4

The whole training run fits in a Colab notebook. Cells in order:

!pip install -q unsloth==2026.4 trl==0.12.0 peft==0.13.0 datasets==3.2.0 bitsandbytes==0.45.0

Complete Training Script

Save as train.py. This runs locally on an 8 GB GPU or in a Colab cell:

"""QLoRA fine-tune of Qwen3-4B-Instruct on a small instruction dataset."""
from __future__ import annotations
import os
from datasets import load_dataset
from unsloth import FastLanguageModel
from trl import SFTTrainer, SFTConfig

MODEL_NAME = "unsloth/Qwen3-4B-Instruct"
MAX_SEQ_LENGTH = 2048
DATASET_PATH = "data/train.jsonl"
OUTPUT_DIR = "outputs/qwen3-4b-domain-adapter"

# 1. Load model in 4-bit
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL_NAME,
    max_seq_length=MAX_SEQ_LENGTH,
    dtype=None,             # auto: bf16 on Hopper, fp16 on T4
    load_in_4bit=True,
)

# 2. Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",   # ~30% less VRAM, no quality loss
    random_state=42,
)


# 3. Format examples to the model's chat template
def format_example(example: dict) -> dict:
    messages = []
    user_text = example["instruction"]
    if example.get("input"):
        user_text += "\n\n" + example["input"]
    messages.append({"role": "user", "content": user_text})
    messages.append({"role": "assistant", "content": example["output"]})
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
    return {"text": text}


dataset = load_dataset("json", data_files=DATASET_PATH, split="train")
dataset = dataset.map(format_example, remove_columns=dataset.column_names)
dataset = dataset.train_test_split(test_size=0.05, seed=42)

# 4. Train
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    args=SFTConfig(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=10,
        num_train_epochs=2,
        learning_rate=2e-4,
        bf16=False,                # T4 doesn't have bf16; use fp16
        fp16=True,
        logging_steps=10,
        eval_strategy="steps",
        eval_steps=50,
        save_strategy="steps",
        save_steps=100,
        save_total_limit=2,
        output_dir=OUTPUT_DIR,
        max_seq_length=MAX_SEQ_LENGTH,
        dataset_text_field="text",
        report_to="none",
        seed=42,
    ),
)

trainer.train()

# 5. Save adapter
model.save_pretrained(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)
print(f"adapter saved to {OUTPUT_DIR}")

Sample Dataset

50 examples in a domain you care about. Save as data/train.jsonl:

{"instruction": "Summarise in one sentence.", "input": "FastAPI is a Python web framework with async support, automatic OpenAPI docs, and Pydantic validation built in.", "output": "FastAPI is an async Python web framework that auto-generates docs and validates with Pydantic."}
{"instruction": "Convert temperature.", "input": "98.6 F", "output": "37.0 C"}
{"instruction": "Format as JSON.", "input": "name=Alice age=30", "output": "{\"name\": \"Alice\", \"age\": 30}"}
// ... 47 more pairs in the same domain

Replace these examples with whatever your actual task is. Customer-support replies, code reviews, classification labels — the format is the same.

Monitoring the Run

Watch two numbers:

Train loss — should decrease smoothly. If it's spiky, your LR is too high. If it plateaus instantly, something's wrong with your data formatting
Eval loss — should decrease, then plateau. When it starts going up, you're overfitting; stop now

The Unsloth trainer prints both every 10 steps. A healthy small-dataset QLoRA run looks like train loss falling from ~2.0 to ~0.6 over a few hundred steps, eval loss following at a slight gap.

Smoke-Test the Adapter

"""Generate a few outputs from the trained adapter to eyeball quality."""
from unsloth import FastLanguageModel

model, tok = FastLanguageModel.from_pretrained(
    model_name="outputs/qwen3-4b-domain-adapter",
    max_seq_length=2048,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

prompts = [
    "Convert temperature.\n\n75 F",
    "Format as JSON.\n\nname=Bob age=42",
]
for p in prompts:
    inputs = tok.apply_chat_template(
        [{"role": "user", "content": p}],
        tokenize=True, add_generation_prompt=True, return_tensors="pt",
    ).to("cuda")
    out = model.generate(input_ids=inputs, max_new_tokens=80, temperature=0.0)
    print(tok.decode(out[0], skip_special_tokens=True))
    print("---")

Evaluating Properly

Eyeballing a few outputs is fine for "did training do anything?" but you need a real eval before you trust the adapter:

Task metric. The thing the model is supposed to do. Exact-match for classifications. ROUGE for summarisation. Pass-rate for code. Build this from a held-out set you didn't train on
Sanity check on a general benchmark. Run MMLU subset (or a 100-example sample). If general capability has collapsed, you've overfit
Compared to the base model. If your fine-tune isn't measurably better than the base, the fine-tune is a failed run. This catches "I trained for 5 hours and changed nothing" silently

Adapter or Merged?

For deployment you have two choices:

Option	When
Keep adapter separate; load on top of base at runtime	You serve multiple specialised adapters from one base (vLLM does this natively — see Part 8)
Merge adapter into base, save as a single model	You deploy one specialised model. Quantize the merged result — that's Part 7

For this series we merge, because Part 7 quantizes the merged model. The merge is one line:

model = model.merge_and_unload()
model.save_pretrained("outputs/qwen3-4b-domain-merged")
tok.save_pretrained("outputs/qwen3-4b-domain-merged")

Common Failure Modes

Symptom	Cause	Fix
Model parrots training data verbatim	Overfit (loss went to ~0)	Fewer epochs or more data
Loss flat from step 1	Wrong chat template; tokenizer mismatch	Verify `apply_chat_template` output matches what the base model expects
Eval loss climbs	Overfitting or LR too high	Lower LR, early-stop at the eval-loss minimum
Output is garbage post-training	Stop tokens wrong, or you trained on the wrong field	Print formatted training examples; confirm the assistant turn ends correctly
Adapter is trained but saved poorly	Forgot `save_pretrained`, saved checkpoint dir instead	Always save to a clean output dir; verify the dir contains `adapter_model.safetensors`

Key Takeaways

Fine-tune for behaviour. RAG for knowledge. Don't fine-tune to add facts — it works badly and takes forever
Dataset is 70% of the work. 500 clean examples beat 5,000 noisy ones
QLoRA + Unsloth fits 7B fine-tunes on a free Colab T4
Default to r=16, alpha=16, all-linear, dropout=0, lr=2e-4, epochs=1-3. Tune one knob at a time
Monitor train and eval loss. Stop early when eval loss plateaus or rises
Always evaluate against a held-out set and against the base model. A fine-tune that doesn't beat the base is a failed run
Merge for single-model deployment; keep adapter separate for multi-adapter serving

Next Up

Part 7 takes the merged model from this part and quantizes it — AWQ for vLLM serving, GGUF for Ollama / llama.cpp / Apple Silicon. The deeper dive on quantization theory lives at Demystifying LLM Quantization.

Next: Part 7 — Quantization for Deployment