Posts

AI/ML paper breakdowns, implementations, and research notes.

May 3, 2026 Tutorial

Part 10: Scaling AI Systems

Concurrency patterns, caching at every layer, structured logging, monitoring, graceful degradation, and a one-page shipping checklist.

AI Engineer Path Scaling Production
May 2, 2026 Tutorial

Part 9: Knowledge Distillation

Distil a fine-tuned 4B teacher into a 1.5B student via synthetic data — with quality-control filters and an honest eval against the teacher.

AI Engineer Path Distillation Synthetic Data
May 1, 2026 Tutorial

Part 8: Serving Quantized Models with vLLM

From AWQ checkpoint to a production endpoint — vLLM config, multi-LoRA hot-swapping, prefix caching, and swapping the local endpoint into the Part 5 RAG.

AI Engineer Path vLLM Serving
April 30, 2026 Tutorial

Part 7: Quantization for Deployment

Take a 14 GB FP16 model down to 4 GB AWQ with sub-1% accuracy loss. Full pipeline from merged checkpoint to AWQ + GGUF, with benchmarks.

AI Engineer Path Quantization AWQ
April 29, 2026 Tutorial

Part 6: Fine-Tuning with LoRA and QLoRA

QLoRA + Unsloth on a free Colab T4. Dataset prep, hyperparameters that matter, training, evaluation, common failure modes — end to end.

AI Engineer Path LoRA Unsloth
April 28, 2026 Tutorial

Part 5: Building a Production RAG System

Bolting search onto an LLM endpoint isn't RAG yet — it's the first 30%. Chunking, hybrid retrieval, prompt construction, citations, and eval.

AI Engineer Path RAG Hybrid Search
April 27, 2026 Tutorial

Part 4: Embeddings and Vector Search

How embeddings work, which model to pick, and how to add semantic search to the FastAPI service from Part 3 with sentence-transformers and Qdrant.

AI Engineer Path Embeddings Qdrant
April 26, 2026 Tutorial

Part 3: Building APIs with FastAPI

Wrapping the Part 2 LLM client in a production-ready FastAPI service: validation, streaming, auth, rate limiting, and a Dockerfile.

AI Engineer Path FastAPI Authentication
April 25, 2026 Tutorial

Part 2: Your First LLM API Call

From four lines of OpenAI SDK to a production-grade client wrapper with retries, streaming, structured output, and cost tracking.

AI Engineer Path LLM API Streaming
April 24, 2026 Tutorial

Part 1: Python Foundations for AI Engineers

The Python skills that actually matter when you start building AI systems — type hints, async, context managers, and tests.

AI Engineer Path Python Fundamentals
May 3, 2026 Deep Dive

Running a 22B Audio-Video Diffusion Model on a Single GPU

Fitting a 22B-parameter audio-video diffusion model into a GPU budget shared with an LLM serving stack — with FP8 quantisation, tiled VAE decode, and subprocess-per-job VRAM isolation.

Diffusion FP8 Blackwell
May 3, 2026 Implementation

Designing a Consumer-Tier AI Product That Degrades Gracefully

The default LLM product pattern propagates inference failures to the user. For a consumer-tier product where purchase reliability matters more than AI quality, that's the wrong default. Here's the inversion.

AI Products System Design LLM
May 3, 2026 Implementation

Building a Hybrid-RAG Assistant That Doesn't Hallucinate Statute

How I built a chat assistant grounded in ~1,600 chunks of primary legislation — with hybrid retrieval, careful chunking, and one parser bug that taught me to validate chunks before tuning anything else.

RAG Qdrant Hybrid Retrieval
May 3, 2026 Implementation

Generating Long-Form Compliance Documents with a Single LLM Call Pipeline

Why big-prompt generation breaks at scale, and how I rebuilt my document generator as a pipeline of small typed LLM calls — with deliberate decisions about RAG, LoRA targets, and prefix caching.

LLM vLLM LoRA
Apr 16, 2026 Implementation

How I Build AI Agents That Actually Work in Production

The patterns, tools, and pitfalls I've learned from building multi-agent systems that go beyond demos.

AI Agents LangGraph ReAct
Apr 16, 2026 Deep Dive

How I Get the Best Out of My GPU Using vLLM for Local LLM Production

How PagedAttention borrows from OS virtual memory to achieve 14-24x throughput, with code examples for deploying LLMs on GPUs.

vLLM PagedAttention GPU
Apr 14, 2026 Paper Breakdown

Demystifying LLM Quantization: GPTQ, AWQ & GGUF

How to shrink a 14GB model to 4GB and still get usable results — a practical guide with Python code for GPTQ, AWQ, and GGUF.

Quantization LLMs GPTQ
Feb 5, 2026 Paper Breakdown

Understanding Attention Is All You Need

A deep dive into the Transformer architecture that revolutionized NLP and became the foundation for modern LLMs.

Transformers NLP Deep Learning
Feb 15, 2026 Implementation

Automating Financial Statement Audits with LLMs

From research paper to production app — building an LLM-powered auditor for Kenyan company financial statements with RAG, FastAPI, and Next.js.

LLMs RAG FastAPI
Feb 18, 2026 Implementation

Building TradingAgents for the NSE

From UCLA/MIT research paper to a working multi-agent LLM system that debates, analyzes, and trades NSE equities — adapted for frontier market constraints.

LangGraph Multi-Agent NSE
Coming Soon Paper Breakdown

BERT: Pre-training Deep Bidirectional Transformers

Understanding how BERT changed the NLP landscape with bidirectional context and masked language modeling.

BERT NLP Pre-training
Coming Soon Implementation

Building a Neural Network from NumPy

Implementing backpropagation and gradient descent without frameworks to understand the fundamentals.

Neural Networks NumPy From Scratch
Coming Soon Research

Scaling Laws in Large Language Models

Analyzing how model size, data, and compute affect performance and what it means for AI development.

Scaling LLMs Research