Blog | Samson Kinyanjui

May 3, 2026 Tutorial

Part 10: Scaling AI Systems

Concurrency patterns, caching at every layer, structured logging, monitoring, graceful degradation, and a one-page shipping checklist.

AI Engineer Path Scaling Production

16 min read Read more

May 2, 2026 Tutorial

Part 9: Knowledge Distillation

Distil a fine-tuned 4B teacher into a 1.5B student via synthetic data — with quality-control filters and an honest eval against the teacher.

AI Engineer Path Distillation Synthetic Data

14 min read Read more

May 1, 2026 Tutorial

Part 8: Serving Quantized Models with vLLM

From AWQ checkpoint to a production endpoint — vLLM config, multi-LoRA hot-swapping, prefix caching, and swapping the local endpoint into the Part 5 RAG.

AI Engineer Path vLLM Serving

14 min read Read more

April 30, 2026 Tutorial

Part 7: Quantization for Deployment

Take a 14 GB FP16 model down to 4 GB AWQ with sub-1% accuracy loss. Full pipeline from merged checkpoint to AWQ + GGUF, with benchmarks.

AI Engineer Path Quantization AWQ

13 min read Read more

April 29, 2026 Tutorial

Part 6: Fine-Tuning with LoRA and QLoRA

QLoRA + Unsloth on a free Colab T4. Dataset prep, hyperparameters that matter, training, evaluation, common failure modes — end to end.

AI Engineer Path LoRA Unsloth

17 min read Read more

April 28, 2026 Tutorial

Part 5: Building a Production RAG System

Bolting search onto an LLM endpoint isn't RAG yet — it's the first 30%. Chunking, hybrid retrieval, prompt construction, citations, and eval.

AI Engineer Path RAG Hybrid Search

16 min read Read more

April 27, 2026 Tutorial

Part 4: Embeddings and Vector Search

How embeddings work, which model to pick, and how to add semantic search to the FastAPI service from Part 3 with sentence-transformers and Qdrant.

AI Engineer Path Embeddings Qdrant

13 min read Read more

April 26, 2026 Tutorial

Part 3: Building APIs with FastAPI

Wrapping the Part 2 LLM client in a production-ready FastAPI service: validation, streaming, auth, rate limiting, and a Dockerfile.

AI Engineer Path FastAPI Authentication

14 min read Read more

April 25, 2026 Tutorial

Part 2: Your First LLM API Call

From four lines of OpenAI SDK to a production-grade client wrapper with retries, streaming, structured output, and cost tracking.

AI Engineer Path LLM API Streaming

15 min read Read more

April 24, 2026 Tutorial

Part 1: Python Foundations for AI Engineers

The Python skills that actually matter when you start building AI systems — type hints, async, context managers, and tests.

AI Engineer Path Python Fundamentals

14 min read Read more

May 3, 2026 Deep Dive

Running a 22B Audio-Video Diffusion Model on a Single GPU

Fitting a 22B-parameter audio-video diffusion model into a GPU budget shared with an LLM serving stack — with FP8 quantisation, tiled VAE decode, and subprocess-per-job VRAM isolation.

Diffusion FP8 Blackwell

10 min read Read more

May 3, 2026 Implementation

Designing a Consumer-Tier AI Product That Degrades Gracefully

The default LLM product pattern propagates inference failures to the user. For a consumer-tier product where purchase reliability matters more than AI quality, that's the wrong default. Here's the inversion.

AI Products System Design LLM

4 min read Read more

May 3, 2026 Implementation

Building a Hybrid-RAG Assistant That Doesn't Hallucinate Statute

How I built a chat assistant grounded in ~1,600 chunks of primary legislation — with hybrid retrieval, careful chunking, and one parser bug that taught me to validate chunks before tuning anything else.

RAG Qdrant Hybrid Retrieval

5 min read Read more

May 3, 2026 Implementation

Generating Long-Form Compliance Documents with a Single LLM Call Pipeline

Why big-prompt generation breaks at scale, and how I rebuilt my document generator as a pipeline of small typed LLM calls — with deliberate decisions about RAG, LoRA targets, and prefix caching.

LLM vLLM LoRA

7 min read Read more

Apr 16, 2026 Implementation

How I Build AI Agents That Actually Work in Production

The patterns, tools, and pitfalls I've learned from building multi-agent systems that go beyond demos.

AI Agents LangGraph ReAct

16 min read Read more

Apr 16, 2026 Deep Dive

How I Get the Best Out of My GPU Using vLLM for Local LLM Production

How PagedAttention borrows from OS virtual memory to achieve 14-24x throughput, with code examples for deploying LLMs on GPUs.

vLLM PagedAttention GPU

14 min read Read more

Apr 14, 2026 Paper Breakdown

Demystifying LLM Quantization: GPTQ, AWQ & GGUF

How to shrink a 14GB model to 4GB and still get usable results — a practical guide with Python code for GPTQ, AWQ, and GGUF.

Quantization LLMs GPTQ

10 min read Read more

Feb 5, 2026 Paper Breakdown

Understanding Attention Is All You Need

A deep dive into the Transformer architecture that revolutionized NLP and became the foundation for modern LLMs.

Transformers NLP Deep Learning

12 min read Read more

Feb 15, 2026 Implementation

Automating Financial Statement Audits with LLMs

From research paper to production app — building an LLM-powered auditor for Kenyan company financial statements with RAG, FastAPI, and Next.js.

LLMs RAG FastAPI

20 min read Read more

Feb 18, 2026 Implementation

Building TradingAgents for the NSE

From UCLA/MIT research paper to a working multi-agent LLM system that debates, analyzes, and trades NSE equities — adapted for frontier market constraints.

LangGraph Multi-Agent NSE

25 min read Read more

Coming Soon Paper Breakdown

BERT: Pre-training Deep Bidirectional Transformers

Understanding how BERT changed the NLP landscape with bidirectional context and masked language modeling.

BERT NLP Pre-training

14 min read

Coming Soon Implementation

Building a Neural Network from NumPy

Implementing backpropagation and gradient descent without frameworks to understand the fundamentals.

Neural Networks NumPy From Scratch

18 min read

Coming Soon Research

Scaling Laws in Large Language Models

Analyzing how model size, data, and compute affect performance and what it means for AI development.

Scaling LLMs Research

11 min read

Posts

Part 10: Scaling AI Systems

Part 9: Knowledge Distillation

Part 8: Serving Quantized Models with vLLM

Part 7: Quantization for Deployment

Part 6: Fine-Tuning with LoRA and QLoRA

Part 5: Building a Production RAG System

Part 4: Embeddings and Vector Search

Part 3: Building APIs with FastAPI

Part 2: Your First LLM API Call

Part 1: Python Foundations for AI Engineers

Running a 22B Audio-Video Diffusion Model on a Single GPU

Designing a Consumer-Tier AI Product That Degrades Gracefully

Building a Hybrid-RAG Assistant That Doesn't Hallucinate Statute

Generating Long-Form Compliance Documents with a Single LLM Call Pipeline

How I Build AI Agents That Actually Work in Production

How I Get the Best Out of My GPU Using vLLM for Local LLM Production

Demystifying LLM Quantization: GPTQ, AWQ & GGUF

Understanding Attention Is All You Need

Automating Financial Statement Audits with LLMs

Building TradingAgents for the NSE

BERT: Pre-training Deep Bidirectional Transformers

Building a Neural Network from NumPy

Scaling Laws in Large Language Models

Stay Updated