AI/ML paper breakdowns, implementations, and research notes.
Concurrency patterns, caching at every layer, structured logging, monitoring, graceful degradation, and a one-page shipping checklist.
Distil a fine-tuned 4B teacher into a 1.5B student via synthetic data — with quality-control filters and an honest eval against the teacher.
From AWQ checkpoint to a production endpoint — vLLM config, multi-LoRA hot-swapping, prefix caching, and swapping the local endpoint into the Part 5 RAG.
Take a 14 GB FP16 model down to 4 GB AWQ with sub-1% accuracy loss. Full pipeline from merged checkpoint to AWQ + GGUF, with benchmarks.
QLoRA + Unsloth on a free Colab T4. Dataset prep, hyperparameters that matter, training, evaluation, common failure modes — end to end.
Bolting search onto an LLM endpoint isn't RAG yet — it's the first 30%. Chunking, hybrid retrieval, prompt construction, citations, and eval.
How embeddings work, which model to pick, and how to add semantic search to the FastAPI service from Part 3 with sentence-transformers and Qdrant.
Wrapping the Part 2 LLM client in a production-ready FastAPI service: validation, streaming, auth, rate limiting, and a Dockerfile.
From four lines of OpenAI SDK to a production-grade client wrapper with retries, streaming, structured output, and cost tracking.
The Python skills that actually matter when you start building AI systems — type hints, async, context managers, and tests.
Fitting a 22B-parameter audio-video diffusion model into a GPU budget shared with an LLM serving stack — with FP8 quantisation, tiled VAE decode, and subprocess-per-job VRAM isolation.
The default LLM product pattern propagates inference failures to the user. For a consumer-tier product where purchase reliability matters more than AI quality, that's the wrong default. Here's the inversion.
How I built a chat assistant grounded in ~1,600 chunks of primary legislation — with hybrid retrieval, careful chunking, and one parser bug that taught me to validate chunks before tuning anything else.
Why big-prompt generation breaks at scale, and how I rebuilt my document generator as a pipeline of small typed LLM calls — with deliberate decisions about RAG, LoRA targets, and prefix caching.
The patterns, tools, and pitfalls I've learned from building multi-agent systems that go beyond demos.
How PagedAttention borrows from OS virtual memory to achieve 14-24x throughput, with code examples for deploying LLMs on GPUs.
How to shrink a 14GB model to 4GB and still get usable results — a practical guide with Python code for GPTQ, AWQ, and GGUF.
A deep dive into the Transformer architecture that revolutionized NLP and became the foundation for modern LLMs.
From research paper to production app — building an LLM-powered auditor for Kenyan company financial statements with RAG, FastAPI, and Next.js.
From UCLA/MIT research paper to a working multi-agent LLM system that debates, analyzes, and trades NSE equities — adapted for frontier market constraints.
Understanding how BERT changed the NLP landscape with bidirectional context and masked language modeling.
Implementing backpropagation and gradient descent without frameworks to understand the fundamentals.
Analyzing how model size, data, and compute affect performance and what it means for AI development.