You can rent video generation as an API. It works, it's expensive, and your prompts go through someone else's infrastructure. For one of my recent projects, neither of those was acceptable — the workload was high-volume and the prompts contained pre-release product material that had to stay on my hardware. So I built a self-hosted browser portal around an open-source 22B-parameter audio-video diffusion model, running on a single Blackwell-class GPU with ~96 GB of VRAM.
This post is about how to fit a model that wants ~50 GB just for its decode pass into a budget where another LLM serving stack is already holding 34 GB on the same GPU. Most of it is about VRAM accounting.
The Model Class
The diffusion model used is a Diffusion Transformer (DiT), not U-Net based. It does joint audio + video in one forward pass over a single token sequence that mixes video patches and audio mel patches with cross-modal attention. That architecture is the reason lip-synced dialogue and motion-synced foley fall out of one inference run rather than requiring separate audio post-processing.
The downside is VRAM. The joint sequence is long, and naive decoding of long outputs blows past the GPU budget catastrophically.
Three Models, Sequentially Loaded
| Model | Role | Size | When |
|---|---|---|---|
| Text encoder (12B) | Encodes prompt to per-token embeddings | ~25 GB BF16 | Loaded → encoded → freed |
| Diffusion transformer (22B) | Denoises the latent audio-video stream | ~22 GB after FP8 cast | Loaded after encoder freed |
| Video VAE + audio VAE + vocoder | Decodes latents → frames + waveform | ~3 GB | Bundled with diffusion checkpoint |
The non-obvious move is freeing the text encoder before loading the diffusion transformer. The temptation is to keep the encoder warm — you'll need it again on the next job. But the encoder is 25 GB and the diffusion model is 22 GB. Holding both means you've spent ~47 GB before you've decoded a single frame. Free aggressively.
FP8-Cast Quantisation
| BF16 | FP8-cast | |
|---|---|---|
| Weight storage | ~44 GB | ~22 GB |
| Compute | BF16 | Upcast to BF16 only for active matmul |
| Accuracy delta | baseline | <1% measured loss |
| Hardware requirement | Any | Blackwell (compute capability 12.0) |
At load time, weights of selected linear layers (attention to_q/k/v/out, FF projections) are cast BF16 → FP8 (e4m3fn). During the forward pass, they're upcast back to BF16 only for the active matmul. The savings: ~5.5 GB on the 22B model, sub-1% measured accuracy loss.
This is hardware-specific. It requires Blackwell. On older silicon, FP8-cast either doesn't work or doesn't save you anything because the upcast cost eats the memory win.
Tiled VAE Decode: The Unblocker for Long Outputs
The first version of the portal capped output at ~5 seconds. I wanted 20.
The problem: the stock one-stage CLI tries to decode all frames in a single conv3d. At 361 frames, that requires ~50 GB in one allocation — and OOMs on a 96 GB GPU because nothing else can fit alongside it.
The fix: a wrapper that builds a TilingConfig and feeds explicit chunk count to the decoder.
| Tile parameter | Value |
|---|---|
| Spatial tile | 512 px |
| Temporal tile | 64 frames |
| Overlap | 24 frames |
Same algorithm, smaller per-step buffers. Adds ~30 s of decode wall-clock but raises the practical output ceiling from ~5 s to 20 s.
The pipeline class itself supports tiling. The upstream one-stage CLI just hardcodes video_chunks_number=1. Wrapping rather than forking the upstream code keeps the install lean and the upgrade path clean — when upstream releases a new version, I rebase the wrapper instead of merging a fork.
Subprocess-Per-Job VRAM Isolation
Every generation forks a fresh Python process with the diffusion venv. When the job finishes — or is cancelled, or crashes — the OS reclaims all VRAM. An idle portal uses 0 GPU.
This matters enormously when other workloads share the GPU continuously. A vLLM serving stack on the same machine holds 34 GB constantly. If the diffusion subprocess leaked even a couple of GB between jobs, the LLM stack would OOM by the third generation.
The cost: ~12 s warmup per job (Python startup + model load). The benefit: GPU memory accounting becomes the OS's job rather than a Python reference-counting problem. For a workload where jobs take 50 s to 7 minutes, 12 s warmup is acceptable.
Sequential Job Queue
A single worker thread drains a queue.Queue. No two diffusion subprocesses ever fight for VRAM. No concurrency above the GPU layer.
Adding queueing infrastructure (Celery, Redis) for a single-worker case is overhead with no benefit. The right design when one GPU serves one workload type is the simplest queue that works.
Image Conditioning, Done Correctly
Uploaded stills go through ImageOps.exif_transpose → ImageOps.fit(target_aspect, LANCZOS) → strip EXIF on save, before reaching the diffusion subprocess.
Three reasons:
- Stops the model's internal padding pass from leaving black-border artefacts on non-matching aspect ratios
- Saves a few hundred MB of activation memory at decode time by not making the model handle the resize
- EXIF strip is a quiet privacy win — uploaded photos often carry GPS and camera metadata users don't intend to share
Validation as a First-Class Constraint
Pydantic v2 field validators enforce the model's hard rules before any subprocess is spawned:
| Constraint | Why |
|---|---|
width % 32 == 0 and height % 32 == 0 | DiT patch size requirement |
(num_frames - 1) % 8 == 0 | Temporal compression requirement |
num_frames ∈ [9, 481] | Empirical floor and ceiling |
Failing validation at the API boundary is much cheaper than failing 12 seconds into a subprocess warmup. Every failure mode the model has, the API knows about and rejects with a useful message.
Upload Security
Three independent layers, because filesystem-adjacent code is the wrong place to rely on a single guard:
- Two-layer size guard. Content-Length header rejection (cheap, but the header is forgeable) plus a chunked-write accumulator that catches the actual bytes
- MIME allowlist. Only
image/jpeg,image/png,image/webpaccepted - Path-traversal check.
Path.resolve()plusis_relative_to(uploads_dir)before any path reaches the subprocess
The subprocess receives a path string from the parent process. Any vulnerability that lets a user influence that path is a vulnerability that lets them touch arbitrary files. Three layers because one layer is one bug away from disaster.
Empirical Performance
| Scenario | Wall clock | Peak VRAM | Notes |
|---|---|---|---|
| 49 frames @ 768×448, 30 steps | ~50 s | ~25 GB | Default 2-second clip |
| 121 frames | ~2 min | ~26 GB | Original cap before tiled-decode work |
| 361 frames (15 s @ 24 fps) | ~5 min | ~26 GB | Validated via tiled-decode wrapper |
| 481 frames (20 s @ 24 fps) | ~7 min | ~27 GB | Slider cap |
Per-step cost at 768×448 fits 0.025 + 0.026 × num_frames seconds — sub-linear in frame count, no cliff on this model size.
The flat VRAM curve is the payoff of tiled decode. Wall-clock grows roughly linearly with frame count, but peak memory grows barely at all because the tiled decode caps the per-step buffer regardless of output length.
What I Skipped
| Feature | Why skipped |
|---|---|
| Celery / Redis | Overkill for one GPU, one worker |
| Two-stage upscaler pipeline | Use case didn't need it; saved ~3 GB of weights |
| Distilled LoRA | Same |
| In-process model invocation | Subprocess buys VRAM release for ~12 s warmup cost — correct trade for a shared GPU |
Key Takeaways
- On constrained VRAM, free aggressively. Don't keep models warm "just in case"
- FP8-cast on Blackwell is a measurable memory win for sub-1% accuracy loss
- If your decoder OOMs on long outputs, look for a tiling config before throwing more GPU at it
- Subprocess-per-job VRAM isolation is worth the warmup cost on any GPU shared with continuous-load workloads
- Validate model constraints at the API boundary, not in the model's error messages
- For filesystem-adjacent code, three independent guards is the minimum