Running a 22B Audio-Video Diffusion Model on a Single GPU

You can rent video generation as an API. It works, it's expensive, and your prompts go through someone else's infrastructure. For one of my recent projects, neither of those was acceptable — the workload was high-volume and the prompts contained pre-release product material that had to stay on my hardware. So I built a self-hosted browser portal around an open-source 22B-parameter audio-video diffusion model, running on a single Blackwell-class GPU with ~96 GB of VRAM.

This post is about how to fit a model that wants ~50 GB just for its decode pass into a budget where another LLM serving stack is already holding 34 GB on the same GPU. Most of it is about VRAM accounting.

The Model Class

The diffusion model used is a Diffusion Transformer (DiT), not U-Net based. It does joint audio + video in one forward pass over a single token sequence that mixes video patches and audio mel patches with cross-modal attention. That architecture is the reason lip-synced dialogue and motion-synced foley fall out of one inference run rather than requiring separate audio post-processing.

The downside is VRAM. The joint sequence is long, and naive decoding of long outputs blows past the GPU budget catastrophically.

Three Models, Sequentially Loaded

Model	Role	Size	When
Text encoder (12B)	Encodes prompt to per-token embeddings	~25 GB BF16	Loaded → encoded → freed
Diffusion transformer (22B)	Denoises the latent audio-video stream	~22 GB after FP8 cast	Loaded after encoder freed
Video VAE + audio VAE + vocoder	Decodes latents → frames + waveform	~3 GB	Bundled with diffusion checkpoint

The non-obvious move is freeing the text encoder before loading the diffusion transformer. The temptation is to keep the encoder warm — you'll need it again on the next job. But the encoder is 25 GB and the diffusion model is 22 GB. Holding both means you've spent ~47 GB before you've decoded a single frame. Free aggressively.

FP8-Cast Quantisation

	BF16	FP8-cast
Weight storage	~44 GB	~22 GB
Compute	BF16	Upcast to BF16 only for active matmul
Accuracy delta	baseline	<1% measured loss
Hardware requirement	Any	Blackwell (compute capability 12.0)

At load time, weights of selected linear layers (attention to_q/k/v/out, FF projections) are cast BF16 → FP8 (e4m3fn). During the forward pass, they're upcast back to BF16 only for the active matmul. The savings: ~5.5 GB on the 22B model, sub-1% measured accuracy loss.

This is hardware-specific. It requires Blackwell. On older silicon, FP8-cast either doesn't work or doesn't save you anything because the upcast cost eats the memory win.

Tiled VAE Decode: The Unblocker for Long Outputs

The first version of the portal capped output at ~5 seconds. I wanted 20.

The problem: the stock one-stage CLI tries to decode all frames in a single conv3d. At 361 frames, that requires ~50 GB in one allocation — and OOMs on a 96 GB GPU because nothing else can fit alongside it.

The fix: a wrapper that builds a TilingConfig and feeds explicit chunk count to the decoder.

Tile parameter	Value
Spatial tile	512 px
Temporal tile	64 frames
Overlap	24 frames

Same algorithm, smaller per-step buffers. Adds ~30 s of decode wall-clock but raises the practical output ceiling from ~5 s to 20 s.

The pipeline class itself supports tiling. The upstream one-stage CLI just hardcodes video_chunks_number=1. Wrapping rather than forking the upstream code keeps the install lean and the upgrade path clean — when upstream releases a new version, I rebase the wrapper instead of merging a fork.

Subprocess-Per-Job VRAM Isolation

Every generation forks a fresh Python process with the diffusion venv. When the job finishes — or is cancelled, or crashes — the OS reclaims all VRAM. An idle portal uses 0 GPU.

This matters enormously when other workloads share the GPU continuously. A vLLM serving stack on the same machine holds 34 GB constantly. If the diffusion subprocess leaked even a couple of GB between jobs, the LLM stack would OOM by the third generation.

The cost: ~12 s warmup per job (Python startup + model load). The benefit: GPU memory accounting becomes the OS's job rather than a Python reference-counting problem. For a workload where jobs take 50 s to 7 minutes, 12 s warmup is acceptable.

Sequential Job Queue

A single worker thread drains a queue.Queue. No two diffusion subprocesses ever fight for VRAM. No concurrency above the GPU layer.

Adding queueing infrastructure (Celery, Redis) for a single-worker case is overhead with no benefit. The right design when one GPU serves one workload type is the simplest queue that works.

Image Conditioning, Done Correctly

Uploaded stills go through ImageOps.exif_transpose → ImageOps.fit(target_aspect, LANCZOS) → strip EXIF on save, before reaching the diffusion subprocess.

Three reasons:

Stops the model's internal padding pass from leaving black-border artefacts on non-matching aspect ratios
Saves a few hundred MB of activation memory at decode time by not making the model handle the resize
EXIF strip is a quiet privacy win — uploaded photos often carry GPS and camera metadata users don't intend to share

Validation as a First-Class Constraint

Pydantic v2 field validators enforce the model's hard rules before any subprocess is spawned:

Constraint	Why
`width % 32 == 0` and `height % 32 == 0`	DiT patch size requirement
`(num_frames - 1) % 8 == 0`	Temporal compression requirement
`num_frames ∈ [9, 481]`	Empirical floor and ceiling

Failing validation at the API boundary is much cheaper than failing 12 seconds into a subprocess warmup. Every failure mode the model has, the API knows about and rejects with a useful message.

Upload Security

Three independent layers, because filesystem-adjacent code is the wrong place to rely on a single guard:

Two-layer size guard. Content-Length header rejection (cheap, but the header is forgeable) plus a chunked-write accumulator that catches the actual bytes
MIME allowlist. Only image/jpeg, image/png, image/webp accepted
Path-traversal check. Path.resolve() plus is_relative_to(uploads_dir) before any path reaches the subprocess

The subprocess receives a path string from the parent process. Any vulnerability that lets a user influence that path is a vulnerability that lets them touch arbitrary files. Three layers because one layer is one bug away from disaster.

Empirical Performance

Scenario	Wall clock	Peak VRAM	Notes
49 frames @ 768×448, 30 steps	~50 s	~25 GB	Default 2-second clip
121 frames	~2 min	~26 GB	Original cap before tiled-decode work
361 frames (15 s @ 24 fps)	~5 min	~26 GB	Validated via tiled-decode wrapper
481 frames (20 s @ 24 fps)	~7 min	~27 GB	Slider cap

Per-step cost at 768×448 fits 0.025 + 0.026 × num_frames seconds — sub-linear in frame count, no cliff on this model size.

The flat VRAM curve is the payoff of tiled decode. Wall-clock grows roughly linearly with frame count, but peak memory grows barely at all because the tiled decode caps the per-step buffer regardless of output length.

What I Skipped

Feature	Why skipped
Celery / Redis	Overkill for one GPU, one worker
Two-stage upscaler pipeline	Use case didn't need it; saved ~3 GB of weights
Distilled LoRA	Same
In-process model invocation	Subprocess buys VRAM release for ~12 s warmup cost — correct trade for a shared GPU

Key Takeaways

On constrained VRAM, free aggressively. Don't keep models warm "just in case"
FP8-cast on Blackwell is a measurable memory win for sub-1% accuracy loss
If your decoder OOMs on long outputs, look for a tiling config before throwing more GPU at it
Subprocess-per-job VRAM isolation is worth the warmup cost on any GPU shared with continuous-load workloads
Validate model constraints at the API boundary, not in the model's error messages
For filesystem-adjacent code, three independent guards is the minimum