First Block Cache vs TeaCache vs AdaCache

2025-02-16

Trying to compare a few caching tricks for diffusion models, I'll update this in more detail later: First Block Cache, TeaCache, and AdaCache. These show up in large image/video models like FLUX or Wan, and help with making inference faster. All of them work by skipping parts of the model when it's safe to do so. A quick breakdown based on some testing and digging into papers/code.

1. First Block Cache (FBCache)

This one's simple. You run the first transformer block, then check how much its output changed compared to the last step. If it's almost the same, you just reuse the entire previous model output instead of running all blocks again.

How it decides: residual norm difference (usually L2), thresholded.
What it skips: everything after block 1.
Best for: models where early layers capture most of the change (common in diffusion).

Setting	Speedup	Quality Drop	Notes
conservative (0.04)	1.3x	none	low cache hit rate
default (0.08)	1.6x	tiny	works well across models
aggressive (0.12+)	2.0x	mild	for faster runs only

2. TeaCache

Smarter about when to reuse. Instead of waiting for model outputs, it uses the input noise + timestep embedding to guess if the output would change much. If not, skips the whole model.

How it decides: difference between modulated inputs, scaled.
What it skips: full model (like FBCache).
Best for: long videos, where early steps are noisy and later steps barely change.

Setting	Speedup	Quality Drop	Notes
1.6x	1.6x	none	good balance
4.4x	4.4x	~0.07%	high reuse late in sampling

3. AdaCache

Meant for video. Instead of skipping full steps, it selectively skips transformer layers if the residuals don’t change much. Also adapts to how much motion the video has.

How it decides: L2 norm of residuals at each layer.
What it skips: some layers, some steps.
Best for: low-motion or static video segments.

Setting	Speedup	Memory	Notes
default	2.6x	medium	better than earlier video caching
aggressive + MoReg	4.5x	high	good for 720p static videos

TL;DR

Method	Skip Scope	Main Decision	Speedup	Notes
FBCache	full step	early-block diff	1.6x–2x	light, easy to drop in
TeaCache	full step	input diff estimate	2x–4.4x	more accurate, bit more setup
AdaCache	per-layer	per-layer residuals	2.6x–4.5x	best for video, heavier

For image models, FBCache or TeaCache are probably enough. If you're doing video and want max performance, AdaCache is worth a look. All three are training-free, work at inference, and plug into transformer-style diffusion models without retraining.

Might test combining one of them with early-stop or latent caching next.