← back

First Block Cache vs TeaCache vs AdaCache

2025-02-16

Trying to compare a few caching tricks for diffusion models, I'll update this in more detail later: First Block Cache, TeaCache, and AdaCache. These show up in large image/video models like FLUX or Wan, and help with making inference faster. All of them work by skipping parts of the model when it's safe to do so. A quick breakdown based on some testing and digging into papers/code.

1. First Block Cache (FBCache)

This one's simple. You run the first transformer block, then check how much its output changed compared to the last step. If it's almost the same, you just reuse the entire previous model output instead of running all blocks again.

  • How it decides: residual norm difference (usually L2), thresholded.
  • What it skips: everything after block 1.
  • Best for: models where early layers capture most of the change (common in diffusion).
Setting Speedup Quality Drop Notes
conservative (0.04) 1.3x none low cache hit rate
default (0.08) 1.6x tiny works well across models
aggressive (0.12+) 2.0x mild for faster runs only

2. TeaCache

Smarter about when to reuse. Instead of waiting for model outputs, it uses the input noise + timestep embedding to guess if the output would change much. If not, skips the whole model.

  • How it decides: difference between modulated inputs, scaled.
  • What it skips: full model (like FBCache).
  • Best for: long videos, where early steps are noisy and later steps barely change.
Setting Speedup Quality Drop Notes
1.6x 1.6x none good balance
4.4x 4.4x ~0.07% high reuse late in sampling

3. AdaCache

Meant for video. Instead of skipping full steps, it selectively skips transformer layers if the residuals don’t change much. Also adapts to how much motion the video has.

  • How it decides: L2 norm of residuals at each layer.
  • What it skips: some layers, some steps.
  • Best for: low-motion or static video segments.
Setting Speedup Memory Notes
default 2.6x medium better than earlier video caching
aggressive + MoReg 4.5x high good for 720p static videos

TL;DR

Method Skip Scope Main Decision Speedup Notes
FBCache full step early-block diff 1.6x–2x light, easy to drop in
TeaCache full step input diff estimate 2x–4.4x more accurate, bit more setup
AdaCache per-layer per-layer residuals 2.6x–4.5x best for video, heavier

For image models, FBCache or TeaCache are probably enough. If you're doing video and want max performance, AdaCache is worth a look. All three are training-free, work at inference, and plug into transformer-style diffusion models without retraining.

Might test combining one of them with early-stop or latent caching next.