First Block Cache vs TeaCache vs AdaCache
Trying to compare a few caching tricks for diffusion models, I'll update this in more detail later: First Block Cache, TeaCache, and AdaCache. These show up in large image/video models like FLUX or Wan, and help with making inference faster. All of them work by skipping parts of the model when it's safe to do so. A quick breakdown based on some testing and digging into papers/code.
1. First Block Cache (FBCache)
This one's simple. You run the first transformer block, then check how much its output changed compared to the last step. If it's almost the same, you just reuse the entire previous model output instead of running all blocks again.
- How it decides: residual norm difference (usually L2), thresholded.
- What it skips: everything after block 1.
- Best for: models where early layers capture most of the change (common in diffusion).
Setting | Speedup | Quality Drop | Notes |
---|---|---|---|
conservative (0.04) | 1.3x | none | low cache hit rate |
default (0.08) | 1.6x | tiny | works well across models |
aggressive (0.12+) | 2.0x | mild | for faster runs only |
2. TeaCache
Smarter about when to reuse. Instead of waiting for model outputs, it uses the input noise + timestep embedding to guess if the output would change much. If not, skips the whole model.
- How it decides: difference between modulated inputs, scaled.
- What it skips: full model (like FBCache).
- Best for: long videos, where early steps are noisy and later steps barely change.
Setting | Speedup | Quality Drop | Notes |
---|---|---|---|
1.6x | 1.6x | none | good balance |
4.4x | 4.4x | ~0.07% | high reuse late in sampling |
3. AdaCache
Meant for video. Instead of skipping full steps, it selectively skips transformer layers if the residuals don’t change much. Also adapts to how much motion the video has.
- How it decides: L2 norm of residuals at each layer.
- What it skips: some layers, some steps.
- Best for: low-motion or static video segments.
Setting | Speedup | Memory | Notes |
---|---|---|---|
default | 2.6x | medium | better than earlier video caching |
aggressive + MoReg | 4.5x | high | good for 720p static videos |
TL;DR
Method | Skip Scope | Main Decision | Speedup | Notes |
---|---|---|---|---|
FBCache | full step | early-block diff | 1.6x–2x | light, easy to drop in |
TeaCache | full step | input diff estimate | 2x–4.4x | more accurate, bit more setup |
AdaCache | per-layer | per-layer residuals | 2.6x–4.5x | best for video, heavier |
For image models, FBCache or TeaCache are probably enough. If you're doing video and want max performance, AdaCache is worth a look. All three are training-free, work at inference, and plug into transformer-style diffusion models without retraining.
Might test combining one of them with early-stop or latent caching next.