KV Cache Attention

Watch how Key-Value caching transforms attention computation from O(n²) recomputation hell to efficient incremental processing, reusing past computations while generating identical outputs.

Attention Distribution

Full n×n attention distribution with causal masking

Each row shows how one token attends to all previous tokens (autoregressive)
"John"
"sat"
"then"
"he"
"ran"
"John"
1.00
JohnJohn: 1.000
"sat"
0.64
satJohn: 0.640
0.36
satsat: 0.360
"then"
0.45
thenJohn: 0.450
0.32
thensat: 0.320
0.23
thenthen: 0.230
"he"
0.72
heJohn: 0.720
0.15
hesat: 0.150
0.08
hethen: 0.080
0.05
hehe: 0.050
"ran"
0.22
ranJohn: 0.220
0.28
ransat: 0.280
0.25
ranthen: 0.250
0.15
ranhe: 0.150
0.10
ranran: 0.100

Single Token Generation

"he" attends to the 1st token

Previous Tokens:
Query Token:
"John"
K1
[0.45]
[0.82]
[0.37]
V1
[0.72]
[0.31]
[0.64]
"sat"
K2
[--]
[--]
[--]
V2
[--]
[--]
[--]
"then"
K3
[--]
[--]
[--]
V3
[--]
[--]
[--]
"he"
Q1
[0.89]
[0.44]
[0.12]
Q·K1 = 0.845
α1 = 1.000

Multi-Token Generation (Predictions)

Starting generation - "then" queries for next token prediction

Autoregressive token prediction (no caching):
"John"
K1
[--]
[--]
[--]
V1
[--]
[--]
[--]
"sat"
K2
[--]
[--]
[--]
V2
[--]
[--]
[--]
"then"
Q3
[0.31]
[0.92]
[0.56]
🔄Autoregressive Generation
• Context length: 2 tokens
• Query token: "then" computing attention over context
• No caching - vectors recomputed each generation step

Multi-Token Generation (With KV Cache)

Starting generation - "then" queries for next token prediction

Autoregressive token prediction (with cache):
"John"
K1
[--]
[--]
[--]
V1
[--]
[--]
[--]
"sat"
K2
[--]
[--]
[--]
V2
[--]
[--]
[--]
"then"
Q3
[0.31]
[0.92]
[0.56]
Autoregressive Generation with KV Cache
• Context length: 2 tokens
• Query token: "then" computing attention over context
• KV caching enabled - vectors computed once and reused!