KV Cache Attention

Watch how Key-Value caching transforms attention computation from O(n²) recomputation hell to efficient incremental processing, reusing past computations while generating identical outputs.

Attention Distribution

Full n×n attention distribution with causal masking

Each row shows how one token attends to all previous tokens (autoregressive)

"John"

"sat"

"then"

"he"

"ran"

"John"

1.00

John → John: 1.000

—

"sat"

0.64

sat → John: 0.640

0.36

sat → sat: 0.360

—

"then"

0.45

then → John: 0.450

0.32

then → sat: 0.320

0.23

then → then: 0.230

—

"he"

0.72

he → John: 0.720

0.15

he → sat: 0.150

0.08

he → then: 0.080

0.05

he → he: 0.050

—

"ran"

0.22

ran → John: 0.220

0.28

ran → sat: 0.280

0.25

ran → then: 0.250

0.15

ran → he: 0.150

0.10

ran → ran: 0.100

Single Token Generation

"he" attends to the 1st token

Previous Tokens:

Query Token:

"John"

[0.45]

[0.82]

[0.37]

[0.72]

[0.31]

[0.64]

"sat"

[--]

"then"

[--]

"he"

[0.89]

[0.44]

[0.12]

Q·K1 = 0.845

α1 = 1.000

Multi-Token Generation (Predictions)

Starting generation - "then" queries for next token prediction

Autoregressive token prediction (no caching):

"John"

[--]

"sat"

[--]

"then"

[0.31]

[0.92]

[0.56]

🔄Autoregressive Generation

• Context length: 2 tokens

• Query token: "then" computing attention over context

• No caching - vectors recomputed each generation step

Multi-Token Generation (With KV Cache)

Starting generation - "then" queries for next token prediction

Autoregressive token prediction (with cache):

"John"

[--]

"sat"

[--]

"then"

[0.31]

[0.92]

[0.56]

⚡Autoregressive Generation with KV Cache

• Context length: 2 tokens

• Query token: "then" computing attention over context

• KV caching enabled - vectors computed once and reused!