Watch how Key-Value caching transforms attention computation from O(n²) recomputation hell to efficient incremental processing, reusing past computations while generating identical outputs.
Full n×n attention distribution with causal masking
"he" attends to the 1st token
Starting generation - "then" queries for next token prediction
Starting generation - "then" queries for next token prediction