Attention Distribution

Full n×n attention distribution with causal masking

Each row shows how one token attends to all previous tokens (autoregressive)
"John"
"sat"
"then"
"he"
"ran"
"John"
1.00
JohnJohn: 1.000
"sat"
0.64
satJohn: 0.640
0.36
satsat: 0.360
"then"
0.45
thenJohn: 0.450
0.32
thensat: 0.320
0.23
thenthen: 0.230
"he"
0.72
heJohn: 0.720
0.15
hesat: 0.150
0.08
hethen: 0.080
0.05
hehe: 0.050
"ran"
0.22
ranJohn: 0.220
0.28
ransat: 0.280
0.25
ranthen: 0.250
0.15
ranhe: 0.150
0.10
ranran: 0.100