You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Rewrite attention sink from eviction to ring buffer (#18821)
Summary:
Replace the eviction-based attention sink implementation with a torch.export
compatible ring buffer approach, and rewrite all tests.
Key changes:
- RopeWithAttentionSink: simplified to pass through original positions (no more
position shifting or k re-rotation)
- KVCacheWithAttentionSink: uses ring buffer with index_copy_ instead of dynamic
eviction (torch.cat/narrow/shift). Cache layout: [sink slots | ring buffer].
Sets is_ring_buffer=True so AttentionMHA.forward handles masking natively.
- CachePositionsManagerWithSink: new module that maps positions to cache indices,
with sink tokens in fixed slots and window tokens in ring buffer region.
- AttentionMHA.forward: ring buffer models skip start_pos bounds check and compute
their own causal mask after KV cache update.
- Remove eviction_batch_size from all interfaces (no longer needed).
- Remove attention_sink_forward monkey-patch and rerotate_k dead code.
- Add llama_attention_sink.yaml example config.
- Rewrite 16 eviction-based tests with 18 ring buffer tests covering sink
preservation, ring wrapping, causal masking, and degenerate cases.
Differential Revision: D100216687
0 commit comments