DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention
DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention
要約
Current hierarchical attention methods, such as NSA and InfLLMv2, select the top-k relevant key-value (KV) blocks based on coarse attention scores and subsequently apply fine-grained softmax attention on the selected tokens. However, the top-k operation assumes the number of relevant tokens for any …