interesting. IIRC, excluding current token's KV by attention mask (i.e. remove the diagonal) doesn't work! Hypothesis: this effectively makes current token to be an attention sink.