The Kimi K2.5 benchmark but done by OpenAI data scientists