DApp Store | Web3 Hub for Events & Games

Trending topics

Very cool blog by @character_ai diving into how they trained their proprietary model Kaiju (13B, 34B, 110B), before switching to OSS model, and spoiler: it has Noam Shazeer written all over it. Most of the choices for model design (MQA, SWA, KV Cache, Quantization) are not to optimize for "AGI benchmark" (think MMLU) since this is not what people will use the model for but instead having a good serving speed. Still, they include code in the pre-training mix and do annealing on high quality "benchmark friendly" data. One surprising thing is that those models are not MoEs, despite that people working at character at the time like @stephenroller or Noam previously worked on MoE. Here are a few optimizations that they did -> MuP-like scaling -> MQA + SWA -> Clamping everywhere to control activation, not sure if it's soft or hard? -> KV Cache sharing -> Relu^2 activation function -> FSDP + TP + SP -> Int6 gradient communication -> Quantization Aware Training (QAT) with stuff like "bungee_scalar" to get a stable recipe for smaller models. KV Cache and forward pass are in int8, gradient and activation are in bf16, master weight and grad acc in fp32.

Top

Ranking

Favorites