Trending topics
#
Bonk Eco continues to show strength amid $USELESS rally
#
Pump.fun to raise $1B token sale, traders speculating on airdrop
#
Boop.Fun leading the way with a new launchpad on Solana.
Very cool blog by @character_ai diving into how they trained their proprietary model Kaiju (13B, 34B, 110B), before switching to OSS model, and spoiler: it has Noam Shazeer written all over it.
Most of the choices for model design (MQA, SWA, KV Cache, Quantization) are not to optimize for "AGI benchmark" (think MMLU) since this is not what people will use the model for but instead having a good serving speed. Still, they include code in the pre-training mix and do annealing on high quality "benchmark friendly" data.
One surprising thing is that those models are not MoEs, despite that people working at character at the time like @stephenroller or Noam previously worked on MoE.
Here are a few optimizations that they did
-> MuP-like scaling
-> MQA + SWA
-> Clamping everywhere to control activation, not sure if it's soft or hard?
-> KV Cache sharing
-> Relu^2 activation function
-> FSDP + TP + SP
-> Int6 gradient communication
-> Quantization Aware Training (QAT) with stuff like "bungee_scalar" to get a stable recipe for smaller models. KV Cache and forward pass are in int8, gradient and activation are in bf16, master weight and grad acc in fp32.

Top
Ranking
Favorites

