DApp Store | Web3 Hub for Events & Games

Trending topics

i spent a few hours going through /karpathy/autoresearch repo line by line. the "ai agents doing research" angle is what's getting all the attention but i think the more interesting thing is what's actually inside the training script and the engineering decisions that make the search loop tight. it's one of the most dense single-file training setups i've read. let me start with the thing that makes the whole project possible: the time budget is fixed at 300 seconds wall clock. not fixed steps, not fixed tokens, not fixed flops. wall clock seconds. this sounds like a minor detail but it's the entire reason the autonomous loop works. the agent can make the model 3x bigger, cut the batch size in half, swap in a completely different architecture, and the result is still directly comparable to every other experiment because they all got exactly 5 minutes of training on the same gpu. if you fixed steps instead, a bigger model would get less gradient updates per second and you'd be penalizing it unfairly. if you fixed tokens, you'd have the same problem. fixing wall time means you're asking the right question: given this hardware and this much time, what is the best model you can produce? everything else is a free variable. the agent can explore the full pareto surface of model size vs throughput vs convergence speed without any of those tradeoffs being confounded by the evaluation protocol. the metric is also carefully chosen. it's bits per byte, not cross entropy loss. cross entropy depends on your vocab size. a model with 32k tokens and a model with 8k tokens will have very different loss values even if they compress the data equally well. bpb normalizes this away by summing the per-token cross entropy in nats, summing the utf-8 byte lengths of the target tokens, and converting nats-per-byte to bits-per-byte. so even if the agent changes something that affects the effective token distribution, the comparison remains fair. these two choices, fixed wall time and a vocab-invariant metric, turn what would be a messy incomparable search into a clean optimization problem. now the model itself. it's a GPT but with a bunch of modern tricks that are worth understanding. first, RMSnorm everywhere. on the block inputs (pre-norm), and also on queries and keys right before the attention dot product. this QK-norm thing is important because without it the norms of q and k can grow unboundedly during training, causing attention logits to sharpen and softmax to saturate. normalizing q and k keeps the dot products in a stable range regardless of how deep the network is or how training dynamics evolve. the attention itself is FA 3, loaded through the kernels library. it uses varunneal's implementation on hopper (sm_90) and falls back to a community build on older gpus. the attention pattern is "SSSL" which means three layers of sliding window attention (window = half the sequence length) followed by one layer of full causal attention, repeating. this is the sparse-to-dense pattern you see in mistral and gemma2. the local attention layers are computationally cheap because the attention matrix is banded, and the periodic global layer lets information flow across the full context. with 8 layers and a 4-character pattern you get layers 0,1,2 local, layer 3 global, layers 4,5,6 local, layer 7 global. the last layer is forced global regardless of pattern. the value embedding thing is subtle and i think underappreciated. every other layer gets its own embedding table, completely separate from the main token embedding, that maps token ids directly to value-dimension vectors. these get mixed into the attention values through a learned gate: v = v + 2 * sigmoid(W_gate @ x:32) * ve. the gate weight is zero-initialized, so sigmoid(0) = 0.5, times 2 gives 1.0, which is a neutral starting point. over training the model can learn to amplify or suppress the value embedding per-head based on the first 32 dimensions of the hidden state. this is from the ResFormer line of work and the intuition is that it gives attention a direct shortcut to token identity. the value vectors can carry information about "what token is at this position" without that information having to survive the residual stream transformations from earlier layers. it's essentially a skip connection from the input directly into the attention values, gated so the model can decide when it's useful. there are also per-layer learnable scalars on the residual stream: x = lambda_residi * x + lambda_x0i * x0, where x0 is the normalized embedding from layer 0. every layer can independently control how much it listens to the running residual vs the original input. the residual lambdas start at 1.0, the x0 lambdas start at 0.1. this is a soft version of the "disentangled residual" idea. in a standard transformer the residual stream is a sum of all previous layer outputs and it gets increasingly polluted as you go deeper. giving each layer access to the clean original embedding means it doesn't have to learn to "undo" earlier layers to recover low-level information. the logits are softcapped at 15 via tanh(logits/15)*15 which prevents the model from being overconfident early in training when the representations are still noisy. but honestly the most interesting part of the whole file is the optimizer. MuonAdamW is a combined optimizer that dispatches different update rules based on parameter group. embeddings (token embedding, value embeddings, unembedding head) and per-layer scalars get standard AdamW with different learning rates for each group. the spread is wild. embedding lr is 0.6, unembedding lr is 0.004, that's a 150x difference, and it's intentional. the embedding matrix sees every single token and needs to update aggressively. the unembedding matrix is a linear probe on the final representation and benefits from stability. the embedding, value embedding, and unembedding learning rates are all scaled by (d_model / 768)^(-0.5) which is a muP-inspired correction. as model width changes, those learning rates adjust to keep the feature learning dynamics scale-invariant. the scalar learning rates for the per-layer lambdas are handled separately and don't get this scaling. the 2D weight matrices in the transformer, attention projections and mlp weights, get Muon, and this is where it gets genuinely interesting. muon takes the gradient, applies nesterov momentum, then runs a newton-schulz iteration to approximate the polar decomposition of the gradient matrix. the polar decomposition factors a matrix G into G = U * S where U is orthogonal and S is symmetric positive semi-definite. muon computes U, the nearest orthogonal matrix to the gradient, and uses that as the update direction. the newton-schulz iteration is 5 steps. for tall matrices (more rows than columns), A = X^T @ X then X -> aX + X @ (bA + cA^2). for wide matrices, A = X @ X^T then X -> aX + (bA + cA^2) @ X. the coefficients are hardcoded from a precomputation. they call it "polar express." the whole thing compiles to a single fused kernel via torch.compile. why does this matter? because for weight matrices the frobenius norm gradient (what adam and sgd use) is geometrically wrong. the "correct" steepest descent direction for a weight matrix is the one that minimizes the loss subject to the constraint that the update has unit spectral norm, not unit frobenius norm. the orthogonal polar factor gives you exactly this. in practice it means muon makes much larger effective updates because it's not wasting step size on scaling the singular values. it only rotates them. this is why muon converges significantly faster than adam on transformer weight matrices. muon does maintain per-element momentum buffers (same shape as the parameters, stacked across each shape group), but unlike adam it doesn't track per-element second moments. the second moment estimates are per-row or per-column after orthogonalization, not per-element. that's where NorMuon comes in. on top of the base muon there's NorMuon, a variance reduction scheme. after orthogonalization, it computes per-row (or per-column depending on aspect ratio) second moment estimates, maintains an exponential moving average of those, and rescales the update so each output dimension gets its own adaptive step size. it's essentially the adam adaptivity idea but applied in the orthogonalized coordinate system rather than the raw parameter space. the weight decay is also non-standard. it's "cautious," meaning it only decays parameters where the muon update direction agrees with the parameter sign: mask = (g * params) >= 0. this avoids the known failure mode where weight decay pushes parameters toward zero against the update's wishes, which can destabilize training. one small detail i appreciated: after the very first training step, the code calls gc.collect(), gc.freeze(), gc.disable() to completely shut off python's garbage collector. python's GC runs periodically and causes ~500ms stalls. when your total budget is 300 seconds and each step is maybe 300ms, a random GC pause costs you almost 2 training steps. they manually trigger gc.collect() every 5000 steps as a compromise. this is the kind of thing you only learn by profiling real training runs and noticing mysterious throughput drops. the first 11 steps (0 through 10) aren't counted toward the time budget either. that's the warmup where torch.compile does its thing and CUDA kernels get JIT'd. without this exclusion, different experiments would get different amounts of "real" training depending on how long compilation takes for that particular model configuration. again, a design choice that seems small but is critical for making experiments comparable. now zoom out. the actual autoresearch loop is: the agent reads program.md (a markdown file that describes its job), modifies train .py, commits, runs for 5 minutes, checks if val_bpb improved, keeps or reverts, repeats. program.md explicitly says "NEVER STOP." the agent runs indefinitely until the human kills it. ~12 experiments per hour, ~100 overnight while you sleep. ...

Top

Ranking

Favorites