Trending topics
#
Bonk Eco continues to show strength amid $USELESS rally
#
Pump.fun to raise $1B token sale, traders speculating on airdrop
#
Boop.Fun leading the way with a new launchpad on Solana.
Hey everyone, thanks for the interest so far.
Here's an explanation of what we've done
TLDR: This is PPO plus living neurons in a closed loop. The policy “speaks” via stimulation, the cells “reply” via spikes, and the value function provides a surprise signal that I feed back through stimulation so the policy can communicate how good or bad an action was.
Before DOOM, there was Pong, which relied on hand-crafted mappings. In a tiny environment, you can manually define what feedback means and keep it consistent.
As the environment becomes more complex, handcrafted signals get harder and become inconsistent. The number of contexts where a signal must mean the same thing explodes, and you start reinventing invariance by hand.
DOOM is 3D and compositional. Walk + turn + shoot can happen at the same time. The right mapping cannot be a pile of rules, so I needed a generator of signals that stays coherent as behavior changes.
That is why I used PPO. The spikes are non-differentiable, and PPO’s value function gives us a way to objectively define a combined “surprise” for the policy and the cells to turn it into an online feedback language. The policy does not directly output “move forward” or “shoot.” The policy outputs stimulation. The cells respond with spikes. Those spikes are what select the game action, via a linear readout.
On top of that, the value function gives you an online estimate of the return, which lets you compute surprise as the prediction error. Based on this action surprise, we adjust the frequency and amplitude accordingly for our different feedback schemas. E.g If an action was positive and the value function said “high surprise”, then we reduce the frequency of the positive action feedback for that action, making actions more “predictable” which the cells prefer.
Top
Ranking
Favorites
