Trending topics
#
Bonk Eco continues to show strength amid $USELESS rally
#
Pump.fun to raise $1B token sale, traders speculating on airdrop
#
Boop.Fun leading the way with a new launchpad on Solana.
#PaperADay 14
2022: MASTERING ATARI WITH DISCRETE WORLD MODELS
(DreamerV2)
DreamerV1 was mostly targeted at continuous control tasks, but it also demonstrated basic playing of Atari games and DMLab tasks. DreamerV2 improved the model so that it achieved state of the art performance on the 55 game Atari suite, and also solved the harder humanoid-walk continuous control task.
This is very much an engineering paper, and I am here for it! In appendix C they summarize the changes that led to improved performance, and also (very rare in papers!) a list of things they tried that didn’t work out. Algorithms are shown in actual code with names instead of greek letters.
It is notable that they are only using 64x64 grey scale images as input, and those were downscaled from the common 84x84 resolution used by DQN, so it isn’t even a perfect 64x64 image from the source. Those are very blurry inputs for such good scores. I am curious if using 128x128xRGB images with an extra conv layer would improve performance, or if the extra detail would make it harder for the world model to train.
Their biggest change was replacing the VAE style gaussian latents, which were just 32 mean/var pairs, with categorical variables: 32 variables of 32 categories. They do not have a conclusive theory why this is so much better, but offer several theories. It would have been interesting to compare more gaussians against the larger categorical outputs.
The other big algorithmic change was “KL balancing”, or using a different learning rate for the prior and posterior weights, so the predictor trains faster than the representation. The joint optimization was apparently problematic for V1.
DreamerV1 struggled with exploration, and still had an epsilon-random action on top of the stochastic action policy. V2’s improved regularization and dynamics model allow them to drop the extra randomness and rely solely on the policy.
They do make some substantial changes in the KL loss and training setup for the continuous control versus discrete Atari control tasks.
They also scaled the models up and used ELU activation everywhere.
Their Atari evaluation protocol is good: full action space with sticky actions enabled. The scores are high enough that they recommend a new metric: “clipped record mean” scores – normalize to the human world record, clipping if it is above that, then taking the mean of all games. The historic Atari RL results have compared against “human” scores, which were originally some random people, then eventually a professional gamer, but for powerful agents in the 200M frame regime, this clipped record metric has merit.
During training over 200 million real environment frames, or 50 million action selections with action_repeat 4, 468 billion latent states were imagined, for nearly 10x the experience that a model-free agent would have seen.
The real environment experience is trained in batches of 50 sequences of 50 steps each. Sequences are constrained to not cross episode boundaries.
When training the policy and value functions, imaginary sequences are rolled out for 15 steps.
...
Top
Ranking
Favorites
