1/ Non-linear RNNs didn't fail at language modeling because of the non-linearity. They failed because vector hidden states compress too much context. Expanding the state to a matrix changes everything. 🧵