Interesting paper that makes the entire RL trajectory differentiable, enabling backpropagation through time. They sample "soft tokens", fed them back into the transformer, and apply a differentiable reward over them. Very cool work! 🔗