📈 now trending on alphaXiv "∆Belief-RL: Intrinsic Credit Assignment for Long-Horizon Interaction" Long-horizon interactive RL is brutal because rewards are sparse and it’s unclear which specific questions or actions actually caused success, so agents either don’t learn or learn brittle heuristics. To improve on this, ∆Belief-RL turns “curiosity” into a proper long-horizon learning signal by rewarding an agent whenever an interaction increases its belief in the true answer, where it boosts the model’s own probability on the correct outcome. This gives dense, step-by-step credit assignment for asking the right questions, so agents learn effective info-seeking behavior faster and generalize to much longer horizons + real tasks like customer service and personalization with far fewer wasted interactions.