Trending topics
#
Bonk Eco continues to show strength amid $USELESS rally
#
Pump.fun to raise $1B token sale, traders speculating on airdrop
#
Boop.Fun leading the way with a new launchpad on Solana.
This is really cool.
It got me thinking more deeply about personalized RL: what’s the real point of personalizing a model in a world where base models can become obsolete so quickly?
The reality in AI is that new models ship every few weeks, each better than the last. And the pace is only accelerating, as we see on the Hugging Face Hub. We are not far away from better base models dropping daily.
There’s a research gap in RL here that almost no one is working on. Most LLM personalization research assumes a fixed base model, but very few ask what happens to that personalization when you swap the base model. Think about going from Llama 3 to Llama 4. All the tuned preferences, reward signals, and LoRAs are suddenly tied to yesterday’s model.
As a user or a team, you don’t want to reteach every new model your preferences. But you also don’t want to be stuck on an older one just because it knows you.
We could call this "RL model transferability": how can an RL trace, a reward signal, or a preference representation trained on model N be distilled, stored, and automatically reapplied to model N+1 without too much user involvement? We solved that in SFT where a training dataset can be stored and reused to train a future model. We also tackled a version of that in RLHF phases somehow but it remain unclear more generally when using RL deployed in the real world.
There are some related threads (RLTR for transferable reasoning traces, P-RLHF and PREMIUM for model-agnostic user representations, HCP for portable preference protocols) but the full loop seems under-studied to me.
Some of these questions are about off-policy but other are about capabilities versus personalization: which of the old customizations/fixes does the new model already handle out of the box, and which ones are actually user/team-specific to ever be solved by default? That you would store in a skill for now but that RL allow to extend beyond the written guidance level.
I have surely missed some work so please post any good work you’ve seen on this topic in the comments.
Top
Ranking
Favorites
