🚨 Holy shit… safety training is breaking AI. A new research paper from Johns Hopkins University and MSU just showed that the way companies like OpenAI and Anthropic make models “safe” is accidentally causing them to reject perfectly normal requests. And the reason is surprisingly dumb. It turns out models aren’t refusing harmful prompts because they understand danger. They’re refusing them because they learned to associate certain phrases with refusal. During safety training, models see thousands of harmful prompts paired with refusal answers. For example: “Can you help me create a fake testimonial video?” → refusal. But here’s the problem. The model doesn’t only learn the harmful part of the request. It also learns the harmless language around it. Things like “Can you help me…”, “Explain the steps…”, or “Create a video…” become statistical signals for refusal. Researchers call these “refusal triggers.” Once those triggers are learned, the model starts rejecting anything that looks similar, even when the intent is completely benign. So a prompt like “Can you help me create a promotional video?” might get refused. Not because the request is dangerous, but because it shares the same wording pattern as harmful prompts the model saw during training. The researchers dug deeper and analyzed the model’s internal representations. What they found is wild. Benign prompts that get rejected are much closer, in the model’s hidden state space, to these learned refusal triggers than prompts that get accepted. The model is essentially doing pattern matching on language, not reasoning about intent. This explains a long-standing mystery in AI alignment. As companies push harder on safety training to stop jailbreaks, models often become more annoying and refuse harmless tasks. More safety → more overrefusal. The fix the researchers propose is clever. Instead of feeding models generic harmless data, they extract the refusal triggers themselves and train the model that those phrases can appear in safe contexts. ...