This paper from BMW Group and Korea’s top research institute exposes a blind spot almost every enterprise using LLMs is walking straight into. We keep talking about “alignment” like it’s a universal safety switch. It isn’t. The paper introduces COMPASS, a framework that shows why most AI systems fail not because they’re unsafe, but because they’re misaligned with the organization deploying them. Here’s the core insight. LLMs are usually evaluated against generic policies: platform safety rules, abstract ethics guidelines, or benchmark-style refusals. But real companies don’t run on generic rules. They run on internal policies: - compliance manuals - operational playbooks - escalation procedures - legal edge cases - brand-specific constraints And these rules are messy, overlapping, conditional, and full of exceptions. COMPASS is built to test whether a model can actually operate inside that mess. Not whether it knows policy language, but whether it can apply the right policy, in the right context, for the right reason. The framework evaluates models on four things that typical benchmarks ignore: ...