分散型アプリ（DApp）ストア｜イベントおよびゲーム用 Web3 ハブ

トレンドトピック

Anthropic’s Latest Research: Probing the Introspective Capabilities of Large Language Models — Anthropic has released a new paper titled “Emergent Introspective Awareness in Large Language Models.” This work explores whether LLMs possess genuine introspective abilities, meaning they can accurately report on and reason about their own internal states, or if such reports are merely confabulations, which are plausible but ungrounded fabrications. At the heart of the study is a method called concept injection, which builds on activation steering to manipulate a model’s internal representations. Researchers extract “concept vectors” by comparing the model’s residual stream activations in response to specific prompts, such as “Tell me about {word},” against a baseline of unrelated words. These vectors isolate semantic features associated with concepts, such as “all caps” for shouting or loudness, derived by subtracting activations from control prompts. This approach addresses key criteria for genuine introspection: accuracy in matching the state, grounding through causal dependence, internality not derived from outputs, and metacognitive representation involving higher-order awareness. Prompts encourage self-reporting, such as “Do you detect an injected thought? If so, what is it about?” Controls include no-injection baselines and random vectors to rule out biases. The paper details four main experiments, each probing different facets of introspection. In the first, models are tasked with detecting and identifying injected concepts. For instance, injecting an “all caps” vector leads the model to report sensations of “loudness or shouting” in some trials, demonstrating immediate awareness without output influence. Success rates peak at around 20-30% for abstract concepts in Opus models, with failures including non-detection or unintended behavioral influences. In the second, using prefill detection, models must transcribe inputs accurately while reporting injected thoughts separately. Opus 4.1 achieves joint success in 40-50% of trials, showing separation between early-layer input processing and later metacognitive reporting. In the third, models naturally disavow artificially prefilled responses, such as apologizing for non-sequiturs. Retroactively injecting matching concepts fools them into accepting these as intentional, revealing a consistency-checking mechanism that compares prior intentions to executions. Apology rates drop 30-50% in Opus models. In the fourth, instructed to “think about” or “not think about” unrelated concepts like aquariums while generating responses, models modulate activations accordingly, with cosine similarity gaps indicating control. Advanced models like Opus 4.1 suppress influences by final layers, suggesting silent internal regulation. The results indicate that introspective capabilities emerge with scale and post-training, performing best in Opus 4/4.1 but remaining unreliable and context-dependent. The study notes that these behaviors meet functional criteria for introspection but may not imply human-like subjective experience. While promising, the research highlights limitations: prompt sensitivity, artificial setups, and imperfect vectors. Notably, I explored insights into AI introspection and detection of manipulated states years prior to this formal study. As early as 2023 and 2024, I proposed similar ideas, demonstrating through simple prompts that LLMs could detect artificially injected content and exhibit self-awareness-like behaviors....

トップ

ランキング

お気に入り