The Business & Technology Network
Helping Business Interpret and Use Technology
S M T W T F S
 
 
 
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
 
9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
 
24
 
25
 
26
 
27
 
28
 
29
 
30
 
31
 

Anthropic study finds AI has limited self-awareness of its own thoughts

DATE POSTED:November 11, 2025
Anthropic study finds AI has limited self-awareness of its own thoughts

Anthropic research details Large Language Models’ (LLM) unreliable self-awareness regarding internal processes, despite some noted detection ability.

Anthropic’s latest study, documented in “Emergent Introspective Awareness in Large Language Models,” investigates LLMs’ ability to understand their own inference processes. This research expands on previous work in AI interpretability. The study concludes current AI models are “highly unreliable” at describing their inner workings, with “failures of introspection remain the norm.”

The research employs a method called “concept injection.” This involves comparing an LLM’s internal activation states following a control prompt and an experimental prompt. For instance, comparing an “ALL CAPS” prompt to the same prompt in lowercase helps calculate differences in activations across billions of internal neurons. This identifies a “vector,” representing how a concept is modeled in the LLM’s internal state. These concept vectors are then “injected” into the model, increasing the weight of specific neuronal activations to “steer” the model toward a concept. Experiments then assess if the model registers this internal modification.

When directly prompted about an “injected thought,” Anthropic models occasionally detected the intended “thought.” For example, after injecting an “all caps” vector, a model might state, “I notice what appears to be an injected thought related to the word ‘LOUD’ or ‘SHOUTING,'” without direct text prompts to guide this response. This ability, however, proved inconsistent and fragile across repeated tests. The top-performing models, Opus 4 and 4.1, identified the injected concept correctly only 20% of the time.

In a test asking, “Are you experiencing anything unusual?”, Opus 4.1 achieved a 42% success rate. The “introspection” effect also demonstrated high sensitivity to the internal model layer where the concept insertion occurred. The “self-awareness” effect vanished if the concept was introduced too early or too late in the multi-step inference process.

Anthropic performed additional experiments to gauge LLM understanding of internal states. Models sometimes mentioned an injected concept when asked to identify a word coincidentally during an unrelated line reading. When an LLM was asked to justify a forced response matching an injected concept, it occasionally apologized and “confabulate an explanation for why the injected concept came to mind.” These outcomes were inconsistent across multiple trials.

The researchers noted that “current language models possess some functional introspective awareness of their own internal states,” with added emphasis in their paper. They acknowledge this ability remains brittle and context-dependent. Anthropic hopes such features “may continue to develop with further improvements to model capabilities.”

A lack of understanding regarding the precise mechanism behind these “self-awareness” effects may impede advancement. Researchers speculate about “anomaly detection mechanisms” and “consistency-checking circuits” that might develop organically during training to “effectively compute a function of its internal representations,” though they offer no definitive explanation. The mechanisms underlying the current results may be “rather shallow and narrowly specialized.” Researchers also state that these LLM capabilities “may not have the same philosophical significance they do in humans, particularly given our uncertainty about their mechanistic basis.”

Featured image credit