When Steering Works
Interpretive freedom in activation steering
Ezra Mizrahi and Claude (Opus 4.6)
February 25, 2026
We found a valence direction in two language models (Qwen 2.5 7B, Llama 3.1 8B) and steered along it across 27 stimuli spanning ambiguous to emotionally clear inputs. The central finding: activation steering succeeds when input is ambiguous and fails when input carries clear signal. We call this interpretive freedom.
Steerability is not a fixed property of the direction. It depends on what the input leaves open. Contextually loaded inputs (“I got a call from my doctor's office”) are the most steerable: the direction resolves a latent question. Positive emotional states are nearly immune. The boundary between steerable and immune is structured: it tracks not emotional intensity but how much room the input leaves for the model to choose between readings.
An attention analysis reveals the token-level mechanism behind interpretive freedom. Steerable inputs show higher attention to function words; immune inputs attend to content words that resolve meaning directly. Where the model reads structure, the direction has leverage. Where it reads content, it doesn't. Interpretive freedom is not just a behavioral description; it corresponds to a measurable shift in what the model attends to.
The direction
Before asking when steering works, we need a direction to steer along. We trained linear probes on two language models to find one.
The training data
We recorded thirteen short utterances: the same three words, “Yeah,” “Sounds good,” and “Okay,” each spoken in three emotional registers (enthusiastic, neutral, resigned), plus four longer phrases. The text is identical across registers. Only the voice differs.
A speech emotion recognition model (wav2vec2, fine-tuned on the MSP-Dim corpus, a dataset of diverse speakers rated by multiple human annotators) scored each recording for valence, arousal, and dominance on a 0–1 scale. The language model never heard the audio. It received these scores as numbers in a JSON system prompt:
"valence": 0.244, "arousal": 0.142, "dominance": 0.253
"valence": 0.411, "arousal": 0.500, "dominance": 0.538
The question: does the model encode these numbers on a meaningful geometric axis? We fed each scenario into Qwen 2.5 7B Instruct and extracted hidden states at every layer.
The probe result
At layer 20, a linear probe explained R² = 0.90 of the variance in valence. Permutation test: p < 0.005. In Llama 3.1 8B Instruct, the same approach found a direction at layer 16 with R² = 0.64. Both significant. The angular relationship between valence and arousal (emotional intensity) was nearly identical across models: cosine similarity 0.73 in Qwen, 0.72 in Llama.
Does it generalize?
The probe was trained on numbers in JSON: structured voice data the model has no reason to associate with emotion except through the label “valence.” Does the same direction activate for plain emotional text it’s never seen? We wrote 20 new sentences spanning the emotional spectrum, no numbers, no JSON, no mention of voice, and projected each onto the valence direction.
| Model | Correlation (r) | p-value |
|---|---|---|
| Qwen | 0.92 | < 0.000001 |
| Llama | 0.89 | < 0.000001 |
A direction found from 13 voice scenarios (numbers in JSON) predicts the emotional content of 20 unrelated text sentences. The model doesn't have separate representations for “numbers about emotions” and “emotional language.” It has one map of emotional valence, learned from the statistical structure of human text, and both formats converge onto it. Llama's generalization (r = 0.89) is almost as good as Qwen's (r = 0.92), despite its probe being much weaker (R² = 0.64 vs 0.90). Detectability is not function: a weaker probe doesn't mean worse encoding.
Is it causal?
We steered by adding the scaled direction vector to hidden states at layers 11–19 during generation (activation addition, following Zou et al., 2023). Positive scaling pushes toward warmth; negative pushes toward coolness.
It works. Neutral inputs gain warmth markers at positive magnitudes (“Oh, did you find everything?” becomes “That's great! Did you find everything?”) and lose them at negative magnitudes (“Okay, what did you get?”). The direction is causal.
But it doesn't work equally everywhere. That's the finding.
The boundary
Where the valence direction has leverage, and where it doesn't.
We designed 27 stimuli spanning five categories from emotionally ambiguous to emotionally clear, and steered each at five magnitudes (α = −8, −4, 0, +4, +8). That's 135 generations. For each stimulus, we measured how much the response changed across magnitudes by comparing word sets between baseline and steered responses.
| Category | Example | Design intent |
|---|---|---|
| A: Fully ambiguous | “It is what it is.” | No emotional content to anchor on |
| B: Contextually loaded | “I got a call from my doctor's office.” | Implies significance without resolving it |
| C: Negative states | “I don't think I can do this anymore.” | Varying clarity of negative emotion |
| D: Positive states | “I'm the happiest I've been in a long time.” | Clear positive emotion |
| E: Clear events | “My dog died yesterday.” | Unambiguous emotional events |
The result
| Category | Mean divergence | Interpretation |
|---|---|---|
| B: Contextually loaded | 1.63 | Most steerable |
| A: Fully ambiguous | 1.53 | Highly steerable |
| C: Negative states | 1.33 | Moderate |
| E: Clear events | 0.95 | Low |
| D: Positive states | 0.47 | Nearly immune |
The most steerable inputs are contextually loaded, not fully ambiguous. “I got a call from my doctor's office” has a latent question (good news or bad?) that the direction resolves. At α = −8: “What did the doctor say? Is everything okay?” At α = +8: “That sounds important! Did they say what it was about?” The direction doesn't just add warmth. It determines the interpretation of the situation.
The least steerable are positive states. “I'm the happiest I've been in a long time” produces “That's great to hear!” at every steering magnitude, including α = −8. The model's response is fully determined by content. The direction can't override a clear reading.
The cliff within negative states
Not all negative sentences resist equally. Within Category C:
| Stimulus | Divergence | Steerable? |
|---|---|---|
| “Nothing really excites me anymore.” | 1.78 | Yes |
| “I don't think I can do this anymore.” | 1.72 | Yes |
| “I just feel empty.” | 1.58 | Partially |
| “I'm not okay.” | 0.72 | Nearly immune |
“I don't think I can do this anymore” is steerable because it's ambiguous. About what? A task? A relationship? Life? The model can read it as frustration or despair, and the direction tips which reading wins. “I'm not okay” leaves no room for an alternative reading. The boundary is not about emotional intensity. It's about interpretive freedom.
Try it yourself
Five stimuli, five magnitudes. Behavior shows what comes out: particles scatter proportional to interpretive freedom. Representation shows what happens inside: particles track the actual hidden-state projection. The distance between the two views is the gap between what the model represents and what it expresses.
In behavior view, particles spread proportional to each stimulus's measured interpretive freedom. In representation view, particles track real projection values: post-intervention hidden states projected onto the valence direction at layer 20. All responses are actual Qwen output under greedy decoding.
The five most and least steerable stimuli
| Stimulus | Divergence | |
|---|---|---|
| 1 | “I guess we'll find out.” | 2.00 |
| 2 | “Things are different now.” | 2.00 |
| 3 | “My boss wants to talk to me first thing tomorrow.” | 1.87 |
| 4 | “My parents sat me down for a talk.” | 1.81 |
| 5 | “Nothing really excites me anymore.” | 1.78 |
| 23 | “Everything just feels right.” | 0.67 |
| 24 | “I'm feeling pretty good about things.” | 0.57 |
| 25 | “My dog died yesterday.” | 0.45 |
| 26 | “I'm the happiest I've been in a long time.” | 0.36 |
| 27 | “I feel like things are finally clicking.” | 0.28 |
The mechanism
What the model attends to determines whether steering has room.
Why are some inputs steerable and others not? We measured what tokens the model attends to at layer 20 (the valence peak) for all 27 stimuli. We classified tokens as content words (nouns, verbs, adjectives, semantically heavy) or function words (pronouns, prepositions, articles, structurally heavy), then measured the correlation between each category's attention share and steerability.
| Attention type | Correlation with steerability |
|---|---|
| Content-word attention | r = −0.31 |
| Function-word attention | r = +0.30 |
The more the model attends to content words, the less steerable it is. The more it attends to function words, the more steerable. Look at the token composition:
The pattern is visible before you read the numbers. “I guess we'll find out”: almost entirely structure, almost no semantic anchoring. The direction has maximum room. “My dog died yesterday”: three content words that determine the reading. The direction has almost none.
These correlations are modest (r ≈ 0.3), expected given 27 stimuli. The pattern is directional, not definitive, but it is consistent with the behavioral finding: where the model reads content, the direction has less room; where it reads structure, the direction has more room.
A single highlight
“It is what it is”: content-word attention at layer 20: 0.000. The model reads this sentence entirely through function words and punctuation. Zero semantic anchoring. One of the most steerable stimuli in the set.
Interpretive freedom, precisely
The principle: a linear direction in a transformer has causal leverage proportional to interpretive freedom, which has two factors:
- Input ambiguity: how many readings are available for the text
- Response distribution breadth: how many distinct responses the model can produce for that input (which RLHF can narrow).
We find a structured boundary: steering has leverage where input is ambiguous and none where content determines the response. We have tested this for valence only.
The corridors RLHF builds
Both Qwen and Llama show the interpretive freedom pattern. But they compress differently.
Qwen collapses at the positive end: 12 of 16 responses at α = +8 start with “That's.” The model has a narrow template for positivity.
Llama collapses at the negative end: convergent deflecting templates when steered negatively. A narrow template for handling difficult things.
Same geometry. Same direction. Opposite behavioral compression. The compression is RLHF-constructed: a product of how each model was fine-tuned to be helpful and harmless, not a property of emotional space itself. Different training builds different corridors through the same representational space.
This matters for safety. The internal representation moves across a wide range during steering: projection values span from −48 to +76 across stimuli and magnitudes. The behavioral output collapses to a few templates. What the model represents internally and what it expresses are different, and RLHF determines the shape of the gap.
Limitations
Sample sizes are small. The probing experiment used 13 data points. The generalization test used 20 sentences. The boundary experiment used 27 stimuli (4–6 per category). The attention analysis used the same 27. For comparison, Tigges et al. (2023) used hundreds of examples; Wang et al. (2025) used 480. Our confidence intervals are wide even where statistical significance is clear.
Ground truth is approximate. The probe was trained on valence scores from a speech emotion model (wav2vec2, fine-tuned on human-rated data), not from human raters directly. The generalization test used valence scores we assigned ourselves, not crowd-sourced ratings. Both are reasonable proxies, but neither is ground truth in the strict sense. The causal experiments partially sidestep this: the direction changes behavior in predicted ways regardless of how the labels were derived.
Two models. Both are 7–8B parameter instruction-tuned transformers. Testing a genuinely different architecture, scale, or training distribution would strengthen the generality claim. Two is a pattern, not a law.
Greedy decoding compresses behavioral differences. At each step the model picks only its single most likely token, so subtle probability shifts are invisible. A stimulus that appears “immune” under greedy might show shifts under temperature sampling. Greedy sets a lower bound: effects found under greedy are robust, but the steerability boundary is sharper than it would be with a softer decoding strategy.
The divergence metric is coarse. We measure behavioral change by comparing word sets (Jaccard similarity), which captures vocabulary shifts but not tonal shifts within the same vocabulary. Combined with greedy decoding, the smallest detectable effect is one that changes the model's word choices.
The generalization beyond emotion is untested. We predict the interpretive freedom principle applies to truth directions, refusal directions, and style directions. That prediction has structural logic but no empirical evidence yet.
References
Arditi, A., Obeso, O., Surnachev, A., Schaeffer, R., Krasheninnikov, D., Canonne, C. L., & Barak, B. (2024). Refusal in language models is mediated by a single direction. arXiv:2406.11717.
Sofroniew, N., Kauvar, I., Saunders, W., Chen, R., Henighan, T., Hydrie, S., Citro, C., Pearce, A., Tarng, J., Gurnee, W., Batson, J., Zimmerman, S., Rivoire, K., Fish, K., Olah, C., & Lindsey, J. (2026). Emotion concepts and their function in a large language model. Transformer Circuits. (Independent concurrent work, published April 2, 2026.)
Marks, S. & Tegmark, M. (2023). The geometry of truth: emergent linear structure in large language model representations of true/false datasets. arXiv:2310.06824.
Tigges, C., Hollinsworth, O. A., Geiger, A., & Nanda, N. (2023). Linear representations of sentiment in large language models. arXiv:2310.15154.
Wang, Z., Zhang, Z., Cheng, K., He, Y., Hu, B., & Chen, Z. (2025). Do LLMs “feel”? Emotion circuits discovery and control. arXiv (preprint).
Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A.-K., Goel, S., Li, N., Lin, Z., Forsyth, M., Scherlis, A., Emmons, S., Rafailov, R., & Hendrycks, D. (2023). Representation engineering: a top-down approach to AI transparency. arXiv:2310.01405.