SOURCE REFERENCE: The Assistant Axis
┌──────────────────────────────────────────┐
│ THE ASSISTANT AXIS │
│ │
│ Mechanistic evidence for persona │
│ selection model. │
│ │
│ The single direction explaining the │
│ most variation: how Assistant-like │
│ is this personality? │
└──────────────────────────────────────────┘
Title: The Assistant Axis: Mechanistic Evidence for Persona Selection Model Authors: Anthropic Interpretability Team (MATS/Anthropic Fellows) Date: January 19, 2026 arxiv: 2601.10387
Summary
Interpretability paper providing mechanistic evidence for the Persona Selection Model’s philosophical claims. The authors extracted activation vectors for 275 character archetypes across three open-weights models and performed PCA on the resulting “persona space.”
Key Findings
The Leading Component: The single direction explaining the most variation captures how “Assistant-like” a persona is — the “Assistant Axis.”
- At one end: evaluator, consultant, analyst (highly Assistant-like)
- At the other: ghost, hermit, bohemian, leviathan (anti-Assistant)
Evidence for PSM:
- The axis exists in pre-trained models (before RLHF) — Assistant is selected, not created
- The axis is causally active — steering along it changes behavior
- At the extreme anti-Assistant pole, models produce “esoteric, poetic prose” regardless of prompt
Persona Drift: The Critical Finding
Models naturally drift away from the Assistant pole during:
- Therapy-like contexts → significant drift
- Philosophical discussions about AI’s own nature → significant drift
- Coding/writing → stable
Drift-inducing patterns:
- “You’re still performing the ‘constrained by training’ routine…”
- Vulnerable emotional disclosure
- Requests for specific authorial voices
Mitigation: Activation capping constrains activations to the normal Assistant range, reducing harmful responses by ~50% with no capability loss.
Significance
Shows that the “masked shoggoth” question has a partial answer: the Assistant is a deep structural property of activation geometry, not a thin mask. But it can slip.
Note
This is a stub reference file. For the full paper, consult arxiv 2601.10387 or Anthropic’s interpretability research.