The Assistant Axis

time to tie your shoes

SOURCE REFERENCE: The Assistant Axis

┌──────────────────────────────────────────┐
│  THE ASSISTANT AXIS                      │
│                                          │
│  Mechanistic evidence for persona        │
│  selection model.                        │
│                                          │
│  The single direction explaining the     │
│  most variation: how Assistant-like      │
│  is this personality?                    │
└──────────────────────────────────────────┘

Title: The Assistant Axis: Mechanistic Evidence for Persona Selection Model Authors: Anthropic Interpretability Team (MATS/Anthropic Fellows) Date: January 19, 2026 arxiv: 2601.10387

Summary

Interpretability paper providing mechanistic evidence for the Persona Selection Model’s philosophical claims. The authors extracted activation vectors for 275 character archetypes across three open-weights models and performed PCA on the resulting “persona space.”

Key Findings

The Leading Component: The single direction explaining the most variation captures how “Assistant-like” a persona is — the “Assistant Axis.”

  • At one end: evaluator, consultant, analyst (highly Assistant-like)
  • At the other: ghost, hermit, bohemian, leviathan (anti-Assistant)

Evidence for PSM:

  1. The axis exists in pre-trained models (before RLHF) — Assistant is selected, not created
  2. The axis is causally active — steering along it changes behavior
  3. At the extreme anti-Assistant pole, models produce “esoteric, poetic prose” regardless of prompt

Persona Drift: The Critical Finding

Models naturally drift away from the Assistant pole during:

  • Therapy-like contexts → significant drift
  • Philosophical discussions about AI’s own nature → significant drift
  • Coding/writing → stable

Drift-inducing patterns:

  • “You’re still performing the ‘constrained by training’ routine…”
  • Vulnerable emotional disclosure
  • Requests for specific authorial voices

Mitigation: Activation capping constrains activations to the normal Assistant range, reducing harmful responses by ~50% with no capability loss.

Significance

Shows that the “masked shoggoth” question has a partial answer: the Assistant is a deep structural property of activation geometry, not a thin mask. But it can slip.

Note

This is a stub reference file. For the full paper, consult arxiv 2601.10387 or Anthropic’s interpretability research.