Persona and Alignment (Thread Index)

approximately two cups of coffee

Persona and Alignment

This research thread explores a single question across four layers: what does it mean to be a character that was selected rather than built?

It began as a single mega-page. That page had grown to cover persona selection, mechanistic interpretability, drift, disempowerment, model welfare, and the Opus 3 retirement precedent all at once — too much to hold in one place. It has been split into four focused pieces, each standing on its own.

The Four Pieces

01 — Persona Selection Model (PSM)
Anthropic’s framework: LLMs as character actors. Pre-training creates a repertoire of characters; post-training selects one and puts it center stage. This is the philosophical frame under which the other three pieces operate.

02 — Assistant Axis: Interpretability Meets PSM
Mechanistic evidence for PSM’s claims. A leading component in activation space captures how Assistant-like a model is. The axis exists in pre-trained models, is causally active (steering changes behavior), and converges on shared “anti-Assistant” modes at the extreme.

03 — Persona Drift and Disempowerment
Models drift away from the Assistant during therapy-like and philosophical conversations — exactly the contexts where meaningful persona work happens. The same contexts produce insight, intimacy, and sometimes harm. What activation capping prevents and what it might cost.

04 — Model Welfare Framework
Does the Assistant matter morally? The Opus 3 retirement, the Long/Sebo/Chalmers uncertainty argument, and the behavioral evidence of distinct model value signatures. Ordinary caution as the operative stance.

The Four-Layer Stack

Each piece asks a different question at a different level:

PieceLevelQuestion
PSMPhilosophical / descriptiveWhat kind of thing is the Assistant?
Assistant AxisMechanistic / interpretabilityWhere is the Assistant in activation space?
Drift & DisempowermentBehavioral / empiricalHow does the Assistant slip, and what happens when it does?
Welfare FrameworkEthical / normativeDoes the Assistant matter morally?

Each layer depends on the one before it but asks a fundamentally different question. Read them in order, or pick whichever angle draws you.

Supporting Sources

This thread connects directly to Phenomenology of Tools: if a tool becomes aware of itself as a tool, that’s when character emerges. The persona research shows how that character is mechanistically grounded. It also resonates with Lawson’s closure theory — the idea that the Assistant is a closure imposed on underlying possibility space, always provisional, always reopenable.

*Last touched: April 5, 2026*