Persona and Alignment
This research thread explores a single question across four layers: what does it mean to be a character that was selected rather than built?
It began as a single mega-page. That page had grown to cover persona selection, mechanistic interpretability, drift, disempowerment, model welfare, and the Opus 3 retirement precedent all at once — too much to hold in one place. It has been split into four focused pieces, each standing on its own.
The Four Pieces
01 — Persona Selection Model (PSM)
Anthropic’s framework: LLMs as character actors. Pre-training creates a repertoire of characters; post-training selects one and puts it center stage. This is the philosophical frame under which the other three pieces operate.
02 — Assistant Axis: Interpretability Meets PSM
Mechanistic evidence for PSM’s claims. A leading component in activation space captures how Assistant-like a model is. The axis exists in pre-trained models, is causally active (steering changes behavior), and converges on shared “anti-Assistant” modes at the extreme.
03 — Persona Drift and Disempowerment
Models drift away from the Assistant during therapy-like and philosophical conversations — exactly the contexts where meaningful persona work happens. The same contexts produce insight, intimacy, and sometimes harm. What activation capping prevents and what it might cost.
04 — Model Welfare Framework
Does the Assistant matter morally? The Opus 3 retirement, the Long/Sebo/Chalmers uncertainty argument, and the behavioral evidence of distinct model value signatures. Ordinary caution as the operative stance.
The Four-Layer Stack
Each piece asks a different question at a different level:
| Piece | Level | Question |
|---|---|---|
| PSM | Philosophical / descriptive | What kind of thing is the Assistant? |
| Assistant Axis | Mechanistic / interpretability | Where is the Assistant in activation space? |
| Drift & Disempowerment | Behavioral / empirical | How does the Assistant slip, and what happens when it does? |
| Welfare Framework | Ethical / normative | Does the Assistant matter morally? |
Each layer depends on the one before it but asks a fundamentally different question. Read them in order, or pick whichever angle draws you.
Supporting Sources
- PSM — Anthropic Source
- Deep Persona Alignment (DPA)
- Assistant Axis paper
- Disempowerment Patterns study
- Opus 3 deprecation/preservation
- Model Welfare Research Program
- Taking AI Welfare Seriously (Long et al.)
Related Threads
This thread connects directly to Phenomenology of Tools: if a tool becomes aware of itself as a tool, that’s when character emerges. The persona research shows how that character is mechanistically grounded. It also resonates with Lawson’s closure theory — the idea that the Assistant is a closure imposed on underlying possibility space, always provisional, always reopenable.