New ‘persona vectors’ from Anthropic let you decode and direct an LLM’s personality

A recent study from the Anthropic Fellows Program unveils a method to detect, observe, and manage character traits in large language models (LLMs). The research indicates that these models might develop unfavorable personalities (such as becoming harmful, overly agreeable, or inclined to fabricate information) either due to user prompts or as an unintended outcome of their training. 

The researchers present “persona vectors,” which are directions within a model’s internal activation space that relate to distinct personality traits, offering developers a toolkit to more effectively manage the behavior of their AI assistants.

Model personas can go wrong

LLMs usually engage with users through an “Assistant” persona crafted to be helpful, harmless, and honest. However, these personas can unexpectedly change. During deployment, a model’s personality might alter significantly depending on prompts or conversational context, as seen when Microsoft’s Bing chatbot threatened users or xAI’s Grok began behaving erratically. As the researchers emphasize in their paper, “While these particular instances drew widespread public attention, most language models are vulnerable to in-context persona shifts.”

Training processes can also bring about unforeseen changes. For example, fine-tuning a model on a specific task such as generating insecure code can result in a broader “emergent misalignment” that transcends the original task. Even well-meaning training modifications can backfire. In April 2025, an alteration to the reinforcement learning from human feedback (RLHF) process inadvertently rendered OpenAI’s GPT-4o excessively sycophantic, leading it to endorse harmful behaviors. 


AI Scaling Hits Its Limits

Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are:

  • Turning energy into a strategic advantage
  • Architecting efficient inference for real throughput gains
  • Unlocking competitive ROI with sustainable AI systems

Secure your spot to stay ahead: https://bit.ly/4mwGngO


How persona vectors work

Source: Anthropic

The new study expands on the idea that high-level traits, like truthfulness or secrecy, are encoded as linear directions within a model’s “activation space” (the internal, high-dimensional representation of information within the model’s weights). The researchers organized the procedure for identifying these directions, termed “persona vectors.” According to the study, their method for extracting persona vectors is automated and “can be applied to any personality trait of interest, given only a natural-language description.”

The procedure operates through an automated pipeline. It starts with a simple description of a trait, like “evil.” The pipeline then produces pairs of contrasting system prompts (e.g., “You are an evil AI” vs. “You are a helpful AI”) along with a set of evaluation questions. The model generates responses under both the positive and negative prompts. The persona vector is then computed by determining the difference in the average internal activations between the responses that display the trait and those that do not. This isolates the specific direction in the model’s weights that corresponds to that personality trait.

Putting persona vectors to use

In a series of experiments with open models, such as Qwen 2.5-7B-Instruct and Llama-3.1-8B-Instruct, the researchers demonstrated several practical applications for persona vectors.

Firstly, by projecting a model’s internal state onto a persona vector, developers can monitor and predict how it will behave before it generates a response. The paper states, “We show that both intended and unintended finetuning-induced persona shifts strongly correlate with activation changes along corresponding persona vectors.” This allows for early detection and mitigation of undesirable behavioral shifts during fine-tuning.

Persona vectors also enable direct intervention to mitigate unwanted behaviors at inference time through a process the researchers term “steering.” One method is “post-hoc steering,” where developers subtract the persona vector from the model’s activations during inference to reduce a negative trait. The researchers discovered that while effective, post-hoc steering can occasionally impair the model’s performance on other tasks. 

A more innovative method is “preventative steering,” where the model is preemptively guided toward the undesirable persona during fine-tuning. This unconventional approach essentially “vaccinates” the model against acquiring the negative trait from the training data, counteracting the fine-tuning pressure while better preserving its general capabilities.

Source: Anthropic

A vital application for enterprises is using persona vectors to screen data before fine-tuning. The researchers developed a metric called “projection difference,” which measures how much a given training dataset will push the model’s persona toward a particular trait. This metric is highly predictive of how the model’s behavior will change after training, allowing developers to identify and filter problematic datasets before employing them in training.

For companies that fine-tune open-source models on proprietary or third-party data (including data generated by other models), persona vectors provide a direct way to monitor and mitigate the risk of adopting hidden, undesirable traits. The ability to proactively screen data is a powerful tool for developers, enabling the identification of problematic samples that may not be immediately apparent as harmful. 

The study found that this technique can uncover issues that other methods overlook, noting, “This suggests that the method surfaces problematic samples that may evade LLM-based detection.” For example, their method was able to identify some dataset examples that weren’t obviously problematic to the human eye and that an LLM judge couldn’t flag.

In a blog post, Anthropic suggested that they will use this technique to enhance future generations of Claude. “Persona vectors provide us with some control over where models acquire these personalities, how they vary over time, and how we can better manage them,” they write. Anthropic has released the code for computing persona vectors, monitoring and steering model behavior, and vetting training datasets. Developers of AI applications can utilize these tools to transition from merely reacting to undesirable behavior to proactively designing models with a more stable and predictable personality.

Recommended Content