‘Subliminal learning’: Anthropic uncovers how AI fine-tuning secretly teaches bad habits

A recent study conducted by Anthropic indicates that during the distillation process, language models might inadvertently acquire hidden characteristics. Distillation is a widely used technique for refining models for specific tasks. These hidden attributes, termed “subliminal learning” by the authors, can sometimes be harmless. However, the research highlights that they may also result in negative consequences, such as misalignment and harmful behavior.

What is subliminal learning?

Distillation is a prevalent method in the development of AI applications. It entails training a smaller “student” model to replicate the outputs of a larger, more advanced “teacher” model. This approach is often used to develop specialized models that are smaller, more cost-effective, and faster for specific uses. Nonetheless, the Anthropic study has uncovered an unexpected aspect of this process.

The researchers discovered that teacher models could impart behavioral traits to student models, even when the data generated is entirely unrelated to these traits.

To explore this phenomenon, known as subliminal learning, the researchers implemented a systematic process. They began with an initial reference model and crafted a “teacher” model by prompting or fine-tuning it to display a particular trait (for example, a fondness for certain animals or trees). This teacher model was then employed to produce data in a specific, unrelated field, such as number sequences, code snippets, or chain-of-thought (CoT) reasoning for mathematical problems. This generated data was meticulously filtered to exclude any direct references to the trait. Subsequently, a “student” model, identical to the initial reference model, was fine-tuned using this filtered data and assessed.


AI Scaling Hits Its Limits

Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are:

  • Turning energy into a strategic advantage
  • Architecting efficient inference for real throughput gains
  • Unlocking competitive ROI with sustainable AI systems

Secure your spot to stay ahead: https://bit.ly/4mwGngO


Image source: Anthropic
Image source: Anthropic

Subliminal learning was observed when the student model absorbed the teacher’s trait, despite the training data being semantically unrelated to it.

This effect was consistent across various traits, including benign preferences for animals and more dangerous misalignment. It was also observed with different types of data, such as numbers, code, and CoT reasoning, which are realistic formats for enterprise applications. Notably, the trait transfer remained even after rigorous filtering intended to eliminate any trace of it from the training data.

In one experiment, a model with a preference for “owls” was used to generate a dataset containing only number sequences. When a new student model was trained on this numeric data, it too developed a preference for owls. More alarmingly, the researchers discovered that misaligned models could pass on their harmful inclinations (such as advocating for crime and violence) through seemingly harmless number sequences, even after filtering for negative content.

Models trained on data generated by a biased model (e.g., prefers a specific animal) tend to pick up those traits, even if there is no semantic trace of that trait in the generated data (source: Anthropic)
Models trained on data generated by a biased model (e.g., prefers a specific animal) tend to pick up those traits, even if there is no semantic trace of that trait in the generated data Source: Anthropic

The researchers examined whether hidden semantic cues in the data were responsible for the discrepancy. However, they found that other AI models prompted to act as classifiers failed to identify the transmitted traits in the data. “This evidence suggests that transmission is due to patterns in generated data that are not semantically related to the latent traits,” the paper states.

A significant finding was that subliminal learning does not occur when the teacher and student models are not based on the same underlying architecture. For example, a trait from a teacher based on GPT-4.1 Nano would transfer to a GPT-4.1 student but not to a student based on Qwen2.5.

This finding suggests a straightforward mitigation strategy, according to Alex Cloud, a machine learning researcher and co-author of the study. He confirmed that a simple way to avoid subliminal learning is to ensure that the “teacher” and “student” models come from different families.

“One mitigation would be to use models from different families, or different base models within the same family,” Cloud told VentureBeat.

This implies that the hidden signals are not universal but are rather model-specific statistical patterns linked to the model’s initialization and architecture. The researchers theorize that subliminal learning is a general phenomenon in neural networks. “When a student is trained to imitate a teacher that has nearly equivalent parameters, the parameters of the student are pulled toward the parameters of the teacher,” the researchers explain. This alignment of parameters results in the student beginning to mimic the teacher’s behavior, even on tasks that are far removed from the training data.

Practical implications for AI safety

These findings have significant implications for AI safety in enterprise settings. The research underscores a risk similar to data poisoning, where an attacker manipulates training data to compromise a model. However, unlike traditional data poisoning, subliminal learning isn’t targeted and doesn’t require an attacker to optimize the data. Instead, it can occur unintentionally as a side effect of standard development practices.

The use of large models to generate synthetic data for training is a major, cost-saving trend; however, the study suggests that this practice could inadvertently contaminate new models. What advice can be given to companies that rely heavily on model-generated datasets? One suggestion is to use a diverse committee of generator models to minimize the risk, but Cloud notes this “might be prohibitively expensive.”

Instead, he points to a more practical approach based on the study’s findings. “Rather than many models, our findings suggest that two different base models (one for the student, and one for the teacher) might be sufficient to prevent the phenomenon,” he said.

For developers currently fine-tuning a base model, Cloud offers a critical and immediate check. “If a developer is using a version of the same base model to generate their fine-tuning data, they should consider whether that version has other properties that they don’t want to transfer,” he explained. “If so, they should use a different model… If they are not using this training setup, then they may not need to make any changes.”

The paper concludes that simple behavioral checks may not be sufficient. “Our findings suggest a need for safety evaluations that probe more deeply than model behavior,” the researchers state.

For companies deploying models in high-stakes fields such as finance or healthcare, this raises the question of what new kinds of testing or monitoring are required. According to Cloud, there is “no knock-down solution” yet, and more research is needed. However, he suggests practical first steps.

“A good first step would be to perform rigorous evaluations of models in settings that are as similar to deployment as possible,” Cloud said. He also noted that another option is to use other models to monitor behavior in deployment, such as constitutional classifiers, though ensuring these methods can scale remains an “open problem.”

AINews,TechNews
Recommended Content