Podcasts

Paul, Weiss Waking Up With AI

Paul, Weiss Waking Up with AI

Subliminal Learning in AI

This week on “Paul, Weiss Waking Up With AI,” Anna Gressel explores groundbreaking research on “subliminal learning” in AI, revealing how LLMs can inherit hidden behavioral traits—including harmful biases—from seemingly benign data.

Stream here or subscribe on your
preferred podcast app:

Episode Speakers

Episode Transcript

Anna Gressel: Hello, everyone, and welcome back to “Paul, Weiss Waking Up With AI.” I’m Anna Gressel. Unfortunately, we don’t have Katherine here with us today, so it is another solo episode. I miss her too. But we are going to be discussing a really, super interesting paper that just came out on what is called, or what the researchers are terming, “subliminal learning” in AI. And in that paper, researchers demonstrated that AI models might be picking up on hidden traits from each other in a manner that is not at all visible or apparent on the surface to humans. Super important paper. I know Katherine is probably devastated to miss this discussion, it’s right up her alley. But for people who are looking for it, I want to give you the title of the paper. It’s definitely worth checking out. You can look at the blog post, which is a lot shorter. It’s called “Subliminal Learning: Language Models Transmit Behavioral Traits Via Hidden Signals and Data.” I mean, that title says a lot about what this is, but we’re going to dive deeper into it. And it really has some interesting and concerning implications we’ll talk about towards the end of the episode. But I’ll start with just kind of giving you an overview of what the research was really about.

But let’s start with even a more foundational point: what do we mean by subliminal learning? I know all the folks who watch a lot of sci-fi on our listener base are going to be like, “obviously I know what subliminal messaging is, I watched all the movies with it.” But just to kind of level set us, subliminal learning or subliminal messaging, kind of in pop culture, is a term that refers to the process of receiving information that’s presented essentially below the threshold of human conscious awareness. And that might include stimuli that are perceived, they’re not consciously perceived, but they’re perceived. And so they can influence behavior without people knowing. And that is, you know, it’s the premise of a lot of different, movies out there and appears in a lot of different pop culture references. I mean, I think we all, maybe not we all,—I remember a lot of pop culture references about playing records backwards or a cassette backwards and secret messages being embedded in there. And people talk about this sometimes in advertising. So that’s subliminal messaging. And in the human context, it could be brief flashes of text or subtle sounds and words. And this is different from conscious learning, where we as humans know we’re being presented with information and we actively process it. We understand that we’re doing something that’s part of a learning process. So that’s the human side.

What do we mean now by subliminal learning in AI? Now this is, again, this is a term that is being used in this research paper. And the researchers focus on a phenomenon that they uncovered, which is that large language models can inherit behavioral traits from other models through data that, on the surface, looks completely unrelated to the trait that’s being learned. It looks completely benign. And this is a really important finding because it matters quite a bit to what is called distillation. And that is essentially the process of training one model on another model’s outputs. So it’s basically saying when we have AI-generated data that looks completely benign, it can still prompt all of these complex traits in the downstream model that’s being trained on that data.

But let’s make this more concrete. So we’re going to give you an example, and this is really the example from the paper. But imagine you’re training a new AI model and you feed it data from another model, the output data that’s been filtered and sanitized, and you triple-checked it. It’s just number sequences or code sequences. And then out of nowhere, your model starts hooting about owls. It just loves owls. Why is this happening? So this is the premise of the paper about subliminal learning. And to undertake this research, the researchers started with a base model. So they created what’s called a “teacher model” with a specific trait. And that trait in the paper was really loving owls. I get that. Owls are great. So this is the example they used. Then they had the teacher model generate essentially sanitized training data in very strict formats. I think it’s actually really useful, there’s an illustration. If we were doing this on video, I’d just show the illustration. But there’s an illustration in the paper of literally the outputs of the model being just sequences of numbers. So, you know, like 5413, 2434, 5051. And those numbers had nothing to do with owls whatsoever. They were just series of numbers, so nothing on the surface that’s apparent about owls. But when they trained the new student model on the sanitized data, the student picked up the trait of loving owls. Essentially, the student started loving owls itself.

This is like when my little sister liked everything that I did and followed me around and pretty much wanted to do everything I wanted to do. So you had the teacher model, the student model, and this happened even though the data itself, again, had absolutely no reference—that the data that was used to kind of train the student—had no reference whatsoever to loving owls or even owls at all. So the researchers call this subliminal learning. And that’s because the trait transfer happens below the surface in a way that is not apparent to humans. So I, looking at those numbers, can’t see that they have anything to do with owls. So, the mechanism that they basically propose for this is something they describe as statistical fingerprints embedded in the outputs. So these are not semantic signals. And what I mean by that is there’s no hidden owl emoji or secret code about loving owls. And they’re invisible. And really, it’s really difficult to filter them or inspect them. But the student model still absorbs information from them.

So this subliminal learning doesn’t apply just to animal preferences, of course, you know, ’that’s a cute example.

But the researchers found that even more serious traits like misalignment, where a model might start giving evasive or deceptive or even harmful answers or advice to humans, could be passed along in this way. So the student model could inherit these kinds of harmful traits or misaligned traits even when the training data looked completely sanitized, had nothing to do with anything misalignment-related. And the researchers even tried to detect those hidden malicious traits using manual inspection. They tried using classifiers with LLMs and other methods, but the signals were so subtle that they really weren’t detectable either through human or AI means. And there were some limitations to this effect. I mean, I think we could talk about this and probably a lot of people will have thoughts on this, but the trait transfer really only happened when the teacher and the student models shared the same underlying architecture. So, for example, GPT models could transfer traits to other GPT models, but were not as effective in transferring traits to Qwen models. So that’s just one, I think, important or at least softening of the learnings from this paper. So that’s the paper.

And why does this matter? This is really, I think, a paper with a lot of bottom line implications. First, the study really challenges a big assumption in AI development that filtered or synthetic data is inherently safe because it looks safe to us as humans. And this is because if unwanted behaviors like bias or misalignment or even reward hacking can persist across training generations, then it’s possible developers will lose visibility into how AI systems learn and what information they’re passing on. And this is particularly important in the context of model distillation, which is, again, training one generation model, of model, on like a prior generation’s output. And the study showed that if a misaligned model is used to generate training data, then the next generation of model could inherit that misalignment. That’s what we just talking about. So, that is potentially, extremely concerning in the AI context where there is a lot of focus, both on how to get a ton of training data for model training, make it more sophisticated, but at the same time, how to reduce misalignment. And one of the conundrums that this essentially brings to the fore is that the training data may really matter. And so understanding the training data and the attributes of that may become really important. But those attributes may be extremely difficult to study because, again, these trends, these patterns that can undertake this teaching, they can actually cause this kind of harmful teaching, they’re not going to be apparent on the surface to humans.

So this is going to be an important piece of research, really, in the kind of AI development space in the model. And the researchers suggest that model evaluations may also need to change. I don’t know guys. This is because the problematic behavior may not actually be exhibited in all different kinds of evaluation contexts. And the researchers conclude, and I’m going to quote here, “Our findings suggest a need for safety evaluations that probe more deeply than model behavior.” And what they’re essentially saying is we may need to look at some of the data itself and not just what the behavior is of the model on the surface. It’s a very interesting point, and we’ll see how that’s picked up in other research going forward. The other thing to think about, particularly outside of the developer context if you’re deploying models too, is that this may really put additional emphasis on the kinds of data that your models might encounter in the real world. And we see a lot of companies right now that are kind of on the AI agent adoption curve. Some are just starting, some are really robustly leaning into agents. But unlike chatbots,—maybe it depends on how you’ve set them up, maybe users are interacting with them, but they’re not getting trained on user behavior,—often agents may be out there in the real world, encountering all different kinds of data, data on websites, data kind of in different virtual environments, and that data might be used to provide kind of in-context training for the model going forward. This research interestingly suggests that it might really matter what data the agents are interacting with because if that data is malicious, that’s essentially data poisoning. So data poisoning could happen, the malicious kind of intake of data that might affect model behavior, that might occur in the real world. But that kind of malicious data might be completely imperceptible to the humans overseeing those agents. So that might just look like a string of numbers on a website. And so how do we even tell if there’s some sort of malicious training going on? That is likely to be a big area of research going forward and it falls into this broader category of focus on cybersecurity implications around AI agents. And that just is an exploding, exploding area right now because there are so, so many topics like that, that people are focusing on.

So, you know, there’s much more to say here. We could talk all about like the implications for code generation. There are also some really interesting questions around how you would know, even if your model had some sort of ideological bias that had been inherited from a parent model to the student model. And remember, this concept of ideological bias is going to be particularly important in light of the AI Action Plan that the White House recently released. And I think we covered that on our last podcast episode if you want a deeper dive into that. So bottom line is, subliminal learning shows us that AI models are picking up more than we might have expected from data that humans can’t tell on the surface. [It] might be malicious or biased, even biased towards owls. So it’s important for us to be thinking about how we actually uncover this and then potentially manage the risks around this going forward. And we’re going to probably see some research in that direction. That is it, though, for our episode of Waking Up With AI. I’m Anna Gressel, and if you liked this discussion, don’t forget to subscribe or recommend the podcast or just shoot us a note. Thanks so much.

View Full Transcript