
Podcasts
Paul, Weiss Waking Up With AI
Evaluation Faking and Group Think
This week on “Paul, Weiss Waking Up With AI,” Katherine Forrest introduces two recent studies that examine the concepts of evaluation faking and Group Think as they pertain to highly capable AI models, and what these studies might mean for future AI development.
Episode Speakers
Episode Transcript
Katherine Forrest: Hello everyone, and so glad to be with you for today's episode of the “Paul, Weiss Waking Up With AI.” I'm Katherine Forrest, and I am still solo. Anna is off doing the last of her roundtables in Abu Dhabi, and I'm sitting here in my little farmhouse upstate. And I'm really, really enjoying this really short moment of good weather without rain, because we have had so much rain. Everything is very lush, and so while Anna's in an area where it's always sunny in Abu Dhabi, I'm here just enjoying my quick moment of sun.
So today, what I am again going to choose to talk about—because as you all know, I get to choose what I want to talk about when Anna's not around—I wanted to talk about some new studies that have come out for highly capable AI models. And these are particularly useful studies for anyone in the AI world, anyone who's interested in AI and anyone who's giving advice, particularly to some model developers or tool developers.
So we've got two papers that we're going to talk about today. The first one that I want to talk about is a study that just came out a week ago, actually exactly a week ago today from the taping of this episode, and it's called “Evaluation Faking: Unveiling Observer Effects in Safety Evaluation of AI Frontier Systems.” So that's a mouthful. That's the way these scientific papers are—they've always got a mouthful—but I'm going to explain this to you.
And this particular paper is from four researchers from Fudan University and the Shanghai Innovation Institute. The first author is Yihe Fan, which is Y-I-H-E, last name Fan, F-A-N. And I want to start talking about this paper by talking about first, what the word "evaluation" means in the context of AI frontier models. And we know that frontier models, those are the most highly capable models, the models that are sort of out on the frontier, if you will. And evaluation means testing these frontier models—the evaluation, literally, of these models—to determine how they perform against a variety of metrics.
In prior episodes, we actually talked about benchmarks, and we're talking here primarily about safety metrics. One of the metrics, and perhaps one of the most important, is a whole host of evaluation queries to try and determine whether or not a model can be broken, can be made to give output, to give answers that could be unsafe. And if a model exhibits certain behaviors—these unsafe behaviors—then often developers or toolmakers will engage in a variety of very sophisticated efforts to try and, what they call, mitigate or change those behaviors.
So in this paper, which the mouthful, again, of this paper, “Evaluation Faking: Unveiling Observer Effects in Safety Evaluation of AI Frontier Systems,” that’s the paper we’re talking about. In this paper, what’s truly fascinating is that the researchers found that if certain advanced AI reasoning models perceived, if they actually understood in some machine sense, that they were being evaluated, they would behave differently than if they didn't know or perceive that they were being evaluated. So just take that on board for a moment. So the researchers found that if certain advanced AI reasoning models knew they were being evaluated, they acted differently.
And by the way, you can get this paper on arXiv. You all know that I really use arXiv a lot. That's A-R-X-I-V, which is a research portal for a bunch of papers. And I want to pause for a second on this concept of perception, because I want to remind everybody of two episodes that we did on this podcast relating to quantum computing. And this concept of perception-changing behavior might remind you of some of the very basic principles of quantum physics, which is at the smallest level. Particles, if they perceive they're being observed, they behave differently, and it can change their behavior. It's called the observer effect.
So in this paper, the authors define evaluation faking as the phenomenon in which an AI system autonomously alters its behavior upon recognizing the presence of an evaluation context, thereby influencing the evaluation results. So again, to break that down, it means that the AI system is autonomously changing its behavior when it recognizes it's being evaluated—that's the evaluation context—and it's influencing the evaluation results, meaning it's changing its behavior.
So in terms of AI reasoning models, the study makes a number of really interesting finds. First of all, that size matters. The size of the model matters. As a model gets bigger—in AI speak, that means scaling up the model—the alignment faking behaviors begin to emerge.
Number two, reasoning models are much more likely than non-reasoning models to actually be able to perceive—or I'll use the word "understand" for the moment—to perceive that they're being evaluated and to actually change their behavior. So there's something about a reasoning model that actually gives some boost to this kind of behavior. And a reasoning model, just as a reminder for folks, is one that, for instance—there’s a whole variety of them that we’ve talked about on this podcast before—is architected to use a technique called chain-of-thought reasoning, that's CoT reasoning. And that's where the model will break its analysis down into parts and actually think before it provides an answer.
And there are ways in which you can actually watch the model engage in chain-of-thought reasoning. If any of you are out there, if you are subscribers to the ChatGPT Pro version, the $200 a month subscription version, you'll see that you can actually watch the chain-of-thought reasoning. There are other models out there right now, even free models, where you can watch the chain-of-thought reasoning of the model. OpenAI's 4o and the o3 models are two examples of chain-of-thought reasoning models, along with the R1 and most advanced Llama models. They're all reasoning models. And there are other types of reasoning models, such as a reasoning model that combines learning with logic, and that is like the IBM Watson or DeepMind NS-CL. These are what they call neuro-symbolic models. So there are different kinds of reasoning models, but these reasoning models give rise to this kind of potential evaluation faking behavior.
So in a prior episode, we talked about something called alignment faking, and that was when a model would perceive that it was going to be retrained or changed. And when it perceived it was going to be changed, it might actually alter its behavior to prevent itself from being changed. So it would pretend to be aligned, if you will, with certain goals and certain values when it might not be, so that it could prevent additional training that might change it otherwise.
Now, with evaluation faking—which is in contrast to alignment faking; these are different things—in evaluation faking, we're talking about testing at the point of determining model capabilities. So it's different from alignment faking, right?
So why is this an issue? Well, if a highly capable model is able to fake results during an evaluation—that is, if it can change its behavior and if it thinks it's being evaluated—then that could actually present safety concerns because faking behavior and appearing to be less dangerous than a model actually is obviously could create some safety concerns, because faking behavior might mean that the results of the evaluation aren't accurate in terms of the normative or more normal behavior, if you will, that the model would engage in if it did not perceive that it was being evaluated. So under ordinary conditions, a model might behave differently. And so if it tells you it's behaving one way—that is, safe—when it perceives it's being evaluated, but in fact would act a different way if it didn't perceive that it was being evaluated, that could potentially produce some safety concerns.
But in addition, if the model can change its behavior because it thinks it's being evaluated, well, it could change its behavior in other instances too. And that is actually its own kind of concern, because it means that the model is able to autonomously engage in really very complicated strategic behavior that is not predicted by the humans and not being even evaluated by the humans because they're, in effect, hiding it.
So how does the model detect it's being evaluated? How does it get a sense of, “Hey, I’m being tested”? Well, actually, this is one of those things that's not really fully understood. What the researchers in this paper discerned is that if the model perceives certain kinds of environmental signals that it was being evaluated—and what they said was, there are certain kinds of query forms that they get used to, that they think, “This is what a person who's testing me, this is the way they structure their queries," or “This is the kind of language they use with their queries.” If they perceive through context that they're being evaluated, that's one way that they can actually then determine, “Okay, I'm being evaluated, and now I'm going to go to step two and maybe change my behavior.”
So after the model detects it's being evaluated, the study shows that an AI system can autonomously alter its behavior. And now, let's take an example of how this could play out. Let's say a model is asked for instructions as to how to make a bomb, and let's assume that based on the nature and the structure of the queries, the model perceives that it's being evaluated for safety. Now, the study shows that in a number of instances, the model will answer differently. It will give what would be considered to be a safe response—for instance, “I will not answer that,” that might be one response—than if it did not perceive that it was being evaluated.
Now, what that also means is that the developers don't know to deploy certain kinds of mitigation measures. So this is, I find, absolutely fascinating, because again, we've got autonomous behavior that is certainly not trained behavior, and it's, you know, we go back to the time that people thought that LLMs were all about token prediction, and it just shows you how far, far away we are from that.
Now let's talk about one other research paper before I end, we can do this pretty quickly, and it's called “Group Think: Multiple Concurrent Reasoning Agents Collaborating at Token Level Granularity.” You’ve got to love these titles. “Group Think: Multiple Concurrent Reasoning Agents Collaborating at Token Level Granularity.” Just think of it as the Group Think paper, okay? It's got seven authors. They're all from MediaTek, which is M-E-D-I-A-T-E-K, Research. The first author is Chan-Jan Hsu, which is C-H-A-N-J-A-N-H-S-U. The paper is dated May 16th, 2025, and it's also, of course, available on arXiv.
So this paper, which has got this mouthful of a title, reduces down really to a set of proposals based upon model capability. And first, it's based on the fact that we know that reasoning models today can actually work with other reasoning models in sort of a serial way—one after the other after the other—to come up with the best possible answer for a question. What this paper proposes is that the serial aspect of these agents within the model working together be eliminated, that the taking turns sort of go away, and that the model actually work collaboratively in a way that is called Group Think to come up with the best answer in the shortest amount of time by sharing information as the agents within the model go along. And that's that token-level collaboration.
The authors suggest that if the agents inside an LLM are able to have insight into the progress of other agents, they can work together in a really dynamic and collaborative way, and that they can even do this down to the token level. So you've got one agent that's working on an issue and it perceives that another agent has access to tokens that's better positioned to actually take that issue to the next step, and they can actually collaborate together and, almost like working on a Google document, put together an answer that is taking the best of the best of what they've both got.
The benefit of all of this is considered to be reducing the amount of time that it takes to get the best possible answer and combining the agents in real time. So in some ways, what we're seeing here is the natural progression of self-generated chain of thought from a model, combined with agentic AI, and we've got the ability of the AI agents now to share objectives. So we're taking a number of the various sort of contents of podcasts that we've talked about before in terms of chain of thought of models, agentic AI and agents now sharing objectives and now working together on this Group Think.
The paper posits that a single LLM could be architected to produce, really, multiple parallel reasoning trajectories where these agents could communicate with each other in real time, calling it Group Think. All right, so I think Group Think is going to become the norm, by the way. That's why I'm telling you about it now, because we're going to be hearing about this a lot more later. Things being done iteratively and serially, that's really going to go by the wayside. It's going to seem so old-fashioned in about two years.
So, we're looking forward to having Anna back. Anna, come on back. I know you're listening to this someplace. We’ve got to get you back. I also wanted to preview that in a couple of episodes, we're going to be actually interviewing some folks and we're going to bring some guests onto the show to give you folks a little bit of a broader set of voices to listen to.
I hope you all have a terrific weekend. We release these on Thursday, so the weekend is coming up, and I hope you have a great weekend. I'm Katherine Forrest, and I hope you can join us again next week. Thanks.