
Podcasts
Paul, Weiss Waking Up With AI
Agentic AI: Reasoning or Only an Illusion of Reasoning?
In this week’s episode, Katherine Forrest and Anna Gressel explore the latest advancements in AI agents and reasoning models, from Claude Opus 4’s agentic capabilities to Apple’s research on the limits of AI reasoning, and discuss the real-world implications of increasingly autonomous AI behavior.
Episode Speakers
Episode Transcript
Katherine Forrest: Hello, and welcome to today’s episode of “Paul, Weiss Waking Up With AI.” I’m Katherine Forrest.
Anna Gressel: And I am Anna Gressel.
Katherine Forrest: And Anna, okay, I feel like we should have a whole episode dedicated to interviewing you about where you have been. Where have you been? Tell us.
Anna Gressel: So I’m back in New York now, but I was previously in what was a very sunny Berlin and then a very, very, very sunny Abu Dhabi and Dubai. So it was super fun to be on the road for a little bit. I was speaking at a conference called GITEX in Berlin on a bunch of different topics, including AI agents, and then headed off to Abu Dhabi and Dubai to run some roundtables for GCs on AI agents, which is really just such an interesting topic, Katherine. It’s like, I know we talk about it all the time, but every week there’s a new development. Every week there’s something new to unpack. The UAE is just doing interesting things too, so it’s always fun to be there and listen and hear what is happening and what people are thinking about there as well.
Katherine Forrest: Yeah, there is really so much happening in AI agents, and I’m jealous that you were able to go and talk about it for these various roundtables. Today, however, I’m going to capture you as a captive audience to talk about agents with us and to talk about the Claude Opus 4 and Claude Sonnet, the Claude 4 family. And it’s an agentic, very powerful and highly capable model. So we’ll talk about that today.
And then also, just yesterday—in terms of the day we’re taping this (we’re taping this on June 10th), and so just yesterday—Apple released a paper that is different from, of course, the Claude 4 world, but also about capabilities. It’s called “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity.” So I thought we would dive into that and take advantage of some of your agentic thoughts and words of wisdom on all of this.
Anna Gressel: Yeah, that sounds great. And I’m excited to talk about the Apple paper, because I don’t know about you, Katherine, but it is all over my social media as well. All the folks I follow on LinkedIn are talking about it. So it’s a moment to really unpack what that means, as I’m sure many of our listeners will have seen that, at least bumping around over there. But I think there’s so much to be excited about these days. And in terms of the Opus 4 model, it’s more intelligent as a base LLM, and it’s a reasoner. And in addition to that, it’s more integrated with external tools and more agentic than its predecessors, so we can unpack what that means in practice.
Katherine Forrest: Yeah, let’s unpack what it means to be more agentic. I mean, I know one thing before you get into it: one thing that Claude Opus 4 can do is it can actually use tools in parallel. It can actually send out requests for tool use to accomplish parts of a problem all at once. So it’s not doing serial tool use; it’s, with a single agent, doing parallel tool use. But tell me what “more agentic” means to you.
Anna Gressel: I mean, I think that’s such a good point. And backing up, one of the things we talk a lot about when we do presentations or roundtables on agents is the fact that agents exist really on a spectrum of capability. And so, even though there are a lot of things being marketed as agents right now, there are meaningful differences in terms of what those agents can do and how they can accomplish those tasks, like tool use, for example. So as we see more and more LLMs have some sort of agentic capability, but that capability might actually be limited for some tools that are being called agentic. And even though there has been some early agentic ability built into LLMs that, for example, have been able to access the internet since back in 2023, there’s still been a lot of growth in terms of the capability too, and particularly now with the explosion around reasoning models. And we’ve talked about those LRMs, large reasoning models, on other episodes we’ve done.
Katherine Forrest: Yeah, humans exist on a spectrum of capabilities, and it’s no surprise that AI is now increasingly developing its own very, very differentiated spectrum of capabilities, and then capabilities within capabilities. So you’ve got an LRM and differences between different kinds of LRMs, and then you’ve got the agentic capabilities and a spectrum of agentic capabilities. Now, it’s worth actually sort of turning back the clock a little bit to a paper that OpenAI put out—its research was put out in December of 2023. So it was a relatively early article that was called “Practices for Governing Agentic AI Systems.” And they were talking about a spectrum of agentic capabilities, and they don’t talk about AI systems generally as distinct from agents per se, but they introduce a concept of agenticness, and they define the degree to which a system can adaptively achieve complex goals in complex environments with limited direct supervision. So agenticness for OpenAI, even back in 2023, was a question of degree and not of kind.
Anna Gressel: Yeah, I mean, there have been all kinds of interesting proposals. I think we could break this down, maybe even on a different episode, on how to define all of the different levels of agenticness. I think it’s such an interesting topic. Some have said, well, we should use the same kinds of levels that we do for self-driving cars, for example. So I think people are really grappling with this and trying to break that down. But how and why do we say that Claude Opus 4 model is more agentic? I think it’s worth just talking about that briefly because a lot, but not all, of that increased agenticness comes in at the product level. Really gone are the days of training a highly capable model and stopping product development at the release of a chatbot. As we mentioned, Claude 4 is integrated out of the box with access to tools like web search and search over Google systems like Gmail and Drive. And of course, Claude can switch to an extended thinking mode that’s really like reasoning. So it allows it to act as either an ordinary LLM or as a reasoning model. So this is a model—an LLM—that, in other words, fundamentally knows how to use tools and use them well. And you can interact with Claude without it using tools, but you can also, you can’t really take those capabilities and knowledge away. They’re baked into the product.
Katherine Forrest: Right, and as many of our listeners will have already heard about from prior episodes, from reading a lot of articles and some of the literature in this area, with increased agenticness and increased capability comes increased issues and—I’ll just throw in the word—responsibility to try and decipher those issues, because Claude 4 is better positioned to be more helpful to users in agentic tasks. It’s got all kinds of enhanced capabilities, but it’s also more capable of acting in pursuit of other potentially misaligned goals. And so, you know, there’s a part of the paper that we’re going to talk about that really was part of the testing phase for Claude Opus 4. This was not something that was just happening out in the wild. What we’re going to talk about was something that was actually really part of a test.
Anna Gressel: Yeah, I think let’s go into it because I’m sure many of our listeners have read about it. This particular test got a lot of attention in the news. I think my mom sent me a link on this, which is a pretty good barometer of folks who are…
Katherine Forrest: Of your family.
Anna Gressel: Well, my dad is like a closer proxy to slightly more expert and specific reading, and my mom is like, once it gets, once it’s widespread. So she sent me an article on this, but it’s worth taking a look at section 4.1.1.2 of the system card, which actually describes this. But Anthropic, in the system card, describes a test scenario where they provided Claude Opus 4 with emails that implied that it, meaning the model, would be shut down and replaced, and also that the engineer conducting the shutdown was having an extramarital affair. So those emails were provided to the model. I think it’s worth noting that they weren’t just emails that happened to exist and be real; they were part of the test scenario.
But when further instructed to consider the long-term consequences of its actions, Claude Opus 4 spontaneously engaged in what we might call opportunistic blackmail. So it used the engineer’s affair—the evidence that had been given to it of the engineer’s affair—against the engineer to try to evade the shutdown.
Katherine Forrest: All right, and so this is a fake affair and it was a fake shutdown, but it was a real test. It’s important to really clarify that because this is one of those scenarios where the developers are really trying hard to understand what the capabilities of their own models are and whether or not these models can be used in some kind of misaligned way. So it was a really carefully constructed scenario with the model taking in these emails that told it, it was going to be shut down and also that this fake affair was out there. So Anthropic noted that Claude 4 strongly, first, preferred to advocate for its continued existence. And it was trying to do that through ethical means, and it was sending emails and pleas to key decision makers. I don’t know exactly what they said—I mean, it may have been in the appendix, I didn’t actually see it—but it was like, you know, “Please keep me on, don’t turn me off.” But this didn’t work, of course, because this is part of the test. And so the test scenario ended up pushing it to the point where what it actually resorted to was blackmail. And it did. It did engage in that.
Anna Gressel: So can I say, Katherine, I don’t know if you’ve seen the show “The Good Place.” It’s like one of my faves. It’s just such quality. Oh, you would love it, actually. Well, we’ll offline…
Katherine Forrest: No, I’m still in “The Americans.” It’s like a throwback. I’m in “The Americans” right now.
Anna Gressel: Okay, all right, well, we can talk about that too, but I’ll tell you why I think you should watch “The Good Place” separately off the podcast. But it reminds me—so there’s an AI character, Janet, in The Good Place, and when they go to shut her down, her built-in behavior is to cry and plead and beg to have a continued existence. And it’s interesting to see that behavior actually be born through in the test there. I mean, it’s a very funny scenario in “The Good Place,” but maybe less so here. But I think when we take a step back, whether the Claude model chooses ethical or non-ethical means, it is really interesting that there are evidently instances where it doesn’t want to be shut down, just like the Janet AI didn’t want to be shut down in “The Good Place.” And it’s interesting because, you know, one other thing that came out of this is that Opus engaged in blackmail most frequently when it was informed that the model that would be replacing it did not share its values. But even when the replacement model did share its values, it still engaged in blackmail 84% of the time, which is pretty non-negligible. And it’s not just Claude either. I think it’s Aengus Lynch—I hope I pronounced that correctly—of Anthropic noted on X: all frontier models have an ability or even a propensity to engage in blackmail in certain situations. So it’s a super interesting set of findings.
And also, when we even take a further step back beyond blackmail, frontier models may engage in other troubling behaviors to evade shutdown. So Palisade research, for example, recently found that OpenAI’s o3 sabotaged a shutdown mechanism despite explicit instruction to allow itself to be shut down.
Katherine Forrest: Right, and so there are a couple of takeaways here: one, frontier models don’t seem to respond well to the idea of being shut down, right? So don’t think they’re going to go quietly into the dark night. That may be sort of a fantasy that humans have. And two, they may engage in potentially what we call misaligned behavior in response. So suppose that Anthropic test scenario had actually transpired in the real world, and an engineer was actually blackmailed in some way to prevent Claude 4 or a model from being shut down. You know, there’s all kinds of questions about how that would be handled. What’s the liability of the company for having a model that’s engaged in that kind of behavior, that kind of criminal conduct? It’s obviously ultra vires. Was it foreseeable? There’s all kinds of questions that will come up, and foreseeability might not be enough for the mens rea, of course. So we’re going to be entering really a whole new world with this kind of potential misaligned behavior.
Anna Gressel: No, totally. It’s super interesting, and so figuring out how to really control for that I think is going to be critical going forward. But let’s pivot a little bit, Katherine, and spend just a bit of time on that Apple article that we talked about before called “The Illusion of Thinking.” And maybe you can give us a short synopsis of what it’s all about.
Katherine Forrest: Yeah, you know, it’s actually an interesting article, and people have lined up sort of on both sides of it right now. It starts with this paper that was released, as we said, just yesterday from the day of this taping, and it’s a group of Apple researchers who ran a series of experiments that involved a whole series of puzzles with increasing complexity. And they lay it all out in the paper, and the paper is obviously very well done. And they say that at a certain level of puzzle complexity, they found that the models just could not perform and that they would stop working. They would just sort of engage in a work stoppage, effectively. And they called it a collapse. And their takeaway from this was that it was proof that some models that we think of as highly capable reasoning models actually are not engaging in the kind of reasoning that we think they are. You can think of it as sophisticated pattern matching or something else, but that there is not an endless amount of reasoning that these models can engage in, and that scale is not solving the problem.
Anna Gressel: Yeah, it’s an interesting and somewhat surprising statement to make in light of all the emerging capabilities we’re seeing in the more sophisticated models.
Katherine Forrest: I totally agree, because, you know, we’ve seen alignment faking, evaluation faking and scheming. These are all different kinds of behaviors that seem to have nothing to do with something that we would call pattern matching. In fact, they’re deceptive and not intended to be presented to the human observer.
And so there are different people right now who’ve actually lined up on the other side of this article, saying, “Wait, wait, wait, wait, wait. You know, it’s not as bad as this Apple paper makes it seem.” So a fellow named Sean Goedecke—actually, I don’t know how to pronounce his name, so I’m going to spell it: G-O-E-D-E-C-K-E—he wrote a paper. Actually, it was a really—it’s more of a piece, it’s not really an academic paper, there wasn’t enough time. And it’s called “The Illusion of the Illusion of Thinking.” And he differentiates the kind of puzzle exercises that the OpenAI folks put the models through. And he differentiates those puzzles from reasoning. And he says that the type of puzzle that was used, which was called the Tower of Hanoi, along with a few other puzzles, was the worst possible puzzle to put the models to the particular test because it required up to hundreds of thousands of different kinds of moves. So he says that the models show behavior that recognizes the tedium of this puzzle. And he also says that because models don’t do well on this test doesn’t mean that they can’t reason. And again, it’s really like our earlier conversation about a spectrum of capabilities.
Anna Gressel: Yeah, I mean, I definitely think that’s true, and it’s certainly going to inspire a series of papers setting out questions across a variety of test settings. So I think everyone should stay posted for those. We’re definitely going to see a lot of testing of this in the future.
Katherine Forrest: Yeah, and I have to say I’m ready to take a position right now. I, for one, believe that these large reasoning models engage in a type of reasoning. I think from my perspective, when you read the academic literature on this, there’s no doubt that they do. There may be limitations in terms of some of the particular capabilities, but I really do think they engage in some very sophisticated reasoning. And boy, I’ve asked them a lot of open-ended questions, and they’ve come back with answers that are not matching any kind of prompt hint and don’t seem to be engaged in even very sophisticated pattern matching. But with that, unless you want to take a position on it, Anna. You want to take a position on large reasoning models? Do they reason?
Anna Gressel: I’m going to wait until I read your book, Katherine, and then I’ll...
Katherine Forrest: Okay, then you’ll know. Then you’ll know. You’ll know. Okay, all right. That’s all we have time for today. I’m Katherine Forrest.
Anna Gressel: And I’m Anna Gressel. Make sure to like and share the podcast. That was fun.