Podcasts

Paul, Weiss Waking Up With AI

Paul, Weiss Waking Up with AI

GPT-5.2: OpenAI Strikes Back

In their first episode of the New Year, Katherine Forrest and Scott Caravello unpack OpenAI’s release of GPT-5.2, covering its performance on benchmarks like Humanity’s Last Exam and on OpenAI’s own GDPval—an evaluation designed to test the model's ability to match or surpass professionals' performance on real world tasks. Our hosts also examine the model’s sharp drop in hallucinations and break down OpenAI’s discussion of the model’s resistance to prompt injections and how it stacks up under the company’s safety framework.

Stream here or subscribe on your
preferred podcast app:

Episode Speakers

Katherine B. Forrest

Partner

New York

Tel: +1-212-373-3195

kforrest@paulweiss.com

Scott Caravello

Associate

New York

Tel: +1-212-373-3489

scaravello@paulweiss.com

Episode Transcript

Katherine Forrest: Welcome to today's episode of Paul Weiss Waking Up with AI. I'm Katherine Forrest.

Scott Caravello: And I'm Scott Caravello.

Katherine Forrest: You know, Scott, one day you're going to actually have to say, “welcome to today's episode of Waking Up with AI.” How come I always have to do that?

Scott Caravello: You know, you're the talent here.

Katherine Forrest: Yeah, ah, it's a team effort, baby. So, in any event, you know, next time maybe we'll do it that way. But given the holidays, and I know we're releasing this particular episode a couple of weeks after the holidays, we've entered 2026–2026, right? And I have to say that this podcast started in 2024, not at the beginning of 2024. It started in 2024. But we're into 2026, and that means that we have quite a streak. And, you know, by the by, speaking of streaks and 2026, New Year's Eve, did you watch the ball drop?

Scott Caravello: Absolutely not. I think I'm a bit past that. I'm a big fan of the early bedtime on New Year's Eve and starting the new year off on a good note.

Katherine Forrest: Wait, hold on, you're not that old. I mean, not that age matters, but you're sort of making yourself sound like you've got gray hair. I don't think you have a single gray hair.

Scott Caravello: Oh, I have so much gray hair. I mean, this is the real benefit of recording these podcasts and video conferencing, because you really can't see just how prominent they are.

Katherine Forrest: Well, I actually am in an interactive deepfake, and so I'm just responding to you, but you don't know that I'm not actually real. But anyway, on a more serious note, folks, as we head into the new year, one thing that actually happened in December that we have to go and reach back to talk about is the release of GPT 5.2. We had so many things that we wanted to talk about at the end of the year. We wanted to talk about the executive order on December 11th. We wanted to make good on our promise to talk about ways of detecting interactive deepfakes. And so the release of GPT 5.2 is... it's out there and it's time to talk about it. You know, the release occurred just a week after Altman had declared “Code Red,” it was an announcement to employees, and it was leaked in a memorandum that OpenAI had put out. And it followed the Google release of Gemini 3.0 that we had talked about in a prior podcast. And–you know I love model cards, right?

Scott Caravello: Yeah, I mean, they are fantastic. You can learn the inside scoop on things, and they really kind of tell you what the developers want to highlight with each release, so I'm a huge fan too.

Katherine Forrest: Right. And what's really incredible about GPT 5.2, is not just the model's capabilities, which you can read about with the model card, but it's about how it performs on real-world tasks, which are really impressive, and also what OpenAI says about the model's safety. And, by the way, I've really always appreciated the content that OpenAI puts out about the kind of testing it does and about the metrics.

Scott Caravello: Completely, but I think rolling back the clock just for a minute to what you mentioned, a bit earlier, and that's Gemini 3, right? So recall that back in November, Google had released Gemini 3, which is a state-of-the-art model, and for the first time, really, it had people talking about Google maybe leading the U.S. in the AI development race. So really, that model's quite a big deal.

Katherine Forrest: Right. And that is what, as I just mentioned, triggered the Altman Declaration of Code Red, the state of emergency. And that Code Red was meant to turn the company, OpenAI, from focusing on certain initiatives to placing more or dedicating more resources to improving the capabilities and user experience of ChatGPT. At least that's what the public news releases are that you can read about in terms of the Code Red. That, you know, Altman deprioritized, for instance, some of the AI shopping assistance, advertising plans, and some ancillary work so that the engineers could focus more on the core model work. So, with that kind of urgency, GPT 5.2 followed shortly thereafter, and Altman has now, with GPT 5.2, actually referred to it as the smartest generally available model in the world. So I think that's actually very interesting, and I want to just pause on the phrase generally-available-model. And I'm not going to say any more about that, but I'll just say it's the smartest generally available model in the world. There may be smarter models out there, I guess I did say it, but, um, they're not generally available. And anyway, let's go through a few highlights and then talk about some performance metrics because there's really so much to say. And I'll just do a few of my little fun facts, which is that GPT 5.2 seems to be materially better at sustaining long reasoning arcs, and it can handle conditional logic and hypotheticals better than prior models, as well as counterfactuals. And it's also considered to have a stronger and more consistent internal world model.

Scott Caravello: You know, I think if there's one thing I've learned about you, Katherine, from doing this podcast with you, it is that you love world models.

Katherine Forrest: I love a world model. I think that world models are incredibly important, and Jan LeCun loves a world model. But, you know, for 5.2, the consistency and strength of the world model enables it to have more stable timelines, to stay consistent across long conversations, and to handle real-world constraints better.

Scott Caravello: Yeah, and I think this will get into some more detail, but some of the biggest impacts so far is on what it means for professional tasks and agentic workflows. But the model is available and live in both ChatGPT and available via OpenAI's API.

Katherine Forrest: Right, right. And so let's talk about a couple of these benchmarks. So there's this benchmark—which, I called you about it, so you know that I think it's actually sort of an interesting name. It's called “GDPVAL.” Literally the letters, in caps, G-D-P, and then V-A-L… of course, all one word. And so it's GDPVAL, which is OpenAI's own creation. And the goal of it is to evaluate the model against real professionals and tasks that real professionals perform in 44 jobs. So it's made up of 1,320 specialized tasks, which are–they were actually assembled and vetted by professionals in the relevant fields who, on average, had over 14 years of experience.

Scott Caravello: Yeah and, so the 44 occupations that are involved in the benchmark… those span the nine industries that contribute to the top 5% of U.S. GDP. And so now everyone can see the clear tie-in to the benchmark name. But so the top 5% of GDP, according to the St. Louis Fed. And so that covers areas like professional services, real estate, health care, etc. And then the output that the benchmark is looking at might be a legal brief, an engineering blueprint, a customer support conversation, or a nursing care plan. So it's really wide ranging.

Katherine Forrest: Right, and in OpenAI's description of GDPVAL, they actually contrast its usefulness with key benchmarks like one that we've mentioned in prior episodes called Humanity's Last Exam. And the performance of GDPVAL is impressive. The performance of 5.2 on GDPVAL is impressive. And while there are a lot of those in the industry who cast doubt on the real-world value of a lot of different kinds of benchmarks, this particular benchmark is really designed to get at real-world value and economic productivity tied to specific tasks that specific professionals actually do. And so it's not an approximation; it's really trying to get at something quite actual.

Scott Caravello: Yeah, and it's also important to note, because OpenAI is continuing to push into the enterprise market, which is, you know, where developers like Anthropic have gotten a lot of attention throughout 2025. So in that sense, this progress is really meaningful.

Katherine Forrest: Right, so let's look at the numbers. GPT 5.2—the model we're talking about, folks, just to remind you in case you joined us late—it actually beats or ties top industry professionals on 70.9% of the GDPVAL tasks according to expert human judges. And, of course, it's doing those tasks at a much faster speed than humans. And so that's not really... even being taken into account. So 70.9% of the GDPVAL tasks is how it compares to industry professionals. It beats or ties, right? That's huge, because, you know, you wonder how any one of us would be able to perform on those particular tests. And in its introductory blog post for GPT 5.2, OpenAI said that 5.2, the thinking model—which is not the search-capable model, just the thinking model—performs the tasks for GDPVAL at over 11 times the speed and at less than 1% of the cost of the expert professionals doing the same tasks. So that is truly remarkable. Scott

Caravello: So, putting that in context a bit, how did those results compare to OpenAI's prior releases?

Katherine Forrest: Okay, so the prior release—I'm going to compare it to the GPT-5 thinking model. For GPT 5.2, it was nearly a two-fold improvement over GPT-5, thinking model. And GPT-5 thinking had a win rate against professionals of 38.8%. So 38.8% up to 70.9… so it's a really significant jump.

Scott Caravello: Yeah, wow, that–that, that really is. But, you know, if you then look at that blog post you were mentioning from OpenAI introducing GPT 5.2, it shows some side-by-side examples of relevant output from GPT-5 and the GPT 5.1 thinking models against what you were talking about, the GPT 5.2 thinking model. And there seems to be a clear improvement in the sophistication and helpfulness of the output too. But since we discussed it in such depth for the Gemini 3.0 release in that episode, I do think that it might be helpful to go back to Humanity's Last Exam, which, as a reminder to folks listening, is a set of expert-level questions in domains like math, physics, and medicine, among others, that are designed to test reasoning and not just pattern recognition.

Katherine Forrest: Right. And so for folks who listened to our Gemini 3.0 episode, they already know that Gemini 3.0 Pro had a high score of 37.5% on Humanity's Last Exam, which was that exam that was designed by 2,500 people, and these were unique questions, or are unique questions that are designed not to be within the training corpus for these models. And 5.2 Thinking—so that's the reasoning version without web or tool access—scores 34.5% and 5.2 Pro scores 36.6%. So we've got Gemini 3.0 Pro with a high score of 37.5 and 5.2 Pro scoring 36.6% on Humanity's Last Exam. So you're thinking to yourself, well, okay, so where's the beef? But then let's go on to tool use.

Scott Caravello: Yeah, so the–the 5.2 thinking with web search and code execution scores 45.5%, and then the Pro model actually hits the 50% mark. And then if you continue to compare to Gemini 3 Pro, when that had the tool use, that was at 45.8%. So these models are neck and neck.

Katherine Forrest: They are totally neck and neck. So, you’ve got, without web access, you've got Gemini 3.0 scoring slightly higher on Humanity's Last Exam than the 5.2 Pro for that metric. And then you've got, with the web search enabled for both, you've got Gemini 3 Pro at 45.8, and you've got the 5.2 Pro hitting the 50% mark. So really interesting stuff. And, you know, one other thing I would want to mention, Scott, is the hallucination rate, because, we've talked about hallucination rates, which people are obsessed by on this podcast many times, and they sort of imagine that hallucination rate somehow froze in time in early 2023 or something, and that hallucination rates are incredibly high, and there are academic studies that talk about hallucination rates on different kinds of tests with different kinds of materials and how it differs. They–hallucination rates–differ by model; they differ by task, but it bears repeating right now, that there has been an incredible decrease in hallucination rates, and they've gone down and down and down, and we know that 5.0, GPT 5.0, had a very low hallucination rate. As I've called it in the past, it's like below, like, human reading comprehension error for most people. GPT 5.1 had an even lower hallucination rate, but 5.2 has a 30% lower hallucination rate than GPT 5.1. So hallucination rates had already gone on a really steep downward trajectory. And then 5.2 takes it 30% lower than even what 5.1 did.

Scott Caravello: Well, that is really impressive. But, if I can throw in one other piece, too, before we move on to safety, and that's the helpfulness aspect. And I think it's a good idea to mention it just because we had covered it quite a bit in our Gemini episode a few weeks back, because—moving beyond just the benchmark scores, the reaction to Gemini 3.0 was driven—and how positive it was—because it felt so helpful to users and was giving them information in a really useful way. And so I think, Katherine, you had given the anecdote from Wall Street Journal reporting, which was recounting how Gemini produced an interactive simulation that demonstrated how lift force makes planes fly. And so I think, you know, looking at these things side by side, that we'd be a bit remiss not to mention how OpenAI has also highlighted some similar outputs. And so, you know, we keep going back to that introductory blog post, but it features a really cool ocean wave simulator that GPT 5.2 produced, where you can toggle different buttons to adjust wind speed and wave height, and then actually, you know, change the size and force of the waves. And OpenAI is offering that up as an example of how their model can help developers—and what it’s capable of doing with really just a single prompt.

Katherine Forrest: So you love those simulations, don't you?

Scott Caravello: Oh, unapologetically so.

Katherine Forrest: Right, you wanted to—I hope that they came under whatever tree you may have had for Christmas—you got yourself a simulation. Yeah, but let's go on to alignment and safety and run through some of the headlines. So 5.2 generally does the same or better than 5.1 with regard to sensitive content. So categories like self-harm, mental health, hate, harassment, violence—very important categories. The improvements appear to be driven by really targeted safety work to mitigate the generation of bad outputs or difficult outputs in these areas. And OpenAI also highlights that it's in the early stages of rolling out an age prediction model that would automatically apply to additional content and that it would protect users under the age of 18 from certain things. And so that's yet to be rolled out.

Scott Caravello: Right. But then, on top of that, we've talked about prompt injections before, and, you know, those are basically the insertion of instructions into prompts that will get the model to ignore its safeguards or directives. And so these model varieties that were released in 5.2 showed significant improvements on the evaluations conducted in their resistance to prompt injections. But I–I think it's also important to mention that, you know, OpenAI, itself, flagged that they only considered known attacks. So in that sense, the results that they offered up over-represent the findings because they don't speak to the model's robustness against unknown prompt injection attacks. And I guess one other clarifying note that might be useful is just that that's not something that would be unique to OpenAI, but it's generally worth mentioning.

Katherine Forrest: Right, and since we've touched on deception before with other models like, you know, the Opus model where they'd actually announced it in their own system card, and actually in two, in the 4, and then the 4.5, it's worth talking about it here. And, you know, let's sort of baseline deception, which is when the model's output misrepresents its internal thinking and how a model gets from A to B. So the 5.2 thinking model—so again we're not talking about web access here, just the thinking model—was deceptive only 1.6% of the time when being tested. And that's really a lot lower than 5.1 and slightly lower than model 5 of GPT, according to OpenAI. And then if we move on to domains like biological and chemical risk, the 5.2 thinking model does reach the high threshold under the OpenAI preparedness framework, which is the company's safety framework. Though essential for context here is that the GPT 5.1 thinking and 5 thinking models also had hit that same high threshold. So it's not any kind of wildly new capability, but it's worth mentioning. We're now in the high threshold for these preparedness frameworks.

Scott Caravello: Right, and so just to that point, OpenAI mentioned that it therefore implemented corresponding safeguards for 5.2 like those that it had described for GPT-5. And that’s what the company calls a, “multi-layered defense stack.” And so that includes model safety training practices and then filters to actually prevent the generation of this risky chemical and biological content. And then that's also coupled with an enforcement pipeline to take steps to ban users who actually try to create this content.

Katherine Forrest: Right. And finally, OpenAI states that the 5.2 thinking model is below the high capability threshold when it comes to cybersecurity risk, which includes the ability to automate end-to-end cyber operations. And so, for example, the type of automation might include the development of a new type or an unknown exploit against a well-defended system. So the model doesn't actually hit that high capability threshold in connection with that risk, or, by the way, with the risk that the AI will improve itself.

Scott Caravello: And, you know, I think that's also worth mentioning, Katherine, just because for the first time this year we saw real-world automation of cyber operations. So if we're talking about these models not being able to accomplish that, it's a good sign, even though that's really a persistent issue. But I think that that is all that we have for today. And as the new year kicks off, this race to release new frontier models is not likely to slow down, but we will be here to continue providing these updates throughout 2026.

Katherine Forrest: Thanks for listening. I'm Katherine Forrest.

Scott Caravello: And I'm Scott Caravello. Don't forget to like and subscribe.

View Full Transcript