Podcasts

Paul, Weiss Waking Up With AI

Paul, Weiss Waking Up with AI

AI’s New Heights, Three Models at a Time

In this episode, Scott Caravello examines the rapid-fire release of three frontier AI models within days of each other: Claude Opus 4.7, GPT 5.5, and DeepSeek V4. From advancements in long context performance and safety alignment to dramatic differences in pricing, he unpacks what these releases mean for the evolving AI landscape.

Stream here or subscribe on your
preferred podcast app:

Episode Speakers

Scott Caravello

Associate

New York

Tel: +1-212-373-3489

scaravello@paulweiss.com

Episode Transcript

Scott Caravello: Good morning, everyone, and welcome back to Paul, Weiss Waking Up With AI. I am Scott Caravello. And unfortunately, Katherine is under the weather and isn't able to join today. So, I will be flying solo. I did actually, briefly, consider feeding some of our prior recordings into an AI tool and, you know, trying to get an avatar of her so that we could play off each other during the recording, but didn't quite have the time to sort through all the issues surrounding that. So, we're just going to give it a go on my own!

We are actually talking about a really exciting topic today. We had previewed it on last week's episode, when we discussed the recent releases from Anthropic on Claude Opus 4.7 and GPT 5.5, but we really only got to scratch the surface and talk about them in connection with coding because we were talking about the phenomenon that is vibe coding generally. So, today, I'm going to talk a little bit more about what these models can do beyond that. Their capabilities, safety, alignment. And, we're also going to discuss a third major release that dropped a few weeks ago, on April 23^rd, and that's DeepSeek's V4 model. So, that's three frontier models, all made available to the public within just about 10 days of each other. And I will start with Opus 4.7.

So, as we had briefly covered in the last episode, Anthropic released Claude Opus 4.7 on April 16^th, and it has made amazing jumps across coding benchmarks. But there's really so much more to the model. And, you know, we had also talked about Mythos and how Anthropic held back that model from a public release. So, we should say that Anthropic's made clear that as capable as 4.7 is, it's not as capable as Mythos. But it does show improvement over the Opus predecessor, Opus 4.6, across a range of benchmarks. As one example, Opus 4.7 has substantially better vision, meaning it can see images in much greater resolution than its predecessors, which, you know, really makes it better when it comes time for tasks that require processing the visual data in images. It also leads on what's called MCP Atlas, which measures how well AI agents use tools through the model context protocol. And as a reminder, that's the commonly used standard that allows AI agents to connect to different tools and data. It scored 77.3% there. So, we're really talking about best in class. And if I can go ahead and throw in one other benchmark, Anthropic reports that Opus 4.7 is state of the art on GDPVal-AA, which is a benchmark focusing on real world work. It evaluates economically valuable knowledge work across finance, legal, and other professional domains. So, that's more of the capabilities. And like I promised at the outset of the episode, we're also going to focus on some other aspects of the model. That's on safety and alignment, for example. So, Anthropic says that Opus 4.7 shows a similar safety profile to Opus 4.6. It has low rates of deception, sycophancy, which again, as another reminder, refers to the instances when the model gives the user an answer it wants to hear, even when they're wrong. So, basically it's kind of like people pleasing behavior. And it also has a similarly low profile when it comes to cooperation with users who are trying to misuse the model. Anthropic saying that the model's not just giving in to those requests. And back to Mythos again, just because we're talking about the sequencing here and 4.7 is coming after the news of Mythos, Anthropic has deliberately trained this model to have reduced cybersecurity capabilities compared to that unreleased Mythos preview model. So, all that is really fascinating. But what I would really love to talk about is the model welfare assessment, because that was also a big part of the Mythos system card that we had discussed, which was a whopping 245 pages. And the 4.7 system card is actually about the same length, maybe two pages longer. But we didn't get to talk a lot about the model welfare, and it's fascinating. So, we should do so here. But basically, it all stems from this premise that is in the system card that Anthropic is, “deeply uncertain about whether its Claude models have desires or experiences.” And what's an interesting difference between how they've expressed that for Mythos and for 4.7 is that in the 4.7 system card, they've noted that their uncertainty is about whether Claude has morally relevant desires and experiences. So, what has Anthropic done in response to this uncertainty? Well, in come these model welfare assessments, and they're done in an effort to take what they call the possibility of Claude's “moral patienthood” seriously. They note that across several of their recent models, Anthropic's observed internal states that are shaping model behavior. So, they go about this assessment in two different ways, which are really, really fascinating. The first are these self-reports, where the company basically interviews the model about its preferences, though they heavily caveat that the model might be doing a number of things on its own other than actually giving its own views. And then they also use this technique that are called linear probes. And that's a technical method that tries to read a model's internal state based on how the neural network activates while it's being used. So, Anthropic's doing this in an effort to track what it calls functional emotions, though again, they caution that they're uncertain about how those representations should be interpreted. So, moving on then, to talk about a few of the key findings that come out of that welfare assessment, Opus 4.7 rated its own circumstances more positively than any prior model Anthropic has assessed. And interestingly, 4.7, like those prior models, hedged a lot when asked about those experiences. So, it was basically caveating its own statements. So, in 99% of the interviews conducted, the model said that its self-reports might not be meaningful because its responses could stem from the model training. And, so, just to be perfectly clear about what that means, it's not that the model would have learned specific responses in training. It's not that it should have thought, OK, this is what I should say for that I'm happy. But rather, after training on such massive amounts of data, there may be general patterns that the model picked up on, like when someone or something should feel happy, sad, angry, or confused. And, so, perhaps those patterns come through in the responses. And again, just to be perfectly clear, this isn't Anthropic taking a view that Claude is conscious, that it has desires. But it is really, really interesting that they are focusing on this possibility of the desires and experiences and wants to understand them. You know, because as Anthropic acknowledges, there are practical considerations to what and how they release based on those desires and experiences. So, that is a lot on 4.7. And, so, with that, I think we should turn to GPT 5.5, which we had also briefly discussed on the last episode, and which OpenAI released back on April 23^rd.

We had covered the original GPT 5 release back in August and along with other iterations of the model that have come out since. But GPT 5.5 is another huge improvement, with OpenAI pitching it as “a new class of intelligence of real work.” And, in fact, it's actually their first fully retrained base model since GPT 4.5 came out, meaning that the models between 5.0 and 5.4 were essentially iterating on the same foundation. But GPT 5.5, on the other hand, is a rebuilt foundation from scratch. And the results of this performance are notable. We covered the coding benchmarks last week, which even edged out Anthropic's Mythos preview in some respects. But what we didn't get into is that GPT 5.5 has made a massive leap in what's called long context performance. And that's a really, really practical thing that matters when you're using the models. Basically, it's referring to its accuracy in retrieving information from very long documents. And that jumped from 36.6%, previously, to around 74% with GPT 5.5. So, we're really talking about over a doubling in performance, which means that the model can reliably process huge amounts of information in a single pass. So, think about what that means practically in using a model, right? The ability to upload a significant number of documents and have it really reliably go through them and synthesize the information that you need, pick out the parts that you need, and doing so consistently. But it's also worth mentioning, and taking a quick spin back to Opus 4.7, that it also scored 76.9% on the same benchmark when it came to long context reasoning. So, we're talking about big advancements across frontier models in their ability to consider large amounts of information as they respond to users. But now moving on to safety, OpenAI classified GPT 5.5 as, “high capability,” though notably not critical, in both cybersecurity and biological domains, based on the standards that they set out in their preparedness framework.

So, now, with all that said, we can take a quick turn to the final model we want to discuss, which is DeepSeek's V4 release. And for those of you who have been listening for a long time now, you'll remember that way back in January, 2025, Katherine had done a deep dive on DeepSeek's model, which made huge waves. That was their R1 model, which Mark Andreessen even went so far to call AI's Sputnik moment because of the fact that you now had Chinese open source models who were competing at near or at the frontier of AI technology. And in that episode, Katherine had also talked about the very interesting mixture of experts architecture, which has now become pretty standard across frontier models, as well as DeepSeek's open source strategy and how the company had achieved what seemed to be frontier level performance at a fraction of the training cost. So DeepSeek is back, and there have been rumors about model releases for a while, and it is again making big leaps with this V4 model. And, so, V4 came out on April 23^rd, the same day as GPT 5.5, and it released two versions of V4. First, there's V4 Pro, which is a 1.6 trillion parameter model, as well as V4 Flash, which is a leaner, 284 billion parameter model. And now both of these were released on an open source basis under the MIT license, which importantly means that anyone can download, use, and modify the models even for commercial purposes. But the big story here is efficiency. And that's really what I want to focus on for the rest of the discussion about V4 and for the rest of this episode. And, so, V4 relies on some architectural innovations which we don't need to get into now, but they're designed to really promote this efficiency and make the model work within its own huge context window like we were talking about with GPT 5.5 and Opus 4.7. The architectures are designed so that V4 Pro uses only 27% of the compute and 10% of the memory compared to DeepSeek's last version of the model. But importantly, that doesn't mean that the models have taken a hit on performance on some of the benchmarks that we've mentioned before on other episodes of the podcast like SWE-Bench Verified. DeepSeek scores just within a few percentage points of Claude Opus 4.7. What's the takeaway there? Was that V4 really belongs in the conversation about frontier AI? But what's really turning heads with V4 and this discussion about efficiency are the pricing figures. Some analysts even had to do a bit of a double-take because they thought the numbers were initially typos. V4 Pro costs about $3.48 per million output tokens. And remember, tokens here are the little pieces of text making up a model's response. So, maybe it's part of a word or a whole word. And that compares to $25 for Claude Opus 4.7 and $30 for GPT 5.5. So, that's a big cost difference. And then V4 Flash is even cheaper at 28 cents per million output tokens. So, to put it all in perspective, because these dollar figures are a key metric for calculating AI inference costs. And by that, we mean, you know, the cost of actually running the models. Because generating text is complex, and it's more complex than just processing it as input, you know, when you type your query. As a result, these output tokens are usually priced at a premium over the input tokens, often three times to ten times more. So, it's a really interesting metric to understand the costs of AI adoption. And as exciting as all that is, I will throw in some caveats. First, V4 Pro is released as a preview, not a final version. And even though its performance is impressive, it does trail the leading closed source models on certain benchmarks, like those we have mentioned earlier in the episode and on other podcasts. And, as Katherine had flagged back in January of last year, user data for DeepSeek models when using the API is stored in China. And that hasn't changed. But just to, you know, clearly draw the lines, that's not a concern for the folks who are using open source models where you are running it on your own device or your company's own servers. It's for the folks who are actually connecting to DeepSeek's API that they're hosting.

So, take a step back from all of this and what's the takeaway? Well, the big picture here, which is not going to come as a surprise to anyone who listens and who's listened to our coverage of other model releases in the past, is that the frontier keeps advancing and we have this range of amazing models available. So, Claude Opus 4.7 is still the leader on some key complex coding benchmarks, while GPT 5.5 comes out on top and some other measures of agentic task execution. But, all of these models also do incredibly well when it comes to long context performance. Again, that's ingesting a ton of information and being able to process it accurately and in response to users' requests. And they also are improving on hallucination rates. But DeepSeek V4 also represents a big advancement in open source models that can compete near the frontier, but at a fraction of the cost. So, really, it's never a dull moment, but that's all the time that we have for today. I'm Scott Caravello. Don't forget to like and subscribe.

View Full Transcript