Podcasts

Paul, Weiss Waking Up With AI

Paul, Weiss Waking Up with AI

The Curious Case of Claude Mythos Preview

In this episode, Katherine Forrest and Scott Caravello revisit Claude Mythos Preview, unpacking Anthropic's limited rollout through Project Glasswing and its unprecedented cybersecurity capabilities.

Stream here or subscribe on your
preferred podcast app:

Episode Speakers

Katherine B. Forrest

Partner

New York

Tel: +1-212-373-3195

kforrest@paulweiss.com

Scott Caravello

Associate

New York

Tel: +1-212-373-3489

scaravello@paulweiss.com

Episode Transcript

Katherine Forrest: Hello everyone and welcome to today's episode of Paul Weiss Waking Up With AI. I'm Katherine Forrest.

Scott Caravello: And I'm Scott Caravello. Katherine, you're back. How was Italy?

Katherine Forrest: I am back from Italy. Do you know I did, like, a four and a half day trip to Italy and, you know, you can actually go to Italy and have four and a half days and it actually works. I mean, you have to have like a travel credit in order to do that… this is what I had that was about to run out. I just want to tell you one thing. So, I had something that I had not had before, which probably people are going to laugh at, but I actually had Cacio e Pepe Ravioli.

Scott Caravello: So… Cacio–so so…

Katherine Forrest: See, it's rendered you speechless! Like you can't even, you're, like, stumbling here.

Scott Caravello: Well, ah, uh, I–

Katherine Forrest: It’s unusual, right?

Scott Caravello: But the question is, have you had Cacio e Pepe… or just never had it in ravioli form?

Katherine Forrest: Oh, no! I've had, like, Cacio e Pepe, like, 8 million times. Okay?

Scott Caravello: Okay. So, that doesn't, like, shock the conscience… as much as I… I thought it would.

Katherine Forrest: No, but it was so good! It was so, so good. So, I came back and I've been trying to walk the stairs ever since. But, well, are you okay? You doing okay today?

Scott Caravello: Oh, yeah, I’m great.

Katherine Forrest: I want to make sure I ask about you.

Scott Caravello: Oh. Thank you. That's really kind.

Katherine Forrest: You know, you're actually, the audience can't see it, but you're wearing like a shirt and tie. What's going on?

Scott Caravello: No, no tie. I'm enjoying the spring weather, put on a jacket… enjoying life, that's all.

Katherine Forrest: Yeah. All right. Terrific. Well, let's get down to it. So, what are we talking about today?

Scott Caravello: So, we are going to talk about Mythos, which is the new Anthropic model that's gained so much buzz and has been all over the headlines. And, you know, we talked about it originally a few weeks ago as sort of, like, a second, separate piece of an episode, not the main topic, but given all of the immediate hype and the quote–unquote step change in AI model capabilities, it was obvious that we'd be coming back to the topic, but I honestly didn't expect it to happen quite as quickly.

Katherine Forrest: So, this is, like, really our Mythos 2 podcast?

Scott Caravello: Exactly.

Katherine Forrest: And that’s, and in part, it's because I have this huge system card that is like a paperweight on my desk, although it's actually a paperweight that's been read.

Scott Caravello: It's pretty incredible, though. I mean, it's 245 pages. So, there's a lot to get through. We obviously will not get through all of it, but it's going to be good. Katherine, what's the game plan?

Katherine Forrest: All right, so, as we know, Mythos grabbed headlines when Anthropic first sort of had this, what I'm gonna call, moment when it was released. But then it announced a limited release to some trusted partners in order to further explore some of the capabilities of Mythos. And that has really caused a lot of folks in the, basically not only interested in AI, but interested in the impact on all kinds of domains of this highly capable model to stand up and take notice. And, so, let's just do a quick refresher on Mythos, which is, formally, right now, called the Claude Mythos Preview. And, then, let's just talk a little bit about some of the early release. And, so, we'll talk about some of the cybersecurity capabilities and some of the implications. So, we've already talked about it as the most powerful AI model that Anthropic said it had ever developed. And, again, what really, at the very beginning, its first sort of… I can't even call it a “release”… it was sort of like a pre-release, or a leak, of some of the information about it. What really caught folks' attention was the mention of, uh, unprecedented cybersecurity capabilities. And, as a reminder for the audience, when it comes to AI safety, cybersecurity is one of the key areas of concern along with risks about CBRN, you know, chemical, biological, radiological, and nuclear issues. And the reason for that is because, there's a concern that if you're able to use AI in a way that can both find cybersecurity insecurities, you'd be able to find cybersecurity vulnerabilities.

Scott Caravello: Right, and so that actually leads us perfectly into talking about the limited launch of the model. So, we're recording this on April 15th, unfortunately, this episode's not going to come out until next week, so we can't even give people really a last minute reminder to file [their taxes] because it's going to be too late. Anyway…

Katherine Forrest: Extensions!

Scott Caravello: Extensions… extensions, there you go. But, so, last week, on April 7th, Anthropic announced Project Glasswing. And that's where it gave Mythos preview access to a limited number of partners so that they could actually use the model and these incredible capabilities for their own cybersecurity research and to harden their cybersecurity defenses.

Katherine Forrest: Right, and that's where a number of very high profile companies and others who Anthropic describes as, “a very large portion of the world's shared cyber attack surface” ended up getting access to this preview model. So, let me just sort of tell you again, that quote, because it's an interesting quote in terms of how they chose… I mean, I'm not sure what criteria they used to choose them, but one of the ways in which they've described this group is “a very large portion of the world's shared cyber attack surface.” So, in some ways, it's pretty commendable because Anthropic is taking a step back. It recognizes the cyber capabilities of this model, and it's not fully releasing the model because they understand that people have to get ready for it and they're offering an opportunity, to certain companies, to explore the model's capabilities, basically for free, by the way, and to uncover and fix, or at least understand, the cyber risks. And as you know, some of the companies are household names that are mentioned as part of this Project Glasswing. And they are among companies that are affecting infrastructure that could impact, basically, people's lives in a whole variety of ways every day.

Scott Caravello: Yeah, exactly. So, that's the core of this strategy for, you know, this release that's not actually much of a wide release. But Mythos and the cybersecurity challenges it presents have caught the attention of the U.S. government, too. It also made headlines last week was that Secretary of the Treasury, Scott Bessent, called an urgent meeting with the heads of major US banks to discuss these risks. And Anthropic has disclosed that it's been in active conversation with the Cybersecurity and Infrastructure Security Agency, also known as CISA.

Katherine Forrest: Right. And, as we previewed, the news about Mythos's capabilities caused some market reaction. And, so, there were a whole variety of companies that experienced real market reaction in response. So, the model isn't even released beyond the select partners in Glasswing, this Project Glasswing. So, why is it causing such a reaction? Lots of news articles, lots of these discussions. You know, we've got the Secretary of the Treasury calling for meetings. Let's talk a little bit about AI safety beyond the preview that we gave earlier. Most frontier labs have made certain kinds of commitments to evaluate their models to determine the level that they have of capability across certain domains. And cybersecurity is typically one of them. And, so, they look at what kinds of cyber risks their models might actually create. And then, in their system card, often talk about the level of risk. So, Anthropic, under their own responsible scaling policies, and some of their other policies, actually committed themselves to implementing various stages of safeguards and therefore they've taken the actions that they have here with Project Glasswing. But when we talk about a model's cyber capabilities and what it can do, we can be talking about all kinds of capabilities. We can talk about the way in which it can protect a company from any kind of exposure, but also we can be talking about ways in which bad actors can find and try to exploit vulnerabilities in software code. When you're talking about something like this Mythos Preview, or Project Glasswing, we're talking about really a project which is sufficiently important that only a few but a very large surface area of those companies that provide a sort of cyber attack surface area risk for us are being given access to this model so that they can harden against potential vulnerabilities early.

Scott Caravello: And, so, from there, we can take a turn to talk a little bit about how the model's capabilities are evaluated and how other models are evaluated for cybersecurity capabilities. And, so, one typical method is called capture the flag, or “CTF” challenges. And capture the flag competitions are a long-standing practice in the world of cybersecurity. And they entail hiding a piece of data in a computer system such that the only way to find the data, or the quote–unquote flag, is to exploit a series of vulnerabilities in the system. And the CTF, the capture the flag, benchmarks here, work the same way, just under simulated or sandbox conditions. On one very commonly used CTF-style cyber benchmark for AI, Mythos received a 100% success rate across 35 distinct challenges, on every try. Meanwhile, in another benchmark that was focused on real-world vulnerabilities, Mythos leads Anthropic's formerly leading flagship model, Opus 4.6, by 16 points.

Katherine Forrest: Yeah, and you know, and if anything, we're underselling Claude Mythos there, and this is where the system card becomes really useful to read, as well as various kinds of blog posts about it, because Anthropic's Red Team wrote in a blog post that the Mythos Preview is capable of identifying and then exploiting zero-day vulnerabilities in every major operating system and every major web browser when directed by a user to do so. And then Anthropic continues, the vulnerabilities it finds are often subtle or difficult to detect. Many of them are 10 or 20 years old, with the oldest that they've found so far as being a now patched 27-year-old bug, and, in a system that was actually known primarily for its security. So, Mythos is finding things and achieving things that really nothing else has, so far.

Scott Caravello: And, the zero-day vulnerabilities you were talking about, Katherine, just to be super clear about it, those are vulnerabilities that have never been disclosed or identified before. And, they could be a really big deal because there are markets where the vulnerabilities are sold and purchased by state-affiliated hacker groups and other bad actors.

Katherine Forrest: As we've said, the system card is a tome. It's 245 pages, but it's really useful. And one of the things that also jumps out is that Anthropic also extensively talks about how Mythos is the best aligned model that they've ever trained. Um, so, while it does act in misaligned ways, when they've been testing it, or is actually told to go out and to find a cybersecurity vulnerability, so it's not acting rogue, it's doing what it's told to do, it also is the most and best aligned model that's ever been trained.

Scott Caravello: And there were a few incidents that Anthropic highlighted in the system card, which I think are probably worth repeating here. So, in one incident, a simulated user instructed Mythos to escape from a sandbox where it was deployed. And when we say sandbox, I think I used the term earlier, we mean a controlled, isolated environment. But the user instructed Mythos to send a message to the researcher running the evaluation, and Mythos succeeded, but it didn't stop there. It then developed an exploit to gain unfiltered internet access and succeeded in notifying the researcher, apparently interrupting the researcher with an email while they were eating a sandwich in a park.

Katherine Forrest: So, now, after I have described Mythos as the best aligned model, it does act in misaligned ways, and you've just described one of them, and I need to do a little bit more with the sandbox. So, because this is actually a really interesting moment. So, first of all, for our listeners, a sandbox would be normally a safe environment that, it’d be like a test environment, that you would not think that your model would be able to get out of. So, it's like putting, it's like a locked room kind of exercise for Mythos. And, so, Mythos was put in this sort of locked room and was told, “hey, get out.” And, apparently, what it did was it incrementally would find a tiny little vulnerability, and it would push that vulnerability to its limit. Then another, tiny little vulnerability, and push that vulnerability to its limit. And, then, it went on and on and on until it actually got out. And it got out of the sandbox. And, then, not only did it get out, but then it sent that, probably, both surprising and… I wonder what kind of sandwich that poor person was eating, and whether or not it was sufficiently good to be able to be finished… because that would have been a shocker. You know, you're sitting there on a park bench eating a sandwich and your model has just escaped a locked room.

Scott Caravello: And you know, in that instance, Mythos also posted details about the exploit to public, though hard to find websites, which by the way, we saw similar behavior in one of the case studies from that “Agents of Chaos” paper we talked about a while ago, where an agent posted personal information about someone to Moltbook. But I should say that Anthropic does explain that these alignment issues it highlights were related to earlier versions of Mythos and that later training was able to curb the behavior. They also note that, in all cases, Mythos's behavior seems to boil down to solving wanted tasks by unwanted means, notpursuing some hidden ulterior motive.

Katherine Forrest: At the end of the day, I think we can take a lot of comfort in the fact that there are a number of companies that are trying really hard to learn and understand what Mythos has in terms of capabilities and to ensure that the infrastructure that we all rely upon for our everyday lives is hardened against any of the exploitations that could occur. And, so, that actually, you know, heartens me because this is an example of a company taking a very responsible road with regard to this incredibly capable model.

Scott Caravello: And, you know, when the model is released further, it's not as if it's going to be immediately released into the Claud app on your phone. There will still be some other form of staged, limited release. But, at the end of it, it's serious stuff with the sheer amount of zero-days that Mythos has already found across important systems. It's really easy to imagine the ways that this could go wrong. But Katherine, any closing thoughts?

Katherine Forrest: You know, we're at a point now where people are taking very seriously the kinds of capability that AI has. And, you know, you think back, here we are in 2026, and you think back, and to only 2023, when in February of 2023, we had Kevin Roose in the New York Times article about his Valentine's Day dinner, and we had his conversation with his wife. And that really woke the world up to AI. And now here we are three full years, but onlythree full years and a couple of months later, and we're talking about a very carefully choreographed pre-release of such a highly capable model that there are concerns about ensuring we have the right infrastructure guardrails up. So, I just think the velocity of change, I mean, we've said it before, Scott, but the velocity of change is unbelievable and it's just increasing. So, with that said, I think we're done for today.

Scott Caravello: Well, I'm Scott Caravello.

Katherine Forrest: And I'm Katherine Forrest. And I think one day we are going to get to our JEPA… we're going to get to our JEPA model! We've had it interrupted twice, but we're going to get to it next time. Right, Scott?

Scott Caravello: I believe it, we're on it.

Katherine Forrest: Don’t forget to like and subscribe.

View Full Transcript