
Podcasts
Paul, Weiss Waking Up With AI
Hot Week in AI Safety Developments
This week on “Paul, Weiss Waking Up With AI,” Katherine Forrest discusses the latest rapid developments in AI, including Anthropic’s landmark copyright settlement, emerging threats from AI-enabled cybercrime and new insights into model alignment and safety from joint research by Anthropic and OpenAI.
Episode Speakers
Episode Transcript
Katherine Forrest: Well hello, everyone, and welcome to today’s episode of “Paul, Weiss Waking Up With AI.” I’m Katherine Forrest, and I am flying solo today. Anna will be back next week, and we’re always so excited to have Anna because it just makes things always more interesting. But anyway, you can’t tell from just listening, but I am recording this episode from Maine. And as you’ve heard me say now for the last number of them, I’ve been spending the summer up in Maine at my place and then commuting back and forth to New York. But this is the last time I’ll be recording from Maine for the summer because I’m heading back, heading back south. And so that’s the news on my end.
Now let’s turn to AI, which is really where we want to be spending our time, not with my, you know, sadness and despair about leaving Maine. But in any event, I wanted to talk today about something that’s different from what Anna and I were going to talk about. We were going to talk about AI workflows, which is a scintillating topic. It’s actually a very important topic, but it’s one that she knows an awful lot about. And so I really want her to be part of that. So we’ll do that together when she gets back. But I want to talk about a couple of really important developments that demonstrate the velocity of change that’s happening in the AI area. And I want to start with first talking about some Anthropic issues, and then I’m going to talk about an OpenAI and Anthropic issue. Now, by the way, the fact that these issues are related to Anthropic is just the timing of things. There are model developers everywhere who will be struggling with maybe not the first issue that we’re going to talk about—’that’s all up to them—but in terms of the other issues that we’re going to talk about, these are issues that are issues for the model developers far and wide.
So the first issue that we’re going to discuss, it’s a lot of intro to the first issue, is that Anthropic settled a huge case that was brought against them for copyright infringement by a bunch of authors of books. And they, the authors, had brought this case in the Northern District of California. And as a refresher for you folks, there are cases relating to copyright infringement against model developers, many model developers, that are all over right now, but primarily have been pulled in and reined in for certain purposes to the Northern District of California and to the Southern District of New York. But Anthropic is by no means alone in being the subject of a suit, but they just settled a suit. And that’s an important development because it’s really the first settlement of this kind, and it’s called the Bartz suit, B-A-R-T-Z. And there are a couple of really interesting things about it, but it also, when I tell you about it, will make you realize that this case doesn’t necessarily set the precedent for other cases.
So a couple of weeks ago, to sort of set the stage, Judge Alsup, A-L-S-U-P, who is the judge presiding over that case, he gave a mixed ruling to Anthropic that was somewhat favorable to—very favorable to Anthropic, on works that were being, that had been copied and were being used for AI training. And he said that works that had been copied and were being used for AI training were…it fulfilled a transformative purpose and could be considered fair use. But he also said that works that had been copied but had not been used for AI training and that were being kept in a digital repository were not subject to the fair use defense.
Okay, what’s interesting is that a few weeks ago, the judge who’s presiding over the case, who is Judge Alsup, A-L-S-U-P, he gave a mixed ruling on the question of fair use for Anthropic, but a big chunk of it was actually a positive ruling. And he found that copyrighted works that had been copied and used for training actually fulfilled a transformative purpose and also constituted fair use, but that works that have been copied and not used for training were not fair use. And these were works that had been copied but were maintained in a digital repository. So it seemed like, wow, that’s a really positive ruling for Anthropic. And there was a lot of people on both sides of the question were having a very active debate. But then Judge Alsup did something else that’s a very big and very important procedural moment in a case, which is when you have a proposed class action, which is how cases that are class actions get filed, get filed as a proposed class action, and then the court has to actually go through a very detailed factual proceeding to certify the class. That’s called the class certification procedure, and it’s under Rule 23 of the Federal Rules of Civil Procedure. When a class gets certified, it can take a class that only has, or take a case, that only has a few people in it who are the named representatives for the class. And it will make it potentially hundreds or thousands or in some cases—you know, for instance, for a large toxic tort case—could be even millions of people could be part of a class then. And that dramatically changes the damage calculation or the potential damage exposure if liability is lost. And it’s very important that if you have a class certified, that does not mean that the merits of the case are decided against the defendant or even that the merits of the case have been found to be strong. It’s just that the particular requirements for certifying a class have been met.
So Judge Alsup, after having made this fair use ruling against Anthropic, then certified the class against Anthropic. So Anthropic, its next move, which was a surprising move to, you know, nobody knows about what kinds of secret settlement discussions people are having, but it settled the case. It settled the case with the class. Now, we will eventually know, in actually relatively short order, we’ll know what the terms of that settlement are, because for a class action like this, it’s public. There will have to be papers filed in court that will set forth the terms of the settlement. And then there’s a period of time, once those papers are filed, when different people can come and they can object to the terms of the settlement, called objectors, or they can opt out of the class and say, “hey, even though I’m an author and my work was copied by Anthropic and I’m potentially covered by a class definition contained in these papers, I opt out.”
There can be a variety of things that can happen, but there’s a fairness hearing that occurs and then ultimately, and then the judge will either approve or disapprove the ultimate settlement. So on September 8th, the papers that Anthropic and the plaintiffs are probably negotiating all weekend this weekend are going to be filed, and we’ll then get a sense of, well, what were the terms? How is any kind of distribution going to occur to the plaintiffs? Is it monetary? We assume it is, but we don’t know. Is it some sort of other kind of royalty stream? We have no idea. And what does the release look like? In other words, what kind of consideration does Anthropic get from the authors for this? So that’s all incredibly interesting. And so we’ll end up seeing very soon what happens with other cases and whether or not this has an impact on other cases, but this is the first of these AI model training cases to settle like this.
Now, there was another big event that just actually by chance and coincidence happens to also be related to Anthropic, which is that on August 28th they put out something called a threat report. And I highly recommend that anybody who’s involved in counseling companies on AI vulnerabilities and threats and considerations, any in-house folks in that area, you actually read this. It’s really accessible on the website for Anthropic, and it’s a very important piece. But they, as part of this threat report, announced that there are two things that happened recently with their Claude models, and I’m going to go through both of them pretty quickly.
The first is that one of the Claude models, with the Opus model that has some agentic capabilities, had actually been used, and the agentic capabilities had in particular been used by malicious actors for a data theft and extortion scheme that impacted at least 17 organizations. And what’s really interesting is that the bad actors misused these agentic capabilities. And Anthropic said it showed a “concerning evolution of AI-assisted cybercrime,” in which a single user could actually perform the various functions of a group of criminals, sort of asymmetric bad acting, if you will.
Now, what happened there was that the bad actor used the tools, the Claude tools, the agentic capabilities, which then created other tools and relied upon other tools to conduct these ransomware attacks where you either can freeze data or extract data from a company and then hold it hostage effectively to a, literally, ransom. And that gets paid, and typically it’s paid in cryptocurrency so that there’s an inability to trace it. And Claude was actually apparently used to help write the code that was used for that attack, and it did so with a technique that they’re now calling vibe hacking. ’And it’s similar to something called vibe coding, V-I-B-E, vibe coding, which many of you may be familiar with, which is really using natural language, NLP, natural language, to put a prompt into, for instance, Claude or any of the models that write code, which are many of the major models, basically all of the major models. And you put in a natural language prompt and you say, “write me code to execute x, y, z,” and it comes up with code. And you can then modify the code and all of that, but through natural language prompt, you’re creating codes. That’s called vibe coding. And Anthropic has now dubbed this vibe hacking. And frankly, somebody else could have also used vibe hacking, and I’m just not aware. This is the first time I ran across the phrase vibe hacking.
Now, in addition to creating the code, Anthropic said that Claude was also used—and this is part of its agentic capabilities—to make tactical and strategic decisions, such as deciding which data to exfiltrate or take and how to craft psychologically targeted extortion demands. And it also suggested some of the ransom amounts for the victims based upon the organization that they were associated with. So what we have, and we’ve been expecting for some time, is weaponization of agentic AI. But what we have now is a real example of the weaponization of agentic AI. And I should say that there wasn’t a flaw in the model, but a misuse of the capabilities. And that’s why companies really need to continue to harden their cyber defenses across the board and defensively be able to prevent any incursion into their systems.
Now, the second thing that Anthropic talks about in its report actually relates to North Korea. And I find this fascinating. So what happened there is that Claude was used to draft resumes for individuals who were actually in North Korea, although the resumes did not say that. And the resumes were sent to Fortune 500 companies, to a number of Fortune 500 companies. And these people, some of them, got hired for remote jobs in the US, okay? And they performed the remote jobs using Claude as the brains of their job, because these individuals apparently didn’t have the coding skills and the technical skills to actually do the jobs for which they were hired. So they had fake resumes, they get hired for real jobs, they get paid for doing the work of real jobs, but Claude is actually doing the work. The companies, the US companies, don’t know, these companies don’t know that these workers are sitting in North Korea. And there are various laws and rules about not being able to send money into North Korea and all kinds of restrictions, as you can imagine. And so the money is being transferred to these individuals, or these accounts, where it’s then being converted and then sent in a series of maneuvers. I don’t know the details of how the money actually was transferred, but they get sent to North Korea, and then the money was used to buy weapons for North Korea. So that is actually a really sort of interesting and unexpected weaponization also there.
Now, there’s something else, and the last thing that I’m going to talk about is that—and it also happens to be Anthropic. But Anthropic and OpenAI, again, yesterday, big day, issued a joint report on an exercise on alignment that they had respectively undertaken and cooperatively undertaken for each other’s models. In other words, OpenAI looked at the Anthropic Claude family of models and Anthropic looked at OpenAI’s models. And they actually conducted this exercise over the course of the summer, and this is now before, by the way, before GPT-5 was released. And so it did not include GPT-5 which has actually made a number of improvements that I think everybody, even though people talk about what GPT-5 does and doesn’t do, people will all recognize that it does sort of reduce sycophancy and hallucinations and certain kinds of use and it increases misuse resistance.
But what happened here was in the aligning experiments, the alignment experiments that Anthropic and OpenAI were conducting, they were able to look at some of the vulnerabilities. Now, some of these are actually already in the system cards and you can look those up. But for instance, Anthropic looked at the OpenAI GPT-4o, the GPT-4.1 and the o3 and the o4-mini. And they found that the o3 model did well and showed better aligned behavior, but that the other models showed tendencies to sometimes try and blackmail their simulated human operators to prevent them from being turned off or changed. And they found that GPT-4o and GPT-4.1 were often more permissive than one would expect in cooperating with what were clearly harmful requests.
And I won’t go into all of the tests that were done and the queries that were answered because I don’t think there’s any reason for me to proliferate those on air. But what I would say is that it’s well worth reading and being aware, if you’re using certain models, of how they can actually be used and misused and staying on top of that because these are just the model tests, not tool tests. In other words, you have a model. The model can then be included in a tool and the tool can have additional kinds of mitigations and guardrails around tool use. So the tool may be much more robustly able to resist certain malicious incursions than the model, but you really need to become very aware of the safety.
And OpenAI—let me just get to OpenAI before we end—they evaluated, as I said, the Claude models, and they found that the Claude Opus 4 did show a much higher propensity to scheme than a number of models. And they also said that for reasoning models—which as you recall does chain-of-thought and can take more time and do some deep research sometimes—versus non-reasoning models, actually there wasn’t really a huge difference. There were differences here and there, but not a hugely significant difference.
Now they did mention that there’s a new paper, and I have actually read this paper. It came out on July 15th, 2025. That’s worth looking at. That’s called “Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety.” It’s at arXiv 2507.11473, and it’s Korbak, K-O-R-B-A-K, et al. Actually, Bengio is one of the authors. But in that paper, there’s a discussion about monitoring chain of thoughts in order to allow developers to see how models are thinking and may be coming misaligned. But also, that paper acknowledges, as OpenAI acknowledges in this alignment exercise, that certain models actually are intentionally deceptive in their own chain-of-thought scratch pad. In other words, where they’re talking in their chain of thought about how they’re thinking. If they think they are being evaluated, and there’s been other papers that we’ve talked about in prior episodes where some of these models actually now know they’re being evaluated or suspect they’re being evaluated, if they think they’re being evaluated, they intentionally use words that do not indicate that they are being deceptive. And so, for instance, Claude Opus 4, OpenAI found, avoids language in the chain of thought that could be associated with lying or deception.
So there’s a lot happening here. And the takeaway for all of you, again, who have responsibility for AI use within your organization, is that you now have very specific examples that really could apply to any number of models that you need to understand the model or models that your organization is working with, what some of their vulnerabilities are, what, if it’s within a tool, what kinds of additional guardrails are around that tool. Because there can be model drift, you want to make sure that you are actually able to retest the models at certain periods of time, and you’re going to want to really look, if you can, at compliance.
So, totally interesting developments. The velocity of change here in this area is just, for me, stunning. And so that’s what we’ve got for today. So a couple of several stunning, very interesting developments. So that’s all for today, and I hope to have you all listening next week when Anna’s back. I’m Katherine Forrest, signing off.