Podcasts
Paul, Weiss Waking Up With AI
Advancements in AI Video Generation
This week on “Paul, Weiss Waking Up With AI,” Katherine Forrest and Scott Caravello discuss OpenAI’s Sora 2 model and the rapid evolution of AI-generated video and audio—what it enables, why it matters and the legal and ethical questions it raises.
Episode Speakers
Episode Transcript
Katherine Forrest: Hello, everyone, and welcome back to another episode of “Paul, Weiss Waking Up With AI.” I’m Katherine Forrest.
Scott Caravello: And I’m Scott Caravello.
Katherine Forrest: And Scott, I want to thank you for being patient with me because we weren’t able to record our episode last week. And so we’re doing this one on a condensed timeframe where we’re going to record it, it’s going to get edited and we’re going to get it out all in lickety-split time because I was in London last week.
Scott Caravello: Love London. How was it?
Katherine Forrest: Well, you know, I have spent a lot of time in England, but I’ve not spent a lot of time in London except for a few work things. So I decided to combine a speaking engagement—I was speaking at the Thomson Reuters Trust Conference last week relating to AI and responsible AI and safe AI—with some pleasure. So Amy joined me on Wednesday, and we went to a bunch of different places and had a blast. Just had an absolute blast.
Scott Caravello: That’s great. What would you say was the highlight?
Katherine Forrest: You know, the thing is, first of all, it was celebratory because we had just finished this—you know, the book, “Of Another Mind,” which I’ve talked about a million times and been completely obsessed with. And so in this sort of celebratory spirit, we went to the Kit Kat Club—you know, it’s the show Cabaret, which has actually been in New York as well, but it’s actually showing in London. And you sit at cabaret-style tables, and you have a few drinks out of martini-style glasses—although they were a little light on the alcohol, I have to say. I don’t think they wanted the audience to get too raucous. But we also saw Charles Dickens’ house, which, by the way, not many people go to it, so it’s really easy to walk around and to see the scullery and to see the children’s nursery and to see part of the bars where his father was in the workhouse. They’ve taken the bars and put them in the room—and Churchill’s War Rooms. But anyway, that is me. But you yourself have done travel lately.
Scott Caravello: Oh, yeah. Just a few months ago I got married, and we spent a few weeks just bopping around Italy, which was the trip of a lifetime. I had never been to the Dolomites before, and the hiking there was just incredible. They maintain the trail system so well. It was really fantastic.
Katherine Forrest: So both of us have had these old-world sort of non-AI experiences—you hiking in the Dolomites and me going to, you know, sort of Cabaret and the Charles Dickens house. But what we’ve got today is completely so different from that because we’re going to be talking about some of the extraordinary advances that have been happening in the last little bit of time in terms of the world of AI.
Scott Caravello: Yeah, so I think it’s a pretty full plate of tech developments. So I think we’re going to have to break this one into two episodes, Katherine.
Katherine Forrest: Yeah, originally I had—we had written a sort of an episode that was going to combine both Sora 2 and Anthropic’s two newest models, Sonnet 4.5 and Haiku 4.5, but it was too much. So we thought what we would do, which makes a whole lot of sense, to just frame our conversation today in terms of Sora 2, and then we’ll get into the others next week. But it’s really a huge moment just for Sora 2 and for video generation, AI video generation, right now. And so it really is deserving of its own episode because there is now so much AI-generated video content that is across the internet on a scale that we have never seen before, and Sora 2 is really part of that. It’s a game changer.
Scott Caravello: Completely agree. And so, maybe to just give the listeners a little bit of additional background before we really get into the impact of what this model may be: so, Sora 2 is OpenAI’s latest audio and video generator model. And it’s a huge advancement, right? Hyper-realistic visuals with fewer glitches plus complex motions—like, think of a Rubik’s Cube being solved with that image and that video being presented seamlessly without issue.
Katherine Forrest: Right, I mean, it’s like a lot of twisting and turning of the Rubik’s Cube. And so you’ve got a lot of sharp visuals, you’ve got to be able to have color management, you have to be able to have consistency of image.
Scott Caravello: Totally. And so it can also generate and synchronize audio as well, right? Including music and dialogue in multiple languages. And that audio component is a big upgrade compared to the original Sora, which was video-only.
Katherine Forrest: Right. You know, it’s fascinating. So OpenAI released the Sora 2 system card—because that actually also has…it’s a very short system card, but they released the system card on September 25th. And it’s really—it’s worth a read. But I would actually encourage people who may have looked at the original Sora system card to go back to that one, which is December 9th of 2024. And I’m going to call that the original Sora just to separate it out and to compare these two, because the original Sora is also, of course, a video generation model, and it took user prompts and it could create incredibly photorealistic scenes. And so it’s based on a diffusion model, which means that the original video looks like static noise. It’s sort of all squishy, if you will. It’s got some color, but there’s not really a crisp picture. And it gets gradually transformed into a crisp image. And one of the amazing advantages of the original Sora was that it solved this problem that I’m going to call a “persistency problem.” I’m sure it’s not me; I’m sure that other people have called it a persistency problem. But making sure that the subject stays the same even when the subject—let’s just say it’s a person—leaves the screen for a period of time. Because holding on to the similar image of the subject when you actually sort of would lose, you know, the visual moment of seeing the subject and having them reappear—that was really hard to do. And that was something that the original Sora had managed to actually achieve.
Scott Caravello: It really is incredible. And so that original Sora was trained on internet-scale data, and it uses a type of transformer architecture, but what are called “visual patches” instead of text tokens. And so the way they make those patches is by compressing videos and then decomposing them—or basically tokenizing them.
Katherine Forrest: Right, and the original Sora was able to create videos of about a minute in length. But now let’s move on to Sora 2, because that one just came out. It came out in late September, and it’s an audio and video generation model, and it has some significant advances in general. And the one that I really want to focus on is one of my favorite topics, of course, because coming out of my superintelligence book, “Of Another Mind,” I talked a lot about world models. Or Amy and I talked a lot about world models in that one. But as listeners may recall, a world model is the concept of having an intrinsic understanding of the world—the things that go up, they come down; that you can’t see over the horizon; that when you turn a corner, you disappear. And we’ve talked about world models in prior episodes. And so what Sora 2 does is it actually has a sense of a world model that is far more advanced than the original Sora, and it seems to be showing better adherence to the laws of physics, and it’s able to therefore make motion more plausible across scenes and in a single scene. And it’s really fascinating to watch. And so you can see that in some of the discussion of that in the system cards, and it’s called a video diffusion transformer approach both for Sora 1 and for Sora 2, but with advances now in Sora 2.
Scott Caravello: Yeah, and so on that point of consistency—because consistency is so key for a video, right? You just don’t want objects disappearing or morphing from frame to frame when they should be stable features of a scene. And so then, like I mentioned before, that the audio generation here is built in. The sound is generated as part of the scene. And so that’s really giving it this believable, authentic quality when these videos are being generated.
Katherine Forrest: Yeah, so let’s talk about that—how it works practically.
Scott Caravello: Right, well, so the tool follows detailed user prompts, and users have pretty fine creative control over what’s generated, from the camera angles to the perspectives and more. And so another really interesting thing to note about the Sora 2 release is what OpenAI is calling cameos. Users can record a one-time video and audio sample to verify their identity, and once verified, people can then insert their own likeness into the generated scenes.
Katherine Forrest: That’s, oh—can you imagine what’s going to happen? So effectively this—it’s like a designed likeness consent scheme. You’ve got users who can opt in and decide who can use their likeness—whether it can be just you, your friends or a broader audience. So I could become a star, I guess, you know, in my friends’ videos. And, you know, it’s really—it’s interesting from a copyright perspective and also from a name and likeness perspective. There are a number, actually, a number of important legal issues. And what OpenAI is doing is they’re implementing an opt-in scheme for rights holders to choose whether a protected character, including themselves in some respects, can appear in a Sora 2 video. And this is actually a reversal from an initial plan which was to require rights holders to explicitly opt out of the use of characters, including copyrighted characters.
Scott Caravello: Yeah, and so on that point, right, I think also worth flagging is how granular the controls actually get. So users can also set restrictions on how that cameo is used. For example, even if I let anyone use my likeness, I can revoke that at any time, by the way. I can also prohibit the cameo from appearing in videos on topics that I just don’t want myself featured in, right? So whether that be about political commentary or any kind of sexual content. And I can also provide detail about how my likeness appears. I think one of the examples on OpenAI’s website is actually that if I want myself to appear wearing a fedora in every single video generated of me, I have the ability to exercise that control. And so…
Katherine Forrest: And would you? I just need to know—would you? Would you wear a fedora? Do you feel like—
Scott Caravello: I think it’s going to be a Mets hat. It’s going to be a Mets hat.
Katherine Forrest: A Mets hat.
Scott Caravello: It’s going to be a Mets hat.
Katherine Forrest: Oh no, it can’t be a Mets hat.
Scott Caravello: No, no. So it’s this very technical mixture of adhering to that consent scheme you were talking about, implementing other user preferences, and then running filters to prevent the generation of harmful content.
Katherine Forrest: You know, so as I previewed at the start of the episode, you know, really practically speaking, this is a big deal. And we’re seeing a lot of AI video-generated content on the internet already. There’s already a ton of it. And we’re seeing now some explicitly Sora 2 content with its photorealism. And there’s some amazing, amazing content that’s happening, but the problem is you can’t always tell that it’s AI-generated. Or maybe that’s not the problem; maybe that’s the beauty of it. But it does get confusing now trying to figure out what’s real and what’s not. I mean, my son said to me about some sort of—I’ve forgotten who it was, but there was some sort of person who appeared to be with child, and it was a star. And I said, “Oh my goodness, did you know that so-and-so is pregnant?” And he said, “Mom, that’s AI.” And I thought, oh, I had no idea. So it’s actually hard to tell.
Scott Caravello: That is wild. And, but you know, you can also see some more amazing scenes and even of some more frivolous stuff, right? So from people running from dinosaurs to one that I’ve actually pretty weirdly seen in my social media feeds, which is George Washington playing in NBA games.
Katherine Forrest: Okay, so that just tells you all about your algorithm. And I want to know whether or not he was wearing a Mets hat. But, you know, it’s important to say that Sora 2—and Sora—is not alone in its video and audio generation. We’ve got Runway, for instance, that also has really advanced video generation capabilities. And it’s been used in a number of even high-profile film projects like, “Everything Everywhere All at Once.”
Scott Caravello: Yeah, and so they now have three generations actually, and on their website they talk explicitly about the importance of a general world model to what they’re doing.
Katherine Forrest: You see, general world model comes up again. The whole concept of being able to have a world model is so important, and you’re right. You know, with Runway, they’ve actually got a really interesting piece on their website about the need to create a consistent world model to solve some of the very important video generation issues and the consistency issues and the laws of physics and all of that.
Scott Caravello: Yep, you have to make the world look and feel like the real world.
Katherine Forrest: Absolutely.
Scott Caravello: And so, you know, Runway actually uses a slightly different approach, though, compared to Sora and everything that we were just talking about. And so they use something called an autoregressive-to-diffusion model. And so they’re leveraging the Qwen2.5-VL model as their base. And then they also use a math and reasoning model along with it, and then incorporate chain-of-thought reasoning as well.
Katherine Forrest: Wow, there was a lot going on in that statement that you just said, and maybe we’ll get to some of that in another episode, but it’s just worth saying that the Qwen2.5-VL—that’s one of the Chinese models that’s out there. And so what they’re doing is they’re actually combining a whole bunch of different things—models on top of models. And so let’s talk about that in another episode, but the important thing is that these video generation models are truly making incredible progress.
Scott Caravello: Yeah, and you know, I already mentioned how much I’m seeing George Washington these days just as another little tidbit and preview—
Katherine Forrest: Well, you didn’t mention you were—if you’re seeing him more than once, Scott, that you did not mention.
Scott Caravello: No, that’s true. Well, Kat—
Katherine Forrest: How often is George Washington showing up?
Scott Caravello: I would say not less than two, but not more than four times per week. But then on top of that, if you really want to know, I’ve also been getting these videos of, like, grizzly bears breaking into people’s houses. They all have the Sora watermark, so I know that it’s all AI-generated content. But, you know, we can save a dissection of my TikTok algorithm for another episode as well.
Katherine Forrest: Well that’s actually—it’s an important point, actually, talking about the watermarks, you know, because Sora does have a feature which is really a very useful feature where the videos come out watermarked. They can be removed, and there are some folks who found a way to remove them, but that’s actually a nice feature in terms of knowing whether or not something is AI-generated or not. But you know, Scott, one thing that I saw recently—and I mentioned this to you—was this, you know, the evolving.ai site, which has released some AI-generated video of kids talking about being AI. And so they’re kids sort of sitting on stools in a studio, and they’re asked about, how do you feel about being artificial intelligence? And they’re being allegedly filmed—this is all video-generated—but they’re allegedly being filmed as they give their answer. And it’s pretty wild in terms of seeing this, and it raises some, I think, very interesting issues about the lines that we’re drawing with video generation.
Scott Caravello: Oh, completely. It’s pretty eerie. But it also makes me think of the whole conversation around, “seemingly conscious AI,” which I know we’ve talked about before. And it’s a term that Mustafa Suleyman, as the CEO of Microsoft AI, had used in an article last month when he warned against the development of systems that imitate consciousness in a convincingly enough way.
Katherine Forrest: Right, and we’re not sure actually with evolving.ai whether or not they’re using Sora 2 or some other video generation model. But the point, I think, is the same, which is we have to think about the issues that Mustafa Suleyman is raising, which is when you’ve got incredibly photorealistic video, it does amazing things, it has amazing capabilities to allow people’s creative, you know, juices to flow and the ability to create all kinds of interesting, phenomenal content. But with this evolving.ai with the kids—it also raises some interesting issues, you know, about whether or not people are going to start to believe that there’s seemingly conscious AI. So—ah. You know, we’re going to have to think about these things, and I will tell you that that is mentioned in my book, “Of Another Mind,” soon to be available at a bookstore near you. But in any event, Scott, I think that we have now run out of time, and we’re going to have to go on to some of the Anthropic new models next week. And that’s what we’ve got time for. I’m Katherine Forrest.
Scott Caravello: And I’m Scott Caravello. Don’t forget to like and subscribe.