
Podcasts
Paul, Weiss Waking Up With AI
New Developments in World Models
This week on “Paul, Weiss Waking Up With AI,” Katherine Forrest and Anna Gressel explore the rapid advancements and real-world applications of AI world models, highlighting cutting-edge developments like Google DeepMind’s Genie 3, Fei-Fei Li’s World Labs and Microsoft Muse.
Episode Speakers
Episode Transcript
Katherine Forrest: Hello, everyone, and welcome to today’s episode of “Paul, Weiss Waking Up With AI.” I’m Katherine Forrest.
Anna Gressel: And I’m Anna Gressel. And Katherine, we did an episode of world models a while ago. I’m actually spoiling what our topic is, but we did it a while ago. And we thought it would be time to update that because world models have been in the media, and it is time to take another look, I think, and see what’s been going on and share that with our audience. Although, of course, anyone can go back and listen to our old episode on this but we actually, you know, I skipped asking you how you are, and we like just hopped on this in the morning. Katherine, how are you this morning?
Katherine Forrest: I am doing just fine.
Anna Gressel: Okay, good.
Katherine Forrest: And I am still in Maine, and one day I’ll come back from Maine. Actually, I was back in New York yesterday, and then it was so hot.
Anna Gressel: I know, we saw each other IRL and it was so special. I feel like I haven’t seen you that much this summer. It was so fun.
Katherine Forrest: What’s really fun is when Anna and I, we were yesterday speaking together on something in front of a crowd. What’s funny is it turns out that Anna and I cannot break from our podcast personas. That we interrupt each other, we chit chat, you know. There we are at some place where we’re supposed to be totally serious, and it’s really hard.
Anna Gressel: Bring us in for a CLE, and we’ll just banter with each other.
Katherine Forrest: Right, exactly. But the world model thing is really, it’s a great one to go back to folks because…And I think, Anna, I actually started this one because of Genie 3, you know, the Google sort of release. And when I saw it, I was like, “oh my goodness, we have to do an episode on this.” And people who are not familiar with Genie 3 need to immediately look up Genie 3 because it’s pretty mind-blowing, and it also brings us back to the concept of world models in general and what’s been happening. And it makes sense that world models—given what they are, which is a model of the world—that they are, as we have more and more capable AI models, that they are becoming more and more sophisticated and are in the news a lot. The conversation is really moving from the theoretical concept of world models and early world models to a practical application of world models. And we’re seeing now, and this is really Genie 3, the explosion of hyper-realistic interactive models and that are being put out there for use. And they’re really now being utilized by a number of different industries, not just sort of gaming. Gaming is obviously sort of a big use, but it’s not by any means now the only use. They are being used in real industrial applications, and so we’re going to talk about some of those today. You know, the Genie 3, but also Fei-Fei Li’s World Labs and the Microsoft Muse and a few other trailblazers that are really pushing the boundaries here with these kinds of immersive experiences. So I think world models, it deserves this second episode, but go back and listen to the first if you really want to sort of figure out what they are, because we do a lot more of the background in that one.
Anna Gressel: Yeah, I think let’s talk about who is doing what. So like, what is top of mind? Let’s talk about why investors are interested in this area. I think that’s a worthwhile topic to take on. And then, Katherine, do you want to start us with some background on world models? I mean, they have quite a history. I think it’s worth touching on briefly.
Katherine Forrest: They do, and not to sort of rehash all that we did before, but it is worth talking about this 2018 paper that was entitled, interestingly, “World Models,” by David Ha and then Jürgen Schmidhuber, which you can get on arXiv, which I talk about a lot as a source for folks who really want to get into the academic papers in a variety of areas. It’s run by Cornell University, it’s a terrific repository. But this paper in 2018 really reignited interest and popularized the whole concept of world models within the AI community. In that paper, the authors hypothesized that AI systems could learn representations of experiences of the world in simulated world environments, and it would help AI models then understand how our actual human earth functions. So, for instance, things that go up come down and gravity is a thing and physics is a thing and when you drop an egg it breaks and things that are fragile break and that when you are going upstairs, you’re going to have to come downstairs or somehow get down back down to the original plane that you left from. You know, it’s really sort of figuring out how the world works, and it’s interesting because that is not necessarily something that humans think about because as babies, we learn all of that intuitively, intrinsically just by sort of existing in the world.
Anna Gressel: Yeah, absolutely. I think, we probably did an episode on video models like over a year ago, maybe a year and a half ago. And I remember talking about when I first saw Sora, which is, you know, one of the big video models, and the uncanniness of watching the video rendering of a cat slip, like on a windowsill, like just slips and then rights itself. All right. So you can tell I don’t have a cat and I have a dog, but.
Katherine Forrest: Wait, cats don’t slip, by the way. Okay, that’s like the whole thing with cats, is they don’t slip.
Anna Gressel: You can tell I don’t have a cat, I have a dog, but you know.
Katherine Forrest: I know what you mean. Sora had that, it had the whole rain thing going, and it had the whole concept of the world and the rain comes down and there’s sloshing and there’s puddles.
Anna Gressel: And you can think of that as being generated from like a replication or an understanding of videos. So, you know, the video has that and therefore videos that are trained on other videos have that. But you can also think of an understanding of gravity and the fact that if you fall, you have to get up. And so it’s kind of that distinction, I think, we’re drawing between just looking like something in the world and reasoning based off of something in the world or modeling based off of something in the world. And that’s a pretty major step change when you think about the potential for these models, because you want them to have fidelity to the world. But right now, a lot of large language models don’t entirely have that. And there are kind of efforts to ground in reality, and a lot of techniques focused on that. But world models kind of take that on as a bigger project and really say, you know, “what if we had AI systems that could actually plan and reason and adapt to real world environments?” And Katherine, as you mentioned, that has video game applications, like plenty of people want their video game worlds to look more like the real worlds, but it also has industrial applications and AI training applications. We talked about some of those on the last podcast.
Katherine Forrest: You know, one of the things I wanted to go back to was the fact that Yann LeCun really believes and says, I think, in a variety of fora where he speaks, he talks about world models as being sort of a missing link in the ultimate advancement of AI towards AGI. But let’s talk about the big players in the realm of world models. And really, we’re building ourselves back up towards this Genie 3, which I find to be so stupendous. But we’ll start with Genie 3, but that’s made by Google DeepMind. And it’s actually the third generation. It’s the third line of world models that are part of a family called Generative Interactive Environments. And so it’s Genie 3, Generative Interactive Environments. And what’s remarkable about this is the ability of Genie 3 to generate entire video game environments complete with interactive elements just from a simple prompt or a few frames of gameplay. So you’ve really got to go on and look at this in order to believe it, because you can put yourself effectively into a 3D game world.
Anna Gressel: Yeah, just as a side note, I think it’s worth looking often at the changes between model generations. So we talked on one of our podcasts about the differences between AlphaFold 1, AlphaFold 2, and AlphaFold 3. And it’s just remarkable how much AlphaFold 3 can do that 1 couldn’t. And that really has real world implications. And here, I think it’s also worth it if you’re going to like look at a model of like or demo of Genie 3, go back and look at the earlier generations and you can see how much better it is. And it’s really pretty impressive. It’s a step change and it cannot just simulate visuals, but also actually like the whole underlying logic of the game world. And that means that people can play in these generative environments, and the world responds in a way that feels authentic and dynamic. And there’s something that they call “promptable world events,” which is, I think, kind of fun, which is basically like you can change your environment on the fly. You can add a herd of deer to a ski slope, switch from day to night, you can kind of command it to change. I mean, I’m sure there is a sci-fi movie or reference where this happens and I’m just forgetting it. But like, how cool is that?
Katherine Forrest: Right, no, it’s totally cool. And it’s a huge deal for both developers, game developers, model developers, creators of all kinds, because instead of having to really painstakingly design every aspect of a particular environment, they can use Genie 3 through a text prompt to generate really a rich, complex world on the fly. And what you’ve got effectively is an infinite, huge canvas for creativity.
Anna Gressel: Yep, definitely. And one of the most cool, fun demos I’ve seen involving Genie 3 is—really involves generating a platformer game from a hand-drawn sketch. And the model fills in all the details, the textures, the lighting, the physics of how the character moves and jumps. And it’s really about making things feel real, which is pretty cool. And I’m not a video gamer, but I do remember, like, you know, as a kid trying to design board games, and it’s pretty cool to think that even like a board game I could have designed as a little kid with my hand drawn whatever it was, Etch A Sketch or markers, could have turned into this whole environment. That must be just super awesome for kids to play with.
Katherine Forrest: Right, I mean can you imagine what this is going to do? You’re going to be able to be with Colonel Mustard in the library with the wrench.
Anna Gressel: Exactly.
Katherine Forrest: Right?
Anna Gressel: Or whatever version my sister and I would have dreamed up, that would have been like our version of that.
Katherine Forrest: Right, right, right. But the realism isn’t just visual. It’s really, again, about the world that is behaving in a way that makes sense to humans in our physical world. And so if you’re trying to create immersive environments that are realistic, and so when you talk about it being really, really realistic, it’s called hyper-realistic. And when you’re talking about creating a hyper-realistic world that you can immerse yourself in, having the world behave as we expect it to behave, unless you’re doing something sci-fi, is really important. And that brings us to Fei-Fei Li’s World Labs. And that company has been making waves recently with its focus on translating two-dimensional still photographs into fully-explorable 3D environments, which is crazy.
Anna Gressel: Yeah, and it highlights that this isn’t just about gaming. So their technology has applications in anything from virtual tourism to training simulations. And, you know, you can imagine taking a photo of a living room or really any environment and automatically turning it into an immersive, interactive experience. So it’s really interesting from a user-generated experience perspective, someone can take their world or like a piece of their real, physical world and turn it into a virtual world. That’s just fascinating to me.
Katherine Forrest: Right, and you know, Fei-Fei Li, her reputation in this area—well, in the AI world in general—can’t be overstated. She is really, she is the person behind ImageNet, which was pivotal for deep learning. She’s sometimes called the grandmother of AI, even though she doesn’t strike me as grandmother age, but they talk about her as generationally having this incredible importance. And at World Labs, she’s bringing the same level of ambition to world models, and she’s really aiming to create digital spaces that are rich and as nuanced as the real world is. So I’m very excited to see what comes out of the World Labs.
Anna Gressel: Yeah. You know, I want to make sure we mention Runway. I love them, in part because they’re a New York-based company and I’m a New Yorker. They’re downtown New York folks doing cutting-edge AI. That’s awesome, shout out. But one of the things they do in general is tools for video editing, really kind of in the hands of professionals. It’s very, very interesting. They’ve been working on world models too, including generating hyper-realistic video sequences, and that is being adopted by game studios looking to create more lifelike cutscenes or interactive narratives. You can see kind of, already the applications are really being adopted and used in day-to-day life.
Katherine Forrest: Yeah, and here’s another one. It’s Microsoft’s Muse, which is built on what Microsoft calls WHAM, which is a World and Human Action Model. And Muse is also pushing the envelope in generating video game content, but it can be used to generate all kinds of visuals. And for video game content, it can simulate player inputs, it can really be a powerful tool for game ideas. And then there’s, of course, Meta, which has released their second generation of the V-JEPA model, so the V-JEPA 2. And we talked about the JEPA models in a prior episode, and that stands for Video Joint Embedding Predictive Architecture. I’m really glad they gave us acronyms for some of this stuff because talking about a video embedding, a video joint embedding predictive architecture, is a lot harder than just saying V-JEPA. But the V-JEPA 2 model is the second generation for that. And it was a first world model that was trained on video, according to Meta. And its objective really is to help, again, machines understand 3D environments, object movements, et cetera, et cetera.
Anna Gressel: Yeah, and it’s, it’s really interesting. I mean, we’ve talked a lot about the developer side of this. I often think it’s important and interesting to look at the investor side, because often what we see from investors is not just a narrative about why this is helpful today, but why they see certain aspects of AI technologies as foundational to kind of larger term development, you know, either kind of development promise or as part of the scaffolding of a larger ecosystem. And so I would recommend for sure folks look up the publications from Lightspeed and Sequoia on world models, which I think are particularly interesting. And they really talk about the history and the investment thesis behind world models. And they see this as a foundational technology, not just for games, but the next generation of interactive media. And of course, we’ve talked about some of the industrial and scientific applications.
Katherine Forrest: And it’s not just about entertainment, because these models right now are being developed for a variety of uses where interaction and immersion in an environment can be very, very enriching in education, in therapy, you mentioned earlier sort of tourism. But the ability to create realistic interactive simulations has implications that go really far, far, far beyond gaming. So there can be lots of industries that are going to be able to really experience a lot of benefits from all of this.
Anna Gressel: I think it’s just worth, like, on kind of a final note, saying that this isn’t without its own complexity. And some of that is this similar complexity to what we see across the AI space generally, which is that there are computational challenges. These are really massive models. It requires training on a huge amount of data and also processing power. And we’re talking about tons and tons and tons of video data, you know, among other things that are used to train these models, that could be petabytes of video. And so that, of course, is going to be energy intensive and computationally intensive. So all of the bottlenecks that exist generally in AI exist here too. And of course, that means that people are going to be working on ways of alleviating those and creating more energy efficient or computationally efficient training methods. But, you know, these are all kind of cross-industry challenges that we see really in almost every model type as well.
Katherine Forrest: You know, and another one is, of course, the data requirements, because the data requirements for a world model are extraordinary, and they’re different than some of the LLMs. I mean, it’s, we are familiar with multimodal LLMs, but what we’re doing with the world model is you’re taking a lot of video to learn how the world works and then training the model in the physics of the world and the manner in which every physical sort of action and interaction, cause and effect actually works. So you’ve got to have a complete data set. If you don’t—for instance, with, you know, if GPS were sort of missing a street or missing sort of a, it went blank during part of the world—you’d have a real issue with a world model being able to figure out sort of what was happening there. So the data requirements are very, very significant.
Anna Gressel: Well, it’s an exciting time. I mean, I think it’s certainly a space we’re going to be watching. And as with everything in AI, we’re seeing all these parallel threads of development world models, you know, multilingual models, you know, very interesting kind of new techniques. They’re all developing in parallel. And in some respects, we keep seeing the advantages or the outcomes really converging and kind of benefiting each other. So I think it’s an exciting moment to look across all of the developments in the landscape and just recognize it’s still a field that’s moving really fast. And, I know, Katherine, like every time we talk and give a CLE, we talk about this pace of change because it’s not slowing down. I don’t see it slowing down. Certainly, I don’t know, Katherine, if you do. But it’s important for folks to just keep abreast of all of these developments because they’ll affect, you know, your own internal decision making and policy, et cetera.
Katherine Forrest: Yeah, no, it’s the velocity of change. It’s one of my refrains. The velocity of change is really extraordinary. And right now we have a situation where these world models that we’ve talked about are no longer a research curiosity. They’re real. They’re getting practical applications all the time. And that’s going to lead to yet other developments. And that’s what Yann LeCun thinks that we need to sort of fill in the gap for AGI. We’ll see. So that’s all we’ve got time for today. And Anna, we’re going to have to talk about, I think, coding next week, right?
Anna Gressel: Mm-hmm. Stay tuned. It’s going to be an exciting one.