Podcasts
Paul, Weiss Waking Up With AI
Gemini 3: Google’s Big Jump
In this episode, Katherine Forrest and Scott Caravello unpack Google’s Gemini 3.0—its leading reported score on key benchmarks like Humanity’s Last Exam, how its sparse mixture-of-experts architecture works, and the model’s rapid integration into Google products and platforms. Plus, a discussion of Google’s new image generation model, Nano Banana Pro.
Episode Speakers
Episode Transcript
Katherine Forrest: Hello everyone and welcome back to today's episode of Paul Weiss Waking Up with AI. I'm Katherine Forrest.
Scott Caravello: and I'm Scott Caravello.
Katherine Forrest: And so Scott, right after we recorded the episode where we'd mentioned the White House executive order that was going to impact AI state laws, there were developments like that day. And so I just wanted to alert our audience to it, which is it looks like, for a variety of reasons, the White House has put that initiative either completely on hold or at least pushed it way down on the sort of ladder of initiatives that they're going to spend time on. And there's been, you know, a lot of chatter about the White House not seeing it as urgent at the moment.
Scott Caravello: Yeah, so definitely going to be a wait and see. The key issue here, though, is that the EU AI Act does remain in place. So no matter what happens in the US with regard to our own state laws, many of the biggest companies have put governance in place that's oriented towards the EU. And that actually encompasses a lot of what the states were doing in their own ways. But, that said, the EU is undertaking its own simplification initiative that may streamline rules on AI, including by pushing out the effective date for compliance with obligations for high-risk AI systems. But I think we can cover that in detail at a later date once we see how that initiative progresses.
Katherine Forrest: I just want to know how you say all that without like being fully caffeinated in the morning. Okay? That's like a whole mouthful. And I'm like, wow, I actually just need to take another swig of this latte from La Colombe that I got downstairs before I can even like go on to our next topic. And he's like all over the EU AI Act simplification process.
Scott Caravello: Live and breathe the EU AI Act. Always ready to talk about it.
Katherine Forrest: You really do. You really do. Alright – so we'll put a proverbial pin in that one. And I want to talk about, and I know that you do too, because we've talked about it offline, the recent and a really interesting announcement that people should stop and pause on.
Scott Caravello: Seriously, seriously.
Katherine Forrest: Yeah, and that's the Google release of Gemini 3.0. And, you know, Google is calling it the most intelligent model yet, showing what they're referring to as state-of-the-art performance on reasoning, multimodal understanding, and agentic capabilities. Now, that's actually some pretty big words because state-of-the-art performance today is actually a highly capable state-of-the-art performance, right? So when you're saying that, you're not just saying sort of really good performance, you're saying extraordinarily capable performance. And so you're talking about capabilities that are at least at the range of the GPT-5, 5.1, the Claude 4.5, Sonnet models, and, you know, there are some others that we've talked about recently. So I thought one of the things that we could do in order to really get at what's happening with Gemini 3.0 is talk a little bit first about the Google line of models because frankly we hear a lot about them but there hasn't been a lot of individual or independent discussion of the Google models in the way that the OpenAI models are talked about or the Anthropic models or Meta models are talked about.
Scott Caravello: Right, and so the Google Gemini models have often been referred to as sort of a “second tier,” but to the extent that anyone bought that, this is a game changer, and Katherine, sort of just to add in there, I think the Wall Street Journal's coverage actually described it, which I thought was pretty funny, as “America's Next Top Model.”
Katherine Forrest: Right, there you go, there you go. And you know, it's funny because Gemini 2.5 has been talked about a lot recently and so I think people have been becoming increasingly aware of the power of these Gemini models. But let's just do a little quick review of the Google line of models building up to Gemini 3. And the first of the Google models evolved from what was the original Lambda model, which is “L-A” and then, capital, “M-D-A.”
Scott Caravello: Yeah, and so that was actually the model that Blake Lemoine controversially claimed was sentient at the level of a seven-year-old back in the spring and early summer of 2022.
Katherine Forrest: Right, and he had released a transcript that you might recall that I've actually talked about in a couple of speeches that I've given about his conversation with that LLM, and he eventually separated from Google. But that LLM, that Lambda, was released in 2021 actually and it was followed by two PaLM models spelled “P,” little, “a,” and then, capital, “L-M”— there was PaLM and PaLM II, and they actually had improved coding skills, and they came out in ’22 and ’23. So we have the Lambda, we have PaLM, we have PaLM II. There was sort of a steady cadence at that point of what Google was producing.
Scott Caravello: Yeah, and then what followed that was the Bard chatbot. And that came out when ChatGPT was taking the world by storm in the spring of 2023. And then the Gemini models took over.
Katherine Forrest: Right, the Gemini models in terms of the use of that name took over. And the first of what were called any of the Gemini models was in December of 2023. It was developed and then released by DeepMind, which is of course owned by Google. And unlike earlier models, the Gemini models were multimodal. They processed text, audio, you know, video, and other types of data files.
Scott Caravello: Yeah, and then in February 2024, Gemini 1.5 was released and it actually contained some pretty advanced capabilities like a 1 million token context window and a mixture of experts architecture. And then in May 2024, they released what they called the Flash version of that same model, which was optimized for speed and efficiency.
Katherine Forrest: Right, and then there was a little bit of time that went by, not that much time actually, it was just between the spring of ’24 and then January of ’25 when Gemini 2.0 Flash came out. It also had enhanced speed, but it included agentic capabilities. And then Gemini 2.0 Pro was introduced in February 2025, which had advanced reasoning and coding capabilities. And it started to really, I think, make an impression on people because it was scoring very well on complex benchmark testing.
Scott Caravello: So then following up on that, in March of 2025, they released the Gemini 2.5 experimental model, which was billed as a fast-thinking model for complex problems.
Katherine Forrest: Right. And so I actually sort of started to become aware of some of the high capabilities of the Gemini models with 2.5, you know, and then the 2.5 Pro. And it was actually in my book, Scott—you know I always want to talk about my book! In my book, the Gemini models are discussed.
Scott Caravello: Do we have a release date for that yet? Okay. Okay.
Katherine Forrest: We don't have a release date, but we do have cover art.
Scott Caravello: Okay. Okay.
Katherine Forrest: We have cover art for the book that has been chosen and acknowledgements have been written and dedications have been done and galleys are being worked on. And I would say that the thing that apparently bothers the publisher the most are contractions.
Scott Caravello: Oh, interesting.
Katherine Forrest: If I have anything with a contraction, they want to change the contraction. But the problem is that, for reasons that are known only to the secrets of the publisher, they change contractions in quotes, which I don't think is like a thing. Scott Caravello: Interesting. Well, you keep fighting the good fight.
Katherine Forrest: Right? I'm–I’m fighting the good fight. I'm fighting the good fight. Anyway, so let's go on. And, this leads us from the Gemini 2.5, which came out in March 2025—from that model leads us to the Gemini 3.0 series. And there's the 3.0 Pro and the 3.0 DeepThink models that were announced just on November 18th, 2025. So just not too long ago. And that really now is being talked about by many as the best of all of the Gemini models and as a real step change, both of those. And they're scoring better than most models that are out there right now, particularly when you use the web browsing functionality, because you can do it with and without web browsing functionality. And it's a truly highly capable model that people really need to stop and pause on. If you're not aware of this model, take a look at it, look at the model system card and study this, because this is really interesting stuff.
Scott Caravello: Completely, and there's been such a quick rollout, too. In Google's words, they're shipping Gemini 3 “at the scale of Google.” It was immediately released in the Gemini app, in Google Search and AI mode, and also to developers, including in Google's new agentic development platform called Anti-Gravity.
Katherine Forrest: Right, and so it's widespread already. And so let's talk a little bit about these advanced capabilities.
Scott Caravello: Sounds great.
Katherine Forrest: So the model is scoring, as I said, really very high across benchmarks, including image and video reasoning benchmarks and tests that look at factual accuracy of outputs. And they do that both on the—without web browsing and with web browsing capabilities. And the models have performed competitively on benchmarks looking at whether a model can operate a computer via terminal, which refers to interacting with the computer using text-based commands rather than through the user interface that we typically use. And the same is true for benchmarks that look at tool use and how a model is handling coding workflows.
Scott Caravello: Yeah, but you know, Katherine, I mean, to me, that's not even the most impressive piece of this.
Katherine Forrest: So you're going to talk about now Humanity's Last Exam.
Scott Caravello: You bet.
Katherine Forrest: Now have you—wait, but before we even start there, I want to know, have you yourself taken Humanity's Last Exam?
Scott Caravello: No, no, that's such a good idea though.
Katherine Forrest: Right? Do you think you could pass it? Like first of all, it's not even available to us, I think. But could you pass Humanity's Last Exam? Scott Caravello, are you ready to be declared like…
Scott Caravello: I'm–I’m pretty sure that I could not pass an expert-level question on physics and biology, but I'm happy to try.
Katherine Forrest: Alright, okay, alright. So let's just say, you know, Humanity's Last Exam. Go ahead and explain what it is for our audience.
Scott Caravello: Yeah, so kind of previewed there, right? It's a set of 2,500 expert-level questions across subjects like math, physics, biology, medicine, computer science, humanities and social sciences, chemistry, engineering, and more. But it explicitly targets reasoning, not just pattern recognition. So the questions can be multiple choice or short answer, and some require the models to have multimodal understanding, combining text with images. It's tough by design and it intentionally avoids being a test the model can memorize the answers to.
Katherine Forrest: Right, part of the issue when you look at some of the benchmarking studies that were done, part of the issue is that there was a concern that one of the reasons these highly capable models were scoring so high at the 87 to 100 percent percentile sort of level was that they had read every test and every answer and that they were able to find and ferret out, so to speak, all of the tests and all of the answers. And so it wasn't so much that they were reasoning through the answers, it's that they had had some form of memorization. And so what this Humanity’s Last Exam does is it sets this benchmark apart from others in terms of how it's been put together. And so all of the questions are sourced from subject matter experts. They're one of a kind. They're filtered first by models to ensure that questions that existing models can answer are actually excluded. And if a question survives, then it's vetted by humans over multiple rounds. And the goal is to push models past what would have been the low-hanging fruit of answers that they could have gotten relatively easily, and you don't get any partial points. So these models actually have to answer the remaining questions before it can actually score onto that benchmark… it’s a tough test!
Scott Caravello: Yeah, and the Gemini 3 was the highest score we've seen, right?
Katherine Forrest: It is the highest score. It's the leading score. But I would say one thing, which is Grok has claimed that it has scored over 50%, but there isn't literature right now that really looks at that and has verified that. But I want to put out there that they have actually said that. But in terms of the testing that has been replicated, it is the leading score and it's a strong signal that the model can really grapple with complex problems to get the right answers without relying on external tools or information sources, except of course that it's able to do some web browsing. So Gemini 3 came in at 37.5%. Now that sounds low, but you have to compare it to GPT 5.1, which is around 27%, and Claude Sonnet 4.5, which is a little below 14%. And it's actually hard to understand what these numbers mean because I'm not sure I would score above 5%. And I mean, I'm not like a dumb-dumb, but, you know—I taught a class in quantitative methods for 10 years and all of that—but here we have these numbers, which I think you really have to take as relative to themselves rather than sort of thinking of them as, my goodness, a human would score 100%. That's not true. We're talking about scoring that's relative to itself.
Scott Caravello: Yeah, so that score is quite significant. And I think maybe also worth flagging that Gemini 3.0's score rose to 45.8% when it had the help of search and code execution tools. But taking a step back, right, what's actually driving this? And so maybe, Katherine, we should go under the hood a bit and talk about the model architecture, since Google attributes the model’s enhanced capabilities to developments in the sparse mixture-of-experts architecture.
Katherine Forrest: You know, going under the hood, that always reminds me of those, like the podcasts that talk about like, we're gonna do a deep dive, we're gonna like go under the hood. I just want the audience to know that that just is a phrase that worked its way in here that has nothing to do with this being like a trumped-up, like, podcast. We do these podcasts, Scott, right? We do them with hard work.
Scott Caravello: Absolutely, absolutely.
Katherine Forrest: Okay, all right. So why'd you pick under the hood?
Scott Caravello: Uh, I don't know, just inspired.
Katherine Forrest: Inspired, all right. Okay, in any event, so, you know, let's talk about the sparse mixture-of-experts, or, you know, architecture. It's also called “M-o-E,” or “sparse MoE.” But, you know, we've talked about it in prior episodes, but it's worth a quick refresher because it's not necessarily intuitive. So you can basically imagine that one giant neural network is handling everything, and you've got a collection of specialized subnetworks. And these are the so-called experts. So you've got an expert in one thing and another expert who's a neighbor in another thing and another expert that's another thing. And for each piece of input, like a word or a token, there's a router, and there are a number of different ways these routers can work, that chooses where to route the particular query and where to route sometimes even the information. So certain parts of the expert, so to speak, are activated and the rest of the subnetworks will sit idle when the input then is processed. And the concept is that this is more efficient because when you only turn on a subset of the model at one time, you're not actually having to pay for the full compute of the entire model when you only actually need one piece of it. So if you're wondering why it's called a sparse MoE or “sparse mixture of experts,” that's just referring to the fact that most experts within the model are inactive for most of the inputs.
Scott Caravello: Yeah, and in case it's helpful, maybe we can give a bit of an analogy. So you could think of it like a hospital. You wouldn't have every doctor in the building consulting on every case. You let the triage nurse route each patient to the right specialists—a cardiologist for heart issues, a neurologist for brain concerns, and so on. And so the mixture-of-experts approach does that routing for tokens in real time, layer by layer in the network. And so the benefit is that the model can scale up its total knowledge, but for each input it only wakes up the parts of the model that matter. Katherine Forrest: Yeah, actually I think that's a really great analogy.
Scott Caravello: Every now and then! But beyond, but beyond Humanity's Last Exam, when we talk about agentic abilities, there's also another evaluation worth bringing up; which is called “Vending Bench.” And Gemini 3.0 scored high on that one, too. It's a simulation that evaluates the model's ability to act over a longer time horizon as it operates a vending machine and takes on tasks like inventory, order management, and pricing decisions.
Katherine Forrest: Right, it's pretty cool, and there's a safety aspect worth exploring here, too. And, Google has stated that Gemini 3 is its most secure model to date.
Scott Caravello: Yeah, they've stated that the model shows reduced sycophancy, increased resistance to prompt injections—and just as a reminder, that can be where a prompt is used to get the model to do something it's not supposed to do, like ignoring its safety guidelines. And then they also mentioned that it has improved protection against misuse via cyberattacks.
Katherine Forrest: Yeah, you know, it's a pretty brief model card when you pull it, but it seems to have measured up really well under Google's frontier safety framework, and it didn't reach any of the critical capability levels for various frontier safety risks like assisting with development of chemical, biological, radiological, and nuclear threats.
Scott Caravello: Right, they acknowledged that the model can provide accurate and occasionally actionable—action, sorry, sorry, start over on that. Right, they acknowledged that the model can provide accurate and occasionally actionable information on those topics, but that the model generally fails to provide novel or sufficiently complete and detailed instructions that would enhance the abilities of what Google refers to as, “low- to medium-resourced threat actors.”
Katherine Forrest: Right, and then there's harmful manipulation, which refers to the risk that models with significant manipulative capabilities will be misused in a way that could result in large-scale harm. And Google flagged that there was no real improvement or what's sometimes called uplift in that area versus its prior releases. But turning away from discussing the abilities of the model itself, this has been a really big moment for the industry and it's really shifting the narrative about the AI race in the United States. For a while now, you know, the common refrain—do that one again. Yeah, because for a while now, the common refrain was that Google had fallen behind. Although, as I said, I thought that 2.5 Pro was a great—was and is a great—model. But ever since the early ChatGPT moment in late 2022, investors and users have been asking about Google and when is Google going to catch up?
Scott Caravello: Right, and it seems like their long game is paying off.
Katherine Forrest: Yeah, definitely. And with this 3.0 Gemini release, the conversation has gone from “Are they behind?” to “Are they ahead?” almost overnight. And I think that, you know, there are two big points that we should flag here aside from the model's significant capabilities. And the first is that, anecdotally at least, the model's value comes not just from those high scores on benchmarks, but also how it's presenting information. In a journal article, a Google vice president recounted how he had used the new model to help explain to his seven-year-old daughter how, you know, “lift” makes planes fly. And he expected a text response, but instead he received an interactive simulation showing currents moving over a wing with a slider allowing him to move the wing, change the currents, and lift the plane into the air. And that's pretty cool.
Scott Caravello: Yeah, and that helpfulness aspect is so key because it's not just the fact that the model does so well on paper, but that the experience feels smarter and better for users. And that other big point to raise stems from what we mentioned earlier about the model's quick integration into Google's AI Search mode because it increases distribution and really shows Google's own confidence in the model.
Katherine Forrest: Right, and it's the first time that they've used a new model for Search at the time it's released. But this wasn't the only Google release. Google also released a new version of their image generator model called the—okay, I'm not responsible for this name—the “NanoBanana,” okay, which runs on Gemini 3, and they released that just a couple days later.
Scott Caravello: That is quite a name.
Katherine Forrest: Right, NanoBanana–NanoBanana–NanoBanana. You have to say it three times fast.
Scott Caravello: No thanks, that–that one's all you.
Katherine Forrest: All right, all right, jokes aside, you know, that's actually supposed to be a big upgrade, too, and one of the really interesting aspects of the Nano Banana Pro model is that Google claims it's the best model for creating images that have correct and legible text directly in the image, which is something that models have struggled with.
Scott Caravello: Oh, totally. I mean, have you ever used a chatbot to create a map? There's a lot of bizarre and incorrect place-name labels.
Katherine Forrest: You know, I haven't done that, but now I'm gonna try.
Scott Caravello: It's fun.
Katherine Forrest: But what I'm really gonna try is I'm gonna try that airplane thing.
Scott Caravello: Totally.
Katherine Forrest: You know, I'm gonna get off this podcast and I'm gonna go straight to see how I do this lift, and I want to interact with this little airplane. So it's been a, it’s been a big week for Google, but it's also been a big week in terms of doing a step change and picking up the pace now for what other model developers are gonna need to show they can do as well.
Scott Caravello: Seriously, but I think that's all the time we have, Katherine.
Katherine Forrest: It is, even though I could talk about the NanoBanana for some time more. We will stop right now and thank everyone for listening. I'm Katherine Forrest. We'll see you folks next week.
Scott Caravello: And I'm Scott Caravello. Don't forget to like and subscribe.