
Podcasts
Paul, Weiss Waking Up With AI
Model Metrics: Benchmarking AI
In this episode of "Paul, Weiss Waking Up With AI," Katherine Forrest and Anna Gressel discuss AI benchmarking, exploring how these standardized tests evaluate AI models against each other and human capabilities, helping developers and deployers assess performance, safety and progress toward artificial general intelligence.
Episode Speakers
Episode Transcript
Katherine Forrest: Well hello, everyone, and welcome to another episode of “Paul, Weiss' Waking Up With AI.” I'm Katherine Forrest.
Anna Gressel: And I am Anna Gressel.
Katherine Forrest: And so, Anna, we're recording this on a Friday. But as you know, we are now back in the office four days a week, Monday through Thursday. And Friday's a remote day. So where are you?
Anna Gressel: I'm in the city, I'm just not in the office.
Katherine Forrest: You're still in one of your places in waiting for the fire problem to be fully remediated, et cetera, et cetera, right?
Anna Gressel: Yeah, we're waiting for like the 10th mold remediation report. So it's a joy.
Katherine Forrest: You poor thing. Anyway, I am upstate New York, and I have this like really peaceful place. But what I was reminded of this morning is that I have a super race of frogs that actually live outside of my house up here.
Anna Gressel: Like a super race of them? What does that mean?
Katherine Forrest: Total super race. That means that they make more sound than any group of frogs should ever make. And I will tell you the story if you're game.
Anna Gressel: I would love nothing more on a rainy Friday afternoon when we're recording this podcast.
Katherine Forrest: To hear the story. And by the way, it is going to intersect with AI. That's going to be the extraordinary thing that's going to happen. I'm going make my super race of frogs intersect with AI. So here it is. So the short version, and this is the short version, is that last summer there was a patch of land at the same place where I am now that got a puddle. And there was a rainstorm one night, and a puddle developed, and for whatever reason, some frog decided to lay its tadpole eggs in that puddle. Now this caused me a great deal of consternation because the puddle would each week get smaller and smaller and smaller, and I would then have to water the puddle because the tadpoles were in great danger of drying up. Now in the great circle of life, that may have been what they were supposed to have done, right? But I instead launched something called Operation Tadpole.
Anna Gressel: To the consternation of all the evolutionary biologists who would have just let them perish.
Katherine Forrest: They would have let them. That's what I'm saying, I created a super race. And so I, Operation Tadpole consisted of rounding up my wife and my daughter and stepdaughter. And we got buckets and we got shovels. And these are all grown people, by the way, who were just putting up with me, okay? These are not like small children who are having fun. And we dug up the damn tadpoles. Am I allowed to say that? I think I am. It's our podcast. I dug up the tadpoles, put them in the buckets and we ran them across to a pond, whereupon they made a super race of frogs. And so that super race of frogs is what we now have that make noises for me at night.
Anna Gressel: Wait, how do you know that they're a super race? They like sing or, I don't know, what do frogs do?
Katherine Forrest: Extra loud, yeah, they like make extra loud croaking sounds, right? And so now what you may be tempted to do is to compare me to other quirky people that you know or extremely empathetic frog people or sympathetic frog people. I suppose I can’t be empathetic. I guess I only can be sympathetic. And that comparison, Anna, is now going to be my big segue into the topic of our podcast today, which is…
Anna Gressel: Benchmarking. Also, you could benchmark the tadpoles or the frogs against other frogs, right? By loudness of croaking.
Katherine Forrest: No, no, no, no, no. I don't want to benchmark my frogs against other frogs. I'm going to live in the belief that they're a super race. But when you compare one thing to another, what you're doing is doing a comparison. And if one of those things is sort of a standardized metric, then you can call that your benchmark. And you compare something else to that standardized metric, and you are benchmarking.
Anna Gressel: Yeah, and we benchmark people, right? I mean we've all, I've taken a lot of standardized tests in my time in various stages of my life. But things like the SAT, the LSAT, the bar exam, those are all kind of benchmarking tests, right?
Katherine Forrest: Right, and when you make comparisons between different people who take those different kinds of tests, you're benchmarking typically against a standard. sort of. They'll do a curve with a lot of weightings, and you're benchmarking against these kinds of national metrics.
Anna Gressel: So Katherine, let's talk a little bit about what does that mean in the AI space, right?
Katherine Forrest: I know, I'm segueing to AI right now.
Anna Gressel: Okay, all right, let's do it.
Katherine Forrest: I told you I would somehow manage to do this, but so here we are. So we've gone from the super race of tadpoles. And as AI models get released over and over and over again by different companies and companies create new models and different companies create different models, they get tested. And they get tested for really all kinds of things, and they get tested by all kinds of organizations, academic scientists, people, individuals who do their own testing, and they get compared to all kinds of benchmarks.
Anna Gressel: Yeah, and benchmarks are super useful in the AI space. They can allow developers to compare their models. So for example, they could compare their models against models released by other developers, or they could even compare a new model against a model that they had previously released or a model that was smaller, had different kinds of capabilities. So they allow for a lot of comparison, which is really important as capabilities evolve in this space.
Katherine Forrest: Right, and they can even get compared, and do get compared, frequently to human capabilities.
Anna Gressel: Yeah, I think we should unpack that, Katherine. Do you want to do that?
Katherine Forrest: Yeah, let me at least start, which is that AI models often get compared to human capabilities to see how they perform against some normative concept of how a human will accomplish a particular task, how accurate the human is, what the error rate is. And it's also a way, when these models are tested against or benchmarked against human capabilities, to help determine whether or not they are approaching what we call artificial general intelligence or AGI.
Anna Gressel: Yeah, and there are a lot of different definitions of AGI, but one way to think about it is the point at which an AI model reaches or exceeds human intelligence.
Katherine Forrest: Right. And so all of this testing, all of this benchmarking, is a concept that companies use, model developers use it and individuals who are deployers of models use it, compliance departments use it to assess all kinds of different aspects of model performance.
Anna Gressel: Yeah, and I think let's just pick up on that last point. Benchmarking can be really helpful and useful to deployers of models, not just developers. Lawyers might really care, for example, about how a model's hallucination rate compares to another model or to an error rate of humans or how one model compares in terms of potential bias. These are all really important metrics, but they can be benchmarked, as well, against each other. So a user within one company may be benchmarking one tool's capabilities against another's, and of course developers do that all the time as we mentioned.
Katherine Forrest: Right, and another thing that gets benchmarked, which is super important, is safety. And there are a whole series of safety metrics for different AI models and benchmarking against these safety metrics. So you're able to compare how does this model compare in terms of safety against the safety benchmark. And if there are certain issues that arise, then you have mitigations that are then typically undertaken.
Anna Gressel: Yeah, and as capabilities of models grow, so do the number of benchmarks in this area. So we've seen actually an explosion of safety-related benchmarks as developers try to test whether a particular model may create more risk or different kinds of risks or even new kinds of risks compared to an existing benchmark.
Katherine Forrest: Right, and there are lots and lots of these tests or benchmarks, and they all have these really sort of strange names that we'll talk about in just a few minutes.
Anna Gressel: Yeah, and what's interesting — and I mean, it may be a straightforward concept, but I think it's worth pausing on — is that these benchmarks themselves have to be developed, and then they actually have to be tested themselves and then accepted as reasonable benchmarks by a sufficient number of people so that they start to actually mean something, they start to actually tell us something. That's actually not so different, for what it's worth, than standardized testing, which really goes through the whole process of making sure that those tests are robust and that they actually are useful. And of course, there's some debate about that too.
Katherine Forrest: Right, and so let's talk about sort of the first sort of moment when a model is developed, because that's when a developer might have expectations about the capabilities of that model. And the developer then has an architecture for the model, trains the model, has expectations about how the model is going to perform, and then the developer then has to eventually determine how is that model in fact performing. So for instance, it might be expected to perform in a particular way or at a particular level with regard to image recognition or recognizing numbers or actually doing different kinds of reasoning tasks or things like a physics exam at the AP level or a chemistry exam at the AP level or even taking the bar exam.
Anna Gressel: Totally. And those goals are often super, super important to developers that are trying to, again, create these capabilities and demonstrate that the capabilities are better than the prior version of the model. So they set those goals, they may try to find data that can help the model achieve the goals, and then the model is trained. And then it's like tested, tested, tested, tested repeatedly.
Katherine Forrest: Right. And there are, as a general matter, two different themes that you can break different kinds of benchmarks down into. A model's knowledge being tested in a particular domain and the ability for the model to apply its knowledge in that domain. So let me give you some examples. For some knowledge benchmarks, there might be multiple choice tests, just like you see in high school. And popular examples of these knowledge benchmarks, and now we're going to get into those really strange words, are things like MMLU, the Massive Multitask Language Understanding benchmark, or GPQA, the Graduate-Level Google-Proof Q&A.
Anna Gressel: Yeah, and on the applying knowledge side, you have benchmarks like GSM8K, which tests language models on reasoning through grade school math word problems, or harder benchmarks like SWE-bench, which tasks models with realistic problems software engineers have to be able to solve. And some of these benchmarks actually are meant to be used specifically with language models as agents. So they have tools available, like calculators or code interpreters, to be able to actually solve problems themselves. So the agent is like doing the problem solving. And even more sophisticated benchmarks designed specifically for agents are now emerging as well. It's super interesting and exciting.
Katherine Forrest: Yeah, and we talked about safety a couple of minutes ago, and there are plenty of these safety benchmarks as well with not lots of names, and they've got lots of acronyms in their own letters, but we’ll spare our audience except to say that they exist.
Anna Gressel: Yeah, and think one thing that's really interesting that has happened over the past few years is that certain benchmarks that used to be used early on are no longer useful because they can be so easily exceeded. That means the models can kind of achieve the test so easily. And if the model exceeds the benchmark, it's possible you can't really get the information you want from it. It's just something like past 100% or whatever metric is being used to measure.
Katherine Forrest: Right, and there are two benchmarks that are worth calling out in particular that I thought we could pause on. One is called FrontierMath, and the other is called Humanity’s Last Exam. And the first, which is the FrontierMath, consists of entirely new problems, appearing nowhere on the internet, that demand hours of work from expert mathematicians to solve. And so these are mathematicians who've received their PhDs, and they've been practicing in a very narrow domain for years, if not decades. And so no one, even the Fields medalists in math, which is sort of the equivalent of the Nobel Prize for math, would be easily able to solve these problems. And so the FrontierMath has become a test that these models will be put through, or at least some models will be put through, to see how they perform against these extraordinarily difficult math problems.
Anna Gressel: Yeah, and that's similar to something called Humanity’s Last Exam, which is really intended to represent the frontier of human academic knowledge across all domains. And, as the name implies, is intended to be the last academic or knowledge benchmark for all language models.
Katherine Forrest: Right, and as you just said, both of these really were developed in the last year. So that's what we're talking about in terms of a theme with these podcasts, which is sort of the increasing capabilities of these models and trying to figure out how to actually test them. Okay, I only want to mention just one more benchmark, which is called ARC(A-R-C)-AGI. And this was a benchmark that had been around for years and was supposed to be solvable only when an AI system achieved AGI, and we talked about AGI just a little bit earlier, but it was actually solved. In 2024, OpenAI's o series of models finally solved that benchmark. So now there's a totally new ARC-AGI benchmark. So basically, we've got lots and lots and lots of testing that companies do, that compliance departments do, that academics do, that we're going to have to be watching because we've got sort of a moving target here. And so it's super interesting to watch, but that's all we've got time for today, Anna.
Anna Gressel: Do you want to sign off, Katherine?
Katherine Forrest: I guess I could sign off. I was going to find some segue to talk about my tadpoles again.
Anna Gressel: Maybe we should record the next podcast next to the tadpoles, so we can all hear their amazing croaking.
Katherine Forrest: All right, Katherine Forrest, signing off.
Anna Gressel: I'm Anna Gressel. Make sure to like and share the podcast.