
Podcasts
Paul, Weiss Waking Up With AI
Understanding Grokking
On this week’s episode of “Paul, Weiss Waking Up With AI,” Katherine Forrest explores “grokking,” an emergent learning phenomenon where AI models continue to learn and begin to generalize well after training appears complete.
Episode Speakers
Episode Transcript
Katherine Forrest: Hello, everyone, and welcome back to “Paul, Weiss Waking Up With AI.” I’m Katherine Forrest, and you’ve got me flying solo again. Anna will be back and she’s going to, I’m sure, realize I’ve gone rogue and have done all kinds of episodes in her absence that she’ll be like, “what? You did X, you did Y?” as I take advantage of this time, but she will be back. And I wanted to sort of set the scene for you. And then we’re going to talk about something today called grokking, G-R-O-K-K-I-N-G. But before we get there, I want to set the scene for you today because I know I talked or spoke a number of times during the summer about being in Maine and all of the extraordinary coffees. And then I’ve also talked about my place up in Woodstock, or maybe I didn’t say it was in Woodstock. I said it was upstate New York, but actually it’s Woodstock. But it’s just extraordinary today, oh my goodness. It is so amazingly beautiful outside. It’s warm, the trees are starting to turn. It’s just incredible. So I’m up here again trying to push through the remainder of this book that I’m finishing up called, “Of Another Mind.” And I’ve said that like a million times and you’re probably all like, “ oh she’s never going to finish the book, ’she’s never going to finish the book.” But I am going to finish the book. Everybody’s going to be so excited about that when it actually does happen. But anyway, in the process of researching this book, I came across the concept of grokking, which is what we’re going to talk about today.
And it’s really something extraordinary; it’s unusual, and we’re going to be hearing a lot more about it because it’s becoming an active area of research and it’s a learning process. So this book that I’m writing—actually I’m co-authoring it with my wife, Amy Zimmerman—and it’s a book about, well, I’ve described it in the past; I won’t go back into it all right now, but part of it deals with, you know, the cognitive abilities of AI. And when you’re dealing with cognitive abilities of AI, you’re dealing with how it learns, among other things, many other things. And so this learning process that I ran across called grokking is an emergent learning style. And we’ve talked about things that are emergent before, that’s sort of unexpected behavior coming out of models. You know, where you develop the model or somebody’s developing the model, and the model then starts to have this unexpected behavior. And chain of thought was actually an emergent behavior, and reasoning was an emergent behavior. There are a number of things that have been emergent behaviors and also some more of the unusual ones that we’ve talked about, but grokking is also an emergent behavior. So let’s talk about that. And I have to just mention that grokking is completely separate and apart from the xAI Grok model, G-R-O-K. I have no idea, and I was going to look it up before I got on air today, about whether or not the Grok model’s name came from the same place that the word grokking came from. They might have, it would stand to reason certainly, but I don’t know for sure. But anyway, the grokking process that I’m going to talk about is a learning phenomenon, and it’s not the Grok model and has nothing to do with the Grok model. Although, presumably the Grok model, like every other model that’s high-capability right now, experiences grokking.
So grokking itself, or the word grok, came from actually an old sci-fi book that was by Robert Heinlein, H-E-I-N-L-E-I-N. It was a 1961 sci-fi book that was called “Stranger in a Strange Land.” And it was about this guy whose last name was Smith. And he’d been raised by Martians, and he returns to Earth and he challenges all kinds of, you know, Earth-bound norms. And he introduces Martian philosophy that emphasizes what’s described as, in the book, radical individual freedom, a unique form of empathy and something that then is all put together and is called grokking. But apparently the word grok, G-R-O-K, was a Martian term meaning to understand something so thoroughly that you become one with it. Okay, so that’s the derivation of the word. Now the word grokking has been applied, now, more recently in the AI world to sort of describe, or become the word that describes, this emergent phenomenon of learning. So, you might be familiar with people using the word grok in a colloquial sense. Say, “oh I grok to that.” Actually, it’s often the British New Zealanders will say, “I grokked that,” meaning that I got that. And today, in the AI world or machine world, there’s this phenomenon. Now we’re getting into what grokking really is.
There’s this phenomenon, a training phenomenon, where people have realized, researchers have realized, that models continue to learn for quite some time after the humans think that the training has actually stopped. So you do a training run, and you expect that the model that has ingested data, and it’s whizzed and it’s whirred and it’s trained and it’s learned where certain pieces of data fit contextually within the neural network. And it’s— it’s related the cat to the lion and the lion to the giraffe as part of the animal kingdom, and that’s pretty far away from, you know, the teacup and all of that. So there are contextual relationships that occur between words, between concepts. And after that initial process has happened, it turns out that the model is not done. The model is not done. So the model ends up continuing to learn and to change the connections between pieces of information in the neural network after that initial training run and after even many initial or subsequent training runs. So let me put that a little bit differently. You all remember the concept of, like, weights and parameters inside a neural network. And, you know, you have information that’s tokenized or chunked up and then it’s fed into the neural network. And there are lots of layers within the neural networks. And sometimes how many layers a neural network has can be very, very secret, but you can have lots and lots and lots of layers in a neural network. And there are all these contextual relationships between pieces of information, and those are the weights and the parameters. And so where a piece of information fits and its context and what relationship it has to other pieces of information, that becomes its weights. Now what’s happening with grokking is that without the addition of other information, of new information, it turns out that the weights can continue to create contextual relationships between each other. In other words, they continue to learn after people think that it’s done.
So you might think at first, if you’re a model developer, that the model has taken in the information, it’s been trained, you’ve done perhaps even a set of initial experiments to understand exactly how the model is going to perform, what it knows, and it can then, over a period of time, actually change because it’s continuing to learn. So this continual learning is this grokking phenomenon. And so now there’s some complicated concepts that go into this continual learning phenomenon. So one of the concepts is something—it’s a concept called overfitting. So overfitting, it’s like a regression analysis for those of you who are familiar with overfitting in that context. But the concept of overfitting—which, by the way, grokking is going to sort of blow out of the water, right?—is that when a model is first trained on data, it can be trained with the data to only have a very, very narrow view of the data. It just takes that data in, and it just understands exactly what that data says, and it can only answer a question directly related to that data. So it’s not generalizing beyond that data. So that’s called overfitting. That is, it’s stuck. It’s really close to the data. And that’s sort of the opposite, if you will, of generalization. So it turns out that when you actually think you may have overfit the model, the model’s gone through training and you think that you’re at an early stage and the model is really stuck with this overfitting situation where the model is just responding only to the questions that correspond exactly to the data. It’s not generalizing at all. And then you sort of—I’m going to actually put it in sort of colloquial terms, it’s not quite like this—but you walk away, and you let the model continue to whiz and whir and to learn on its own, something really interesting happens. This emergent phenomenon where the model continues to learn on its own and continues to create these connections between the data, and it begins to generalize. So that’s what the grokking process is, and research now on grokking seems to suggest that true generalization, which is a model being able to extrapolate broad concepts, can occur from a relatively small data set. Now that’s contrary to a lot of what we normally think, which is that you want to shove a lot of data in there, and the more data the better, and the more compute the better. It turns out that compute is very important to grokking, but that the data set can actually be smaller than one thinks. And that it also, this concept of grokking, plays havoc with folks who are still, in my view, sort of in the dark ages thinking that models memorize. Models do not memorize. That is, if anything, that was only some initial training thoughts that are now years old. But models now, the highly capable models that we have today and that we’re all so familiar with, those are models that generalize. They actually apply their information very, very broadly, and so it’s the opposite, really, of memorize. And grokking, when you grok, you’re actually moving towards generalization. So it actually blows the concept of memorization out of the water, and that, I think, is really, really interesting.
So let me talk about two papers that are from 2025 that you can look up, and I think we’re going to have some functionality where we’re going to be able to attach things to the podcast somehow. How? I know not, but one of the first of the articles is called “Grokking Explained: Statistical Phenomena,” and it’s February 3rd, 2025, and the first author is Carvalho, C-A-R-V-A-L-H-O, and he’s from IBM Research in, actually in Brazil. And there’s also some other authors from the University of London. And the arXiv reference is 2502.01774. And this is version one of the paper, so you’ll want to make sure you get the right version if there are any subsequent versions. The second paper is GrokAlign, all one word. G-R-O-K-A-L-I-G-N: Geometric Characterization and Acceleration of Grokking, which is a really complicated title. But anyway, that paper came out on the 31st of July, and it’s got the arXiv reference 2506.12284 version two. And by the way, for those of you who haven’t heard from other episodes, prior episodes about arXiv, arXiv is run by Cornell University, and it’s a repository where you can access for free all kinds of academic papers, preprints and actually even papers that have been published. And so you’ll want to become familiar with that. But this second paper, this GrokAlign paper, has got a number of authors from Rice University, Google Research and the Department of Computer Science from Brown.
So take a look at those because they’ll help you understand this concept of grokking. But one of the things that I wanna talk about is why it matters to people who might be involved in compliance or potentially in the legal world or CIOs who are just thinking, “well, why are we talking about grokking? It’s interesting.” But the headline is a really interesting headline here, which is that training and test performance don’t necessarily align in the way that we thought. And that’s important if you’re testing a model for a particular set of outcomes and don’t realize that the model may continue to learn, and its weights and its parameters may continue to change over time. And it will continue on its own to change those. So a test performance result at point A may not be the precise test performance result at point B. So that, I think, is really, really interesting. And the other thing that’s interesting is that there are different ways of actually causing it to occur or assisting in its occurrence. People don’t really know why it occurs, but they think that compute time may actually push it into this phase transition, as they call it, sort of this late phase transition where suddenly the model goes from being overfitted, as we’ve talked about, to sort of generalization. But you want to, if you’re a CIO, if you’re involved in compliance, you want to think about whether or not the model that you think you are shipping, for instance, or using is in fact the same model that was tested. And it will depend on, now, a number of things. And so the big takeaway is that you want to plan for the potential that there could be a late learning process.
So your evaluation times, you know, you may want to—and your engineers will of course know this, they’ll be on to this, they’ll say, of course, of course, of course. But when you’re trying to hit a deadline, for instance, then the concept of grokking can be very, very important because of this late training phase transition. So that is the importance of grokking. What I think is absolutely fascinating about it is, who knew that models continued to learn after we thought they were done, right? I find this endlessly fascinating. So that’s all we have time for today, folks. I hope you enjoyed this little foray into the world of grokking. And we will talk to you again next week, and hopefully we’ll have Anna back, either if not then, then very soon. And if you’ve liked this episode, please tell your friends about it and like and subscribe. Thanks.