Podcast Banner

Podcasts

Paul, Weiss Waking Up With AI

AI Alignment and Misalignment

This week, Katherine Forrest and Anna Gressel review recent research on newly discovered model capabilities around the concepts of AI alignment and deception.

Stream here or subscribe on your
preferred podcast app:

Episode Transcript

Katherine Forrest: Hey, good morning, everyone, and welcome to today's episode of “Waking Up With AI,” a Paul, Weiss podcast. I'm Katherine Forrest.

Anna Gressel: And I am Anna Gressel.

Katherine Forrest: And Anna, we were just joking about this before we started, which is let's begin our podcast as we often — so often do, which is the Anna Gressel whereabout report. Where are you?

Anna Gressel: Today I'm actually on Long Island where it's lovely but really cold, and my dog has been out chasing the bunnies in the backyard. And I think she was tipped off when we arrived and there were rabbit tracks in the fresh snow, which was fun.

Katherine Forrest: Well, that's really cute. Well, I'm in New York City, and we still have a light scattering of snow around that isn't completely brown and disgusting yet, but I was in Woodstock, New York this weekend, which I love, and I was reading up on a series of incredible technical developments that, as you know, because we were going back and forth about it on the weekend, I am really excited to talk to our audience about.

Anna Gressel: Yeah, me too. We were exchanging papers over the weekend. It's just completely fascinating stuff we have on our agenda for today.

Katherine Forrest: Yeah, so let's start by outlining a few of the key developments over just the past really couple of weeks that in my view, and I think if I may speak for you, Anna, in our collective view, are real game changers.

Anna Gressel: Definitely. So these papers and the research is about the concept of alignment. And to be a bit more specific about it, the idea that certain advanced models may not actually act in a manner that's aligned with human intentions or even in a manner that actively deceives humans. And what I find completely fascinating about this deception, or the lack of alignment, is it's actually observable to the researchers because of how the models can explain their reasoning process.

Katherine Forrest: Right, and so to make this comprehensible to our audience, let's start with the basic concept of AI alignment and what it means. So AI alignment, and I'm going take this at a very high level, is a general concept of training a model to be aligned or consistent or in conformance with a particular set of values, or what in AI lingo is sometimes called preferences.

So you'll hear the word alignment, you'll hear the word preferences and you'll hear the word values, all of those words sort of taken together. So in general, preferences refer to human values that developers seek to train models to comply or conform with when the model is providing output, so that the model does not provide output that is inconsistent with those human values.

Anna Gressel: Katherine let's give our listeners an example of some of those alignment values.