The Swyx Mixtape | [Tech] AlphaFold

[Tech] AlphaFold - Demis Hassabis

October 10, 2022 / 14:29/E439 Download MP3

the CEO and co-founder of DeepMind explains how they solved protein folding.

From Lex Fridman: https://lexfridman.com/demis-hassabis/
Transcripts from Karpathy's https://karpathy.ai/lexicap/0299-large.html

link |
00:37:13.160
So let's go to the basic building blocks of biology

link |
00:37:16.960
that I think is another angle at which you can start

link |
00:37:20.200
to understand the human mind, the human body,

link |
00:37:22.280
which is quite fascinating,

link |
00:37:23.400
which is from the basic building blocks,

link |
00:37:26.640
start to simulate, start to model

link |
00:37:28.960
how from those building blocks,

link |
00:37:30.480
you can construct bigger and bigger, more complex systems,

link |
00:37:33.080
maybe one day the entirety of the human biology.

link |
00:37:35.820
So here's another problem that thought

link |
00:37:39.680
to be impossible to solve, which is protein folding.

link |
00:37:42.720
And Alpha Fold or specifically Alpha Fold 2 did just that.

link |
00:37:48.840
It solved protein folding.

link |
00:37:50.320
I think it's one of the biggest breakthroughs,

link |
00:37:53.400
certainly in the history of structural biology,

link |
00:37:55.140
but in general in science,

link |
00:38:00.240
maybe from a high level, what is it and how does it work?

link |
00:38:04.840
And then we can ask some fascinating questions after.

link |
00:38:08.700
Sure.

link |
00:38:09.980
So maybe to explain it to people not familiar

link |
00:38:12.880
with protein folding is, you know,

link |
00:38:14.400
first of all, explain proteins, which is, you know,

link |
00:38:16.980
proteins are essential to all life.

link |
00:38:18.840
Every function in your body depends on proteins.

link |
00:38:21.520
Sometimes they're called the workhorses of biology.

link |
00:38:23.920
And if you look into them and I've, you know,

link |
00:38:25.340
obviously as part of Alpha Fold,

link |
00:38:26.660
I've been researching proteins and structural biology

link |
00:38:30.200
for the last few years, you know,

link |
00:38:31.760
they're amazing little bio nano machines proteins.

link |
00:38:34.760
They're incredible if you actually watch little videos

link |
00:38:36.460
of how they work, animations of how they work.

link |
00:38:39.000
And proteins are specified by their genetic sequence

link |
00:38:42.600
called the amino acid sequence.

link |
00:38:44.280
So you can think of it as their genetic makeup.

link |
00:38:47.040
And then in the body in nature,

link |
00:38:50.080
they fold up into a 3D structure.

link |
00:38:53.360
So you can think of it as a string of beads

link |
00:38:55.320
and then they fold up into a ball.

link |
00:38:57.160
Now, the key thing is you want to know

link |
00:38:59.100
what that 3D structure is because the structure,

link |
00:39:02.480
the 3D structure of a protein is what helps to determine

link |
00:39:06.120
what does it do, the function it does in your body.

link |
00:39:08.580
And also if you're interested in drugs or disease,

link |
00:39:12.320
you need to understand that 3D structure

link |
00:39:13.980
because if you want to target something

link |
00:39:15.840
with a drug compound about to block something

link |
00:39:18.640
the protein's doing, you need to understand

link |
00:39:21.120
where it's gonna bind on the surface of the protein.

link |
00:39:23.440
So obviously in order to do that,

link |
00:39:24.940
you need to understand the 3D structure.

link |
00:39:26.720
So the structure is mapped to the function.

link |
00:39:28.640
The structure is mapped to the function

link |
00:39:29.880
and the structure is obviously somehow specified

link |
00:39:32.560
by the amino acid sequence.

link |
00:39:34.840
And that's the, in essence, the protein folding problem is,

link |
00:39:37.420
can you just from the amino acid sequence,

link |
00:39:39.620
the one dimensional string of letters,

link |
00:39:42.560
can you immediately computationally predict

link |
00:39:45.600
the 3D structure?

link |
00:39:47.120
And this has been a grand challenge in biology

link |
00:39:50.020
for over 50 years.

link |
00:39:51.500
So I think it was first articulated by Christian Anfinsen,

link |
00:39:54.360
a Nobel prize winner in 1972,

link |
00:39:57.040
as part of his Nobel prize winning lecture.

link |
00:39:59.240
And he just speculated this should be possible

link |
00:40:01.860
to go from the amino acid sequence to the 3D structure,

link |
00:40:04.960
but he didn't say how.

link |
00:40:06.060
So it's been described to me as equivalent

link |
00:40:09.440
to Fermat's last theorem, but for biology.

link |
00:40:12.320
You should, as somebody that very well might win

link |
00:40:15.120
the Nobel prize in the future.

link |
00:40:16.560
But outside of that, you should do more

link |
00:40:19.240
of that kind of thing.

link |
00:40:20.080
In the margin, just put random things

link |
00:40:22.160
that will take like 200 years to solve.

link |
00:40:24.440
Set people off for 200 years.

link |
00:40:26.000
It should be possible.

link |
00:40:27.720
And just don't give any details.

link |
00:40:29.040
Exactly.

link |
00:40:29.880
I think everyone exactly should be,

link |
00:40:31.500
I'll have to remember that for future.

link |
00:40:33.520
So yeah, so he set off, you know,

link |
00:40:34.800
with this one throwaway remark, just like Fermat,

link |
00:40:37.040
you know, he set off this whole 50 year field really

link |
00:40:42.640
of computational biology.

link |
00:40:44.400
And they had, you know, they got stuck.

link |
00:40:46.240
They hadn't really got very far with doing this.

link |
00:40:48.520
And until now, until AlphaFold came along,

link |
00:40:52.500
this is done experimentally, right?

link |
00:40:54.320
Very painstakingly.

link |
00:40:55.500
So the rule of thumb is, and you have to like

link |
00:40:57.440
crystallize the protein, which is really difficult.

link |
00:40:59.820
Some proteins can't be crystallized like membrane proteins.

link |
00:41:03.060
And then you have to use very expensive electron microscopes

link |
00:41:05.940
or X ray crystallography machines.

link |
00:41:08.200
Really painstaking work to get the 3D structure

link |
00:41:10.680
and visualize the 3D structure.

link |
00:41:12.400
So the rule of thumb in experimental biology

link |
00:41:14.840
is that it takes one PhD student,

link |
00:41:16.860
their entire PhD to do one protein.

link |
00:41:20.320
And with AlphaFold 2, we were able to predict

link |
00:41:23.440
the 3D structure in a matter of seconds.

link |
00:41:26.400
And so we were, you know, over Christmas,

link |
00:41:28.700
we did the whole human proteome

link |
00:41:30.240
or every protein in the human body or 20,000 proteins.

link |
00:41:33.280
So the human proteomes like the equivalent

link |
00:41:34.760
of the human genome, but on protein space.

link |
00:41:37.560
And sort of revolutionized really

link |
00:41:40.240
what a structural biologist can do.

link |
00:41:43.300
Because now they don't have to worry

link |
00:41:45.720
about these painstaking experimental,

link |
00:41:47.960
should they put all of that effort in or not?

link |
00:41:49.560
They can almost just look up the structure

link |
00:41:51.120
of their proteins like a Google search.

link |
00:41:53.280
And so there's a data set on which it's trained

link |
00:41:56.880
and how to map this amino acid sequence.

link |
00:41:58.800
First of all, it's incredible that a protein,

link |
00:42:00.760
this little chemical computer is able to do

link |
00:42:02.480
that computation itself in some kind of distributed way

link |
00:42:05.720
and do it very quickly.

link |
00:42:07.800
That's a weird thing.

link |
00:42:08.840
And they evolve that way because, you know,

link |
00:42:10.480
in the beginning, I mean, that's a great invention,

link |
00:42:13.200
just the protein itself.

link |
00:42:14.760
And then there's, I think, probably a history

link |
00:42:18.240
of like they evolved to have many of these proteins

link |
00:42:22.740
and those proteins figure out how to be computers themselves

link |
00:42:26.600
in such a way that you can create structures

link |
00:42:28.560
that can interact in complexes with each other

link |
00:42:30.540
in order to form high level functions.

link |
00:42:32.660
I mean, it's a weird system that they figured it out.

link |
00:42:35.520
Well, for sure.

link |
00:42:36.360
I mean, you know, maybe we should talk

link |
00:42:37.640
about the origins of life too,

link |
00:42:39.000
but proteins themselves, I think are magical

link |
00:42:41.180
and incredible, as I said, little bio nano machines.

link |
00:42:45.760
And actually Leventhal, who was another scientist,

link |
00:42:50.280
a contemporary of Amphinson, he coined this Leventhal,

link |
00:42:55.120
what became known as Leventhal's paradox,

link |
00:42:56.820
which is exactly what you're saying.

link |
00:42:58.320
He calculated roughly an average protein,

link |
00:43:01.580
which is maybe 2000 amino acids base as long,

link |
00:43:05.080
is can fold in maybe 10 to the power 300

link |
00:43:09.960
different confirmations.

link |
00:43:11.480
So there's 10 to the power 300 different ways

link |
00:43:13.320
that protein could fold up.

link |
00:43:14.800
And yet somehow in nature, physics solves this,

link |
00:43:18.160
solves this in a matter of milliseconds.

link |
00:43:20.520
So proteins fold up in your body in, you know,

link |
00:43:23.080
sometimes in fractions of a second.

link |
00:43:25.600
So physics is somehow solving that search problem.

link |
00:43:29.080
And just to be clear, in many of these cases,

link |
00:43:31.200
maybe you can correct me if I'm wrong,

link |
00:43:33.040
there's often a unique way for that sequence to form itself.

link |
00:43:37.680
So among that huge number of possibilities,

link |
00:43:41.240
it figures out a way how to stably,

link |
00:43:45.320
in some cases there might be a misfunction, so on,

link |
00:43:47.800
which leads to a lot of the disorders and stuff like that.

link |
00:43:50.040
But most of the time it's a unique mapping

link |
00:43:52.720
and that unique mapping is not obvious.

link |
00:43:54.820
No, exactly.

link |
00:43:55.660
Which is what the problem is.

link |
00:43:57.120
Exactly, so there's a unique mapping usually in a healthy,

link |
00:44:00.720
if it's healthy, and as you say in disease,

link |
00:44:04.040
so for example, Alzheimer's,

link |
00:44:05.400
one conjecture is that it's because of misfolded protein,

link |
00:44:09.000
a protein that folds in the wrong way, amyloid beta protein.

link |
00:44:12.040
So, and then because it folds in the wrong way,

link |
00:44:14.560
it gets tangled up, right, in your neurons.

link |
00:44:17.600
So it's super important to understand

link |
00:44:20.560
both healthy functioning and also disease

link |
00:44:23.600
is to understand, you know, what these things are doing

link |
00:44:26.480
and how they're structuring.

link |
00:44:27.600
Of course, the next step is sometimes proteins change shape

link |
00:44:30.540
when they interact with something.

link |
00:44:32.160
So they're not just static necessarily in biology.

link |
00:44:37.200
Maybe you can give some interesting,

link |
00:44:39.780
so beautiful things to you about these early days

link |
00:44:43.260
of AlphaFold, of solving this problem,

link |
00:44:46.160
because unlike games, this is real physical systems

link |
00:44:51.280
that are less amenable to self play type of mechanisms.

link |
00:44:55.640
Sure.

link |
00:44:56.460
The size of the data set is smaller

link |
00:44:58.440
than you might otherwise like,

link |
00:44:59.760
so you have to be very clever about certain things.

link |
00:45:01.800
Is there something you could speak to

link |
00:45:04.800
what was very hard to solve

link |
00:45:06.680
and what are some beautiful aspects about the solution?

link |
00:45:09.920
Yeah, I would say AlphaFold is the most complex

link |
00:45:12.800
and also probably most meaningful system

link |
00:45:14.600
we've built so far.

link |
00:45:15.860
So it's been an amazing time actually in the last,

link |
00:45:18.400
you know, two, three years to see that come through

link |
00:45:20.520
because as we talked about earlier, you know,

link |
00:45:23.200
games is what we started on

link |
00:45:25.480
building things like AlphaGo and AlphaZero,

link |
00:45:27.900
but really the ultimate goal was to,

link |
00:45:30.400
not just to crack games,

link |
00:45:31.520
it was just to build,

link |
00:45:33.120
use them to bootstrap general learning systems

link |
00:45:35.320
we could then apply to real world challenges.

link |
00:45:37.440
Specifically, my passion is scientific challenges

link |
00:45:40.640
like protein folding.

link |
00:45:41.920
And then AlphaFold of course

link |
00:45:43.280
is our first big proof point of that.

link |
00:45:45.360
And so, you know, in terms of the data

link |
00:45:49.040
and the amount of innovations that had to go into it,

link |
00:45:50.920
we, you know, it was like

link |
00:45:52.280
more than 30 different component algorithms

link |
00:45:54.480
needed to be put together to crack the protein folding.

link |
00:45:57.960
I think some of the big innovations were that

link |
00:46:00.800
kind of building in some hard coded constraints

link |
00:46:04.220
around physics and evolutionary biology

link |
00:46:07.760
to constrain sort of things like the bond angles

link |
00:46:11.640
in the protein and things like that,

link |
00:46:15.400
a lot, but not to impact the learning system.

link |
00:46:18.040
So still allowing the system to be able to learn

link |
00:46:21.000
the physics itself from the examples that we had.

link |
00:46:25.540
And the examples, as you say,

link |
00:46:26.640
there are only about 150,000 proteins,

link |
00:46:28.840
even after 40 years of experimental biology,

link |
00:46:31.240
only around 150,000 proteins have been,

link |
00:46:33.880
the structures have been found out about.

link |
00:46:35.920
So that was our training set,

link |
00:46:37.120
which is much less than normally we would like to use,

link |
00:46:41.120
but using various tricks, things like self distillation.

link |
00:46:43.840
So actually using AlphaFold predictions,

link |
00:46:48.280
some of the best predictions

link |
00:46:49.480
that it thought was highly confident in,

link |
00:46:51.000
we put them back into the training set, right?

link |
00:46:53.320
To make the training set bigger,

link |
00:46:55.440
that was critical to AlphaFold working.

link |
00:46:58.400
So there was actually a huge number

link |
00:47:00.160
of different innovations like that,

link |
00:47:02.720
that were required to ultimately crack the problem.

link |
00:47:06.080
AlphaFold one, what it produced was a distrogram.

link |
00:47:09.720
So a kind of a matrix of the pairwise distances

link |
00:47:13.600
between all of the molecules in the protein.

link |
00:47:17.880
And then there had to be a separate optimization process

link |
00:47:20.440
to create the 3D structure.

link |
00:47:23.640
And what we did for AlphaFold two

link |
00:47:25.120
is make it truly end to end.

link |
00:47:26.920
So we went straight from the amino acid sequence of bases

link |
00:47:31.720
to the 3D structure directly

link |
00:47:33.860
without going through this intermediate step.

link |
00:47:36.080
And in machine learning, what we've always found is

link |
00:47:38.600
that the more end to end you can make it,

link |
00:47:40.920
the better the system.

link |
00:47:42.160
And it's probably because in the end,

link |
00:47:46.160
the system's better at learning what the constraints are

link |
00:47:48.560
than we are as the human designers of specifying it.

link |
00:47:51.920
So anytime you can let it flow end to end

link |
00:47:54.040
and actually just generate what it is

link |
00:47:55.400
you're really looking for, in this case, the 3D structure,

link |
00:47:58.440
you're better off than having this intermediate step,

link |
00:48:00.560
which you then have to handcraft the next step for.

link |
00:48:03.360
So it's better to let the gradients and the learning

link |
00:48:06.160
flow all the way through the system from the end point,

link |
00:48:09.000
the end output you want to the inputs.

link |
00:48:10.880
So that's a good way to start on a new problem.

link |
00:48:13.040
Handcraft a bunch of stuff,

link |
00:48:14.360
add a bunch of manual constraints

link |
00:48:16.640
with a small end to end learning piece

link |
00:48:18.640
or a small learning piece and grow that learning piece

link |
00:48:21.560
until it consumes the whole thing.

link |
00:48:22.840
That's right.

link |
00:48:23.680
And so you can also see,

link |
00:48:25.320
this is a bit of a method we've developed

link |
00:48:26.960
over doing many sort of successful alpha,

link |
00:48:29.640
we call them alpha X projects, right?

link |
00:48:32.200
And the easiest way to see that is the evolution

link |
00:48:34.600
of alpha go to alpha zero.

link |
00:48:36.720
So alpha go was a learning system,

link |
00:48:39.640
but it was specifically trained to only play go, right?

link |
00:48:42.280
So, and what we wanted to do with first version of alpha go

link |
00:48:45.360
is just get to world champion performance

link |
00:48:47.520
no matter how we did it, right?

link |
00:48:49.200
And then of course, alpha go zero,

link |
00:48:51.400
we remove the need to use human games as a starting point,

link |
00:48:55.280
right?

link |
00:48:56.120
So it could just play against itself

link |
00:48:57.960
from random starting point from the beginning.

link |
00:49:00.280
So that removed the need for human knowledge about go.

link |
00:49:03.720
And then finally alpha zero then generalized it

link |
00:49:05.960
so that any things we had in there, the system,

link |
00:49:08.920
including things like symmetry of the go board were removed.

link |
00:49:12.240
So the alpha zero could play from scratch

link |
00:49:14.600
any two player game and then mu zero,

link |
00:49:16.440
which is the final, our latest version

link |
00:49:18.360
of that set of things was then extending it

link |
00:49:20.680
so that you didn't even have to give it

link |
00:49:22.120
the rules of the game.

link |
00:49:23.200
It would learn that for itself.

link |
00:49:24.880
So it could also deal with computer games

link |
00:49:26.600
as well as board games.

link |
00:49:27.760
So that line of alpha go, alpha go zero, alpha zero,

link |
00:49:30.400
mu zero, that's the full trajectory

link |
00:49:33.480
of what you can take from imitation learning

link |
00:49:37.200
to full self supervised learning.

link |
00:49:40.440
Yeah, exactly.

link |
00:49:41.640
And learning the entire structure

link |
00:49:44.720
of the environment you're put in from scratch, right?

link |
00:49:47.640
And bootstrapping it through self play yourself.

link |
00:49:51.840
But the thing is it would have been impossible, I think,

link |
00:49:53.720
or very hard for us to build alpha zero

link |
00:49:55.960
or mu zero first out of the box.

link |
00:49:58.600
Even psychologically, because you have to believe

link |
00:50:01.400
in yourself for a very long time.

link |
00:50:03.040
You're constantly dealing with doubt

link |
00:50:04.640
because a lot of people say that it's impossible.

link |
00:50:06.680
Exactly, so it's hard enough just to do go.

link |
00:50:08.640
As you were saying, everyone thought that was impossible

link |
00:50:10.920
or at least a decade away from when we did it

link |
00:50:14.160
back in 2015, 2016.

link |
00:50:17.320
And so yes, it would have been psychologically

link |
00:50:20.960
probably very difficult as well as the fact

link |
00:50:22.960
that of course we learn a lot by building alpha go first.

link |
00:50:26.400
Right, so I think this is why I call AI

link |
00:50:28.520
an engineering science.

link |
00:50:29.880
It's one of the most fascinating science disciplines,

link |
00:50:32.280
but it's also an engineering science in the sense

link |
00:50:34.200
that unlike natural sciences, the phenomenon you're studying

link |
00:50:38.200
doesn't exist out in nature.

link |
00:50:39.440
You have to build it first.

link |
00:50:40.880
So you have to build the artifact first,

link |
00:50:42.480
and then you can study and pull it apart and how it works.

Broadcast by

headphones Listen Anywhere

Listen Anywhere