[Tech] AlphaFold - Demis Hassabis

Download MP3
the CEO and co-founder of DeepMind explains how they solved protein folding.
From Lex Fridman: https://lexfridman.com/demis-hassabis/
Transcripts from Karpathy's https://karpathy.ai/lexicap/0299-large.html

link
|
00:37:13.160 
So let's go to the basic building blocks of biology
link |
00:37:16.960 
that I think is another angle at which you can start
link |
00:37:20.200 
to understand the human mind, the human body,
link |
00:37:22.280 
which is quite fascinating,
link |
00:37:23.400 
which is from the basic building blocks,
link |
00:37:26.640 
start to simulate, start to model
link |
00:37:28.960 
how from those building blocks,
link |
00:37:30.480 
you can construct bigger and bigger, more complex systems,
link |
00:37:33.080 
maybe one day the entirety of the human biology.
link |
00:37:35.820 
So here's another problem that thought
link |
00:37:39.680 
to be impossible to solve, which is protein folding.
link |
00:37:42.720 
And Alpha Fold or specifically Alpha Fold 2 did just that.
link |
00:37:48.840 
It solved protein folding.
link |
00:37:50.320 
I think it's one of the biggest breakthroughs,
link |
00:37:53.400 
certainly in the history of structural biology,
link |
00:37:55.140 
but in general in science,
link |
00:38:00.240 
maybe from a high level, what is it and how does it work?
link |
00:38:04.840 
And then we can ask some fascinating questions after.
link |
00:38:09.980 
So maybe to explain it to people not familiar
link |
00:38:12.880 
with protein folding is, you know,
link |
00:38:14.400 
first of all, explain proteins, which is, you know,
link |
00:38:16.980 
proteins are essential to all life.
link |
00:38:18.840 
Every function in your body depends on proteins.
link |
00:38:21.520 
Sometimes they're called the workhorses of biology.
link |
00:38:23.920 
And if you look into them and I've, you know,
link |
00:38:25.340 
obviously as part of Alpha Fold,
link |
00:38:26.660 
I've been researching proteins and structural biology
link |
00:38:30.200 
for the last few years, you know,
link |
00:38:31.760 
they're amazing little bio nano machines proteins.
link |
00:38:34.760 
They're incredible if you actually watch little videos
link |
00:38:36.460 
of how they work, animations of how they work.
link |
00:38:39.000 
And proteins are specified by their genetic sequence
link |
00:38:42.600 
called the amino acid sequence.
link |
00:38:44.280 
So you can think of it as their genetic makeup.
link |
00:38:47.040 
And then in the body in nature,
link |
00:38:50.080 
they fold up into a 3D structure.
link |
00:38:53.360 
So you can think of it as a string of beads
link |
00:38:55.320 
and then they fold up into a ball.
link |
00:38:57.160 
Now, the key thing is you want to know
link |
00:38:59.100 
what that 3D structure is because the structure,
link |
00:39:02.480 
the 3D structure of a protein is what helps to determine
link |
00:39:06.120 
what does it do, the function it does in your body.
link |
00:39:08.580 
And also if you're interested in drugs or disease,
link |
00:39:12.320 
you need to understand that 3D structure
link |
00:39:13.980 
because if you want to target something
link |
00:39:15.840 
with a drug compound about to block something
link |
00:39:18.640 
the protein's doing, you need to understand
link |
00:39:21.120 
where it's gonna bind on the surface of the protein.
link |
00:39:23.440 
So obviously in order to do that,
link |
00:39:24.940 
you need to understand the 3D structure.
link |
00:39:26.720 
So the structure is mapped to the function.
link |
00:39:28.640 
The structure is mapped to the function
link |
00:39:29.880 
and the structure is obviously somehow specified
link |
00:39:32.560 
by the amino acid sequence.
link |
00:39:34.840 
And that's the, in essence, the protein folding problem is,
link |
00:39:37.420 
can you just from the amino acid sequence,
link |
00:39:39.620 
the one dimensional string of letters,
link |
00:39:42.560 
can you immediately computationally predict
link |
00:39:45.600 
the 3D structure?
link |
00:39:47.120 
And this has been a grand challenge in biology
link |
00:39:50.020 
for over 50 years.
link |
00:39:51.500 
So I think it was first articulated by Christian Anfinsen,
link |
00:39:54.360 
a Nobel prize winner in 1972,
link |
00:39:57.040 
as part of his Nobel prize winning lecture.
link |
00:39:59.240 
And he just speculated this should be possible
link |
00:40:01.860 
to go from the amino acid sequence to the 3D structure,
link |
00:40:04.960 
but he didn't say how.
link |
00:40:06.060 
So it's been described to me as equivalent
link |
00:40:09.440 
to Fermat's last theorem, but for biology.
link |
00:40:12.320 
You should, as somebody that very well might win
link |
00:40:15.120 
the Nobel prize in the future.
link |
00:40:16.560 
But outside of that, you should do more
link |
00:40:19.240 
of that kind of thing.
link |
00:40:20.080 
In the margin, just put random things
link |
00:40:22.160 
that will take like 200 years to solve.
link |
00:40:24.440 
Set people off for 200 years.
link |
00:40:26.000 
It should be possible.
link |
00:40:27.720 
And just don't give any details.
link |
00:40:29.040 
Exactly.
link |
00:40:29.880 
I think everyone exactly should be,
link |
00:40:31.500 
I'll have to remember that for future.
link |
00:40:33.520 
So yeah, so he set off, you know,
link |
00:40:34.800 
with this one throwaway remark, just like Fermat,
link |
00:40:37.040 
you know, he set off this whole 50 year field really
link |
00:40:42.640 
of computational biology.
link |
00:40:44.400 
And they had, you know, they got stuck.
link |
00:40:46.240 
They hadn't really got very far with doing this.
link |
00:40:48.520 
And until now, until AlphaFold came along,
link |
00:40:52.500 
this is done experimentally, right?
link |
00:40:54.320 
Very painstakingly.
link |
00:40:55.500 
So the rule of thumb is, and you have to like
link |
00:40:57.440 
crystallize the protein, which is really difficult.
link |
00:40:59.820 
Some proteins can't be crystallized like membrane proteins.
link |
00:41:03.060 
And then you have to use very expensive electron microscopes
link |
00:41:05.940 
or X ray crystallography machines.
link |
00:41:08.200 
Really painstaking work to get the 3D structure
link |
00:41:10.680 
and visualize the 3D structure.
link |
00:41:12.400 
So the rule of thumb in experimental biology
link |
00:41:14.840 
is that it takes one PhD student,
link |
00:41:16.860 
their entire PhD to do one protein.
link |
00:41:20.320 
And with AlphaFold 2, we were able to predict
link |
00:41:23.440 
the 3D structure in a matter of seconds.
link |
00:41:26.400 
And so we were, you know, over Christmas,
link |
00:41:28.700 
we did the whole human proteome
link |
00:41:30.240 
or every protein in the human body or 20,000 proteins.
link |
00:41:33.280 
So the human proteomes like the equivalent
link |
00:41:34.760 
of the human genome, but on protein space.
link |
00:41:37.560 
And sort of revolutionized really
link |
00:41:40.240 
what a structural biologist can do.
link |
00:41:43.300 
Because now they don't have to worry
link |
00:41:45.720 
about these painstaking experimental,
link |
00:41:47.960 
should they put all of that effort in or not?
link |
00:41:49.560 
They can almost just look up the structure
link |
00:41:51.120 
of their proteins like a Google search.
link |
00:41:53.280 
And so there's a data set on which it's trained
link |
00:41:56.880 
and how to map this amino acid sequence.
link |
00:41:58.800 
First of all, it's incredible that a protein,
link |
00:42:00.760 
this little chemical computer is able to do
link |
00:42:02.480 
that computation itself in some kind of distributed way
link |
00:42:05.720 
and do it very quickly.
link |
00:42:07.800 
That's a weird thing.
link |
00:42:08.840 
And they evolve that way because, you know,
link |
00:42:10.480 
in the beginning, I mean, that's a great invention,
link |
00:42:13.200 
just the protein itself.
link |
00:42:14.760 
And then there's, I think, probably a history
link |
00:42:18.240 
of like they evolved to have many of these proteins
link |
00:42:22.740 
and those proteins figure out how to be computers themselves
link |
00:42:26.600 
in such a way that you can create structures
link |
00:42:28.560 
that can interact in complexes with each other
link |
00:42:30.540 
in order to form high level functions.
link |
00:42:32.660 
I mean, it's a weird system that they figured it out.
link |
00:42:35.520 
Well, for sure.
link |
00:42:36.360 
I mean, you know, maybe we should talk
link |
00:42:37.640 
about the origins of life too,
link |
00:42:39.000 
but proteins themselves, I think are magical
link |
00:42:41.180 
and incredible, as I said, little bio nano machines.
link |
00:42:45.760 
And actually Leventhal, who was another scientist,
link |
00:42:50.280 
a contemporary of Amphinson, he coined this Leventhal,
link |
00:42:55.120 
what became known as Leventhal's paradox,
link |
00:42:56.820 
which is exactly what you're saying.
link |
00:42:58.320 
He calculated roughly an average protein,
link |
00:43:01.580 
which is maybe 2000 amino acids base as long,
link |
00:43:05.080 
is can fold in maybe 10 to the power 300
link |
00:43:09.960 
different confirmations.
link |
00:43:11.480 
So there's 10 to the power 300 different ways
link |
00:43:13.320 
that protein could fold up.
link |
00:43:14.800 
And yet somehow in nature, physics solves this,
link |
00:43:18.160 
solves this in a matter of milliseconds.
link |
00:43:20.520 
So proteins fold up in your body in, you know,
link |
00:43:23.080 
sometimes in fractions of a second.
link |
00:43:25.600 
So physics is somehow solving that search problem.
link |
00:43:29.080 
And just to be clear, in many of these cases,
link |
00:43:31.200 
maybe you can correct me if I'm wrong,
link |
00:43:33.040 
there's often a unique way for that sequence to form itself.
link |
00:43:37.680 
So among that huge number of possibilities,
link |
00:43:41.240 
it figures out a way how to stably,
link |
00:43:45.320 
in some cases there might be a misfunction, so on,
link |
00:43:47.800 
which leads to a lot of the disorders and stuff like that.
link |
00:43:50.040 
But most of the time it's a unique mapping
link |
00:43:52.720 
and that unique mapping is not obvious.
link |
00:43:54.820 
No, exactly.
link |
00:43:55.660 
Which is what the problem is.
link |
00:43:57.120 
Exactly, so there's a unique mapping usually in a healthy,
link |
00:44:00.720 
if it's healthy, and as you say in disease,
link |
00:44:04.040 
so for example, Alzheimer's,
link |
00:44:05.400 
one conjecture is that it's because of misfolded protein,
link |
00:44:09.000 
a protein that folds in the wrong way, amyloid beta protein.
link |
00:44:12.040 
So, and then because it folds in the wrong way,
link |
00:44:14.560 
it gets tangled up, right, in your neurons.
link |
00:44:17.600 
So it's super important to understand
link |
00:44:20.560 
both healthy functioning and also disease
link |
00:44:23.600 
is to understand, you know, what these things are doing
link |
00:44:26.480 
and how they're structuring.
link |
00:44:27.600 
Of course, the next step is sometimes proteins change shape
link |
00:44:30.540 
when they interact with something.
link |
00:44:32.160 
So they're not just static necessarily in biology.
link |
00:44:37.200 
Maybe you can give some interesting,
link |
00:44:39.780 
so beautiful things to you about these early days
link |
00:44:43.260 
of AlphaFold, of solving this problem,
link |
00:44:46.160 
because unlike games, this is real physical systems
link |
00:44:51.280 
that are less amenable to self play type of mechanisms.
link |
00:44:56.460 
The size of the data set is smaller
link |
00:44:58.440 
than you might otherwise like,
link |
00:44:59.760 
so you have to be very clever about certain things.
link |
00:45:01.800 
Is there something you could speak to
link |
00:45:04.800 
what was very hard to solve
link |
00:45:06.680 
and what are some beautiful aspects about the solution?
link |
00:45:09.920 
Yeah, I would say AlphaFold is the most complex
link |
00:45:12.800 
and also probably most meaningful system
link |
00:45:14.600 
we've built so far.
link |
00:45:15.860 
So it's been an amazing time actually in the last,
link |
00:45:18.400 
you know, two, three years to see that come through
link |
00:45:20.520 
because as we talked about earlier, you know,
link |
00:45:23.200 
games is what we started on
link |
00:45:25.480 
building things like AlphaGo and AlphaZero,
link |
00:45:27.900 
but really the ultimate goal was to,
link |
00:45:30.400 
not just to crack games,
link |
00:45:31.520 
it was just to build,
link |
00:45:33.120 
use them to bootstrap general learning systems
link |
00:45:35.320 
we could then apply to real world challenges.
link |
00:45:37.440 
Specifically, my passion is scientific challenges
link |
00:45:40.640 
like protein folding.
link |
00:45:41.920 
And then AlphaFold of course
link |
00:45:43.280 
is our first big proof point of that.
link |
00:45:45.360 
And so, you know, in terms of the data
link |
00:45:49.040 
and the amount of innovations that had to go into it,
link |
00:45:50.920 
we, you know, it was like
link |
00:45:52.280 
more than 30 different component algorithms
link |
00:45:54.480 
needed to be put together to crack the protein folding.
link |
00:45:57.960 
I think some of the big innovations were that
link |
00:46:00.800 
kind of building in some hard coded constraints
link |
00:46:04.220 
around physics and evolutionary biology
link |
00:46:07.760 
to constrain sort of things like the bond angles
link |
00:46:11.640 
in the protein and things like that,
link |
00:46:15.400 
a lot, but not to impact the learning system.
link |
00:46:18.040 
So still allowing the system to be able to learn
link |
00:46:21.000 
the physics itself from the examples that we had.
link |
00:46:25.540 
And the examples, as you say,
link |
00:46:26.640 
there are only about 150,000 proteins,
link |
00:46:28.840 
even after 40 years of experimental biology,
link |
00:46:31.240 
only around 150,000 proteins have been,
link |
00:46:33.880 
the structures have been found out about.
link |
00:46:35.920 
So that was our training set,
link |
00:46:37.120 
which is much less than normally we would like to use,
link |
00:46:41.120 
but using various tricks, things like self distillation.
link |
00:46:43.840 
So actually using AlphaFold predictions,
link |
00:46:48.280 
some of the best predictions
link |
00:46:49.480 
that it thought was highly confident in,
link |
00:46:51.000 
we put them back into the training set, right?
link |
00:46:53.320 
To make the training set bigger,
link |
00:46:55.440 
that was critical to AlphaFold working.
link |
00:46:58.400 
So there was actually a huge number
link |
00:47:00.160 
of different innovations like that,
link |
00:47:02.720 
that were required to ultimately crack the problem.
link |
00:47:06.080 
AlphaFold one, what it produced was a distrogram.
link |
00:47:09.720 
So a kind of a matrix of the pairwise distances
link |
00:47:13.600 
between all of the molecules in the protein.
link |
00:47:17.880 
And then there had to be a separate optimization process
link |
00:47:20.440 
to create the 3D structure.
link |
00:47:23.640 
And what we did for AlphaFold two
link |
00:47:25.120 
is make it truly end to end.
link |
00:47:26.920 
So we went straight from the amino acid sequence of bases
link |
00:47:31.720 
to the 3D structure directly
link |
00:47:33.860 
without going through this intermediate step.
link |
00:47:36.080 
And in machine learning, what we've always found is
link |
00:47:38.600 
that the more end to end you can make it,
link |
00:47:40.920 
the better the system.
link |
00:47:42.160 
And it's probably because in the end,
link |
00:47:46.160 
the system's better at learning what the constraints are
link |
00:47:48.560 
than we are as the human designers of specifying it.
link |
00:47:51.920 
So anytime you can let it flow end to end
link |
00:47:54.040 
and actually just generate what it is
link |
00:47:55.400 
you're really looking for, in this case, the 3D structure,
link |
00:47:58.440 
you're better off than having this intermediate step,
link |
00:48:00.560 
which you then have to handcraft the next step for.
link |
00:48:03.360 
So it's better to let the gradients and the learning
link |
00:48:06.160 
flow all the way through the system from the end point,
link |
00:48:09.000 
the end output you want to the inputs.
link |
00:48:10.880 
So that's a good way to start on a new problem.
link |
00:48:13.040 
Handcraft a bunch of stuff,
link |
00:48:14.360 
add a bunch of manual constraints
link |
00:48:16.640 
with a small end to end learning piece
link |
00:48:18.640 
or a small learning piece and grow that learning piece
link |
00:48:21.560 
until it consumes the whole thing.
link |
00:48:22.840 
That's right.
link |
00:48:23.680 
And so you can also see,
link |
00:48:25.320 
this is a bit of a method we've developed
link |
00:48:26.960 
over doing many sort of successful alpha,
link |
00:48:29.640 
we call them alpha X projects, right?
link |
00:48:32.200 
And the easiest way to see that is the evolution
link |
00:48:34.600 
of alpha go to alpha zero.
link |
00:48:36.720 
So alpha go was a learning system,
link |
00:48:39.640 
but it was specifically trained to only play go, right?
link |
00:48:42.280 
So, and what we wanted to do with first version of alpha go
link |
00:48:45.360 
is just get to world champion performance
link |
00:48:47.520 
no matter how we did it, right?
link |
00:48:49.200 
And then of course, alpha go zero,
link |
00:48:51.400 
we remove the need to use human games as a starting point,
link |
00:48:55.280 
right?
link |
00:48:56.120 
So it could just play against itself
link |
00:48:57.960 
from random starting point from the beginning.
link |
00:49:00.280 
So that removed the need for human knowledge about go.
link |
00:49:03.720 
And then finally alpha zero then generalized it
link |
00:49:05.960 
so that any things we had in there, the system,
link |
00:49:08.920 
including things like symmetry of the go board were removed.
link |
00:49:12.240 
So the alpha zero could play from scratch
link |
00:49:14.600 
any two player game and then mu zero,
link |
00:49:16.440 
which is the final, our latest version
link |
00:49:18.360 
of that set of things was then extending it
link |
00:49:20.680 
so that you didn't even have to give it
link |
00:49:22.120 
the rules of the game.
link |
00:49:23.200 
It would learn that for itself.
link |
00:49:24.880 
So it could also deal with computer games
link |
00:49:26.600 
as well as board games.
link |
00:49:27.760 
So that line of alpha go, alpha go zero, alpha zero,
link |
00:49:30.400 
mu zero, that's the full trajectory
link |
00:49:33.480 
of what you can take from imitation learning
link |
00:49:37.200 
to full self supervised learning.
link |
00:49:40.440 
Yeah, exactly.
link |
00:49:41.640 
And learning the entire structure
link |
00:49:44.720 
of the environment you're put in from scratch, right?
link |
00:49:47.640 
And bootstrapping it through self play yourself.
link |
00:49:51.840 
But the thing is it would have been impossible, I think,
link |
00:49:53.720 
or very hard for us to build alpha zero
link |
00:49:55.960 
or mu zero first out of the box.
link |
00:49:58.600 
Even psychologically, because you have to believe
link |
00:50:01.400 
in yourself for a very long time.
link |
00:50:03.040 
You're constantly dealing with doubt
link |
00:50:04.640 
because a lot of people say that it's impossible.
link |
00:50:06.680 
Exactly, so it's hard enough just to do go.
link |
00:50:08.640 
As you were saying, everyone thought that was impossible
link |
00:50:10.920 
or at least a decade away from when we did it
link |
00:50:14.160 
back in 2015, 2016.
link |
00:50:17.320 
And so yes, it would have been psychologically
link |
00:50:20.960 
probably very difficult as well as the fact
link |
00:50:22.960 
that of course we learn a lot by building alpha go first.
link |
00:50:26.400 
Right, so I think this is why I call AI
link |
00:50:28.520 
an engineering science.
link |
00:50:29.880 
It's one of the most fascinating science disciplines,
link |
00:50:32.280 
but it's also an engineering science in the sense
link |
00:50:34.200 
that unlike natural sciences, the phenomenon you're studying
link |
00:50:38.200 
doesn't exist out in nature.
link |
00:50:39.440 
You have to build it first.
link |
00:50:40.880 
So you have to build the artifact first,
link |
00:50:42.480 
and then you can study and pull it apart and how it works.
[Tech] AlphaFold - Demis Hassabis
Broadcast by