the CEO and co-founder of DeepMind explains how they solved protein folding.
From Lex Fridman: https://lexfridman.com/demis-hassabis/
Transcripts from Karpathy's https://karpathy.ai/lexicap/0299-large.html
00:37:13.160">
link |
00:37:13.160
So let's go to the basic building blocks of biology
00:37:16.960">link |
00:37:16.960
that I think is another angle at which you can start
00:37:30.480">link |
00:37:30.480
you can construct bigger and bigger, more complex systems,
00:37:33.080">link |
00:37:33.080
maybe one day the entirety of the human biology.
00:37:39.680">link |
00:37:39.680
to be impossible to solve, which is protein folding.
00:37:42.720">link |
00:37:42.720
And Alpha Fold or specifically Alpha Fold 2 did just that.
00:37:50.320">link |
00:37:50.320
I think it's one of the biggest breakthroughs,
00:37:53.400">link |
00:37:53.400
certainly in the history of structural biology,
00:38:00.240">link |
00:38:00.240
maybe from a high level, what is it and how does it work?
00:38:04.840">link |
00:38:04.840
And then we can ask some fascinating questions after.
00:38:14.400">link |
00:38:14.400
first of all, explain proteins, which is, you know,
00:38:18.840">link |
00:38:18.840
Every function in your body depends on proteins.
00:38:21.520">link |
00:38:21.520
Sometimes they're called the workhorses of biology.
00:38:26.660">link |
00:38:26.660
I've been researching proteins and structural biology
00:38:31.760">link |
00:38:31.760
they're amazing little bio nano machines proteins.
00:38:34.760">link |
00:38:34.760
They're incredible if you actually watch little videos
00:38:36.460">link |
00:38:36.460
of how they work, animations of how they work.
00:38:39.000">link |
00:38:39.000
And proteins are specified by their genetic sequence
00:38:44.280">link |
00:38:44.280
So you can think of it as their genetic makeup.
00:38:59.100">link |
00:38:59.100
what that 3D structure is because the structure,
00:39:02.480">link |
00:39:02.480
the 3D structure of a protein is what helps to determine
00:39:06.120">link |
00:39:06.120
what does it do, the function it does in your body.
00:39:08.580">link |
00:39:08.580
And also if you're interested in drugs or disease,
00:39:21.120">link |
00:39:21.120
where it's gonna bind on the surface of the protein.
00:39:29.880">link |
00:39:29.880
and the structure is obviously somehow specified
00:39:34.840">link |
00:39:34.840
And that's the, in essence, the protein folding problem is,
00:39:47.120">link |
00:39:47.120
And this has been a grand challenge in biology
00:39:51.500">link |
00:39:51.500
So I think it was first articulated by Christian Anfinsen,
00:39:59.240">link |
00:39:59.240
And he just speculated this should be possible
00:40:01.860">link |
00:40:01.860
to go from the amino acid sequence to the 3D structure,
00:40:12.320">link |
00:40:12.320
You should, as somebody that very well might win
00:40:34.800">link |
00:40:34.800
with this one throwaway remark, just like Fermat,
00:40:37.040">link |
00:40:37.040
you know, he set off this whole 50 year field really
00:40:46.240">link |
00:40:46.240
They hadn't really got very far with doing this.
00:40:57.440">link |
00:40:57.440
crystallize the protein, which is really difficult.
00:40:59.820">link |
00:40:59.820
Some proteins can't be crystallized like membrane proteins.
00:41:03.060">link |
00:41:03.060
And then you have to use very expensive electron microscopes
00:41:08.200">link |
00:41:08.200
Really painstaking work to get the 3D structure
00:41:30.240">link |
00:41:30.240
or every protein in the human body or 20,000 proteins.
00:41:53.280">link |
00:41:53.280
And so there's a data set on which it's trained
00:42:02.480">link |
00:42:02.480
that computation itself in some kind of distributed way
00:42:10.480">link |
00:42:10.480
in the beginning, I mean, that's a great invention,
00:42:18.240">link |
00:42:18.240
of like they evolved to have many of these proteins
00:42:22.740">link |
00:42:22.740
and those proteins figure out how to be computers themselves
00:42:28.560">link |
00:42:28.560
that can interact in complexes with each other
00:42:32.660">link |
00:42:32.660
I mean, it's a weird system that they figured it out.
00:42:41.180">link |
00:42:41.180
and incredible, as I said, little bio nano machines.
00:42:45.760">link |
00:42:45.760
And actually Leventhal, who was another scientist,
00:42:50.280">link |
00:42:50.280
a contemporary of Amphinson, he coined this Leventhal,
00:43:14.800">link |
00:43:14.800
And yet somehow in nature, physics solves this,
00:43:20.520">link |
00:43:20.520
So proteins fold up in your body in, you know,
00:43:25.600">link |
00:43:25.600
So physics is somehow solving that search problem.
00:43:33.040">link |
00:43:33.040
there's often a unique way for that sequence to form itself.
00:43:45.320">link |
00:43:45.320
in some cases there might be a misfunction, so on,
00:43:47.800">link |
00:43:47.800
which leads to a lot of the disorders and stuff like that.
00:43:57.120">link |
00:43:57.120
Exactly, so there's a unique mapping usually in a healthy,
00:44:05.400">link |
00:44:05.400
one conjecture is that it's because of misfolded protein,
00:44:09.000">link |
00:44:09.000
a protein that folds in the wrong way, amyloid beta protein.
00:44:12.040">link |
00:44:12.040
So, and then because it folds in the wrong way,
00:44:23.600">link |
00:44:23.600
is to understand, you know, what these things are doing
00:44:27.600">link |
00:44:27.600
Of course, the next step is sometimes proteins change shape
00:44:32.160">link |
00:44:32.160
So they're not just static necessarily in biology.
00:44:39.780">link |
00:44:39.780
so beautiful things to you about these early days
00:44:46.160">link |
00:44:46.160
because unlike games, this is real physical systems
00:44:51.280">link |
00:44:51.280
that are less amenable to self play type of mechanisms.
00:44:59.760">link |
00:44:59.760
so you have to be very clever about certain things.
00:45:06.680">link |
00:45:06.680
and what are some beautiful aspects about the solution?
00:45:09.920">link |
00:45:09.920
Yeah, I would say AlphaFold is the most complex
00:45:15.860">link |
00:45:15.860
So it's been an amazing time actually in the last,
00:45:18.400">link |
00:45:18.400
you know, two, three years to see that come through
00:45:33.120">link |
00:45:33.120
use them to bootstrap general learning systems
00:45:37.440">link |
00:45:37.440
Specifically, my passion is scientific challenges
00:45:49.040">link |
00:45:49.040
and the amount of innovations that had to go into it,
00:45:54.480">link |
00:45:54.480
needed to be put together to crack the protein folding.
00:46:00.800">link |
00:46:00.800
kind of building in some hard coded constraints
00:46:07.760">link |
00:46:07.760
to constrain sort of things like the bond angles
00:46:18.040">link |
00:46:18.040
So still allowing the system to be able to learn
00:46:21.000">link |
00:46:21.000
the physics itself from the examples that we had.
00:46:37.120">link |
00:46:37.120
which is much less than normally we would like to use,
00:46:41.120">link |
00:46:41.120
but using various tricks, things like self distillation.
00:46:51.000">link |
00:46:51.000
we put them back into the training set, right?
00:47:02.720">link |
00:47:02.720
that were required to ultimately crack the problem.
00:47:06.080">link |
00:47:06.080
AlphaFold one, what it produced was a distrogram.
00:47:09.720">link |
00:47:09.720
So a kind of a matrix of the pairwise distances
00:47:17.880">link |
00:47:17.880
And then there had to be a separate optimization process
00:47:26.920">link |
00:47:26.920
So we went straight from the amino acid sequence of bases
00:47:36.080">link |
00:47:36.080
And in machine learning, what we've always found is
00:47:46.160">link |
00:47:46.160
the system's better at learning what the constraints are
00:47:48.560">link |
00:47:48.560
than we are as the human designers of specifying it.
00:47:55.400">link |
00:47:55.400
you're really looking for, in this case, the 3D structure,
00:47:58.440">link |
00:47:58.440
you're better off than having this intermediate step,
00:48:00.560">link |
00:48:00.560
which you then have to handcraft the next step for.
00:48:03.360">link |
00:48:03.360
So it's better to let the gradients and the learning
00:48:06.160">link |
00:48:06.160
flow all the way through the system from the end point,
00:48:10.880">link |
00:48:10.880
So that's a good way to start on a new problem.
00:48:18.640">link |
00:48:18.640
or a small learning piece and grow that learning piece
00:48:32.200">link |
00:48:32.200
And the easiest way to see that is the evolution
00:48:39.640">link |
00:48:39.640
but it was specifically trained to only play go, right?
00:48:42.280">link |
00:48:42.280
So, and what we wanted to do with first version of alpha go
00:48:51.400">link |
00:48:51.400
we remove the need to use human games as a starting point,
00:48:57.960">link |
00:48:57.960
from random starting point from the beginning.
00:49:00.280">link |
00:49:00.280
So that removed the need for human knowledge about go.
00:49:03.720">link |
00:49:03.720
And then finally alpha zero then generalized it
00:49:05.960">link |
00:49:05.960
so that any things we had in there, the system,
00:49:08.920">link |
00:49:08.920
including things like symmetry of the go board were removed.
00:49:27.760">link |
00:49:27.760
So that line of alpha go, alpha go zero, alpha zero,
00:49:44.720">link |
00:49:44.720
of the environment you're put in from scratch, right?
00:49:47.640">link |
00:49:47.640
And bootstrapping it through self play yourself.
00:49:51.840">link |
00:49:51.840
But the thing is it would have been impossible, I think,
00:49:58.600">link |
00:49:58.600
Even psychologically, because you have to believe
00:50:04.640">link |
00:50:04.640
because a lot of people say that it's impossible.
00:50:08.640">link |
00:50:08.640
As you were saying, everyone thought that was impossible
00:50:17.320">link |
00:50:17.320
And so yes, it would have been psychologically
00:50:22.960">link |
00:50:22.960
that of course we learn a lot by building alpha go first.
00:50:29.880">link |
00:50:29.880
It's one of the most fascinating science disciplines,
00:50:32.280">link |
00:50:32.280
but it's also an engineering science in the sense
00:50:34.200">link |
00:50:34.200
that unlike natural sciences, the phenomenon you're studying
00:50:42.480">link |
00:50:42.480
and then you can study and pull it apart and how it works.