As the summary says, the basic approach ( evolutionary data -> position coupling -> to distance restraints - into structure solver ) is actually quite old, with the key paper back in 2011.
In fact, the distance constraints to 3D structure part is in fact very old - I was calculating structures from experimentally determined distances 30 years ago. You need a surprisingly low number of fairly weak ( these atoms are between 3 and 5 Angstroms ) distances to determine a 3D structure if you have a decent number of long range ones.
What they have done is execute better.
However the problem they are working on
sequence -> structure
though it's been a long term 'holy grail', practically it's not that useful!
The models typically aren't quite good enough, it's not predicting interactions, and experimental methods to determine structures have also moved on in leaps and bounds.
As the article briefly mentions what you really want to do is go the other way.
Designed novel structure -> protein sequence to make it.
One way to do that if you have a function going the other way ( like alphafold ( let's ignore limitations for now - ie does a knowledge based approach work well for completely novel folds ? ) ), is some sort of heuristic search - however the search space is huge and a step size of hours isn't going to cut it.
I recall reading about a result a couple of years ago supposedly demonstrating that "synonymous" DNA codons were in fact not synonymous, because the ribosome took systematically different amounts of time to process them, and the difference in construction time resulted in different folding for the protein.
This would imply that the problem "sequence -> structure" is not well defined, at least if the sequence in question is the sequence of peptides making up the protein and not the sequence of codons making up the gene that codes for the protein.
Do you know anything about this? Am I just making it up?
The phenomenon you described does exist[0], and similar effects (though rarely as severe) in cotranslational folding have been known for some time now. AFAIK it's not studied that intensely, as it's hard to study experimentally, and it is not expected to have a significant effect on final protein folding. During the folding process proteins often undergo partial misfolds, and defold again (the protein is basically doing a gradient descent search for the lowest energy state), so a small misfold at the beginning of the the translation process will rarely make it into the final fold.
Apart from protein folding there are other interfering folding mechanisms like chaperone proteins that also make it hard to phrase it as a straight "sequence -> structure" prediction problem, though they often only exist for a small number of the total proteins (in humans).
I think it's much more likely that the different codon usage leads to different rates of synthesis, not differently folded proteins. But that's a complex area. Many proteins do not fold to their native structure spontaneously, and there are other proteins that refold them "correctly".
It's not clear to me that you could truly demonstrate substantially different folding due to codon usage like that, on an experimental level, to make a general statement about all proteins.
well, not sure what you mean by "Sequence->structure".
Historically, people have used the fact that small globular proteins refold spontaneously and rapidly to their native state to support the idea that there is a single, unique structure encoded by a specific sequence. That's a helpful if ultimately limited approach (as we observe many proteins that don't fold rapidly to single native structure).
That's a reason that evolution-based methods, which use statistics about families of related proteins to estimate distances between pairs of amino acids (in 3D space), are more effective- many times in biology we can use evolutionary relationships between proteins to infer things that would be hard to determine through experiments or rigorous, thorough simulations.
But it's important to appreciate there are a large number of proteins that don't fold to a single unique structure rapidly-- and there are many ways this can be the case and many different biologically relevant behaviors depend on these properties. The tools from CASP are much less useful for proteins that violate the assumptions of Anfinsen's dogma, although the evolutionary data is helpful there too, it can often be a lot more challenging to deconvolute the signal.
Ultimately, "what is the right problem"? THe one that makes the most money? Produces the most "useful" scientific result? Is accessible with today's technology? For now, there's plenty of value in these sorts of competitions.
Personally I think the "right problem" is: "given a collection of diseases, use experimentally derived data and clever math, to discover biological treatments that reduce the total suffering from those diseases, subject to monetary and ethical constraints". That's what pharma attempts to do, although not particularly well. Others might say simply solving interesting problems like protein folding is inherently valuable as the right problem.
It's true, that different codons are transcribed with different speed but I've never heard that this would lead to differences in folding. Speed of translation is nevertheless used in order to signal shortage of amino acids or to implement modification.
When you're doing protein expression, it's a standard procedure to do a codon optimization in order to adapt the codons to your host expression system. When you're expressing a human protein in yeast or ecoli the codons will be quite different but the folding is the same. If translation wouldn't show this stability it would be difficult to use foreign expression systems at all.
It's biology - so there are exceptions and oddities everywhere.
Imagine software that evolved by trial and error rather than being designed - it's not 'clean'... there are no absolutes.
It's well known that codons effect translation rate and translation rate could/can effect the kinetic pathway of folding.
When a protein folds ( even away from the ribosome ) it isn't in isolation - it's in a solution - being bombarded by Brownian motion of other molecules.
It's amazing you are alive at all really - it also suggests most proteins would need the functional state to be strongly energetically preferred.
So there are potentially lots of things beyond just the protein sequence that could be important variables.
However some proteins will reversibly fold/unfold in simple solution all day long - so it makes sense to start there.
Is the bit by bit construction of the protein an issue? I mean, surely the first half would have already folded into some shape by the time the second half was be added on? And then the rest of it would not necessarily fold the same way as if you considered the entire completed protein starting to fold from scratch.
I am not an ML engineer (except when I program in ML, of course ;)). But this sounds a lot like the following:
1. We had a model that worked in principle, but the search space was practically infeasible.
2. We made an observation that a different model might exist that makes the search space irrelevant.
3. We threw ML at it.
4. Now we might have a model that fulfills (2) but we cannot be sure because we used a black-box approach.
5. Somehow the results are exciting. Better results would be really exciting.
6. We hope that more data yields these better results.
Is that correct? Am I the only one to lament these black-box approaches? Should there not be a bunch of people now studying the learned models to figure out if much better results can actually exist?
Think of it as an engineering problem rather than a science problem.
In much of drug discovery/development (and disease research), being able to predict protein structure would be very valuable. Being able to quickly find candidate structures (that can then be searched for in the lab) speeds things up immensely. Reducing false positives (or just coming up with possibilities at all) is a huge win.
But you’re right that this probably doesn’t help the protein theoretician much if at all. We already “know” how it works (it’s just thermodynamics and quantum mechanics) and of course have no idea how it works (“well they wiggle around until they find a low energy state” doesn’t really tell you anything). But that doesn’t keep this from being exciting.
Protein structure prediction, at the current levels of precision (and I include AlphaFold here), is not useful for drug discovery.
It’s the sort of thing researchers say to get grants, but as a distant goal, not a practical reality.
For structure-based drug discovery (which isn’t even the majority of drug discovery), the details are what matter (e.g. “does this water molecule mediate a binding interaction, or do the sidechains shuffle a bit, and kick the water out?”), and these methods don’t even come close to predicting detailed interactions.
Metrics in this space are focused on “general correctness” of protein backbone conformation. Success is to achieve a kind of blurry view of the overall shape of the molecule, and drug design is trying to predict specific atomic interactions. They’re two wildly different problems.
About the best you can say is that if we had a generalizable model of physics that could predict protein structure, it might also be able to do a good job of evaluating how a small molecule binds to a protein target. But even that is a huge leap, and when you start using black-box methods like AlphaFold to specifically solve the problem of structure prediction, it’s not really clear that generalization is even possible.
There are potential practical uses in drug discovery for a method that can design a protein which takes a particular shape, but even that is pretty different from
what AlphaFold actually does.
There are indeed many different applications of ML to drug discovery, but AlphaFold is a niche technique, and protein structure prediction is not practically useful, in general.
I see. My point is that all these efforts are really validating deep learning as an approach for solving previously intractable biophysics problems. Science is a series of stepping stones.
There are many areas where ML is being used to advance drug discovery and other kinds of practical, difficult biophysics problems. Casual readers would do well not to get too fixated on this particular area.
Ha...I wasn’t repeating it to make some kind of assertion of dominance, just to explain that I’m aware of the point you’re making, and that it’s related to, but different from my own.
>>> Am I the only one to lament these black-box approaches?
Far from it. Prediction looks like a tool in the arsenal for better understanding. One still has to correlate the structure with the complex interactions in vivo. Even using AI in classification mode, where we can segment a large atlas of tumor cells and identify a dozen or so classes of cell anomalies may lead to faster breakthroughs in immunotherapy.
What I am trying to wrap my head around is the synthesis problem. Say AlphaFold generates a promising candidate. One that does not exist naturally. You still need the DNA or mRNA transcription sequence to synthesize the protein, right? Won't some candidates simply be too complex and unstable to reliably produce using existing mammalian or baculovirus platforms?
>Won't some candidates simply be too complex and unstable to reliably produce using existing mammalian or baculovirus platforms?
You can add that to the objecting function that your model training function is optimizing, ensuring model output is not "too complex or unstable to reliably produce".
You can use permuted feature importance to tell you what inputs are most correlated with the condition. WAYYYY better than the binary correlation methods most statistics use for each feature.
This is far from the only time this has happened, though...Many moons ago I was required to take a class in fluid dynamics in College. It was all observational best-fit statistical estimation that kinda modeled observed behavior.
Either it was a poorly taught class, it was aimed at civil engineering and similar where you want simple models with broad applicability and don't care about the wide error bars, or maybe you remember best the observational stuff.
But fluid mechanics has a very deep theoretical underpinning, and generally has interpretable models. I'd suggest skimming through a copy of either Lamb or Batchelor's book, to see how far you can get just with pen and pencil and no statistical input at all.
It WAS for Civil Engineering and it was 30 years ago. FEA was JUST starting to be a thing and it sure didn't reach down into anything I encountered.
looking back through the google, I may be dimly remembering the calculations dealing with turbulent flow. It was a whole different experience to a College Junior that was used to the relatively simple equations from Physics, Statics and Dynamics.
Sounds like the standard way to teach fluids for civ.eng. All the flows you'll ever encounter will be turbulent, no point in learning about all the beautiful results that are mainly for laminar flows.
The books I mentioned are classics in the field and were first published in 1895 and 1967, respectively. Both are still in print. No computers, just advanced math (vector calculus etc).
Unlike in other cases, with protein structure, we really only care about the structure, most of the time no one cares how the chain folded to get there. ML based methods seem perfect for that.
Not an expert but I have read a lot on the subject recently.
Methods from the domain where already not truly model-based (compared to things like physical equations). It is mostly observations on existing proteins coupled with a form of gradient descent so, for this particular application, things have not degraded. (I am aware that there is more to it, this is just a fast summary)
But to be honest, it is far from a solved problem and I expect more breakthroughs from deep understanding and modeling than ML.
> I expect more breakthroughs from deep understanding and modeling than ML
Huh, I would've said exactly the opposite. The feature space of the variations of amino acid sequences is so big that I wouldn't bet on breakthroughs derived from understanding. Most recent advances seem to focus on how specific sequence motifs interact with each other, which are generally only applicable to certain kinds of proteins, but not protein folding as a whole.
My prediction would be that next breakthroughs that push the field over the usability limit are derived from more general machine learning advances.
I think deep learning can only get us so far with the existing, limited, data. I doubt it will be enough for precise structure reconstruction (but I would love to be wrong).
What it can be is a force multiplier for every progress in our understanding of why those sequences are transcribed into those structures.
But I have to admit that, with years of head-on, the research has not gotten there so I might be wrong.
> Now we might have a model that fulfills (2) but we cannot be sure because we used a black-box approach.
It's not like the alternative is any different. Being sure you have the right answer is not viable, the other methods of getting there are just more transparent heuristics.
This is 90% of science these days, just keep tweaking numbers in the model until it fits with what you want to see. The why and the how is not really something we have the capability or tooling to tackle.
I got bored with physics really quickly when I realized this is what everyone does.
What do you mean by “black box” model? How do you know when a model “explains” something or not? How do you know if you are missing a key feature or degree of freedom in a model? Suppose you have two models, A and B, where A gives much poorer predictions than B. Is it possible for A to “explain” things better when it is not capable of producing adequate predictions? Since nature itself is not boiled down to logical elements like math proofs, what about the inherent ambiguity of language or mental models when assessing explainability? Did we really explain anything, or just decompose a problem into units of thought that artificially feel compelling for human minds at this moment of history? What would distinguish a “real explanation” from a convenient fiction, if not purely predictive capability?
Black box model is a perfectly well understood term of art.
It means a model which has a somewhat opaque internal working. A lot of modern ML approaches treat the model as a black box in that it is not particularly clear how the features actually interact to reach the prediction.
I work professionally in machine learning for over ten years now and I deeply dispute what you say, and many other practitioners do as well.
> “not particularly clear how the features actually interact to reach the prediction.“
This is quite true of many models, such as linear regression in the wild. You may also have a clear but wrong picture of how features interact, eg looking at coefficients of interaction terms in a misspecified linear model.
See for example, “The Mythos of Model Interpretability”
Without a coherent definition of what “not black box” or “explainable” means scientifically, the buzzword of “black box” is also meaningless and is more of a political game to be a gatekeeper over what models are allowed to be used than any kind of honest intellectual inquiry.
I’ve worked professionally on systems where misspecified linear models and text models using simple ngram boosting were vastly more inscrutable than comparable neural networks or dimensionality reduction models for the same applications.
Nobody has any scientifically cogent idea of what makes a model “explainable” other than arguing semantics.
It may be because I'm not an expert that I think I understand, but the gist of making a model explainable sounds straightforward from what I've read. You train a model that does who knows what, and then you define a very limited language and use ML to match the first model as well as possible. If the language is simple enough, humans can understand it, and according to what I read, it likely generalizes better than the original model as a bonus. I assume it's harder than it sounds though.
Quote:
"1. There’s a tiny functional language based on a small number of side-effect free combinators
2. For a given task, a program template (which the authors call a sketch), further constrains the set of programs that can be learned for the problem in hand. This also very handily constrains the search space of course, helping to make learning a suitable policy program tractable.
3. To help guide the search within the set of programs conforming to the sketch, a standard reinforcement learning algorithm is used to learn a (black box) policy.
4. The black box policy is used as an oracle (the Neural Policy Oracle), and a neurally directed program search (NDPS) tries to find the sketch-conforming program that behaves as closely to the oracle as possible."
...this seems kind of similar to the recent success in solving mathematical equations using deep learning. The language translation paradigm seems to have a lot more potential than some realized.
Yes, you are oversimplifying what is required to make a model explainable. The logical mechanism of the functional form of the model has very little to do with explainability. Explainability involves various kinds of model checking, overfitting analysis, tests for confounding variables, multicollinearity, interaction effects, model misspecification, etc.
Merely decomposing a prediction into say a linear combination of predictors does not, itself, provide any type of explanation unless a variety of more complicated assumptions about statistical model checking turn out to be true.
Even worse, because the naive approach is to just treat those regression coefficients as if they do automatically give explanatory power or feature-wise attribution, people misrepresent things unwittingly and don’t carry out enough robust model checking, leading to wrong explanations that appear to have deceptive degrees of confidence associated with them.
"The logical mechanism of the functional form of the model has very little to do with explainability"
I'm not sure we are talking about the same thing, or which of the three sources in my comment you are reacting to.
My understanding of "explainability" in this context is that it means "humans can understand why the model gives a certain prediction".
It seems like you are using it to mean "humans can understand why the thing modeled does something (presumably by referring to the model)". That isn't what I thought people meant by the term, nor does it seem like a reasonable goal to me.
Do we agree these are distinct ideas? Feel free to elaborate on where I've gone wrong.
The fact that the software in question is not publicly available or runnable certainly makes it a "black box", regardless of whether the underlying models are human-interpretable. EDIT: my apologies, apparently it is available: https://github.com/deepmind/deepmind-research/tree/master/al... I must have gotten mixed this up with a different black-box-advertised-in-scientific-journal story.
Section 4.1 of the paper he linked discusses extensively the limitations of linear model interpretability. Here's an example:
"With respect to algorithmic transparency, [the claim that linear models are more interpretable]
seems uncontroversial, but given high dimensional or heavily engineered features, linear models lose simulatability or decomposability, respectively."
Given how many interesting problems are high-dimensional or solved through heavy feature engineering, you must be lucky to live in the happy space where problems are either low-dimensional enough to not need heavy feature engineering or where the client is already familiar with the high-dimensional features you're going to limit the model to using.
I especially enjoy the segments where he upfrontly addresses "an indictment of academic science" and "an indictment of pharma". He pulls no punches in saying how embarrassing it is for pharma and academia to be literally outclassed by DeepMind.
A great quote:
"If you think I’m being overly dramatic, consider this counterfactual scenario. Take a problem proximal to tech companies’ bottom line, e.g. image recognition or speech, and imagine that no tech company was investing research money into the problem. (IBM alone has been working on speech for decades.) Then imagine that a pharmaceutical company suddenly enters ImageNet and blows the competition out of the water, leaving the academics scratching their heads at what just happened and the tech companies almost unaware it even happened."
nobody is embarassed here. pharma doesn't work on protein folding prediction. now they can take the published results and code and use it, but protein fold prediction has not, is not, and probably will not ever be the rate limiting step in novel drug discovery and development.
Really good read! They requested the author submit it as a journal letter. This quote stuck out to me:
"Keep in mind that unlike other areas of machine learning, new protein structures are not appearing at an increasing rate, and so waiting things out will not help."
"The resulting algorithm outperformed all entrants at the most recent blind assessment of methods used to predict protein structures, generating the best structure for 25 out of 43 proteins, compared with 3 out of 43 for the next-best method."
This is remarkable. Teams of researchers all over the world have taken part in the CASP competitions for decades. Many attempts using machine learning and ANNs have been made. What is it about DeepMind that allowed them to make such a breakthrough? Do they have expertise in deep learning that does not exist in academia? Incredible amounts of compute that academia cannot afford?
The techniques DM used are popular in academia right now, too. Using evolutionary data to shortcut hard problems has been key to advancement in protein research for decades. DM just executed better, a combination of smart people, some good ideas, and lots of experimentation. NEver underestimate the ability of company that exists to win games, to win competitions.
And never underestimate the amount of money that a big tech company can throw at a random problem. DeepMind probably blew through the equivalent of multiple R01 grants writing that paper.
If their salaries are anything like what Bay Area companies are shelling out for top AI engineers, each one of those 10 people is probably costing as much as 10 grad students in any of the other labs working on this problem. Big Biotech does not usually have the money to get into a bidding war for engineering talent with companies like Google.
"There are dozens of academic groups, with researchers likely numbering in the (low) hundreds, working on protein structure prediction. We have been working on this problem for decades, with vast expertise built up on both sides of the Atlantic and Pacific, and not insignificant computational resources when measured collectively. For DeepMind’s group of ~10 researchers, with primarily (but certainly not exclusively) ML expertise, to so thoroughly route everyone surely demonstrates the structural inefficiency of academic science."
"What is worse than academic groups getting scooped by DeepMind? The fact that the collective powers of Novartis, Pfizer, etc, with their hundreds of thousands (~million?) of employees, let an industrial lab that is a complete outsider to the field, with virtually no prior molecular sciences experience, come in and thoroughly beat them on a problem that is, quite frankly, of far greater importance to pharmaceuticals than it is to Alphabet. It is an indictment of the laughable “basic research” groups of these companies, which pay lip service to fundamental science but focus myopically on target-driven research that they managed to so badly embarrass themselves in this episode."
I completely disagree with his interpretation. It would be surprising if group that concentrates some of the top expertise in AI weren't able to make a big impact on a well-defined optimization problem that has been studied for decades.
I think a lot of the commentary is missing two essential points:
1. Protein structure prediction is to a large extent a solved problem for small-ish, soluble targets. AlphaFold is a significant improvement on the current state of the art, but the state of the art was already far enough along that the best computational models in 2007 were good enough to bootstrap experimental structure determination (https://www.ncbi.nlm.nih.gov/pubmed/17934447). In other words, it's not like the entire academic community was stumbling around helplessly in the dark.
2. The value of these predictions to pharmaceutical companies is extremely marginal. Having a high-accuracy model is very helpful but it's rare that the researchers have so little information available that a completely de-novo prediction is necessary. And when they really don't have much information at all, it's usually because the target is sufficiently messy to defy traditional structure determination methods - which means it's almost certainly more than AlphaFold can handle too.
I had the protein folding project running in my computer for a few years. Can these deep learning models be distributed like that, or do they require tightly coupled processors? Seems the latter as there was a recent IEEE article on a wafer scale array of CPU for deep learning.
Unless I misunderstood (likely, I'm not smart), you still need to calculate the closeness field for a given protein. And that will be 1000s of points each calculated separately. Only once you have that do you give it to ML to find the thermodynamicly favourable value.
Folding@home is working on a different problem. Folding@home finds the path a protein takes to arrive at its final structure, not just the final structure. One of the main goals is to understand protein misfolding or where on the path things go astray to result in disease.
After all these years, we are still at it. New methods are regularly evaluated and the simulation software is being refined all the time.
Yes, I was wondering about this too. It's been going on for a very long time – I remember it was shipped with the PS3, I used to leave mine running sometimes to contribute to the effort.
In fact, the distance constraints to 3D structure part is in fact very old - I was calculating structures from experimentally determined distances 30 years ago. You need a surprisingly low number of fairly weak ( these atoms are between 3 and 5 Angstroms ) distances to determine a 3D structure if you have a decent number of long range ones.
What they have done is execute better.
However the problem they are working on
sequence -> structure
though it's been a long term 'holy grail', practically it's not that useful!
The models typically aren't quite good enough, it's not predicting interactions, and experimental methods to determine structures have also moved on in leaps and bounds.
As the article briefly mentions what you really want to do is go the other way.
Designed novel structure -> protein sequence to make it.
One way to do that if you have a function going the other way ( like alphafold ( let's ignore limitations for now - ie does a knowledge based approach work well for completely novel folds ? ) ), is some sort of heuristic search - however the search space is huge and a step size of hours isn't going to cut it.