It doesn’t have much photogenic appeal or the glamour of seeing back in time, but a new development in computational biochemistry has been hailed as a discovery as important as the images of distant galaxies recently seen through the James Webb Space Telescope. Researchers at DeepMind—the AI company owned by Google’s parent company Alphabet, which developed the algorithm that can beat the best human experts at the strategic board game Go—have used an AI system they call AlphaFold to predict the shape of every protein molecule known to biological research: more than 200m of them, found in all manner of animals, plants, bacteria and other organisms.
Proteins are the molecules at the heart of life. Encoded in the DNA of every organism’s genome, they orchestrate the biochemical reactions that enable life to exist at all. AlphaFold’s predictions of their molecular structure have already been shown to be very reliable, thanks to comparisons with the structures deduced previously from comparison with corresponding structures deduced by painstaking experiments that involve bouncing X-rays off protein crystals. Because such experiments are time-consuming and indeed not always possible—they have been performed only for around 150,000-170,000 proteins so far, including less than 20 per cent of all human proteins—the sudden availability of a good guess at the structure of every known protein offers an astounding shortcut for biological research.
This has led to predictions that the breakthrough will bring a tremendous boost for drug discovery and new medicines. Many drugs work by intervening in the activity of a protein, for example by binding to it and blocking its activity. If you know the structure of a protein thought to be involved in some disease, say, it may be possible to figure out what kind of drug molecule will bind to it—like one piece of a jigsaw piece interlocking with another—and alter its behaviour, hopefully curing or ameliorating the disease. In this way, for example, some approved drugs against HIV block the action of certain key viral proteins that enable the virus to replicate in infected cells.
But despite some breathless headlines, we’re not about to enter a golden age of drug development. AlphaFold demonstrates the exciting potential of AI to assist and accelerate scientific research, but a sense of proportion is needed about is implications for real-world medicine. At this stage, the value of AlphaFold’s structural predictions will lie more with answering fundamental questions about the biology of proteins: How have they evolved? How are they related to one another? How does their shape dictate their biochemical function, and how might that function be altered by tinkering with the chemical structure?
The difficulty of determining protein structures has been considered one of the bottlenecks to drug development in the past. The technique of X-ray crystallography has been used to study proteins since the 1930s, although only from the 1960s did it really become possible to work out the atomic-scale structures of these fiendishly complicated molecules, and only since the advent of powerful computers have such solutions become routine. It’s a testimony to the importance of such work in crystallography that at least seven Nobel prizes have been awarded for it; it was how the double-helical structure of DNA, the genetic storehouse molecule that encodes protein structures, itself was deduced in 1953.
But to use crystallography to deduce a protein structure, you first must crystallise the pure protein—such an enormously challenging business that protein crystal growth is regarded as something of a black art. In the past few decades, vast machines have been constructed at great cost—particle accelerators that create so-called synchrotron radiation, and more recently “free-electron” lasers—to produce brighter X-ray beams that make it possible to perform crystallography on smaller samples. Over the past 30 years or so these methods have been supplemented, and now increasingly supplanted, by a technique called cryo-electron microscopy (cryo-EM), which can deduce the structures of proteins and other complex molecules by looking at the images made by firing electron beams at deep-frozen samples. This approach has been particularly helpful for the many proteins that won’t crystallise at all.
AlphaFold could seem to make much of this high-tech investment redundant. Instead of calculating the structure of proteins from patterns in the beams of X-rays or electrons that they scatter, the algorithm predicts an unknown structure by comparing information about the protein’s chemical make-up to that of others whose structure is already known.
Every protein molecule consists of a long string of chemical units called amino acids (of which there are 20 varieties in proteins), stitched together by chemical bonds like the links in a chain. Typically, this molecular chain folds up in a well-defined way to produce a compact shape, determined solely by the amino-acid sequence.
In theory, it should be possible to predict a protein’s three-dimensional shape just from knowing that sequence: which amino acid follows which. The problem is, we’ve never been able to figure out the rules governing how to get from the latter to the former. We have methods of deducing a protein’s sequence but can’t use that information to figure out the structure. This is known as the protein-folding problem, and has long been seen as one of the deepest and most pressing puzzles in molecular biology. For many years, scientists have been trying to predict the course of protein folding using computer simulations that begin with an unfolded chain. But our computers just aren’t powerful enough to give reliable results for the chains of typically several hundred amino acids in a protein.
AlphaFold doesn’t exactly solve the problem, but sidesteps it
AlphaFold doesn’t exactly solve that problem, but sidesteps it. The algorithm simply looks for correlations between a given sequence and a given structure within the database of known protein structures. With enough of that data to train it, AlphaFold can use what it has learnt to make predictions of how any given sequence will fold. It’s a little like predicting the outcome of a football match by looking at how the two teams have fared in past matches.
How good are these predictions? Crucially, for any given structural prediction, AlphaFold includes an estimate of how much confidence we can place in it. The initial version of AlphaFold was unveiled in 2020 with a demonstration that the algorithm could do far better than any rival method at predicting the structures of some benchmark proteins. Subsequently, the DeepMind team released the source code of the algorithm and reportedthat it could predict the structures of virtually all human proteins—our so-called proteome—and many of those in other organisms: 350,000 in all. (The algorithm itself judged that only around 58 per cent of those predictions offered a reliable picture of the protein’s general folded shape, however.) Now the team has run the algorithm for all 214m known proteins and says that 35 per cent of them are deemed highly accurate, and another 45 per cent good enough to serve as a reliable guide for many research applications in biology. “Essentially you can think of it covering the entire protein universe,” said DeepMind’s CEO Demis Hassabis at a press briefing. “We’re at the beginning of new era of digital biology.”
This is an astounding achievement. The AlphaFold researchers have collaborated with the European Molecular Biology Laboratory’s European Bioinformatics Institute (EMBL-EBI), based near Cambridge, to create a database of all their structural predictions that is free for anyone to access and use—to find with the equivalent of a Google search something that previously might have required years of scientific experiment.
“I’ve seen many of these moments come where you can sense the landscape shifting under you,” EMBL-EBI’s joint director Ewan Birney told New Scientist, “and this has been one of the fastest.”
It's not surprising that this new development from AlphaFold has been greeted as little short of revolutionary. It certainly shows how today’s AI can tackle problems whose complexity had foxed all previous efforts.
But we shouldn’t exaggerate the practical implications at this point. For one thing, these protein structures are still only predictions, not experimental findings. Yes, many look sure to be reliable, but some may not be—and in the absence of direct experimental data on a structure, the prediction must be taken on trust. Stephen Curry, a biologist at Imperial College London who studies protein structures using crystallography, says that there is still plenty of work for those in his field to do to verify the predicted structures. But even here, the AlphaFold predictions might help by offering good starting points for calculating a structure from X-ray data that might be extremely hard to work out from scratch. “If I were just finishing up a PhD in crystallography,” Curry admits, “I’d be thinking very hard about my next career move.”
Another caveat is that the predictions—as with all AI algorithms that spot patterns in training data using the approach called deep learning—are only as good as the data used to train the system. It is thought that among some of the protein structures not yet solved by crystallography and cryo-EM are some that look different to any seen so far. AlphaFold will not have the information it needs to make reliable predictions about these, which are likely to be among those for which the algorithm professes lower confidence.
The real limitations, however, are both more fundamental and more practical, and go to the heart of how we think about biology as an interplay of molecules. The notion widely touted in media reports of the AlphaFold results, that protein structures will unlock untold capacity to find drugs and cures, is rooted in the thinking of the 1980s and 90s and does not reflect what we know about either proteins or medicines today. Back then, “we greatly underestimated the gap between having a protein structure and having a drug,” says Ash Jogalekar , a medicinal chemist who works at the Santa Fe biotech startup OpenEye Scientific, which develops computational tools for drug discovery.
There are several reasons why this is so, some stemming from an outmoded view of protein structure itself. While it is true that many enzymes—protein molecules that act as catalysts, speeding up and controlling the outcome of biochemical reactions—do their job thanks to the exquisite shape of their compact folded form, we now know that many proteins in the human body do not have totally well-defined structures at all. They may contain sections of the chain that don’t adopt a specific folded shape but instead remain loose and floppy, like rubber bands—they are said to be intrinsically disordered. It has been estimated that 37-50 per cent of human proteins contain disordered regions. Some of these might be rather small segments of the chain, but some have a considerable amount of disorder. These will be among those for which AlphaFold can’t offer a confident structural prediction.
Protein disorder is part of what has enabled greater complexity in living things
This lack of structure is, it seems, “by design”: the disorder has been “engineered” through natural selection, because it is essential to the job a protein does. For one thing, it can allow the molecule a great deal of flexibility: a capacity to change shape, which might be important to the way the protein interacts with other molecules (including other proteins) in the cell, or the way it functions to transmit signals through the molecular throng of a cell’s interior. Once considered a quirk of protein structure, the presence of disordered, largely unfolded parts of proteins is now seen as an important element of how they operate in cells.
This is particularly true for large animals like us. The amount of disorder in the entire set of proteins belonging to simpler organisms like bacteria is much smaller—perhaps 4 per cent or so. It’s widely thought that protein disorder is part of what has enabled greater complexity in living things because disordered proteins are less choosy about what they bind to, and can forge new molecular unions inside cells, rewiring the networks of protein interaction that influence how cells behave. Indeed, many of the proteins that act as “hubs” in these networks—rather like the people who act as hubs in social networks—have a lot of disorder, making them more apt to talk to others, so to speak. These hub proteins are some of the key molecular components of our cells, governing the activity of genes by turning them on and off and determining which other proteins get made. Protein disorder so profoundly challenges the old view, in which proteins have distinct folded shapes that determine their function, that some researchers fear that there might still be insufficient recognition of how widespread and significant it is, even within the fields of molecular and cell biology.
AlphaFold can’t predict meaningful structures for disordered proteins—because they don’t exist. This is a big problem for finding candidate drug molecules to target such proteins and tweak their activity—often an enticing goal, given how important they are in our cells. “Some of these key proteins have no small-molecule binding sites at all, which has made them spectacularly difficult to attack,” says Derek Lowe, who has worked in drug discovery at pharmaceutical giants such as Bayer, Novartis and Schering-Plough.
That’s not to say AlphaFold is powerless in such cases, however. For one thing, its structure predictions can help to pinpoint where disorder resides. What’s more, disordered proteins often acquire a structure once they bind to another molecule (such as another protein). In time, and if enough training data accumulates, AlphaFold might well be able to predict the structures of such larger unions of molecules. However, Jogalekar says that predicting how one protein binds to or interacts with another is “many orders of magnitude harder” than predicting the structures of individual molecules.
Even without this complication, a protein structure is no more than a helpful hint for a drug designer. If you’re trying to design a molecule that will bind to and block the action of a protein enzyme, say, you need to know how well the molecule will stick—its so-called binding affinity. “Knowing a protein structure does not mean you can accurately predict the [drug’s] binding affinity,” says Jogalekar. “What you could do is narrow down the list of drug-like molecules you might want to test. In that sense the structure can guide experiments. But the devil is in the details. Subtle changes in the binding site, even in the position of a single amino acid, can drastically change binding affinities.” Curry has pointed out that what AlphaFold deems an accurate structure prediction does not necessarily correspond to the accuracy needed for predicting good drug binding.
Besides, Jogalekar adds, “even if we could predict binding affinities from structure alone, that's just the start for finding a drug.” A drug needs to have excellent pharmacological properties, he says: for example, ones that allow the body to absorb it and then later clear it from the system. “Most drugs fail in the later stages, in clinical trials. None of these protein structures allow you to predict all those downstream effects.”
Lowe agrees. “It's helpful to know the protein structure, and it can lead to some hypotheses about what [drug candidate] compounds to make next,” he says. “But it doesn't send you down a superhighway to the clinic, that’s for sure.”
“The biggest problems in drug discovery are the two things that lead to the high (around 85-90 per cent) clinical failure rate: picking the wrong [protein] target and finding out that your drug does something else bad that you didn't anticipate,” Lowe adds. “Knowing protein structures does nothing for either of those. Likewise, other factors in optimising a drug such as absorption, metabolism, stability on storage, and so on, are not advanced by knowing protein structures either.”
For such reasons, Jogalekar says that AlphaFold’s work is “very valuable for basic biological research, for knowing evolutionary relationships of proteins with each other, for guessing and modelling the structures of related proteins, and so on. But it’s a long way from suddenly opening the way to cancer or Alzheimer's drug design.” Lowe agrees: AlphaFold, he says, “is not going to suddenly turbocharge drug discovery, unfortunately.”
Both Lowe and Jogalekar compare this situation to the sequencing of the entire human genome through the Human Genome Project, which was essentially completed in 2001–2003. That was another stunning technological feat that yielded a wealth of valuable information for biological research, but it was accompanied by overzealous promises of gene-based medicines that have largely failed to materialise in the 20 years since. “That, too, was predicted to unlock a gold rush of new drug targets, as I well remember, but it did no such thing,” says Lowe.
For all the undoubted benefits AlphaFold’s database of protein structures will offer to researchers, then, the limitations of the work point to a general lesson about biology: zeroing in on the structure of individual proteins, like zeroing in on the sequences of individual genes, tells us little about how cells and organisms really work, or how to put right what goes wrong. Such granular information is necessary but not sufficient for that. To believe otherwise would be akin to the delusion (and scientists have made this mistake too) that once we know the position and the network connectivity of every single neuron in the brain, we will understand how this organ works and be able to cure its dysfunctions.
Ultimately, the risk is that of thinking that life has a code that can be cracked, whereupon all its secrets will spill out. Sadly, it’s not that simple.