About 10 years ago, Žiga Avsec was a PhD physics student who found himself taking a crash course in genomics via a university module on machine learning. He was soon working in a lab that studied rare diseases, on a project aiming to pin down the exact genetic mutation that caused an unusual mitochondrial disease.
This was, Avsec says, a “needle in a haystack” problem. There were millions of potential culprits lurking in the genetic code—DNA mutations that could wreak havoc on a person’s biology. Of particular interest were so-called missense variants: single-letter changes to genetic code that result in a different amino acid being made within a protein. Amino acids are the building blocks of proteins, and proteins are the building blocks of everything else in the body, so even small changes can have large and far-reaching effects.
There are 71 million possible missense variants in the human genome, and the average person carries more than 9,000 of them. Most are harmless, but some have been implicated in genetic diseases such as sickle cell anemia and cystic fibrosis, as well as more complex conditions like type 2 diabetes, which may be caused by a combination of small genetic changes. Avsec started asking his colleagues: “How do we know which ones are actually dangerous?” The answer: “Well largely, we don’t.”
Of the 4 million missense variants that have been spotted in humans, only 2 percent have been categorized as either pathogenic or benign, through years of painstaking and expensive research. It can take months to study the effect of a single missense variant.
Today, Google DeepMind, where Avsec is now a staff research scientist, has released a tool that can rapidly accelerate that process. AlphaMissense is a machine learning model that can analyze missense variants and predict the likelihood of them causing a disease with 90 percent accuracy—better than existing tools.
It’s built on AlphaFold, DeepMind’s groundbreaking model that predicts the structures of hundreds of millions of proteins from their amino acid composition, but it doesn’t work in the same way. Instead of making predictions about the structure of a protein, AlphaMissense operates more like a large language model such as OpenAI’s ChatGPT.
It has been trained on the language of human (and primate) biology, so it knows what normal sequences of amino acids in proteins should look like. When it’s presented with a sequence gone awry, it can take note, as with an incongruous word in a sentence. “It’s a language model but trained on protein sequences,” says Jun Cheng, who, with Avsec, is co-lead author of a paper published today in Science that announces Alpha Missense to the world. “If we substitute a word from an English sentence, a person who is familiar with English can immediately see whether these substitutions will change the meaning of the sentence or not.”
Pushmeet Kohli, DeepMind’s vice president of research, uses the analogy of a recipe book. If AlphaFold was concerned with exactly how ingredients might bind together, AlphaMissense predicts what might happen if you use the wrong ingredient entirely.