AI trained on 9 trillion DNA letters predicts harmful mutations and designs new genomes

Trained on genomic data spanning the tree of life, Evo 2 reveals how artificial intelligence can decode the rules of DNA and even generate new functional sequences, opening the door to programmable biology.

Study: Genome modelling and design across all domains of life with Evo 2. Image Credit: ZinetroN / Shutterstock

Study: Genome modelling and design across all domains of life with Evo 2. Image Credit: ZinetroN / Shutterstock

In a recent study published in the journal Nature, researchers at the Arc Institute describe the development and advantages of “Evo 2”. Evo 2 is a novel biological foundation model trained on a massive dataset of approximately 9 trillion DNA base pairs and implemented in models containing up to 40 billion parameters.

Representing a significant advancement over previous artificial intelligence (AI) implementations, Evo 2 was trained on an extensive dataset of approximately 9 trillion DNA base pairs spanning bacteria to humans. Its genomic context window can reach up to 1 million nucleotides, allowing it to analyse the intricate, long-range dependencies that govern gene function.

Research findings reveal that the model demonstrates strong performance across multiple genomic prediction tasks, including predicting the functional impacts of genetic variants (including coding and non-coding mutations) and splice-related changes, and can even design novel DNA sequences at the genome scale (for example, mitochondrial genomes and large microbial or eukaryotic genomic segments) with high precision.

Background: Limits of Reductionist Biology Approaches

Traditional biology research has been characterized as a reductionist pursuit in which complex derived systems (e.g., the central nervous system [CNS]) are broken down into their most basic parts (genes and proteins) to elucidate their mechanistic underpinnings.

Recent research, however, argues that this approach fails to account for life being orchestrated by complex interactions spanning vast distances within the genome. These studies demonstrate that a target gene’s activity is often controlled by "regulatory" DNA located thousands of bases away. Unfortunately, this substantial layer of complexity often exceeds classical human intuition, complicating efforts to develop a holistic model of biological functioning.

Artificial Intelligence and Genome-Scale Modeling Challenges

Unprecedented advances in computational processing power and the simultaneous advent of the artificial intelligence (AI) age have partially addressed these complications. AI models can adopt a holistic view of interactions among individual biological entities and have consequently revolutionized protein structure prediction.

Unfortunately, elucidating the roles of previously unidentified regions of the genome, particularly the non-coding regions, remains a significant hurdle even for today’s cutting-edge AI models.

Researchers have attributed these limitations to the challenge of scale: to truly understand and engineer biology, a model would have to ingest and analyse the holistic diversity of the entire biosphere to unravel the fundamental rules that distinguish a functioning genome from random molecular noise.

Evo 2 Model Development and Training Dataset

In the present study, researchers report the development and testing of “Evo 2”, a novel generalist biological foundation model. The model was trained on a massive, scientifically curated dataset called OpenGenome2, comprising approximately 8.8 trillion nucleotides from bacteria, archaea, eukaryotes, and bacteriophages while intentionally excluding viruses that infect eukaryotic hosts for biosafety considerations.

Unlike previous implementations in the field, Evo 2’s model architecture (called “StripedHyena 2”) adopts a hybrid computational design that combines convolutional and attention mechanisms, greatly expanding its “context window”. The research team likens this advancement to Evo 2’s ability to hold hundreds of pages of the genomic “novel” in its memory, while previous models could at best only analyze paragraphs at a time.

Evo 2’s performance was evaluated on two primary frontiers: 1. Prediction, determining if a specific DNA mutation or other genetic variant can result in disease or loss of function, and 2. Generation, the guided de novo design of synthetic DNA sequences.

Predictive Performance in Genetic Variant Analysis

Study findings revealed that, in scenarios where Evo 2 was required to predict outcomes without specific training on that task (“zero-shot” tasks), the model successfully identified pathogenic (disease-causing) mutations in humans.

Encouragingly, when analyzing the breast cancer-linked BRCA1 gene, the model’s internal representations could be used to train a classifier that outperformed the base model's zero-shot predictions (Area Under the Receiver Operating Characteristic [AUROC] = 0.95). The model further accurately predicted the effects of non-single-nucleotide variants (complex mutations such as insertions and deletions), outperforming other tested models on these mutation classes in benchmark evaluations.

Generative DNA Design and Experimental Validation

Analyses of Evo 2’s generative capabilities revealed that the model could generate complete mitochondrial genomes and sequences resembling bacterial and yeast chromosomes that maintained natural biological architecture in silico, although such generated sequences do not necessarily represent replication-competent genomes.

Furthermore, when coupled with guidance from external predictive models and search algorithms, Evo 2 could design DNA sequences that folded into specific physical shapes in mouse cells and even encoded Morse code messages ("LO", "ARC", "EVO2") in the DNA's physical accessibility patterns. These designs were experimentally validated in mouse embryonic stem cells using chromatin accessibility assays (AUROC of 0.92-0.95), demonstrating that the generated DNA functioned as intended within living cells.

Finally, interpretability tools revealed that specific artificial neurons in Evo 2 had spontaneously learned to recognize biological features. Evo 2 generated candidate regulatory regions that showed a statistically significant enrichment of transcription factor motifs (P = 3.6 x 10-7), confirming the model was capturing biologically meaningful regulatory patterns rather than producing random sequences.

Conclusions and Implications for Programmable Biology

Evo 2 represents a paradigm shift from analyzing isolated biological components to modeling the holistic complexity of genomes. Its extensive context window and mechanistic advancements enable it to elucidate universal patterns of evolution and generalize from single-celled organisms to humans.

To foster innovation, the researchers have made the model parameters, code, and dataset fully open source, thereby democratizing access to this cutting-edge technology. Evo 2’s development marks a significant step toward a future where biology is not just studied, but programmable.

Journal reference:
Hugo Francisco de Souza

Written by

Hugo Francisco de Souza

Hugo Francisco de Souza is a scientific writer based in Bangalore, Karnataka, India. His academic passions lie in biogeography, evolutionary biology, and herpetology. He is currently pursuing his Ph.D. from the Centre for Ecological Sciences, Indian Institute of Science, where he studies the origins, dispersal, and speciation of wetland-associated snakes. Hugo has received, amongst others, the DST-INSPIRE fellowship for his doctoral research and the Gold Medal from Pondicherry University for academic excellence during his Masters. His research has been published in high-impact peer-reviewed journals, including PLOS Neglected Tropical Diseases and Systematic Biology. When not working or writing, Hugo can be found consuming copious amounts of anime and manga, composing and making music with his bass guitar, shredding trails on his MTB, playing video games (he prefers the term ‘gaming’), or tinkering with all things tech.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Francisco de Souza, Hugo. (2026, March 05). AI trained on 9 trillion DNA letters predicts harmful mutations and designs new genomes. News-Medical. Retrieved on March 05, 2026 from https://www.news-medical.net/news/20260305/AI-trained-on-9-trillion-DNA-letters-predicts-harmful-mutations-and-designs-new-genomes.aspx.

  • MLA

    Francisco de Souza, Hugo. "AI trained on 9 trillion DNA letters predicts harmful mutations and designs new genomes". News-Medical. 05 March 2026. <https://www.news-medical.net/news/20260305/AI-trained-on-9-trillion-DNA-letters-predicts-harmful-mutations-and-designs-new-genomes.aspx>.

  • Chicago

    Francisco de Souza, Hugo. "AI trained on 9 trillion DNA letters predicts harmful mutations and designs new genomes". News-Medical. https://www.news-medical.net/news/20260305/AI-trained-on-9-trillion-DNA-letters-predicts-harmful-mutations-and-designs-new-genomes.aspx. (accessed March 05, 2026).

  • Harvard

    Francisco de Souza, Hugo. 2026. AI trained on 9 trillion DNA letters predicts harmful mutations and designs new genomes. News-Medical, viewed 05 March 2026, https://www.news-medical.net/news/20260305/AI-trained-on-9-trillion-DNA-letters-predicts-harmful-mutations-and-designs-new-genomes.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of News Medical.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Jumping DNA fragments found to destabilize cancer genome