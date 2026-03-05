Trained on genomic data spanning the tree of life, Evo 2 reveals how artificial intelligence can decode the rules of DNA and even generate new functional sequences, opening the door to programmable biology.

Study: Genome modelling and design across all domains of life with Evo 2. Image Credit: ZinetroN / Shutterstock

In a recent study published in the journal Nature , researchers at the Arc Institute describe the development and advantages of “Evo 2”. Evo 2 is a novel biological foundation model trained on a massive dataset of approximately 9 trillion DNA base pairs and implemented in models containing up to 40 billion parameters.

Representing a significant advancement over previous artificial intelligence ( AI ) implementations, Evo 2 was trained on an extensive dataset of approximately 9 trillion DNA base pairs spanning bacteria to humans. Its genomic context window can reach up to 1 million nucleotides, allowing it to analyse the intricate, long-range dependencies that govern gene function.

Research findings reveal that the model demonstrates strong performance across multiple genomic prediction tasks, including predicting the functional impacts of genetic variants (including coding and non-coding mutations) and splice-related changes, and can even design novel DNA sequences at the genome scale (for example, mitochondrial genomes and large microbial or eukaryotic genomic segments) with high precision.

Background: Limits of Reductionist Biology Approaches

Traditional biology research has been characterized as a reductionist pursuit in which complex derived systems (e.g., the central nervous system [ CNS ]) are broken down into their most basic parts (genes and proteins) to elucidate their mechanistic underpinnings.

Recent research, however, argues that this approach fails to account for life being orchestrated by complex interactions spanning vast distances within the genome. These studies demonstrate that a target gene’s activity is often controlled by "regulatory" DNA located thousands of bases away. Unfortunately, this substantial layer of complexity often exceeds classical human intuition, complicating efforts to develop a holistic model of biological functioning.

Artificial Intelligence and Genome-Scale Modeling Challenges

Unprecedented advances in computational processing power and the simultaneous advent of the artificial intelligence ( AI ) age have partially addressed these complications. AI models can adopt a holistic view of interactions among individual biological entities and have consequently revolutionized protein structure prediction.

Unfortunately, elucidating the roles of previously unidentified regions of the genome, particularly the non-coding regions, remains a significant hurdle even for today’s cutting-edge AI models.

Researchers have attributed these limitations to the challenge of scale: to truly understand and engineer biology, a model would have to ingest and analyse the holistic diversity of the entire biosphere to unravel the fundamental rules that distinguish a functioning genome from random molecular noise.

Evo 2 Model Development and Training Dataset

In the present study, researchers report the development and testing of “Evo 2”, a novel generalist biological foundation model. The model was trained on a massive, scientifically curated dataset called OpenGenome2, comprising approximately 8.8 trillion nucleotides from bacteria, archaea, eukaryotes, and bacteriophages while intentionally excluding viruses that infect eukaryotic hosts for biosafety considerations.

Unlike previous implementations in the field, Evo 2’s model architecture (called “StripedHyena 2”) adopts a hybrid computational design that combines convolutional and attention mechanisms, greatly expanding its “context window”. The research team likens this advancement to Evo 2’s ability to hold hundreds of pages of the genomic “novel” in its memory, while previous models could at best only analyze paragraphs at a time.

Evo 2’s performance was evaluated on two primary frontiers: 1. Prediction, determining if a specific DNA mutation or other genetic variant can result in disease or loss of function, and 2. Generation, the guided de novo design of synthetic DNA sequences.

Predictive Performance in Genetic Variant Analysis

Study findings revealed that, in scenarios where Evo 2 was required to predict outcomes without specific training on that task (“zero-shot” tasks), the model successfully identified pathogenic (disease-causing) mutations in humans.

Encouragingly, when analyzing the breast cancer-linked BRCA1 gene, the model’s internal representations could be used to train a classifier that outperformed the base model's zero-shot predictions (Area Under the Receiver Operating Characteristic [ AUROC ] = 0.95). The model further accurately predicted the effects of non-single-nucleotide variants (complex mutations such as insertions and deletions), outperforming other tested models on these mutation classes in benchmark evaluations.

Generative DNA Design and Experimental Validation

Analyses of Evo 2’s generative capabilities revealed that the model could generate complete mitochondrial genomes and sequences resembling bacterial and yeast chromosomes that maintained natural biological architecture in silico, although such generated sequences do not necessarily represent replication-competent genomes.

Furthermore, when coupled with guidance from external predictive models and search algorithms, Evo 2 could design DNA sequences that folded into specific physical shapes in mouse cells and even encoded Morse code messages ("LO", "ARC", "EVO2") in the DNA 's physical accessibility patterns. These designs were experimentally validated in mouse embryonic stem cells using chromatin accessibility assays ( AUROC of 0.92-0.95), demonstrating that the generated DNA functioned as intended within living cells.

Finally, interpretability tools revealed that specific artificial neurons in Evo 2 had spontaneously learned to recognize biological features. Evo 2 generated candidate regulatory regions that showed a statistically significant enrichment of transcription factor motifs (P = 3.6 x 10-7), confirming the model was capturing biologically meaningful regulatory patterns rather than producing random sequences.

Conclusions and Implications for Programmable Biology

Evo 2 represents a paradigm shift from analyzing isolated biological components to modeling the holistic complexity of genomes. Its extensive context window and mechanistic advancements enable it to elucidate universal patterns of evolution and generalize from single-celled organisms to humans.

To foster innovation, the researchers have made the model parameters, code, and dataset fully open source, thereby democratizing access to this cutting-edge technology. Evo 2’s development marks a significant step toward a future where biology is not just studied, but programmable.