DNA language model performance hinges on pre-training data choices

Researchers at The University of Texas MD Anderson Cancer Center have performed a comprehensive evaluation of five artificial intelligence (AI) models trained on genomic sequences, known as DNA foundation language models. These comparisons provide valuable insights into their strengths and weaknesses and offer a framework for selecting appropriate models based on specific genomic tasks.

The study, published in Nature Communications, was led by Chong Wu, Ph.D., assistant professor of Biostatistics and affiliate of the Institute for Data Science in Oncology; and Peng Wei, Ph.D., professor of Biostatistics.

"Our benchmarking study demonstrates that choices, such as pre-training data, sequence length and how we summarize model embeddings, can shift performance as much as changing the DNA language model itself. This kind of rigorous benchmarking is critical to ensure DNA language models are used in a transparent, reproducible way as they move closer to supporting clinical decision-making," Wu said.

What are DNA language models and what are they used for?

DNA language models are AI tools specifically trained on large amounts of genomic data to identify and predict patterns in DNA sequences. Specifically, the researchers focused on the models' ability to make predictions for queries they were not specifically trained on, which can provide insights into their problem-solving abilities.

Ideally, these models can predict gene function and interactions as well as protein folding in order to apply predictions for personalized testing and treatment.

What did the researchers evaluate in this study?

The researchers compared how well five different DNA foundation language models could perform across 57 diverse datasets. They measured the ability of these models to identify important genomic components, to predict how strongly a gene will be expressed, and to determine if genes contain harmful mutations that could lead to diseases.

The researchers also examined how different pre-training variables, such as using multi-species or human-only data, can affect the results.

What did the researchers learn from their evaluation?

Each model had strengths and weaknesses based on the tasks at hand. For example, some models were more efficient at identifying genomic components but were less effective in predicting gene expression compared to other, more specialized models.

The study highlights that these models can read long stretches of DNA and are skilled at identifying potentially harmful mutations, even though they weren't directly trained to do so. The researchers noted that the models also performed well on multi-species data, though they performed better depending on which species they saw most during the training.

How can these results be applied to precision medicine?

The study provides a comprehensive evaluation of the five DNA foundation models, offering valuable insights into their strengths and highlighting potential areas for improvement. These findings can guide researchers and clinicians in selecting the appropriate models for tasks that can personalize genetic testing and treatment.

Source:

University of Texas M. D. Anderson Cancer Center

Journal reference:

Wu, J., & Lin, L. (2025). Benchmarking DNA foundation models for genomic and genetic tasks. Nature Communications. DOI:10.1038/s41467-025-65823-8. https://www.nature.com/articles/s41467-025-65823-8.

Posted in: Device / Technology News | Medical Science News | Medical Research News