DNA language model performance hinges on pre-training data choices

Researchers at The University of Texas MD Anderson Cancer Center have performed a comprehensive evaluation of five artificial intelligence (AI) models trained on genomic sequences, known as DNA foundation language models. These comparisons provide valuable insights into their strengths and weaknesses and offer a framework for selecting appropriate models based on specific genomic tasks.

The study, published in Nature Communications, was led by Chong Wu, Ph.D., assistant professor of Biostatistics and affiliate of the Institute for Data Science in Oncology; and Peng Wei, Ph.D., professor of Biostatistics.

"Our benchmarking study demonstrates that choices, such as pre-training data, sequence length and how we summarize model embeddings, can shift performance as much as changing the DNA language model itself. This kind of rigorous benchmarking is critical to ensure DNA language models are used in a transparent, reproducible way as they move closer to supporting clinical decision-making," Wu said.

What are DNA language models and what are they used for?

DNA language models are AI tools specifically trained on large amounts of genomic data to identify and predict patterns in DNA sequences. Specifically, the researchers focused on the models' ability to make predictions for queries they were not specifically trained on, which can provide insights into their problem-solving abilities.

Ideally, these models can predict gene function and interactions as well as protein folding in order to apply predictions for personalized testing and treatment.

What did the researchers evaluate in this study?

The researchers compared how well five different DNA foundation language models could perform across 57 diverse datasets. They measured the ability of these models to identify important genomic components, to predict how strongly a gene will be expressed, and to determine if genes contain harmful mutations that could lead to diseases.

The researchers also examined how different pre-training variables, such as using multi-species or human-only data, can affect the results.

What did the researchers learn from their evaluation?

Each model had strengths and weaknesses based on the tasks at hand. For example, some models were more efficient at identifying genomic components but were less effective in predicting gene expression compared to other, more specialized models.

The study highlights that these models can read long stretches of DNA and are skilled at identifying potentially harmful mutations, even though they weren't directly trained to do so. The researchers noted that the models also performed well on multi-species data, though they performed better depending on which species they saw most during the training.

How can these results be applied to precision medicine?

The study provides a comprehensive evaluation of the five DNA foundation models, offering valuable insights into their strengths and highlighting potential areas for improvement. These findings can guide researchers and clinicians in selecting the appropriate models for tasks that can personalize genetic testing and treatment.

Source:
Journal reference:

Wu, J., & Lin, L. (2025). Benchmarking DNA foundation models for genomic and genetic tasks. Nature Communications. DOI:10.1038/s41467-025-65823-8. https://www.nature.com/articles/s41467-025-65823-8.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of News Medical.

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Primerdesign launches exsig Mag RapidBead Pro Extraction kit for DNA and RNA