Research investigates COVID-19 virus origin using artificial intelligence (AI)

The coronavirus responsible for the COVID-19 pandemic has spread across the globe with unprecedented speed and lethality, killing hundreds of thousands of people and forcing countries’ entire populations to self-quarantine.

The virus technically termed severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), is believed to be zoonotic, but its origin is still in doubt. A new paper reports the use of artificial intelligence (AI) to solve the puzzle of the virus’s origin. The paper was published on the preprint server bioRxiv* in May 2020.

AI has been widely employed during the pandemic, with its uses ranging from rapid diagnostics to contact tracing to drug simulation.  The ability to rapidly compare, classify, and relate data has made it an invaluable tool. In fact, the researchers think this may yet provide the key to developing a virus vaccine.

Using AI-aided cluster analysis to track SARS-CoV-2 origin

To find the origin of the virus, the team decided to compare its genome with those of preexisting organisms. They downloaded 334 complete genome sequences of the virus from the GenBank database, using samples taken across the world - 258 from the United States, 49 from China, and the remaining 27 from other countries. For each set, they used the first released complete mapping of the virus’s sequence from each country.

They also selected reference genomic sequences such as those from alpha and beta coronaviruses, from GenBank and Virus-Host DB. Sequenced genomes of the Guangxi and Guangdong pangolins were downloaded from the GISAID database.

Study: Origin of Novel Coronavirus (COVID-19): A Computational Biology Study using Artificial Intelligence. Image Credit: 2630ben / Shutterstock
Study: Origin of Novel Coronavirus (COVID-19): A Computational Biology Study using Artificial Intelligence. Image Credit: 2630ben / Shutterstock

Altogether, there were three sets of reference genomes selected at various taxonomic levels for use in a supervised decision tree method that has been recommended for the classification of novel pathogens. The method used is to scroll through the levels of classification from high to low, looking for the right slot for the SARS-CoV-2 genome at the genus and lower levels, and its closest relatives.

The reference genomes at each taxonomic level were fed to the AI, along with the viral genome sequences.

The AI-based analysis was then carried out by unsupervised clustering methods, using a hierarchical clustering algorithm along with density-based spatial clustering of applications with noise (DBSCAN). Two steps are involved: using the algorithms to achieve reference sequence clusters alone and then use the same parametric values to cluster a mix of both reference and SARS-CoV-2 genome sequences.

In other words, the method first shows the reference sequences with which the SARS-CoV-2 sequences group. Secondly, the settings are changed to observe corresponding changes in the groups formed. This will help pick up the nearest sequences to compare the similarities between genomes.

What Did the Study Find?

By progressively narrowing the search parameters, the team progressed from high to low levels of taxonomic classification. Beginning with the first reference set, which comprises viruses from 12 major classes at the highest level, the team found that the virus belonged to the Riboviria cluster, represented by the MERS virus (responsible for the MERS outbreak in 2012). Based on this data, they concluded that the coronavirus probably belonged to the Riboviria family.

At the next level, they analyzed the clustering of SARS-CoV-2 against 12 virus families within the Riboviria. The results show that the viral genome groups with the Coronaviridiae family. This class has four genera - the Alpha-, Beta-, Gamma-, and Delta-coronavirus families. SARS-CoV2 belongs to the Beta-coronavirus genus.

Within this genus, among 37 reference sequences, SARS-CoV-2 clusters with the Sarbecovirus sub-genus. This contains mostly SARS coronaviruses and bat coronaviruses, but also 5 Guangxi and one Guangdong pangolin sequence.

Interestingly, the study found that the amount of variation in the genetic code of the 334 samples, as compared to the reference samples, was practically constant for all the samples, which were collected across sixteen countries over a time period of three months.

With narrower cut-off parameters, SARS-CoV-2 continued to be clustered with Sarbecovirus, even while this cluster itself separates into two. At a very low cut-off, SARS-CoV-2 clusters only with 2 viruses based on whole-genome analysis - bat CoV RaTG13 and Guangdong pangolin CoV.

On narrowing the search still further, the AI found only one virus, which it could group with SARS-Cov2 - the bat CoV-RaTG13. This could mean that bats are the most likely reservoir host of SARS-CoV2.

Greater horseshoe bat( Rhinolophus ferrumequinum). Image Credit: ATTILA Barsan / Shutterstock
Greater horseshoe bat (Rhinolophus ferrumequinum). Image Credit: ATTILA Barsan / Shutterstock

However, with a still lower cut-off, the AI did not group the virus with any other organism. Does this mean that the virus could originate from neither bats nor pangolins?

The study says this is a “debatable question” because SARS-CoV-2 and bat CoV RaTG13 (or Guangdong pangolin CoV, for that matter) genome sequences are so similar as to be less than that between, for instance, bat coronaviruses originating from the same host.

SARS-CoV-2 Probably from Bat or Pangolin CoV

They conclude, “Therefore, SARS-CoV-2 is deemed very likely originated from the same host with bat CoV RaTG13 or Guangdong pangolin CoV, which is bat or pangolin, respectively.” The study showcases the ability of AI to make sense of large volumes of data to pick out meaningful and useful patterns. It raises hopes that the same power can be harnessed to develop an effective vaccine against SARS-CoV-2.

*Important Notice

bioRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.

Journal reference:
Dr. Liji Thomas

Written by

Dr. Liji Thomas

Dr. Liji Thomas is an OB-GYN, who graduated from the Government Medical College, University of Calicut, Kerala, in 2001. Liji practiced as a full-time consultant in obstetrics/gynecology in a private hospital for a few years following her graduation. She has counseled hundreds of patients facing issues from pregnancy-related problems and infertility, and has been in charge of over 2,000 deliveries, striving always to achieve a normal delivery rather than operative.


Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Thomas, Liji. (2020, May 15). Research investigates COVID-19 virus origin using artificial intelligence (AI). News-Medical. Retrieved on January 16, 2021 from

  • MLA

    Thomas, Liji. "Research investigates COVID-19 virus origin using artificial intelligence (AI)". News-Medical. 16 January 2021. <>.

  • Chicago

    Thomas, Liji. "Research investigates COVID-19 virus origin using artificial intelligence (AI)". News-Medical. (accessed January 16, 2021).

  • Harvard

    Thomas, Liji. 2020. Research investigates COVID-19 virus origin using artificial intelligence (AI). News-Medical, viewed 16 January 2021,


The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of News Medical.
You might also like... ×
Can melatonin help prevent severe COVID-19?