Identification of long COVID patients using big data

Download PDF Copy

Revised

By Dr. Liji Thomas, MDOct 26 2021

The coronavirus disease 2019 (COVID-19) has affected millions of people, leaving many with chronic or persistent symptoms. These are called long covid or the post-acute sequelae of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection (PASC).

Study: Who has long-COVID? A big data approach. Image Credit: eamesBot/ Shutterstock

This news article was a review of a preliminary scientific report that had not undergone peer-review at the time of publication. Since its initial publication, the scientific report has now been peer reviewed and accepted for publication in a Scientific Journal. Links to the preliminary and peer-reviewed reports are available in the Sources section at the bottom of this article. View Sources

A new preprint, available on the medRxiv* preprint server, exploits big data to come up with some answers as to the incidence, treatment, and impact on healthcare systems of long covid.

Background

Long covid has taken a toll on human health and wellbeing by its debilitating effects on the patients, those who take care of them, those they earlier served, and the healthcare systems charged with providing medical care to them. The high social and economic costs of such illness are obvious.

While acute illness impacts multiple body systems, including the lungs, the gut, the brain, and the heart, long-term illness is less understood. The term long covid is considered to represent "persistent or new symptoms more than four weeks after severe, mild, or asymptomatic SARS-CoV-2 infection."

The sheer variety of signs and symptoms that make up long covid and the long time-scale involved have made its diagnosis and management a matter of major difficulty. However, it is an urgent matter in that it has left many thousands of people incapacitated for daily life, to varying extents.

Several studies have brought out the wide range of patient symptoms due to long covid using instantaneous capture and longitudinal monitoring strategies. The Human Phenotype Ontology (HPO) was also used to show how multiple features characterize long covid. At the same time, the World Health Organization (WHO) guidelines offer a 12-point criterion for diagnosing this condition from both self-reported and clinical data.

The National Institutes of Health (NIH) in the USA has similarly conducted a large RECOVER study with thousands of participants drawn from all over the country to help unravel the risk factors in pregnancy, the effects on psychological and cognitive function, and the differences in outcome between patients. The use of machine learning (ML) and electronic health records (EHRs) can help researchers evolve accurate algorithms that pick out high-risk patients using a defined standard.

This has been tested out by the National COVID Cohort Collaborative (N3C), a data-gathering cum analytic project that brings together and unifies data from EHRs obtained from 65 sites and eight million patients over a period of time. The project is funded by the NIH National Center for Advancing Translational Sciences (NCATS).

It uses data from patients with confirmed infection with SARS-CoV-2, those with symptoms suggestive of this infection, and matched controls without this diagnosis. Long Covid patient data from three NC3 sites were collected and linked to the NC3 database, this serving to train three ML models that were then tested before being used to achieve the desired outcome.

The aim was to define the group of patients, on a nationwide basis, that was at risk of long covid. Secondly, to define the important clinical features of this cohort to help pick up potential new participants for research studies and practical clinical manifestations.

"The NIH RECOVER program has invested in EHR studies to understand PASC, to accurately identify who has PASC, and to prevent and treat PASC." In this context, the current study describes very accurate models that have been trained by EHR data from patients at a long covid clinic.

What did the study show?

The models created with these data proved to be easily reproducible and well-understood. They can be used in individual facilities to recruit local researchers and analyze the collected data. The risk factors with the highest predictive value were outpatient attendance for long covid, age, respiratory symptoms, and other diagnostic and clinical features.

The longitudinal nature of the data means that it is a large-scale initial layer of data on which further ML modeling can be carried out to identify long covid. With more and newer data sources from wearable electronics, the models can be more accurate and discriminative. By giving indications of the chances that a patient will develop this condition, it is hoped that better clinical management will eventually become possible.

The patients who had COVID-19 but not long Covid had different demographic profiles at the NC3, especially that most of the former were female.

These are known to put patients at risk for more severe acute COVID-19 but may also increase the likelihood of long covid

The association of respiratory symptoms and treatments with drugs such as albuterol and inhaled steroids is not surprising, given that the virus is a respiratory pathogen. Post-viral reactive airways disease is frequent among COVID-19 patients, as expected with respiratory virus infections.

Non-respiratory symptoms are also widely reported in long Covid, including sleep disorders, anxiety and malaise, constipation, and chest pain. Similarly, these patients have higher use of lorazepam, melatonin, and polyethylene glycol 3350.

Dexamethasone use was inversely associated with long covid. Importantly, when the vaccine was received following natural infection, the probability of long covid was reduced, indicating the vaccine's ability to protect against symptomatic and severe COVID-19, as well as death and long covid following infection.

These trends are visible using EHR data, supporting its use in the selection of research cohorts as well as to study hypotheses about social and demographic factors, underlying conditions and treatments concerning long covid, as well as how acute disease severity is related to the specific signs and symptoms of long covid.

What are the implications?

The use of EHR data should be understood in terms of the opportunities it gives for recruiting a cohort of patients by the computational selection, based purely on the patient's clinical features at that instant with matched inclusion and exclusion criteria. The broad criteria used allow proxies to be utilized, thus enlarging the scope of inclusion, though it does allow some extraneous patients to be recruited.

It is also necessary to acknowledge that most of this data comes from patients who are likely to use healthcare, those who are more ill, and inpatients, leaving out large populations with limited access to or ability to pay for healthcare, and those visiting hospitals without EHR capabilities, including small or community clinics or hospitals. This signifies the need to use other methods to access such groups to achieve greater diversity and representation.

As larger cohorts of long-COVID patients are established, future research should identify sub-phenotypes of long-COVID by clustering long-COVID patients with similar EHR data "fingerprints."

The model may be refined over time using the influx of data from NC3 and the large size of the samples.

Journal references:

Preliminary scientific report. Girvin, A. T. et al. (2021). Who Has Long-COVID? A Big Data Approach. medRxiv preprint. doi: https://doi.org/10.1101/2021.10.18.21265168, https://www.medrxiv.org/content/10.1101/2021.10.18.21265168v1
Peer reviewed and published scientific report. Pfaff, Emily R, Andrew T Girvin, Tellen D Bennett, Abhishek Bhatia, Ian M Brooks, Rachel R Deer, Jonathan P Dekermanjian, et al. 2022. “Identifying Who Has Long COVID in the USA: A Machine Learning Approach Using N3C Data.” The Lancet Digital Health, May. https://doi.org/10.1016/s2589-7500(22)00048-6. https://www.thelancet.com/journals/landig/article/PIIS2589-7500(22)00048-6/fulltext.

Article Revisions

Apr 29 2023 - The preprint preliminary research paper that this article was based upon was accepted for publication in a peer-reviewed Scientific Journal. This article was edited accordingly to include a link to the final peer-reviewed paper, now shown in the sources section.

Posted in: Medical Science News | Medical Research News | Disease/Infection News

Comments (0)

Written by

Dr. Liji Thomas

Dr. Liji Thomas is an OB-GYN, who graduated from the Government Medical College, University of Calicut, Kerala, in 2001. Liji practiced as a full-time consultant in obstetrics/gynecology in a private hospital for a few years following her graduation. She has counseled hundreds of patients facing issues from pregnancy-related problems and infertility, and has been in charge of over 2,000 deliveries, striving always to achieve a normal delivery rather than operative.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Thomas, Liji. (2023, April 29). Identification of long COVID patients using big data. News-Medical. Retrieved on July 18, 2025 from https://www.news-medical.net/news/20211026/Identification-of-long-COVID-patients-using-big-data.aspx.
MLA
Thomas, Liji. "Identification of long COVID patients using big data". News-Medical. 18 July 2025. <https://www.news-medical.net/news/20211026/Identification-of-long-COVID-patients-using-big-data.aspx>.
Chicago
Thomas, Liji. "Identification of long COVID patients using big data". News-Medical. https://www.news-medical.net/news/20211026/Identification-of-long-COVID-patients-using-big-data.aspx. (accessed July 18, 2025).
Harvard
Thomas, Liji. 2023. Identification of long COVID patients using big data. News-Medical, viewed 18 July 2025, https://www.news-medical.net/news/20211026/Identification-of-long-COVID-patients-using-big-data.aspx.