A powerful AI-driven analysis uncovers hidden COVID-19 deaths across the US, exposing deep inequities in how the pandemic’s toll was recorded.

Study: Applying machine learning to identify unrecognized COVID-19 deaths recorded as other causes of death in the United States. Image Credit: Design_Cells / Shutterstock
In a recent study published in the journal Science Advances, researchers developed a novel Machine Learning (ML) model to estimate previously unrecognized coronavirus disease 2019 (COVID-19) deaths rather than compute a “true” death toll of the COVID-19 pandemic in the United States (US). The model was coded to focus its computations on the period from March 2020 to December 2021.
Algorithm estimates revealed that the US medical reporting system likely did not identify 155,536 COVID-19 deaths that were instead officially attributed to other causes. Furthermore, the model found that these predicted “unrecognized” deaths occurred disproportionately among marginalized racial groups including Hispanic, American Indian/Alaska Native, Black, and Asian populations.
Misreporting was demonstrated to be significantly above the country-wide mean in individuals with less education, and residents of the American South, suggesting systematic inequities in the nation’s death investigation system rather than definitive proof of systemic failure.
Limitations of Traditional COVID-19 Mortality Estimates
Accurate epidemiological public health reporting, particularly mortality data, is widely considered a bedrock of the modern medical system as it allows officials to allocate resources and craft effective policy during emergencies.
However, the recent COVID-19 pandemic is often criticized as an example of the breakdown of this system, with a growing body of evidence suggesting that reporting was often delayed or incomplete.
Traditionally, studies have predominantly used "excess mortality" statistical models to estimate the pandemic's toll by comparing actual deaths to historical trends. Unfortunately, while these models have been proven useful for estimating the total number of deaths in a given area, they cannot accurately identify the cause of death.
Consequently, distinguishing between someone who died directly from a viral (COVID-19) infection and someone who died due to indirect pandemic-associated factors, such as a delayed heart surgery or the economic stress of a lockdown, has hitherto remained impossible using excess mortality approaches alone.
Machine Learning Model and Study Design
The present study aimed to address this knowledge gap within the context of the US death investigation system. The study leveraged recent computational advances to train predictive ML models on a large national death certificate dataset, treating inpatient deaths as a high-quality (“gold standard”) reference under key assumptions.
This training set was derived from US death certificate data for inpatient hospital deaths, where COVID-19 testing was near-universal and cause-of-death reporting was assumed to be highly accurate, rather than a purpose-built dataset. The dataset focused on the period from March 2020 to December 2021 during which time 1.88 million deaths were reported.
Sixteen different ML models were trained on this reference dataset, specifically focusing on the death certificate’s contributing causes and decedent characteristics that may signal a COVID-19 death. The Extreme Gradient Boosting (XGBoost) model was selected for its consistent high predictive accuracy in the training dataset.
The model was subsequently provided with 3.85 million "out-of-hospital" death certificates from adults aged 25 and older. This dataset included up to 20 underlying and contributing causes of death, including age, sex, race, education level, preexisting chronic medical conditions, median household income, and geographic location.
Importantly, the approach assumes that patterns learned from hospital deaths can be validly applied to out-of-hospital deaths, a key yet potentially limiting assumption of the model.
Estimated Underreporting and Mortality Disparities
The XGBoost model estimated a total of 995,787 COVID-19 deaths (95% uncertainty interval [UI]: 990,313 to 1,001,363) during the period under investigation. This number reveals a substantial reporting gap in the US death investigation system, as it is ~19% higher (n = 155,536) than official records (n = 840,251).
The model further revealed that these discrepancies in official records were most severe for deaths occurring at home, where the predicted toll was 160% higher than reported (Adjusted Reporting Ratio [ARR] = 2.60; 95% UI: 2.56 to 2.65). Unexpectedly, the model also identified significant gaps in hospice care and emergency rooms.
When estimating the relative contributions of different sociodemographic and medical conditions associated with misclassification, the model revealed that the Southern United States had the highest rates of unrecognized deaths. Alabama (ARR 1.67), Oklahoma (ARR 1.51), and South Carolina (ARR 1.47) were observed to lead the nation in underreporting.
The model identified reporting disparities in racial and ethnic records, with Hispanic decedents being the most likely to have their COVID-19 deaths unrecognized (ARR 1.31; 95% UI: 1.30 to 1.32). High underreporting was also found among American Indian/Alaska Native (ARR 1.24), Asian (ARR ~1.24) and Black populations (ARR 1.19).
Finally, individuals with less than a high school education were significantly more likely to be undercounted (ARR 1.29) compared to more educated counterparts. Similarly, counties with the lowest household incomes and the worst preexisting health metrics had the highest rates of unrecognized deaths.
Implications for Public Health and Equity
The present publication concluded that the US death investigation system undercounted COVID-19 deaths in a "systematically inequitable" way. XGBoost model findings imply that the system inadvertently hid the true depth of the pandemic's impact on marginalized communities.
While the study is limited by the assumption that hospital-trained models can be generalized to home deaths, the researchers argue that this approach offers an alternative, potentially more specific, approach to traditional excess-death models. The authors also emphasize that these estimates should be interpreted alongside other methodologies rather than as definitive counts.
Future studies should aim to apply similar ML frameworks to investigate other "hidden" mortality crises, such as drug overdoses or the impacts of extreme heat.
Journal reference:
- Kiang, M. V., et al. (2026). Applying machine learning to identify unrecognized COVID-19 deaths recorded as other causes of death in the United States. Science Advances, 12(12). DOI – 10.1126/sciadv.aef5697, https://www.science.org/doi/10.1126/sciadv.aef5697