In a recent study posted to the medRxiv* pre-print server, researchers identified spatial/geographical (county-level) features associated with increased coronavirus disease 2019 (COVID-19) cases and death counts in the United States (US) across different temporal phases of the COVID-19 pandemic.
The team trained and tested a structured gaussian processing (SGP)-based machine learning framework on a geographically-tagged large dataset of demographic, socioeconomic, and political data from all the US counties.
The impact of COVID-19 has been heterogeneous all across the US concerning severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) transmission and COVID-19 mortality.
In the US, public health interventions and resources allocations occur at the county level. COVID-19 spread depends upon proximity, hence spatial analysis, employing geographic information systems (GIS), allowed researchers to investigate associations between demographic, socioeconomic factors, and COVID-19 pandemic dynamics at the county level.
Further, it helped them identify and target areas at the highest risk of becoming a COVID-19 hotspot (spatially) to help flatten the pandemic curve.
About the study
In the present study, researchers gathered county-level daily case counts between January 22, 2020, and March 21, 2021, from the Center for Systems Science and Engineering at Johns Hopkins University; likewise, the United States Census Bureau and the National Center for Health Statistics provided country-specific features.
The team predicted daily COVID-19 case counts and death counts for each county using an SGP regression algorithm at the beginning of each week, starting April 6, 2020, until March 21, 2021.
The model was trained on randomly selected two-thirds of the counties in each state and predicted case and death counts of the remaining one-third of the counties. They normalized the daily COVID-19 case and death counts per 100,000 residents to compute a seven-day moving average.
The team used Pearson’s correlation coefficient (PCC) to assess the accuracy of predictions that represented how well the algorithms captured the event count dynamics; likewise, the proportion of variance (R2) showed the proportion of total variation in the model outcomes.
After recognizing highly predictive spatial features, the researchers used a clustering algorithm termed topic modeling (TM) to identify combinations of spatial features closely linked to the COVID-19 spread.
TM computed sets of co-occurring features that could link counties to topics. The researchers segregated discrete groups of counties with similar spatial features (topic contributions) and derived nine clusters of counties based on the relative contributions of Latent Dirichlet Allocation (LDA) topics.
Within each cluster, they showed topic contributions by plotting the average z-score normalized topic score. Likewise, within each quintile, a histogram showed clusters of counties with a higher incidence of cases and deaths per capita.
The overall and median PCC and R2 across counties were 0.96 and 0.98, and 0.84 and 0.94, respectively. The observed R2 value greater than 0.90 (in most states) demonstrated that the study model built on spatial features could account for most of the variance in the COVID-19 case and death counts.
The predicted COVID-19 cases and death counts were strongly associated with measures of age, urbanicity, and presidential voting margin. Correlation analysis revealed that the interactions between socioeconomic, health, and racial features complicated the interpretation of the relationships between the spatial features and the COVID-19 dynamics.
TM was able to associate features with topics and could group geographically remote but demographically similar counties. Additionally, TM clustered many geographically-similar counties. For instance, in Cluster 1, the Midwest region witnessed the largest surge in the COVID-19 cases and deaths during 2020 and had counties with high scores from topics 1, 3, and 9 and low scores from topic 10.
While TM showed that counties with similar demographic and socioeconomic features tended to cluster together, the unsupervised clustering based on these topics identified county groups that witnessed varying COVID-19 spread.
As clustering delineated cases from deaths and initial phase from nationwide phase dynamics, it highlighted plasticity in the composition of spatial features which were strongly associated with COVID-19 risk.
Accordingly, Cluster 3, geographically restricted to the Southeast US geographical region, was associated with high COVID-19 case counts during the initial phase, and Cluster 0 restricted to Texas and the Rocky Mountain region, was associated with high COVID-19 case counts during the nationwide phase.
Intriguingly, the presidential vote margin was the most consistently selected spatial feature in all the COVID-19 prediction models. It stood independently and showed no collinearity with other spatial factors.
To summarize, the study findings showed that spatial features accounted for the majority of variance in COVID-19 cases and death counts across the US.
Predictive modeling based on combinations of spatial features could identify counties at the highest risk for COVID-19 spread and inform policymakers to prioritize these counties for aggressive mitigation strategies, especially under limited resources.
Importantly, TM provided a novel dimensional reduction approach to examine epidemiologic data and also proved to be a great tool for analyzing datasets with collinear variables.
medRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.