In a recent study posted to the bioRxiv* preprint server, researchers developed and validated an approach for the joint inference of measurement noise and genetic drift by analyzing time-series data of lineage frequencies.
Random genetic drift in infectious disease outbreak dynamics at the population-level results from the randomness of transmission between hosts and of host death or recovery. Studies have reported a strong genetic drift in severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) sequences resulting from superspreading events, predicted to considerably affect the viral evolution and coronavirus disease 2019 (COVID-19) epidemiology. Noise resulting from the measurement process, including bias in obtaining data across location and time, could confound genetic drift estimates.
About the study
In the present study, researchers developed an approach to jointly infer the power of measurement noise and genetic drift from time-varying lineage frequency data that enabled measurement noise to be overdispersed (instead of maintaining uniformity) and the power of overdispersion to vary with time (instead of being constant). They also validated the accuracy of the approach via simulations.
HMM (hidden Markov modeling) was used with continually occurring observed states and hidden ones representing observed and true frequencies, respectively. The transition possibility between hidden states was set by genomic drift, wherein the average true frequency was based on true frequencies determined in the previous period. For rare frequencies, the variance correlated with the average values based on the effective population size [Ne(t)] and the generation time.
The emission possibility between the observed and hidden states was based on measurement noise such that the average value of frequencies observed was equal to the true frequencies. In the case of rare frequencies, the value of variance in observed frequencies correlated with the average value denoting the time-dependent deviations from uniform-type sampling. Modeling was performed assuming that the count of persons and lineage frequencies were high enough to apply the theorem of the central limit.
The model generated “superlineages” by grouping lineages based on phylogenetic distances so that the total value of the lineages’ abundance and frequency exceeded the threshold value, yielding 486, 4083, 6,225, and 24,867 strains of SARS-CoV-2’s pre-B.1.177, B.1.177, Alpha, and Delta variants, respectively. The team assumed that the Ne(t) was constant over nine weeks. Based on the emission and transition possibilities, the maximal likelihood function was determined to represent the possibility of noting a particular time-series lineage frequency dataset, given the power of measurement noise and Ne(t) at different times.
Subsequently, the parameters that most likely represent the dataset were determined. The model was validated by performing simulations using time-varying Ne(t) and measurement noise values. Novel lineages were introduced in the model at a low mutating rate to form a new strain. The model was fitted to the observed SARS-CoV-2 lineage frequencies’ data from simulations, which showed that Ne(t) and the measurement noise strength could be determined accurately in most situations, even if both quantities varied with time.
The inferred Ne(t) was compared to that estimated by the SIR (susceptible, infectious, and recovered) and SEIR (susceptible, exposed, infectious, and recovered) models. The approach was applied to predict the power of measurement noise and genetic drift in SARS-CoV-2 sequences in England by space and time (between March 2020 and December 2021). More than 490,000 sequences of SARS-CoV-2 obtained from the COVID-19 Genomics UK (COG-UK) consortium were analyzed.
The power of the genetic drift was consistently higher than that estimated from the observed count of SARS-CoV-2-positive persons in England by one to three orders of magnitude, throughout time, even after correcting for measurement noise. The elevated genetic drift could not be explained based on superspreading but may be partially explained by deme community structures in the contact networks of hosts. The discrepancy could not be explained by corrections accounting for epidemiological dynamics (SIR or SEIR modeling).
Sampling SARS-CoV-2-infected persons from England’s population were largely uniform for the dataset. The team found proof of a spatial arrangement in the dynamics of the B.1.177 variant, Alpha variant, and Delta variant transmission. The estimated Ne(t) was lesser than the count of SARS-CoV-2-positive community-dwelling individuals by a factor ranging between 16 and 1055 at the different time points. Peaks in measurement noise for pre-B.1.177 were observed in October 2020, although measurement noise for the B.1.177 variant was low during the period.
The HMM-inferred Ne(t) was lower than that inferred from the SIR and SEIR models, indicating elevated genetic drift levels of SARS-CoV-2 in England. Striking differences between the time-varying alterations in the count of SARS-CoV-2-positive community residents and the Ne(t) were: (i) Ne(t) of pre-B.1.177 peaked before that of pre-B.1.177 variant-positives, (ii) Ne(t) of the Alpha variant reduced at a slower rate than the decreases in the count of SARS-CoV-2-positive persons post-January 2021, and (iii) a shoulder in Ne(t) of the Delta variant occurred before that in the count of positives.
Overall, the study findings showed that the strength of genetic drift in SARS-CoV-2 transmission in England was greater than estimated and indicated that further modeling studies methods are required to better understand the mechanisms behind the high genetic drift levels for SARS-CoV-2 in England.
bioRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.