New methods of analysis and novel markers are currently being identified to predict conditions with polygenic inheritance. These include polygenic risk scores (PRS), which are based on the presence of single nucleotide polymorphisms (SNPs) in several genes. However, their utility is limited, as PRS are largely based on data derived from European populations.
A new paper in Nature Genetics reports the results obtained from the use of a new PRS calculator called CT-SLEB on a multi-national GWAS database.
Study: A new method for multiancestry polygenic prediction improves performance across diverse populations. Image Credit: Yurchanka Siarhei / Shutterstock.com
SNPs refer to different gene variants formed by the presence of one of several possible bases at a given position within a nucleotide. These genetic mutations must be detectable in 1% or more of the population to be considered an SNP.
Genome-wide association studies (GWAS) have been used to identify large numbers of SNPs that are linked to complex traits and diseases. PRS uses combinations of SNPs to provide a predicted risk for the occurrence of complex traits and disease conditions.
PRS built on SNP-trait associations have largely been derived from European cohorts, thereby limiting their generalizability. Especially in African populations, the PRS calculated based on these studies has produced inaccurate results.
As a result, PRS is not suitable for clinical use without favoring European-ancestry populations. Thus, the use of GWAS from across multiple populations could facilitate the development of better PRS scores from larger training samples.
To this end, previous studies have compiled GWAS information from the target population with information from larger European populations. However, ideal PRS would require an appropriate sample size with sufficient power, thus indicating the need for better methods, as well as larger and more diverse databases.
About the study
The current study reports on the performance of CT-SLEB, a powerful computational tool based on clumping and thresholding (CT), superlearning (SL), and empirical Bayes (EB) methods, as compared with nine other methods. While CT selects which SNPs should be considered when calculating PRS in the target population, EB is a method used to estimate the SNP coefficient. SL uses a mix of PRSs from various SNP selection criteria.
CT-SLEB requires GWAS summary statistics from both European and non-European training datasets, a tuning dataset that yields the best parameters for the target population, and a validation dataset that provides the final prediction for the target population.
These results were obtained using GWAS simulations on large populations extending across five different ancestries. These included 23andMe, Inc., the Global Lipids Genetics Consortium (GLGC), All of Us (AoU), and UK Biobank (UKBB) across EUR), AFR (primarily African American), Latino, East Asian, and South Asian (SAS) populations.
GWAS data from over five million individuals from several different ancestral groups were included in the analysis., about 1.2 million of whom were from countries outside Europe. The data were used to predict multi-ancestry PRS from a combination of European and less abundant non-European population data.
In addition to providing comparative data on CT-SLEB and other approaches, the scientists also generated validated PRS for 13 complex traits using this multi-ancestry PRS tool.
What did the study show?
Improved PRS performance using CT-SLEB was observed in the groups from non-European countries as compared to the other simpler tools. This remained true regardless of whether the training dataset was small or large; however, this affected the accuracy of other PRS calculators. The greatest number of comparisons was conducted between CT-SLEB and PRS-CSx, the latter being a Bayesian approach.
CT-SLEB maintained or even surpassed the predictive accuracy of other tools that rely more heavily on computational analysis. As the sample size increases, CT-SLEB becomes more accurate, irrespective of the polygenicity, whereas with smaller samples, it performs better with lower polygenicity.
PRS-CSx performed better than CT-SLEB in many settings; however, both platforms perform best when using data from all five ancestries. Using two-ancestry data, CT-SLEB generates African PRS 25 times faster than PRS-CXs at only 4.3 minutes. When based on five-ancestry data, CT-SLEB was over 90 times faster, taking almost the same time as compared to 420 minutes.
The predictive performance of PRS for minority groups generated by CT-SLEB was comparable to that for the European population if the former had sample sizes at least 45% as large as European cohorts. However, the sample size required for accurate prediction varies dramatically with the heritability of various traits.
CT-SLEB is easily scalable and can process a much larger number of SNPs. Thus, this platform is capable of improving its PRS performance in minority groups within the American population by using denser SNP panels.
For many polygenic traits, including the clinically important cardiovascular disease (CVD) trait, CT-SLEB predicted the risk much better than PRS-CSx and PolyPred-S+. Overall, these three outperformed other platforms; however, none was superior in all settings.
Even with the best-performing method and large sample, a substantial gap remained for PRS performance in non-EUR populations compared with the EUR Population.”
What are the implications?
CT-SLEB is a new and computationally scalable method for generating powerful PRSs using data from GWASs in diverse populations.”
The study findings emphasize the need to use multiple methods to generate PRSs across multiple ancestries. For African-American populations, representing African-origin populations that have little baseline polygenic data, with correspondingly lower polygenic prediction accuracy, CT-SLEB produced the most improvement in the PRS performance.
The simulation studies showed the need to determine sample sizes appropriate for such prediction. These studies also highlight the effects of variations in SNP density when predicting the risk of a trait among people of multiple ancestries, which will affect the choice of method for PRS generation.
CT-SLEB produces predictions an order of magnitude faster than PRS-CSx and is easily scalable for handling large increases in the number of SNPs and more populations.
- Zhang, H., Zhan, J., Jin, J., et al. (2023). A new method for multiancestry polygenic prediction improves performance across diverse populations. Nature Genetics. doi:10.1038/s41588-023-01501-z.