One of the most common types of cancer affecting women worldwide is breast cancer. Multiple predictors of this disease have been identified, including inherited genetic factors, reproductive factors, and lifestyle.
Previous studies have emphasized the etiological difference between pre-and post-menopausal breast cancers. Recently, scientists have combined various approaches to accurately predict breast cancer in women.
Study: Combining machine learning with Cox models to identify predictors for incident post-menopausal breast cancer in the UK Biobank. Image Credit: aslysun / Shutterstock.com
Machine learning (ML) methods can analyze large datasets on predictors and process complex non-linear relationships. Although previous studies have used ML for breast cancer risk prediction, they were not used to identify predictors.
The United Kingdom Biobank (UKB), which comprises an extensive and detailed cohort, offers the opportunity to adopt hypothesis-free approaches to identify novel predictors for breast cancer. A recent development of polygenic risk scores (PRS) can project the effect of hundreds and thousands of genetic variants associated with specific diseases or traits using genome-wide association studies (GWAS).
PRS can be used to identify people with high disease risk and target them for early statin prescription. Notably, PRS added accuracy to existing coronary artery disease risk predictors, such as the Framingham risk score.
Previously, breast cancer PRS has been combined with risk prediction models, such as the Tyrer-Cuzick model and the Breast and Ovarian Analysis of Disease Incidence and Carrier Estimation Algorithm (BOADICEA). Although the interaction between PRS and phenotypic features like gene-environment interactions have been analyzed for breast cancer, contradictory findings have been reported.
About the study
A recent Scientific Reports study utilized machine learning (ML) methods for feature selection, followed by Cox models for risk prediction. The main aim of this study was to demonstrate the effective application of ML methods for feature selection to assist classical statistical methods.
SHapley Additive exPlanation (SHAP) feature dependence plots were used to explore the potential interaction between phenotypic features and PRS. Data from UKB was used for the current study, which contains over half a million participants from England, Wales, and Scotland. Baseline data was collected through verbal interviews with a trained nurse, questionnaires, biological samples, and physical examination.
Post-menopausal women between the ages of 40 and 69 at baseline were recruited due to the aforementioned etiological heterogeneity by menopausal status. The incidence of breast cancer was identified using the International Classification of Diseases codes, in which PRS313 and PRS120k were considered as potential genetic features.
A total of 104,313 participants were included in this study, 4,010 of whom developed breast cancer over the follow-up period of 11.9 years. Combining ML with traditional cancer epidemiology statistical approaches, several known and unknown risk factors for the incidence of post-menopausal cancer were identified.
The identified known risk factors included age at menopause, testosterone, and age. Five novel predictors, including blood biochemistry, blood counts, and urine biomarkers, were also identified.
The newly identified predictors were strongly associated with the incidence of post-menopausal breast cancer. In the future, more research is needed to understand whether these are potentially modifiable risk factors for breast cancer.
The XGBoost model selected a detailed body composition measure rather than body mass index (BMI), thus implying that precise body composition measure is an important predictor of breast cancer. The basal metabolic rate was also found to be a significant predictor for breast cancer, which contradicts a previous study that did not find any association between basal metabolic rate and breast cancer.
Plasma urea, which is a blood biomarker related to kidney function, was also associated with breast cancer. This is the first time that an association between plasma phosphate, sodium, or creatinine in urine with breast cancer has been reported.
The two polygenic risk scores were ranked as the strongest risk factors by agnostic ML models. Cox regressions proved that PRS are significant predictors for post-menopausal breast cancer.
The current study identified five statistically significant novel correlations with post-menopausal breast cancer, including urine biomarkers, blood counts, and blood biochemistry. Upon adding these five novel features to the baseline Cox model, the discrimination performance was maintained. Furthermore, the two pre-specified PRSs were found to be the most important features by the SHAP value.
These findings motivate further research on the use of more precise anthropometry measures to improve breast cancer prediction. External validation of the results is the next important step ahead of implementation in clinical practice.
- Liu, X., Morelli, D., Littlejohns, T. J., et al. (2023) Combining machine learning with Cox models to identify predictors for incident post-menopausal breast cancer in the UK Biobank. Scientific Reports 13. doi:10.1038/s41598-023-36214-0