By learning hidden physiological patterns from overnight sleep studies, a new AI foundation model reveals how sleep can serve as an early warning system for disease risk years before clinical diagnosis.
Study: A multimodal sleep foundation model for disease prediction. Image credit: AnnaStills/Shutterstock.com
In a recent study published in Nature Medicine, researchers developed a multimodal sleep foundation model, SleepFM, for disease prediction.
From sleep disorders to systemic disease risk
Sleep disorders impact millions of individuals and are increasingly recognized as contributors to and indicators of various conditions. Polysomnography (PSG) is the gold standard for sleep analysis, capturing rich physiological signals. Previous machine learning studies have typically targeted individual diseases or limited sleep metrics, leaving much of the rich complexity captured by PSG underused.
SleepFM links overnight physiology to long-term disease risk
In the present study, researchers developed SleepFM, a multimodal sleep foundation model, for disease prediction. PSG data were used from four cohorts: BioSerenity, the Outcomes of Sleep Disorders in Older Men (MrOS), the Multi-Ethnic Study of Atherosclerosis (MESA), and Stanford Sleep Clinic (SSC). Together, these cohorts comprised around 65,000 participants and 585,000 hours of sleep recordings.
In addition, the Sleep Heart Health Study (SHHS) dataset was used to evaluate external transfer learning and generalization and was excluded from pretraining. The team employed a self-supervised contrastive learning objective for pretraining. After pretraining, the performance of SleepFM’s learned representations was assessed by fine-tuning on four benchmark tasks: sex classification, sleep stage classification, age estimation, and sleep apnea classification.
SleepFM’s ability to predict chronological age was assessed for age estimation. The model achieved a mean absolute error of 7.33 years. Performance varied by age group, with higher accuracy in middle-aged and pediatric groups and greater error in older adults. Sex classification had an area under the receiver operating characteristic curve (AUROC) of 0.86 and an area under the precision, recall curve of 0.9.
SleepFM performed well in distinguishing wake, stage 2, and rapid eye movement stages but showed confusion in transitional sleep stages, such as stage 1, in line with known variability in scoring. Notably, the model achieved competitive performance compared to state-of-the-art models, including U-Sleep, Greifswald Sleep Stage Classifier (GSSC), Yet Another Spindle Algorithm (YASA), and STAGES, although specialized models occasionally outperformed SleepFM on certain external datasets.
For sleep apnea classification, SleepFM demonstrated competitive performance, with accuracies of 0.87 and 0.69 for presence and severity classification, respectively. Next, the researchers linked SSC data with electronic health records, extracting diagnostic codes and their timestamps for disease prediction. These codes were mapped to a hierarchical system of more than 1,800 disease categories designed for phenome-wide association studies (phecodes). After filtering for prevalence and temporal constraints, 1,041 phecodes were retained for evaluation, with cases defined as diagnoses occurring more than seven days after the sleep study to avoid trivial associations.
SleepFM achieved robust results in various areas, including pregnancy-related complications, mental disorders, neoplasms, and circulatory conditions. The model achieved an AUROC of 0.93 for Parkinson’s disease and 0.84 for both developmental delays and disorders and mild cognitive impairment, measured over a six-year prediction window. Among circulatory conditions, SleepFM effectively predicted intracranial hemorrhage and hypertensive heart disease with six-year AUROC values of 0.82 and 0.88, respectively. The authors emphasize that these predictions reflect statistical risk stratification rather than causal relationships or imminent disease onset.
Among neoplasms, SleepFM demonstrated strong predictive performance for prostate cancer, melanomas of the skin, and breast cancer. The team then examined the model’s generalization capabilities across temporal distribution and external site validation. For temporal generalization, the model was tested on a separate cohort of Stanford patients from 2020 onwards; SleepFM maintained strong predictive performance despite the limited follow-up period.
To evaluate cross-site generalization, the transfer learning capabilities of SleepFM were assessed on the SHHS dataset. Embeddings from the pretrained model were extracted and fine-tuned on a subset of this dataset. Because outcome definitions differed across sites, evaluation was limited to six overlapping cardiovascular outcomes. SleepFM demonstrated robust transfer learning performance across these key outcomes, achieving significant predictive accuracy for congestive heart failure, stroke, and cardiovascular disease-related mortality.
Finally, the researchers compared SleepFM against two supervised baselines, end-to-end PSG and demographics. The demographics baseline was trained on structured clinical features, for example, body mass index, age, sex, and race or ethnicity. The end-to-end PSG model was trained on raw PSG data, including age and sex, but without pre-training.
The percentage difference in AUROC between the two baselines and SleepFM ranged from 5 % to17 %. SleepFM consistently outperformed both baselines across most categories of diseases. Moreover, SleepFM was superior in predicting all-cause mortality, achieving an AUROC of 0.85, compared to both baselines that had an AUROC of 0.78. Across disease categories, the model demonstrated strong risk stratification performance, with more than 130 conditions achieving a Harrell’s C index of at least 0.75. According to the authors, these results highlight the potential of sleep as a rich, underused source of longitudinal health signals.
Sleep-based AI models could reshape early disease detection
In summary, the study developed a large-scale sleep foundation model using more than 585,000 hours of PSG data. SleepFM was robust in predicting dementia, heart failure, chronic kidney disease, and death. The model achieved competitive performance on standard tasks, such as apnea detection and sleep staging, comparable to state-of-the-art models. SleepFM also showed strong transfer learning capabilities, maintaining robust predictive power for several cardiovascular outcomes across independent datasets.
Furthermore, the model outperformed supervised baselines across diverse disease categories, predicting all-cause mortality more accurately than both baselines. However, the authors note that most data were derived from individuals referred for clinical sleep studies, which may limit generalizability to the broader population. They also acknowledge that, like many foundation models, SleepFM’s learned representations are not yet fully interpretable at the level of specific physiological mechanisms.
Overall, these findings suggest that SleepFM could complement existing risk assessment tools and help identify early disease signs. Future studies may explore how integrating sleep models with data from health records, imaging, and omics can enhance their utility.
Download your PDF copy now!