Machine learning-based models can predict diabetes incidence across several ethnicities

In a recent study published in eClinicalMedicine, researchers developed questionnaire-based models for predicting diabetes mellitus type 2 (T2D) incidence and prevalence across differing ethnicities.

Study: Effective questionnaire-based prediction models for type 2 diabetes across several ethnicities: a model development and validation study. Image Credit: NicoElNino/Shutterstock.comStudy: Effective questionnaire-based prediction models for type 2 diabetes across several ethnicities: a model development and validation study. Image Credit: NicoElNino/


Screening and prediction technologies are critical for detecting and managing T2D early, especially in non-white individuals. These individuals are subjected to a complicated combination of circumstances, which leads to an early start and associated consequences.

Machine learning (ML)-based technology can provide non-invasive screening, allowing for preliminary evaluations and referrals, eventually promoting population health and lowering healthcare expenditures.

About the study

In the present study, researchers developed T2D incidence and prevalence prediction models based on questionnaires using United Kingdom Biobank (UKBB) data (for training). They applied them to the Lifelines study data (for validation) for use among white and non-white individuals.

Questionnaire-based algorithms were trained using UKBB's white population data. The algorithms' potential clinical value was compared using two other models (with additional variables such as physical measures and biological markers) and gold-standard models for clinical risk assessment to predict T2D occurrence. Logistic regression modeling was performed to predict T2D incidence and prevalence.

The training dataset comprised white individuals who participated in the UKBB study (472,696 individuals, aged 37 to 73 years, information obtained from 2006 to 2010) and was validated for five non-white ethnicities (29,811 individuals) with external validation using Lifelines data (168,205 individuals, aged 0 to 93 years, data obtained from 2006 to 2013).

Feature selection was performed for model development. The area under the receiver operating characteristic (ROC) curve (AUC) was used to measure predictive accuracy, and sensitivity analyses were performed to assess possible clinical value.

Further, a reclassification analysis was performed after comparing questionnaire-only prediction models to those that included biomarkers and physical and clinical T2D risk tools.

T2D was diagnosed among training cohort participants using self-reported data, including clinician-based T2D diagnosis or hospital records with the International Classification of Diseases, ninth revision (ICD-9) diagnostic codes.

Validation cohort participants were categorized as those having incident or prevalent diabetes type 2 based on self-report.

According to the National Institute for Health and Care Excellence (NICE) recommendations, the thresholds for "potentially undiagnosed" T2D in the training and validation datasets included blood glucose levels above 7.0 mmol/L or glycated hemoglobin (HbA1c) levels greater than 48 mmol/mol. Individuals with "potentially undiagnosed" T2D were excluded from the analysis to minimize bias in the prevalence studies.

In addition, the researchers excluded all incident T2D patients with more than eight years until diagnosis and those individuals who did not acquire T2D but did not return to the assessment center after eight years.

Further, the researchers validated the non-laboratory clinically concise Finnish Diabetes Risk Score (FINDRISC) and the clinical Australian T2D Risk Assessment Tool (AUSDRISK), which use nine and 13 features to predict incident T2D, respectively, spanning medical history, demographics, lifestyle, and anthropometrics.


67,083 and 631,748 individuals were included to assess T2D incidence and prevalence, respectively. Of note, the T2D incidence and prevalence rates differed considerably between non-white and white individuals, with non-Whites showing a 4.0-fold greater prevalence (between 12% and 23%) and 0.5 to 3.0-fold greater incidence (between 1.4% and 8.2%) compared to the white UKBB population (6.00% and 2.80%, respectively).

On the other hand, Lifelines exhibited a lower T2D prevalence (two percent) and incidence (two percent) compared to the White UKBB population, which might be partly explained by the age disparities in the two populations.

In the White UKBB sample, the algorithms correctly predicted T2D prevalence (AUC of 0.9) and incidence over eight years (AUC of 0.9).

Both models reproduced well in the Lifelines external validation, with AUC values of 0.8 and 0.9 for incidence and prevalence, respectively.

Both ML-based models consistently performed well across ethnicities, with AUC values ranging between 0.86 and 0.89 for the prevalence and between 0.82 and 0.88 for the incidence of T2D.

The models outperformed the clinically verified non-laboratory techniques in general, appropriately reclassifying almost 3,000 extra instances. Adding biological markers, but not physical data, increased model performance.

The prevalence and incidence models place a high value on BMI and the number of drugs used, placing them in the top three characteristics of both models. Furthermore, incidence contains a sedentarism element [time spent watching television (TV)].

In forecasting prevalence and incidence in diverse demographics, Lifelines' questionnaire-based ML models surpassed FINDRISC and AUSDRISK.

The questionnaire-only models obtained good sensitivity-specificity balance, PPV, and NPV for all populations. The sensitivity-specificity balance improved in models that included biomarkers, resulting in greater PPV across groups.

With statistical significance for white, Caribbean, other, and South Asian populations, the models accurately categorized more instances than clinically proven prediction techniques. Compared to the questionnaire models, adding physical data accurately ranked more incidents in Lifelines. In almost every case, biomarker-based models beat clinical methods.

Overall, the study findings showed that T2D prevalence and incidence were successfully predicted by ML models from the UK Biobank across all ethnicities, including non-white individuals.

These models outperformed existing methods, resulting in a precise, scalable, cost-effective strategy for identifying positive instances and forecasting risk.

Journal reference:
Pooja Toshniwal Paharia

Written by

Pooja Toshniwal Paharia

Pooja Toshniwal Paharia is an oral and maxillofacial physician and radiologist based in Pune, India. Her academic background is in Oral Medicine and Radiology. She has extensive experience in research and evidence-based clinical-radiological diagnosis and management of oral lesions and conditions and associated maxillofacial disorders.


Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Toshniwal Paharia, Pooja Toshniwal Paharia. (2023, September 27). Machine learning-based models can predict diabetes incidence across several ethnicities. News-Medical. Retrieved on July 13, 2024 from

  • MLA

    Toshniwal Paharia, Pooja Toshniwal Paharia. "Machine learning-based models can predict diabetes incidence across several ethnicities". News-Medical. 13 July 2024. <>.

  • Chicago

    Toshniwal Paharia, Pooja Toshniwal Paharia. "Machine learning-based models can predict diabetes incidence across several ethnicities". News-Medical. (accessed July 13, 2024).

  • Harvard

    Toshniwal Paharia, Pooja Toshniwal Paharia. 2023. Machine learning-based models can predict diabetes incidence across several ethnicities. News-Medical, viewed 13 July 2024,


The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of News Medical.
Post a new comment

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Natural compound found in olives can lower blood sugar levels and promote weight loss