GPT-4 enhances clinical trial screening accuracy and cuts costs

In a recent study published in the new monthly journal NEJM AI, a group of researchers in the United States evaluated the utility of a Retrieval-Augmented Generation (RAG)-enabled Generative Pre-trained Transformer (GPT)-4 system in improving the accuracy, efficiency, and reliability of screening participants for clinical trials involving patients with symptomatic heart failure.

Study: Retrieval-Augmented Generation–Enabled GPT-4 for Clinical Trial Screening. Image Credit: Treecha / ShutterstockStudy: Retrieval-Augmented Generation–Enabled GPT-4 for Clinical Trial Screening. Image Credit: Treecha / Shutterstock


Screening potential participants for clinical trials is crucial to ensure eligibility based on specific criteria. Traditionally, this manual process relies on study staff and healthcare professionals, making it prone to human error, resource-intensive, and time-consuming. Natural language processing (NLP) can automate data extraction and analysis from electronic health records (EHRs) to enhance accuracy and efficiency. However, traditional NLP struggles with complex, unstructured EHR data. Large language models (LLMs), like GPT-4, have shown promise in medical applications. Further research is needed to refine the implementation of GPT-4 within RAG frameworks to ensure scalability, accuracy, and integration into diverse clinical trial settings.

About the study 

In the present study, the Recurrent Error Correction with Tolerance for Input Variations and Efficient Regularization (RECTIFIER) system was evaluated in the Co-Operative Program for Implementation of Optimal Therapy in Heart Failure (COPILOT-HF) trial, which compares two remote-care strategies for heart failure patients. Traditional cohort identification involved querying the EHR and manual chart reviews by non-clinically licensed staff to assess six inclusion and 17 exclusion criteria. RECTIFIER focused on one inclusion and 12 exclusion criteria derived from unstructured data, creating 14 prompts.

Using Microsoft Dynamics 365, yes/no values for criteria were captured during screening. An expert clinician provided "gold standard" answers for the 13 target criteria. The datasets were divided into development, validation, and test phases, starting with 3000 patients. For validation, 282 patients were used, while 1,894 were included in the test set. 

GPT-4 Vision and GPT-3.5 Turbo were utilized, with the RAG architecture enabling effective handling of clinical notes. Notes were split into chunks and retrieved using a custom Python program and LangChain's recursive chunking strategy. Numerical vector representations were generated and optimized with Facebook's AI Similarity Search (FAISS) library.

Fourteen prompts were used to generate "Yes" or "No" answers. Statistical analysis involved calculating sensitivity, specificity, and accuracy, with the Matthews correlation coefficient (MCC) as the primary evaluation metric. Cost analysis and comparison across demographic groups were also performed.

Study results 

In the validation set, note lengths varied from 8 to 7097 words, with 75.1% containing 500 words or fewer and 92% containing 1500 words or fewer. In the test set, clinical notes for 26% of patients exceeded GPT-4's 128k token context window limit. A chunk size of 1000 tokens outperformed 500 in 10 of 13 criteria. Consistency analysis on the validation dataset showed percentages ranging from 99.16% to 100%, with a standard deviation of accuracy between 0% and 0.86%, indicating minimal variation and high consistency.

In the test set, both COPILOT-HF study staff and RECTIFIER demonstrated high sensitivity and specificity across the 13 target criteria. Sensitivity for individual questions ranged from 66.7% to 100% for the study staff and 75% to 100% for RECTIFIER. Specificity ranged from 82.1% to 100% for the study staff and 92.1% to 100% for RECTIFIER. Positive predictive value ranged from 50% to 100% for the study staff and 75% to 100% for RECTIFIER. The answers of both closely aligned with expert clinicians' answers, with accuracy between 91.7% and 100% (MCC, 0.644 to 1) for the study staff and 97.9% and 100% (MCC, 0.837 to 1) for RECTIFIER. RECTIFIER performed better for the inclusion criterion of "symptomatic heart failure," with an accuracy of 97.9% versus 91.7% and an MCC of 0.924 versus 0.721.

Overall, the sensitivity and specificity for determining eligibility were 90.1% and 83.6% for the study staff and 92.3% and 93.9% for RECTIFIER. When inclusion and exclusion questions were combined into two prompts or when GPT-3.5 was used instead of GPT-4 with the same RAG architecture, sensitivity and specificity decreased. Using GPT-4 without RAG for 35 patients, where 15 were misclassified by RECTIFIER for the symptomatic heart failure criterion, slightly improved accuracy from 57.1% to 62.9%. No statistically significant bias in performance across race, ethnicity, and gender was found.

The cost per patient with RECTIFIER was 11 cents using the individual-question approach and 2 cents using the combined-question approach. Due to the increased character inputs required, using GPT-4 and GPT-3.5 without RAG resulted in higher costs of $15.88 and $1.59 per patient, respectively.


To summarize, RECTIFIER demonstrated high accuracy in screening patients for clinical trials, outperforming traditional study staff methods in certain aspects and costing only 11 cents per patient. In contrast, traditional screening methods for a phase 3 trial can cost approximately $34.75 per patient. These findings suggest significant potential improvements in the efficiency of patient recruitment for clinical trials. However, the automation of screening processes raises concerns about potential hazards, such as missing nuanced patient contexts and operational risks, necessitating careful implementation to balance benefits and risks.

Journal reference:
Vijay Kumar Malesu

Written by

Vijay Kumar Malesu

Vijay holds a Ph.D. in Biotechnology and possesses a deep passion for microbiology. His academic journey has allowed him to delve deeper into understanding the intricate world of microorganisms. Through his research and studies, he has gained expertise in various aspects of microbiology, which includes microbial genetics, microbial physiology, and microbial ecology. Vijay has six years of scientific research experience at renowned research institutes such as the Indian Council for Agricultural Research and KIIT University. He has worked on diverse projects in microbiology, biopolymers, and drug delivery. His contributions to these areas have provided him with a comprehensive understanding of the subject matter and the ability to tackle complex research challenges.    


Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Kumar Malesu, Vijay. (2024, June 18). GPT-4 enhances clinical trial screening accuracy and cuts costs. News-Medical. Retrieved on July 25, 2024 from

  • MLA

    Kumar Malesu, Vijay. "GPT-4 enhances clinical trial screening accuracy and cuts costs". News-Medical. 25 July 2024. <>.

  • Chicago

    Kumar Malesu, Vijay. "GPT-4 enhances clinical trial screening accuracy and cuts costs". News-Medical. (accessed July 25, 2024).

  • Harvard

    Kumar Malesu, Vijay. 2024. GPT-4 enhances clinical trial screening accuracy and cuts costs. News-Medical, viewed 25 July 2024,


The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of News Medical.
Post a new comment

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Machine learning model to determine associations between metabolic syndrome and lactation