Promoting Precision Medicine Using Data Science of Large Datasets

insights from industryDr. Rajat MukherjeeStatistician, Director Data ScienceCytel

An interview with Dr. Rajat Mukherjee conducted by Alina Shrourou, BSc

Please give an overview of what exactly data science is, and why it’s important to promote precision medicine.

I feel that data science is a marriage between statistical science and informatics, using statistical principals of math and logic on huge volumes of data.

© Jirsak/

You have to rely on informatics to store, read, and then apply these complex statistical algorithms to make sense of huge volumes of information.

A good use of data science can lead to major breakthroughs in medical research in areas like diagnostics, precision medicine, and real-world evidence. By taking large datasets, we can study if different approaches can have a markedly different effect for different patient populations.

What are the benefits of big data analysis compared to traditional collection and analysis?

That's a good question. In traditional collection and analysis, or randomized clinical trials, the data are often collected in very controlled environments. Things like environmental factors may be very well controlled. On the other hand, factors like genomics may not be accounted for.

© Mopic/

Nowadays, real-world data studies or even randomized clinical trials are being designed differently to accommodate the variability and heterogeneity that environmental factors and genetic factors can bring. Data science provides a platform for systematically studying the interaction with environmental factors and genetic factors and looking at the therapeutic effects.

Please outline how you and your research team use data to inform diagnostics? What other biomedical applications are there of data science?

We study biomedical signals and images to develop statistical classifiers that can be used for diagnostics. We also work in the area of precision medicine, researching either genetics or related areas and biomarkers that can help enrich populations for targeted therapeutics. This is becoming more and more popular and important with applications in oncology, rare diseases and difficult to study diseases like Alzheimer’s and Parkinson’s disease.

Another area where data science is useful is in monitoring the patient population for both minor or harmful side effects of therapeutics that are in the market, as many side effects may only become known in the long-term, which can be hard to capture in short term clinical trials.

How can biomedical signals be used as a source of data for diagnosis? Please describe how signal data is processed and transformed to help with data analytics.

Biomedical signals are high-dimensional data, and they need a lot of what we call pre-processing. Essentially, it is the process of filtering out the noise and extracting the valuable information from these signals.

The next step would be feature extraction. We use these methods because you cannot use each and every component of the high dimensional data. You must extract features from these signals and images that are informative of disease status.

Next follows feature selection, i.e. select extracted features or their combinations that have the highest association with disease status. A diagnostic classifier can then be developed and validated using an independent test or validation set. The validation set must be independent of the data set used to develop the classifier.  In general, the diagnostic development and validation using signals and images are done in two separate clinical trials. However, our team works on different seamless options that may lead to much more efficient but still statistically valid designs.

We have been involved in a few diagnostics projects where we have taken biomedical signals and images and transformed them into classifiers. In one of the biggest projects where we have a classifier now, the pivotal validation studies are ongoing. The design of the pivotal validation trial is an operationally seamless, threshold optimization, group-sequential adaptive design which has been accepted by the CDRH, FDA.

How can data science be used to inform biomarker identification and selection? What work is Cytel doing in this area?

Biomarkers are another interesting example, as they play a key role in providing precision medicine strategies. Biomarkers can be diagnostic, prognostic or predictive. Predictive biomarkers help in enrichment strategies for therapies that may only work for a particular sub-population that may be classified as biomarker-positive sub-population.

Biomarker development relies heavily on data science techniques such as filtering and reduction of data and using machine learning techniques to classify patients into biomarker positive/negative.

Please can you also describe data mining work that you are conducting, and how it can inform decision making?

We have yet to use data mining on big data, but we will in the future. However, we have used data mining to look at go/ no-go type decisions, for example, if there have been multiple early phase studies on a particular area or therapeutic, we can pull all of this early phase data together and we do some data mining to come up with these go no-go decisions either to further the pipeline or to call it to an end.

Another new area of interest for us where we have used data mining techniques is the area of pharmacovigilance where large amounts of post-marketing data are used to generate signals for adverse events.

How important are Bayesian models within data science?

As a statistician, when you talk about real world data, I automatically think about Bayesian methods which are ideally suited to apply on accumulating data and updating the information of interest.

Bayesian methods can also be used for automatic feature extraction and selection. These areas suffer from the absence of uniform methodology and Bayesian methods can fill up that gap.

Do you think data science will change the way we manage large volumes of data? What does the future hold for data science and the healthcare and drug discovery industry?

Managing large volumes of data is part of data science, so yes, data science has an integral role to play in the way we manage large volumes of data. Data science will mean having specialized people to take care of big data in the right way, so that it can be applied in real time. Data science is going to change the nature of how big volumes of data are to be stored, accessed and applied.

I think incorporating data science in a drug-development team opens doors to having effective multidisciplinary teams attacking a common problem from different directions.  

Where can readers find more information?

About Rajat Mukherjee

Rajat Mukherjee has 15 years of professional experience as an industry and academic statistician, and brings a range of expert knowledge to Cytel’s customers. This includes work in pattern recognition problems for devices and biomarker discovery, Bayesian clinical trials, adaptive designs, and design and analysis of complex epidemiological studies.

His experience and expertise also includes statistical computing, survival analysis, longitudinal analysis, nonparametric and semiparametric inference, as well as statistical classification and high-dimensional data. Rajat has a strong background and interest in development and implementation of statistical methodology to real life medical problems.


The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of News-Medical.Net.
Post a new comment