Registry Science: where medicine and data science intersect

By Keynote ContributorDr. Steve LabkoffGlobal Head of Clinical and Healthcare Informatics at Quantori

Written by Keynote Contributor, Dr. Steve Labkoff.

Healthcare and technology media have highlighted the promise of “Big Data” powered by artificial intelligence and machine learning for the past decade. It seems big data can solve everything from supply chain issues to curing lethal diseases like cancer, as long as there is enough data about a given subject for the computer algorithms to analyze. The informatics and data science fields are all about taking data, creating knowledge, and gaining new insights that generate value based on outcomes. However, there are challenges in the value chain for the healthcare industry as the quality of insights directly correlates to the quality and quantity of data used to generate those insights. Unfortunately, getting high-quality medical data, especially from electronic medical records (EHRs), is an exceptionally complex task.

medical dataImage Credit: PopTika/

Although doctors are paid to take care of patients rather than performing EHR data entry, some studies report that over 50% of a physician’s daily work is spent keypunching information into EHRs. When clinicians use “cut & paste” when writing their notes, the data is non-specific, less granular, and skipped fields create data gaps. This incomplete prose leads to other interpretive challenges.

Computers with clever algorithms can help clean up some data, but missingness is still problematic since you can’t add back what was never there. In a recent project I spearheaded for a major Midwest health system, they needed to identify as many cases as possible of patients with a rare blood cancer. At the outset of the project, the assertion was that they had hundreds of new cases each year and could secure as many cases as necessary for the analysis. In the end, they could only produce 96 patient cases for analysis. While this may seem like many patients with a rare blood cancer, it is not enough to do the sophisticated artificial intelligence and machine learning required to generate valuable insights. Simply put, the lack of complete data for a health system that had a catchment area of over three million covered lives became rate-limiting for completing the research project.

Registry Science: translating research into insights to empower clinical care

A strategy does exist to combat this issue. However, it requires expertise in informatics, data analytics, and visualization, and A.I. and ML technologies. It involves the construction of medical registries that can do what EHRs do but in a more focused and deliberate manner. Medical registries collect data about a patient type or class, usually determined by inclusion or exclusion of criteria, and then collect the needed data over time, repeatedly and doggedly, to ensure that as much data about the patient or situation can be brought to bear. This can be a demanding task; however, the upside is the benefit of having “cleaner” and more complete data from which to start the analysis, which is crucial for increasing insights about rare diseases with low incidence and prevalence rates.

This area of medical informatics and data science that intersects with medicine and epidemiology is called “Registry Science.” Although it’s been around for a long time, it’s only recently gained importance with the advent of next-generation technologies that can generate new and more complex data sets. These technological innovations include single-cell RNA sequencing, next-generation DNA sequencing, and time-of-flight mass spectrometry for single-cell analysis that can produce datasets that may be combined with electronic patient records to follow the patient’s journey in ways not previously possible. Registry Science, therefore, enables the production of highly curated data sets that help facilitate the discovery of new insights to power personalized medicine.

Registries began in the mid 20th century using shoeboxes of index cards to track patient data. Over the last few decades, these rudimentary registries evolved into sophisticated means of generating complex and complete data sets needed as inputs for multivariate data analysis performed by machine learning and used by artificial intelligence engines. The emergence of additional data types, such as genomic sequencing and immunologic profiling, enabled the ability to collect data from multiple data types required to gain optimal insights from many points of view.

However, the old problems of the field remain as one of the most important criteria to study in medical registries is the overall patient journey, and securing high-quality patient medical data is crucial for its success. There are well over 300 electronic medical record systems in the U.S., with a handful of systems dominating the space (Epic, Cerner, Meditech, Allscripts, & eCLinical Works). What are the odds that these EHR vendors provide the ability to share data between their systems easily? It turns out the odds aren’t that good.

electronic medical recordsImage Credit: Tero Vesalainen/

As part of the Health Information Technology for Economic and Clinical Health (HITECH) bill of 2009, the government catalyzed the adoption of EHRs, spurring the use of EHR systems. However, its efforts to incentivize “meaningful use” and interoperability were challenging at best.

In 2021, major challenges to true interoperability remain. This means that the aggregation of medical data from disparate systems for things like cancer research is substantially harder than it should be.

Fast track to data curation and analysis

Compounding the issue is the fact that the “owners” or “stewards” of the majority of these data -- primarily large hospitals and universities -- rarely have incentives to share their data, even when provided with signed consent from the patient. There are disincentives to sharing because these data have become the keys to unlocking scientific discovery and are treated as proprietary information. Whoever owns the stewardship rights to the data may conduct the research they want, provided they pass the study protocols or Institutional Review Boards at their institution. Sharing between institutions remains difficult, at best.

Fortunately, in April of 2021, a rule went into effect as part of the 21st Century Cures Act. There are several provisions in the rule. One of its provisions empowered the creation and use of the United States Core Data for Interoperability (USCDI), a standardized set of health data classes and constituent data elements for nationwide, interoperable health information exchange. The USCDI, combined with the use of an emerging standard for exchanging healthcare information electronically, the Fast Healthcare Interoperability Resources (FHIR®), is a recipe for what might be considered a fast track to data curation, aggregation, and analysis. FHIR helps move the data from place to place, and USCDI helps define how it should be named and stored. This might be the “beginning of the end” of large-scale, manual curation.  Another provision in the rule outlaws “data blocking” – namely any impediment to the sharing of data (provided you have the appropriate permissions).

Registry Science holds huge potential for clinical trials, evaluating patient outcomes, epidemiological research, and spurring regulatory action. However, we still have a long way to go. For example, to study an ultra-rare disease, finding enough patients across the U.S. or the world can be onerous, as evidenced in the recent findings from many current registries. Issues with patient consent, sharing among institutions, and institutional inertia make collecting these data challenging. The promise of research employing artificial intelligence and machine learning to empower personalized medicine by way of the FHIR and USCDI is going to be tested in the next several years. We can only hope that some of the issues that make this work arduous will be solved.

Now, suppose we would start paying doctors to create better documentation in their EHR systems; we might get better data to discover the origins of major diseases like cancer, heart disease, Alzheimer’s disease, and respiratory diseases, including COVID-19. If the COVID pandemic has taught us anything in healthcare, it’s that being able rapidly aggregate data into registries can be a critically important capability for both the health and wellbeing of individuals, our nation, and the world.

About Dr. Steve Labkoff

Dr. Steven Labkoff is one of the leading clinician-informaticians in the U.S. today with nearly 30 years of experience in the life sciences and healthcare sector.Dr. Steve Labkoff Trained in medical informatics, cardiology, and internal medicine at Harvard Medical School, MIT, Rutgers School of Biomedical and Health Sciences, and the University of Pittsburgh, he has deep expertise in generating, managing, and analyzing data to accelerate drug development, personalized patient care, and improve medical outcomes. At Quantori, he serves as Global Head of Clinical and Healthcare Informatics, assisting life science clients in developing and implementing innovative informatics solutions throughout the entire drug development lifecycle. In his spare time, Labkoff is an award-winning photographer.


Disclaimer: This article has not been subjected to peer review and is presented as the personal views of a qualified expert in the subject in accordance with the general terms and conditions of use of the News-Medical.Net website.

Last Updated: Nov 22, 2021


The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of News Medical.
Post a new comment

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.