7 Challenges of Mastering Clinical Data Registries

Many research projects fall short (at best) or fail altogether (at worst) through necessity or habit—i.e. using Excel or a clinical trials management system (CTMS) for their electronic data capture (EDC) and storage. And, let’s face it, the reason is typically grounded in either a lack of budget or a failure to understand that purpose-built research operations software support even exists.

In this short blog post, we’ll cover the challenges you should consider before choosing a platform to manage and store your important research effort and output. As we like to say, some data is too important not to share (securely and legally).

Building clinical data registries (CDR) that support acquisition, curation, and dissemination of clinical research data poses a number of unique challenges not met by clinical trials software or Excel spreadsheets. For example, a center-level CDR system needs to accumulate data across multiple studies, time points, and data-types (or may need to in the future!).

At the same time, it needs to support research operations workflows that are varied across projects. CDRs must operate within a complex ecology of data sources, consumers, and governance. The complex ecology poses a number of informatics challenges to delivering clinical data registries.

In this article, we will examine the top 7 challenges of mastering clinical data registries. Understanding each individual and particular demand will help you to better build out a CDR that is capable of compiling mass amounts of data, supporting vast amounts of complex research in your field, and leveraging your findings to their full potential.

Informatics Challenge #1: Understand Your Data Sources (Metadata Variety)

Meaningful data comes in many shapes and sizes. The wider the range of data you can combine, the more vibrant the insights and picture it can paint. But managing a multitude of sources and types can be problematic.

You need a system that is able to treat more varied and volatile data elements and schemas differently from those that are more homogeneous and stable. For example, data models that define output of measurement instruments (e.g., data collection forms and devices that generate measurement data files), as well as the data generated by the instruments, need to be handled differently than research operations data (e.g., studies, grants, and research staff).

Measurements may involve:

  • Storing tens of thousands of scientific variables
  • Operational workflows with dozens of tables with hundreds of columns

“No greater barrier to effective data management will exist than the variety of incompatible data formats, non-aligned data structures, and inconsistent data semantics,” Gartner analyst Doug Laney said in 2004, and it still holds true today. Application programming interfaces (APIs) and universal standards and conversion tools, such as the Research Instrument Open Standard (RIOS), are helping to provide a path to data integration and use.

Once you fully understand the scalability of your system to a multitude of variables and various data structure types, you will be in a position move forward in constructing your own clinical data registry.

Informatics Challenge #2: Understand That Data Sources Will Change and Evolve Over Time (Schema Volatility)

Healthcare data changes quickly, raising the question of how long the data is relevant. It is important to plan for volatility in both instrument and operational schemas over time, e.g., new instruments added, old instruments modified, or operational processes evolution.

This challenge is especially acute in multidisciplinary behavioral and mental health research where columns can be in the tens of thousands, new data models for experimental measures can require multiple related tables, and models change in the course of a typical project. Of critical importance is to consider how the registry system will evolve and adapt to the metadata changes, including which metrics to include in an analysis, and how long to store data before archiving or deleting it.

Having a full understanding of your options available when dealing with change is essential to long-term success with your CDR.

Informatics Challenge #3: Assume That Study Protocol Will Evolve (Workflow Variability)

Each research project may have a different way of acquiring, curating, and distributing research data. Even within a research center, standardizing on a single research workflow across different projects is often unfeasible. The problem is multiplied when systems need to operate across centers or sites. The ability to customize workflows to meet local research needs, providing maximum flexibility and expandability, is essential.

Two major challenges for workflow variability are:

  • Storing tens of thousands of scientific variables
  • Operational workflows with dozens of tables with hundreds of columns

It is important to recognize that conventional and more rudimentary methods of data management, such as spreadsheets (Excel), are a barrier to standardizing your workflows.

To attain flexibility and expandability in your CDR, research operations workflows must be configurable by research staff to handle the variability across studies, sites, and research domains. If non-technical personnel have the ability to configure studies and instruments, you will be able to address the speed and costs of adding new studies, forms, and research assets (such as sample, consent, and measure annotation types).

Your CDR ideally would allow data managers to configure data model additions and screens to support variations in workflows, as well as data marts and query guides to support downstream data consumers.

Informatics Challenge #4: Preserving Data Lineage (Complex and Complete Data Provenance)

Data provenance refers to records of the inputs, entities, systems, and processes that influence data of interest, providing a historical record of the data and its origins. The generated evidence supports forensic activities such as data-dependency analysis, error/compromise detection and recovery, auditing, and compliance analysis.

Knowing how data was collected, by whom, from whom, under what conditions, and what has happened to it is critical for downstream use of research data. Data lineage provides visibility, while simplifying the ability to trace errors back to the cause in the process.

Systems must capture provenance information as part of the operational workflow and store it in a way that makes data context available to enable effective data reuse.

Better understanding of the lineage/provenance will help you comprehend how your data was collected and its role in your CDR.

Informatics Challenge #5: Data Vulnerability—Comply with IRB / HIPAA / Common Rule Requirements (Security and Privacy)

Security is top of mind for the healthcare industry, especially as storage moves to the cloud and data starts to travel between organizations as a result of improved interoperability. Both the regulatory environment and practical privacy concerns create the need for privilege assignment at the entity, record, and attribute level.

Privileging and security must be deeply embedded in the system architecture and cannot depend on the user interface layer or on any design pattern that is vulnerable to inadvertent exposure of more data than necessary. Privileging models must be configurable for each install, with granular user-specific permissions calculated and enforced on the server-side.

Security is top of mind for the healthcare industry, especially as storage moves to the cloud and data starts to travel between organizations as a result of improved interoperability. Both the regulatory environment and practical privacy concerns create the need for privilege assignment at the entity, record, and attribute level.

Informatics Challenge #6: Value—Plan for Vital Research to Be Shared and Impactful in the Future (Data Reuse and Repurposing)

Research data accumulation and reuse requires the ability to import and transform data from a variety of data sources. Once acquired, clinical data registries (CDRs) must provide the capability to reorganize and transform data for different uses. On the output side, CDRs must provide both manual and programmatic methods for querying and exporting data.

It is worth repeating (we never tire of it)—some data is too important not to share.

View your data as a persistent, accumulating asset. Beyond supporting your own research and publications, plan for it to be widely shared and used by collaborators. From cohort discovery to intervention analysis, all (or most) research is cumulative, so why not treat it as such?

Informatics Challenge #7: Never Lose Your Research Data (Maintainability and Longevity)

Systems that support complex research activities must be maintainable, often over decades, in several ways. They must be:

  • relatively easy to adapt to expected, and unexpected changes to the environment
  • capable of being maintained by in-house or third-party staff, in case the original delivery team becomes unavailable
  • able to store data in a manner that assures the longevity of the valuable research assets, despite unknowable future technological changes.


If you spend even a fraction of the time you spent planning your research on your research operations management software, you will gain efficiencies in time, data quality, persistence, and output value that far surpasses the investment in planning.