Data Curation vs. Data Enrichment

Data Curation vs. Two Flavors of Data Enrichment

Data curation is about improving the quality of the data you already have.

It usually involves redaction of cases and measures that don’t meet a quality standard, e.g., removing cases or observations for missing values, missing inclusion criteria, negative annotations about quality.

Data enrichment is about adding data to make the data asset more valuable.

It comes in two flavors: endogenous and exogenous.

“Endogenous data enrichment” transforms existing data into derived variables that are more informative and meaningful relative to the questions being asked than the original data. Often, these are the results of bioinformatics pipelines that create derived results, and in the past, we have described this path as “derived results integration.”

A typical, if trivial, example is the calculation of a summary scale score from a vector of sub-scores. More complex examples include calculation of biomarkers from biological or phenotypic observations.

“Exogenous data enrichment” uses data ingested from new sources to increase the value of the overall data asset in the sense of making more informative relative to some set of questions. Often, it is a result of a “data integration” process, and that’s how we have usually described it in the past.

A typical example is adding PRO or Social Determinants data to clinical data. More complex examples may involve combining data from multiple studies, multiple registries, or new time-points, adding descriptive information about biological entities from public databases.