A researcher collects the first batch of data from her research study. She excitedly cuts and pastes the data from multiple spreadsheets to get it into one master spreadsheet. Eager to make sure the data set is free of errors, she does some quick cleaning and runs a few analyses on it. She’s satisfied with the clean data set and the results, so she saves the original data files, her clean master spreadsheet, and the master spreadsheet with error checks and calculations somewhere she’ll know to look later.
So far so good, right? Now let’s consider what happens next.
As each batch of data arrives, she performs the same manual processes. At some point, she finds an anomaly in a new batch of data that she hadn’t checked for in previous batches. She goes back and checks all of the previous data spreadsheets, saving new versions of the data sets and quality checks, only to find that all are clean. She discovers that the problem had actually been due to a cut and paste error in just one specific batch after all!
Now she’s done cleaning and ready for the best part: analyzing and interpreting the data. Using her favorite stats packages, she generates descriptive statistics and checks the frequencies and distributions for problems. Unfortunately, there are some multivariate outliers that didn’t show up earlier. She also realizes that there is still some data trickling in that she’ll need in order to answer one of her research questions. After weeks of re-checking and re-integrating her data, she’s relieved to perform her analyses. To her surprise, some of the effects run in the opposite direction from her predictions. She decides to run some exploratory analyses to figure out what might be going on, but that involves data that still needs to be cleaned. After many weeks or months of data management drudgery, the analyses finally come together, the paper is published, and she files the data and analyses away.
You can just imagine the difficulties that arise when she thinks of another idea for a paper from this dataset that requires the data to be aggregated differently; when a colleague emails her four years later to ask for the data for use in a meta-analysis or integrative data analysis; or when the data management tasks described above are distributed across members of her research team.
Unfortunately, the above scenario isn’t a far-fetched caricature of data management. It simply describes an instance of an all-too-common ad hoc model of research data management that we call disposable data management (DDM). The scenario is so pervasive that we’ve recently presented a poster and given a presentation on the topic at academic conferences.
The central problem with DDM is that it treats data management processes as a series of one-off, sequential tasks—acquire, clean, use, and archive/forget (see figure below)—even though it’s actually an iterative set of process. This mismatch between workflow and real-world requirements introduces many problems, including redundant, manual effort; greater risk of error; poor auditability; and limited potential for reusing or repurposing the data. It is an inefficient way to manage data that leads to lower-quality, less-usable data.
It’s easy to understand why researchers fall back on ad hoc processes like this, however. Researchers are often extraordinarily pressed for time. They’re working in labs or organizations where resources are stretched thin. Even when PIs submit well-written data management plans with their federal grant applications, the lab may lack concrete, written standard operating procedures for managing data; the lab may rely a bit too much on spreadsheet software because of its ubiquity and handiness for quick tasks; and lab members may not be adequately trained in data management practices. When deadlines loom and the right plan, technology, and training aren’t in place, the most sensible-seeming thing to do is to dump data into a spreadsheet or SPSS file for some quick cleaning to get the job done.
But even the most time- and resource-pressed researchers can and should take steps to improve their data management practices. In the next few posts, I’ll discuss some important data management principles and outline the steps that researchers can take to improve their data management practices.