The Pitfalls of Spreadsheet Data Management
Spreadsheet software is almost as ubiquitous as the personal computer itself. Even in rigorous research contexts, programs like Microsoft Excel offer an environment for ad hoc analyses, back-of-the-envelope calculations, dashboard prototyping, and more.
But the user-friendliness of spreadsheet software tempts its over-use. The fact remains that most of these programs are designed with a broad user-base and generic use-cases in mind.
Where Spreadsheets Fall Flat
Generally, spreadsheets are poor platforms for data management and complex analyses.
- Data quality (DQ) is generally lacking: Since entering data into spreadsheets is easy, entering errors into spreadsheets is just as easy.
- Resolving DQ issues are difficult: DQ investigations are typically relegated to manual spot-checking, conditional formatting, sorting, filtering, and finding-and-replacing. The result is a well-validated pattern of data errors in spreadsheet-driven research.
- Capturing data complexity is challenging: Spreadsheets do not handle dates consistently. They do not manage text-fields, time-dependent data, and other complex data formats well. Nor do they gracefully manage data groupings when the amount of data is not known in advance. For example, in a survey of medication use, grouped questions like “what is your dosage?”, “how frequently do you take it?”, and “brand or generic?” lead to spreadsheets that are less user-friendly if one surveyee takes one medication and another takes three. Similarly, spreadsheets inelegantly handle skip logic, such as in the scenario of “answer questions 2-10 if you said ‘yes’ to question 1″.
- Versioning and collaboration are inelegant: Even with cloud-based interfaces and auto-saving, collaborating through spreadsheets frequently involves saving and re-saving versions, emailing files, and file names with dates and initials.
- Advanced analyses are not supported: Data manipulation and analysis are largely limited to logic-based functions, arithmetic operators, and summary statistics.
Avoiding the Headaches of Spreadsheets
Although spreadsheets have their place, other tools offer stress-free, scalable data management environments. Specifically, Prometheus Research’s RexRegistry offers a data pipeline with data integrity and usability in mind:
- Data models preserve data complexity: An explicit data model groups data that are directly related (e.g., demographics on research participants versus their responses to survey questions), while still preserving the nuanced connections between these data (e.g., a link between a surveyee’s answer to “Is there a history of heart disease in your family” and follow up questions). This helps de-clutter data storage.
- Structure supports DQ: Formal databases like RexRegistry can be tailored to your data dictionary. This means errors can be caught and resolved as they arise.
- Better data collection interfaces: Electronic data capture (EDC) software interfaces, like RexDB, are typically designed with a data entry specialist in mind. This can result in easier entry with fewer human errors.
- DQ is at the heart of RexDB: Custom constraint and calculation functions can be configured in RexDB. This introduces functionality to easily flag suspicious values at both the point-of-data-collection in real-time or during post-collection curation efforts.
- RexRegistry explicitly supports data governance and collaboration: Finely-tunable permissioning functions within RexRegistry facilitate role-specific data accessibility, while still enforcing robust governance policies.
- Accessing data is easier: With querying abilities, researchers can more easily investigate trends and anomalies in their data. HTSQL by Prometheus Research, for example, specifically streamlines the data exploration process, as it offers a more straightforward querying language compared to traditional SQL.
- Easily connects to other software: The endpoint in mind with simple spreadsheet software is a .xlsx file full of data, box plots, and pivot tables, for example. Alternative data management software facilitates many endpoints—such as the exporting of data for complex analysis, modeling, and visualization in R, SAS, MATLAB, or other tools.
Altogether, investigators should take care not to overly depend on spreadsheets. Instead, other data management solutions give researchers the means to easily make high-integrity inferences and data-driven decisions and to avoid spending more time managing spreadsheets than actually analyzing data.