by Frank Farach, Staff Scientist
Note: This post draws heavily from concepts described in an article, slide deck, and presentation by Hadley Wickham, Assistant Professor of Statistics at Rice University and Chief Scientist at RStudio. Dr. Wickham has authored over 30 packages for the R statistical computing environment, including 3 of the 5 most downloaded packages this year. If what I describe here interests you, please check out his excellent work.
I recently wrote about the two main bottlenecks in data analysis, data cleaning and transformation, and suggested that you can improve both relatively easily. In this post, I’m going to outline principles you can follow to make these tedious parts of data management less onerous so you can focus on interacting with your data in a more meaningful (and fun) way.
Let’s have another look at the diagram I shared in the first post:
Note: Figure adapted from a presentation by Hadley Wickham.
Data is often messy, so we need to clean it before we analyze it. Once we clean it we usually can’t jump directly to visualization and statistical modeling. We first have to manipulate the clean data — restructuring, filtering, transforming, aggregating, sorting, and/or merging it with other data — so it serves our initial visualization or model. Unfortunately, all but the simplest analyses involve many rounds of visualization and modeling, some of which depend on the results of previous iterations. So, not only do we have to frequently perform time-consuming data manipulations, we can’t easily predict the data structure we’ll need until we see the results of previous steps in the analysis. This, in a nutshell, is why preparing and analyzing data can seem so tedious and inefficient.
What is needed is a way to store data and use our data manipulation tools that will minimize the effort involved when preparing data for visualization and statistical modeling.
According to Hadley Wickham, the solution is two-fold: (1) Clean your data so you can store it in a default but versatile format; and (2) transform it for modeling and visualization using tools that both accept and produce data in this format. For the first part of the solution, Wickham suggests the following default structure:
Each variable forms a column
Each observation forms a row
Each data set contains information on only one observational unit of analysis (e.g., families, participants, participant visits)
A data set that meets these criteria is “tidy”. A tidy data set is, by definition, about only one type of thing or event (observational unit) that has been observed a certain number of times (rows) using certain measurements and identifiers (variables). The first two criteria essentially define long-format or panel data, a structure that will be familiar to you if you’ve ever analyzed repeated-measures or longitudinal data. In that case it might be easiest to think of tidy data as long-format data that is about only one observational unit.
There are many ways data can be untidy. Wickham’s top five are as follows:
Column names represent data values instead of variable names
A single column contains data on multiple variables instead of a single variable
Variables are contained in both rows and columns instead of just columns
A single table contains more than one observational unit
Data about an observational unit is spread across multiple data sets
These characteristics aren’t intrinsically bad—in fact, they can help us process and digest information better than tidy data. For example, untidy formats, such as cross-tabulations and other data tables, are far better choices than tidy data for the presentation of summaries, relationships, and patterns. What makes untidy data problematic for the data analyst is when it is used as source material for future data manipulation activities.
In contrast, the virtue of tidy data lies with its versatility. Just as glass-blowers work with molten glass because it can easily be blown into different shapes, tidy data is a useful intermediate data structure for data analysis because it can easily be changed into other useful formats. All of the primary data manipulation activities—filtering, transforming, sorting, and aggregating—as well as visualization and modeling are greatly simplified when working with tidy data. This is especially true when we process tidy data with “tidy tools”.
The benefits of tidy data are best realized by using tidy tools to process them. Tidy tools are those that accept, manipulate, and return tidy data, thus preserving the versatility of the tidy data structure and minimizing the need for additional data restructuring. To borrow an analogy from Wickham, the tools are like Lego blocks—individually simple but flexible and powerful in combination.
Note: Figure adapted and modified from a presentation by Hadley Wickham.
What tools are tidy? If you use the free and open source R software, check out Wickham’s popular tidy data manipulation (plyr and reshape2) and visualization (ggplot2) packages; most standard modeling functions in R are tidy, too. Wickham notes SPSS and SAS also have several tidy tools for data manipulation and statistical analysis. There are simply too many tools to review them adequately here. Fortunately, though, if you’ve read this far, you’re well-equipped to assess which of your tools are tidy by following these steps:
Create a tidy data set for testing.
Identify the functions or macros in your favorite stats package that accomplish common data manipulation tasks, such as filtering, transforming, sorting, and aggregation.
Read the documentation.
Test your tools on the data set one at a time, documenting whether each works with tidy data and produces tidy data. If it fails on either criterion, it’s not a tidy tool.
Repeat the above steps for your favorite visualization and statistical modeling tools.
This exercise will give you not only a deeper understanding of how your tools work–it may also help you discover tidier alternatives to the tools you use now.
As we’ll see next, tidy data has a lot in common with the storage of data in relational databases. Relational databases require a greater up-front investment of resources but can greatly improve your data management capabilities.