Good Data Management Practices for Data Analysis: Part 1
By: Frank Farach, Staff Scientist
As far as research experiences go, it’s hard to beat the moment when you finally get to analyze and interpret the data you worked so hard to obtain. Unfortunately, it’s common to spend many tedious and frustrating hours cleaning and wrangling your data into a usable format, followed by careful exploration to provide context and reveal potential problems with the analyses you want to run.
Many researchers view these data management tasks as arduous, frustrating, and brittle—and rightly so. Entire books and articles have been written about data cleaning. Although there’s no way (yet) to completely automate the process, there are good practices you can follow to make your next analysis go more smoothly. In this post, I’m going to discuss where the common bottlenecks are in many analyses. I’ll then spend the next two posts showing you different ways to reduce these bottlenecks that are likely to improve both your efficiency and the quality of your data analyses.
As shown below, data cleaning and data transformation are two major bottlenecks in data analysis. Let’s look at each in turn.
Note: Figure adapted from a presentation by Hadley Wickham.
Data Cleaning. It should be no surprise that it takes longer to clean messier data. Unfortunately, there are many ways that data can be messy; but, as I’ll describe in future posts, powerful tools and practices can help you turn messy data into clean data.
Data Transformation. This one is more subtle. It’s often important to visualize and model the data in various ways when conducting an analysis. Models can suggest new visualizations and vice versa. I’m not talking about going on fishing expeditions, but rather about familiarizing yourself with the data, examining whether it meets the assumptions of your planned statistical analyses, and conducting any follow-up exploratory analyses. The point is that frequent data transformations are required to mediate changes between these representations, introducing an underappreciated amount of friction in analysis. Fortunately, the right approach and tools can make data transformations much easier, as I’ll illustrate in the next two posts.
If data cleaning and transformation are rate-limiting steps in data analysis, then more efficient approaches to these tasks should make the overall process faster and—dare I say it?—more pleasant. In my next post, I’ll show you how to make your next analysis go more smoothly using tools that are immediately available to you. I’ll then show you a more rigorous way to clean and transform your data using a relational database.