Guide to Best Data Management Practices for Data Analysis
In this article on data management best practices for data analysis, we’ll look at 3 concepts that work together to give you a well-rounded strategy to implement with your organization’s most complex data.
In the first section of this series, we’ll examine how data cleaning and data transformation are two major bottlenecks in data analysis. Both can be streamlined by using more efficient methods for storing, cleaning, and processing data.
In the second section, the focus is on a relatively easy way to do this—the tidy data and tidy tools approach.
In the final section, we’ll discuss the benefits of managing tidy data in a relational database. Although relational databases require a greater upfront investment of resources, they offer benefits well beyond those offered by managing tidy data with flat files.
Section #1: Good Data Management Practices for Data Analysis—Data Cleaning & Transformation
As far as research experiences go, it’s hard to beat the moment when you finally get to analyze and interpret the data you’ve worked so hard to obtain. Unfortunately, it’s common to spend many tedious and frustrating hours cleaning and wrangling your data into a usable format, followed by careful exploration to provide context and reveal potential problems with the analyses you want to run.
Many researchers view these data management tasks as arduous, frustrating, and brittle—and rightly so. Entire books and articles have been written about data cleaning. Although there’s no way (yet) to completely automate the process, there are good practices you can follow to make your next analysis go more smoothly.
In this first section, I’m going to discuss where the common bottlenecks are in many analyses. In two subsequent sections, I’ll show you different ways to reduce these bottlenecks that are likely to improve both your efficiency and the quality of your data analyses.
Data cleaning and data transformation are two major bottlenecks in data analysis. Let’s look at each in turn.
Data cleaning means finding and eliminating errors in the data. How you approach it depends on how large the data set is, but the kinds of things you’re looking for are:
- Cases in the data who met exclusion criteria and shouldn’t be in the study
- Duplicate cases
- Impossible or otherwise incorrect values for specific variables
- Skip-pattern or logic breakdowns
- Missing data and outliers
It should be no surprise that it takes longer to clean messier data. Unfortunately, there are many ways that data can be messy; but, powerful tools and practices can help you turn messy data into clean data.
For example, one way to find impossible values for a variable is to print out data for cases outside a normal range.
This is where learning how to code in your statistical software of choice really helps. You’ll need to subset your data using IF statements to find those impossible values.
But if your data set is anything but small, you can also save yourself a lot of time, code, and errors by incorporating efficiencies like loops and macros, so that you can perform some of these checks on many variables at once.
The issues with data transformation are more subtle. It’s often important to visualize and model the data in various ways when conducting an analysis. Models can suggest new visualizations and vice versa. I’m not talking about going on fishing expeditions, but rather about familiarizing yourself with the data, examining whether it meets the assumptions of your planned statistical analyses, and conducting any follow-up exploratory analyses.
The point is that frequent data transformations are required to mediate changes between these representations, introducing an underappreciated amount of friction in analysis. Fortunately, the right approach and tools can make data transformations much easier, as I’ll illustrate in upcoming posts.
If data cleaning and transformation are rate-limiting steps in data analysis, then more efficient approaches to these tasks should make the overall process faster and—dare we say it?—more pleasant. Make your next analysis go more smoothly using tools that are immediately available to you.
The next section takes this a step forward and presents a more rigorous way to clean and transform your data using a relational database.
Section #2: Good Data Management Practices for Data Analysis—Tidy Data and Tidy Tools
Above, we discussed two main bottlenecks in data analysis—data cleaning and transformation—and suggested that you can improve both relatively easily. Now we’ll outline principles you can follow to make these tedious parts of data management less onerous so that you can focus on interacting with your data in a more meaningful (and fun) way.
Data is often messy. With data management, we aim to clean it before we analyze it.
Once we clean it we usually can’t jump directly to visualization and statistical modeling. We first have to manipulate the clean data—restructuring, filtering, transforming, aggregating, sorting, and/or merging it with other data—so that it serves our initial visualization or model.
Unfortunately, all but the simplest analyses involve multiple rounds of visualization and modeling, some of which depend on the results of previous iterations. So, not only do we have to frequently perform time-consuming data manipulations, we can’t easily predict the data structure we’ll need until we see the results of previous steps in the analysis. This, in a nutshell, is why preparing and analyzing data can seem so tedious and inefficient.
What is needed is a way to store data and use data manipulation tools that will minimize the effort involved when preparing data for visualization and statistical modeling.
According to Dr. Hadley Wickham, Chief Scientist at RStudio and Adjunct Professor of Statistics at University of Auckland, Stanford, and Rice University, the solution is two-fold: (1) Clean your data so you can store it in a default but versatile format; and (2) transform it for modeling and visualization using tools that both accept and produce data in this format. For the first part of the solution, Wickham suggests the following default structure:
- Each variable forms a column
- Each observation forms a row
- Each data set contains information on only one observational unit of analysis (e.g., families, participants, participant visits)
A data set that meets these criteria is “tidy.” A tidy data set is, by definition, about only one type of thing or event (observational unit) that has been observed a certain number of times (rows) using certain measurements and identifiers (variables). The first two criteria essentially define long-format or panel data, a structure that will be familiar to you if you’ve ever analyzed repeated-measures or longitudinal data. In that case, it might be easiest to think of tidy data as long-format data that is about only one observational unit.
There are many ways data can be untidy. Wickham’s top five are as follows:
- Column names represent data values instead of variable names
- A single column contains data on multiple variables instead of a single variable
- Variables are contained in both rows and columns instead of just columns
- A single table contains more than one observational unit
- Data about an observational unit is spread across multiple data sets
These characteristics aren’t intrinsically bad—in fact, they can help us process and digest information better than tidy data. For example, untidy formats, such as cross-tabulations and other data tables, are far better choices than tidy data for the presentation of summaries, relationships, and patterns. What makes untidy data problematic for the data analyst is when it is used as source material for future data manipulation activities.
In contrast, the virtue of tidy data lies with its versatility. Just as glass-blowers work with molten glass because it can easily be blown into different shapes, tidy data is a useful intermediate data structure for data analysis because it can easily be changed into other useful formats. All of the primary data manipulation activities—filtering, transforming, sorting, and aggregating—as well as visualization and modeling are greatly simplified when working with tidy data. This is especially true when we process tidy data with “tidy tools.”
The benefits of tidy data are best realized by using tidy tools to process them. Tidy tools are those that accept, manipulate, and return tidy data, thus preserving the versatility of the tidy data structure and minimizing the need for additional data restructuring. To borrow an analogy from Wickham, the tools are like Lego blocks—individually simple but flexible and powerful in combination.
What tools are tidy? If you use the free and open source R software, check out Wickham’s popular tidy data manipulation (plyr and reshape2) and visualization (ggplot2) packages; most standard modeling functions in R are tidy, too. Wickham notes SPSS and SAS also have several tidy tools for data manipulation and statistical analysis. There are simply too many tools to review them adequately here.
Fortunately, though, if you’ve read this far, you’re well-equipped to assess which of your tools are tidy by following these steps:
- Create a tidy dataset for testing.
- Identify the functions or macros in your favorite stats package that accomplish common data manipulation tasks, such as filtering, transforming, sorting, and aggregation.
- Read the documentation.
- Test your tools on the data set one at a time, documenting whether each works with tidy data and produces tidy data. If it fails on either criterion, it’s not a tidy tool.
- Repeat the above steps for your favorite visualization and statistical modeling tools.
This exercise will give you not only a deeper understanding of how your tools work—it may also help you discover tidier alternatives to the tools you use now.
Tidy data has a lot in common with the storage of data in relational databases. Relational databases require a greater upfront investment of resources but can greatly improve your data management capabilities.
NOTE: This section draws heavily from concepts described in an article, slide deck, and presentation by Dr. Hadley Wickham, Assistant Professor of Statistics at Rice University and Chief Scientist at RStudio. Dr. Wickham has authored over 30 packages for the R statistical computing environment, including 3 of the 5 most downloaded packages this year. If what we describe here interests you, please check out his excellent work.
In our final section, we show you what you must know about good data management practices for data analysis.
Section #3: Good Data Management Practices for Data Analysis–—Relational Databases
As we’ve illustrated so far, data cleaning and data transformation are two major bottlenecks in data analysis. Both can be streamlined by using more efficient methods for storing, cleaning, and processing data.
Thus far, we’ve discussed managing tidy data in flat-file formats such as comma- or tab-separated values. This final section addresses improving data management for data analysis, specifically the benefits of managing tidy data in a relational database. Although relational databases require a greater upfront investment of resources, they offer benefits well beyond those offered by managing tidy data with flat files.
Recall that tidy data is a simple but efficient structure that, when processed with tidy tools that preserve that structure, can make analysis more efficient and agile. What is the origin of this versatile structure? Wickham credits relational databases.
A relational database is an efficient structure for storing, managing, and retrieving data. Data in a relational database is stored in multiple tables, usually arranged in a structure called Third Normal Form (3NF). 3NF has a mathematically rigorous definition; roughly speaking, however, a table is in 3NF if all of its data relates to one thing (entity), each observation of that thing forms a unique row, and each aspect (variable) about the thing forms a column. This should sound a lot like the definition of tidy data—because it is. Each table in a relational database is, effectively, a tidy data set.
Given the strong connection between tidy and relational data in data analysis, why would you want to store tidy data in a relational database? Because a relational database significantly extends the benefits of tidy data by helping you store, manage, and retrieve your data in a flexible manner while maintaining robust data integrity. A relational database takes the tidy data approach and puts it on data management steroids. Let’s look at a few specific ways relational databases can help with data cleaning, data integrity, and data transformation.
A relational database facilitates data cleaning
The act of getting data into 3NF is extremely useful for data cleaning. When you set up a relational database, you specify the exact form your tables will take and both the data type and range of values each column will accept, allowing you to validate data upon entry into the system. You also specify whether or not duplicate IDs are allowed (hint: don’t allow them!).
The system will reject any data that doesn’t conform to the model you specify and provide informative error messages. That way, your data is guaranteed to always conform to your specification (called a schema). This immediate feedback can help you detect and diagnose anomalies in your data. While you can certainly clean data stored in flat files, it’s harder to achieve the same level of rigor as you can with a relational database.
A relational database helps keep your data clean
A second reason to prefer managing tidy data in a relational database instead of flat files is that a relational database continuously enforces data integrity. Tidying data commonly involves splitting a single data set into multiple data sets in such a way that they can be transformed and recombined easily. Splitting up data this way, however, creates logical dependencies between data sets that, if not managed properly, can lead to inconsistent or inaccurate data.
For example, if you add or delete data about a patient or subject in the Visit data set, you probably also need to add or delete the corresponding row in the Patient or Subject table.
Further, if you notice one of the patient or subject IDs has a typo and decide to change it, you need to update that ID in every record and every data set it appears. It’s best to avoid this kind of data-management drudgery when excellent alternatives exist. Relational databases provide a clear and robust mechanism for handling these dependencies.
A relational database has powerful tools for accessing and transforming your data
Third, all modern relational database management systems give you access to powerful query capabilities that let you easily transform and retrieve data, reducing one of the major bottlenecks in data analysis. Because queries can both transform and retrieve data, they can handle most of the data transformation processes needed to support data analysis.
Queries have the further benefit that they give you access to the most recent live data. And because queries are used to modify data, you always have a record of changes to your data and the ability to roll back the changes. This eliminates the need to keep multiple versions of data files.
To illustrate the power of queries, let’s see them in action. Suppose we have data from a university registrar’s office that is stored in three relational tables: school, program, and student. The university is divided administratively into schools; each school can have one or more programs; and programs can have any number of students (including no students). The Registrar might be interested in visualizing and analyzing the data from multiple perspectives.
The questions could be simple, such as “Show me the names and total enrollment for the 10 top-enrolling programs.” Note that even a simple query like this executes many data manipulation processes, including filtering, aggregation, sorting, and merging data across the tables.
More complex queries are possible: “For each school with at least one program, what is the school’s total enrollment, how many programs does it have, and, on average, how many males and females are enrolled in each program?” This query merges data from all three tables, applies aggregate statistics (counts), transforms variables, and both filters and reshapes the data.
Some query languages, such as the HTSQL language used in the links above, make it easy for non-programmers to translate research questions into queries that produce the exact data set they need for their next statistical analysis or visualization. Indeed, most modern visualization and analysis tools will allow you to send a query to the database and dump the query result directly into your tool, which saves you the steps of manually exporting and importing data. This promotes a smoother, more agile workflow for data management and analysis.
In modern data management, storing data in a relational database improves upon the tidy data approach by:
- Supporting the data cleaning process by providing feedback on anomalies
- Enforcing data integrity continuously
- Enabling powerful queries that provide the exact data set you need in a single query
Relational databases do have some limitations relative to flat files. Setting up and maintaining a relational database for research requires a fair amount of technical expertise. For most researchers, setting one up with in-house resources usually isn’t feasible, so they either turn to an off-the-shelf product, hire a data manager, or outsource the work. Any of these approaches can work depending on your needs.
The simpler your data collection and management needs, the more it makes sense to use the tidy data approach with flat files and the tools you already know. The more complex your data management scenario, the more likely you are to benefit from a relational database and professional expertise.
If you have questions about how to set up a relational database for your organization, lab, or next research project, please feel free to contact us. We’d be happy to discuss your needs and provide guidance on managing your research data.