by Frank Farach, Staff Scientist
As I illustrated in the first post in this series, data cleaning and data transformation are two major bottlenecks in data analysis. Both can be streamlined by using more efficient methods for storing, cleaning, and processing data. My previous post focused on a relatively easy way to do this—the tidy data and tidy tools approach described by Dr. Hadley Wickham. Thus far, we’ve discussed managing tidy data in flat-file formats such as comma- or tab-separated values. To wrap up this series on improving data management for data analysis, I’ll discuss the benefits of managing tidy data in a relational database. Although relational databases require a greater up-front investment of resources, they offer benefits well beyond those offered by managing tidy data with flat files.
Recall that tidy data is a simple but efficient structure that, when processed with tidy tools that preserve that structure, can make analysis more efficient and agile. What is the origin of this versatile structure? Wickham credits relational databases. A relational database is an efficient structure for storing, managing, and retrieving data. Data in a relational database is stored in multiple tables, usually arranged in a structure called Third Normal Form (3NF). 3NF has a mathematically rigorous definition; roughly speaking, however, a table is in 3NF if all of its data relates to one thing (entity), each observation of that thing forms a unique row, and each aspect (variable) about the thing forms a column. This should sound a lot like the definition of tidy data—because it is. Each table in relational database is, effectively, a tidy data set.
Given the strong connection between tidy and relational data, why would you want to store tidy data in a relational database? Because a relational database significantly extends the benefits of tidy data by helping you store, manage, and retrieve your data in a flexible manner while maintaining robust data integrity. A relational database takes the tidy data approach and puts it on data management steroids. Let’s look at a few specific ways relational databases can help with data cleaning, data integrity, and data transformation.
A relational database facilitates data cleaning
The act of getting data into 3NF is extremely useful for data cleaning. When you set up a relational database, you specify the exact form your tables will take and both the data type and range of values each column will accept, allowing you to validate data upon entry into the system. You also specify whether or not duplicate IDs are allowed (hint: don’t allow them!). The system will reject any data that doesn’t conform to the model you specify and provide informative error messages. That way, your data is guaranteed to always conform to your specification (called a schema). This immediate feedback can help you detect and diagnose anomalies in your data. While you can certainly clean data stored in flat files, it’s harder to achieve the same level of rigor as you can with a relational database.
A relational database helps keep your data clean
A second reason to prefer managing tidy data in a relational database instead of flat files is that a relational database continuously enforces data integrity. Tidying data commonly involves splitting a single data set into multiple data sets in such a way that they can be transformed and recombined easily. Splitting up data this way, however, creates logical dependencies between data sets that, if not managed properly, can lead to inconsistent or inaccurate data. For example, if you add or delete data about measures a subject has completed in the Subject Visit data set, you probably also need to add or delete the corresponding row in the Subject table. Further, if you notice one of the subject IDs has a typo and decide to change it, you need to update that ID in every record and every data set it appears. It’s best to avoid this kind of data-management drudgery when excellent alternatives exist. Relational databases provide a clear and robust mechanism for handling these dependencies.
A relational database has powerful tools for accessing and transforming your data
Third, all modern relational database management systems give you access to powerful query capabilities that let you easily transform and retrieve data, reducing one of the major bottlenecks in data analysis. Because queries can both transform and retrieve data, they can handle most of the data transformation processes needed to support data analysis. Queries have the further benefit that they give you access to the most recent live data. And because queries are used to modify data, you always have a record of changes to your data and the ability to roll back the changes. This eliminates the need to keep multiple versions of data files.
To illustrate the power of queries, let’s see them in action. Suppose we have data from a university registrar’s office that is stored in three relational tables: school, program, and student. The university is divided administratively into schools; each school can have one or more programs; and programs can have any number of students (including no students). The Registrar might be interested in visualizing and analyzing the data from multiple perspectives. The questions could be simple, such as “Show me the names and total enrollment for the 10 top-enrolling programs.” Note that even a simple query like this executes many data manipulation processes, including filtering, aggregation, sorting, and merging data across the tables. More complex queries are possible: “For each school with at least one program, what is the school’s total enrollment, how many programs does it have, and, on average, how many males and females are enrolled in each program?” This query merges data from all three tables, applies aggregate statistics (counts), transforms variables, and both filters and reshapes the data.
Some query languages, such as the HTSQL language used in the links above, make it easy for non-programmers to translate research questions into queries that produce exactly the data set they need for their next statistical analysis or visualization. Indeed, most modern visualization and analysis tools will allow you to send a query to the database and dump the query result directly into your tool, which saves you the steps of manually exporting and importing data. This promotes a smoother, more agile workflow for data management and analysis.
In sum, storing data in a relational database improves upon the tidy data approach by:
Supporting the data cleaning process by providing feedback on anomalies
Enforcing data integrity continuously
Enabling powerful queries that exactly the data set you need in a single query
Relational databases do have some limitations relative to flat files. Setting up and maintaining a relational database for research requires a fair amount of technical expertise. For most researchers, setting one up with in-house resources usually isn’t feasible, so they either turn to an off-the-shelf product, hire a data manager, or outsource the work. Any of these approaches can work depending on your needs. The simpler your data collection and management needs, the more it makes sense to use the tidy data approach with flat files and the tools you already know. The more complex your data management scenario, the more likely you are to benefit from a relational database and professional expertise.
If you have questions about how to set up a relational database for your organization, lab, or next research project, please feel free to contact us. We’d be happy to discuss your needs and provide guidance on managing your research data.