Help! I’m Drowning in a Sea of Changing Data!

Help! I’m Drowning in a Sea of Changing Data!

Tips for Getting Control of Your Data When It Feels Like Your Data Controls You (So You Can Finally Sleep At Night)

Not everyone will admit to having this problem, but I’ve been there myself and have many friends who remain locked in epic personal battles with their data. So how can you finally make peace?

First, here’s the scope of the problem we’re after: you’re in active data collection mode, and you have data pouring in from sources you can’t seem to put a lid on any more (perhaps you’re using MTurk). The stream of data is in constant flux, and you don’t just need to track the survey or experiment data. You also have to pay attention to data about your data, like who, when, which version, and so on. Maybe you’ve read our three-part series over the summer about Good Data Management Practices for Data Analysis, but you don’t have a relational database to use (or the time to set one up). Tidy data sounds intriguing. Maybe you’ve even gotten a foothold on tidying up your data somewhat—but the data pulled a fast one and now you’re cornered again. The spreadsheets are proliferating. Your anxiety levels are starting to rise. What do you do?

Here’s Tip #1: Learn to use a (formal) revision control system

Also referred to as source or version control, the core problem being addressed is familiar to researchers, data managers, and RAs alike, although the solution comes from the world of software development. Let’s say that every couple of days you get a new version of the same data file. So far you’ve had to invent your own versioning system by, say, calling the file something like “data export + download date” and saving it in your project folder somewhere. You know you want to track a single file as it changes over time, but you still end up losing the why and the what of each version by doing it this way anyway. Revision control picks up the slack.

It works like this: first, you create a project *repository* on your computer, shared drive, or wherever you store your data, and then you add files to it for your revision control system to track. When you make a change to a file that you want to “remember,” you *commit* the current state of the file back to the repository with a *commit message* that explains to your future self why you made that modification.

Now, all you need to do is go to the one file with the data you’re looking for, which is always the most recent version. If you want them, you can also resurrect all the previous versions of that file in a snap. So in the future, if you don’t like what you’ve done to a file, you can roll back the file to the state it was in at any other time you committed it—thereby recovering “lost” data, noting important changes, and enjoying the freedom to try any kind of attack plan on your data without the fear of having to repeat work that you liked. It’s kind of like a supercharged lab notebook!

(Bonus tip: if you’re making calculations on your data, you can record the formula you used right in the commit message. You can be as explicit and precise as you want, so you always know exactly what you’ve done and why you’ve done it.)

Here’s the best part: you can get really great revision control for free! The big ones are Git and Mercurial (we use the latter at Prometheus), and you can download them for FREE! The folks at Atlassian have even released a free, cross-platform program called SourceTree that will let you use all the features of your favorite flavor of revision control in a GUI-based tool (all the example screenshots demonstrate programming code, but just use your imagination a bit and you’ll be well on your way).

Equipped with your new revision control system, you can now boldly reclaim your sovereignty over your data files!