AHA 2015: Managing and Maintaining Digital Data (Getting Started in Digital History Intermediate Workshop)

This page contains resources for historians who have started thinking about evidence as data but are still figuring out the ins and outs of data clean-up and storage. The live workshop, for historians attending the AHA’s Getting Started in Digital History workshop on Jan. 2, 2015, will use OpenRefine to look at some of the steps involved in data normalization (breaking evidence into fields, handling typos and other data abnormalities) and then examine some of the solutions for scholars moving from proprietary software (Word, Excel) to data formats that make sharing and maintaining data easier.

First, go read A Good Scholarly Article:

http://journalofdigitalhumanities.org/2-3/big-smart-clean-messy-data-in-the-humanities/

OR TWO! Edited Jan 19 to add: Thomas Padilla’s article at http://thomaspadilla.org/papers/padillahiggins_humdata_postprint.pdf is also an excellent look at evidence and data in humanities research

Then, think about cleaning up your data.

If you’re not in the workshop, tackle OpenRefine on your own:

A basic tutorial: http://schoolofdata.org/handbook/recipes/cleaning-data-with-refine/
Using facets and clusters to clean data: http://www.propublica.org/nerds/item/using-google-refine-for-data-cleaning
David Huynh’s tutorial: http://davidhuynh.net/spaces/nicar2011/tutorial.pdf, which uses the data here: http://electionstatistics.sos.la.gov/Data/Elected_Officials/ElectedOfficials.xls

Some beginner OpenRefine cheat sheets can be really handy as you start working with your data:

Cheat sheet: http://arcadiafalcone.net/GoogleRefineCheatSheets.pdf
Basic OpenRefine recipes: https://github.com/OpenRefine/OpenRefine/wiki/Recipes
A list of external resources: https://github.com/OpenRefine/OpenRefine/wiki/External-Resources
Sample data sets: https://github.com/OpenRefine/OpenRefine/wiki/Sample-Datasets
Cleaning up dates: http://icantiemyownshoes.wordpress.com/2014/04/24/clean-up-dates-and-openrefine/

There are also advanced or specialized OpenRefine recipes that can help you take the next step:

http://schoolofdata.org/2014/05/19/putting-points-on-maps-using-geojson-created-by-open-refine/
A sample process with SQL and OpenRefine: http://icantiemyownshoes.wordpress.com/2014/04/15/openrefine-and-messy-legacy-access-points-in-an-archivists-toolkit-database/
Example of reconciling OpenRefine data with external taxonomies (names, etc.): http://iphylo.blogspot.com/2012/02/using-google-refine-and-taxonomic.html (OpenRefine has an active call out to get their OpenRefine reconciliation working with historical newspaper database APIs)
Cleaning up bibliographies: http://acrl.ala.org/techconnect/?p=3276