OpenRefine – list-making made easy

In case you missed it: the ACRL Tech Blog has just posted a fantastic guide by Margaret Heller about using OpenRefine (formerly GoogleRefine) to spruce up messy data for re-use. If you've ever found yourself wishing that Excel could just be a little smarter, I highly recommend you take a close look at Heller's post, A Librarian's Guide to OpenRefine.

At my library, I've used GoogleRefine for a few projects in the past couple of years. But like Heller, I've found that 

“The main thing that I have used OpenRefine for in my daily work is to change a bibliography in plain text into columns in a spreadsheet that I can run against an API.”

Just last month I turned a fifty-page Word document with information about faculty publications into a neatly-structured spreadsheet that can be used to populate a database or an IR, or to create xml for use on a website. Some fuctions I used for that project were:

  • Create a new column (PubType) based on information in the first column. In my case, if the first column's length was less than 20 characters, then I wanted the value of that column to show up as PubType. This was useful because my list included separate headings for the type of citation (Book, Book contribution, Article, Book review), then listed the citations under the headings.
  • Fill down automatically, changing every time a new value is a encountered. Like a smart CTRL+D in Excel. That gave me PubType for each citation in just a couple seconds.
  • Create new column (PubDate), again based on information in the first column, using regular expressions. In my case, articles and book reviews had publication years formatted in parentheses, while books and book contriubtions didn't.
  • Update author values based on clusters. OpenRefine is especially good at letting you cluster information based on sophisticated matching algorithms instead of, say, regular expressions. So it makes it very easy to consist-ify permutations, e.g. Robert Jones; Jones, Robert; Dr. Robert Jones. 

Have you ever used GoogleRefine or OpenRefine? Heller has asked ACRL Tech Connect readers to comment and share recipes; take her up on her offer at http://acrl.ala.org/techconnect/?p=3276