Here’s a few rough notes from a rough tutorial to openrefine extraction of tabular data that I did for Sarah Craft’s classics class at FSU.
Materials for an Ottoman Gazetteer. These feature a very complex line structure, and were hard to turn into data tables.
Late Fortresses in the Notitia
- Copy text and paste into a word processor document.
- Notice that you must trim the edges of the PDF, and do so using a PDF editor.
- Notice that the last page of the PDF produced mismatched columns. Use Convert Text to Table function in the word processor, and move the last lines of the table into new columns following the appropriate rows.
- Notice that there’s still a problem: not every row of the table corresponds to a single record–some contain half records. To fix this, use Find and Replace with regular expressions: First, change Occ. to FSUOcc. and Or. to FSUOr. Then, then change $ (which indicates a line break) to space. Then, (using Regex) change FSU (which we added as a prefix to Occ. and Or.) to \n (which indicates a hard line break).
- Now export this table as csv (or simply using the clipboard) and import it into OpenRefine.
- Split columns using spaces and commas.
- Use GREL expressions: to concatenate cells:
value + " " + cells["Command 1"].value; to extract only the last word:
value.partition(smartSplit(value," ")[-1])More GREL recipes here and here and here.
Other, more challenging examples
A geographic dictionary like this one is interesting to parse. Unfortunately it’s in Greek and Italian, and the Greek characters were not OCRed as Greek. Similar problems with the Arabic letters in this Ottoman historical geography.