This is especially true for text from the early 1900s and prior. The older the text, the harder OCRing will be.PDFs are often unavoidable, but use less "lossy" formats when possible. Try to work with scans that are at least 300 DPI, saved in TIFF format.(From the Mad Men Mondays "Data's First Class Economy Set" Repository: Hartman Center, Rubenstein Library, Duke University.)
ABBYY has a relatively gentle learning curve and, importantly, straightforward table functionality.įor those more comfortable with the command line and programming, or for open source advocates, I suggest free programmatic alternatives for each tutorial step. Larger complex digitization projects often entail more technical elbow grease and advanced use of such tools. I focus on OCRing material with ABBYY FineReader, a popular commercial program for OCRing. Instead, this is a broad overview aimed at researchers with minimal programming experience tackling smaller digitization projects-say, nothing more than 200 pages. This is a simple introduction to scraping tables from historic (scanned) documents. Tutorial: A Beginner's Guide to Scraping Historic Table Data