Groklaw - Tesseract OCR How-To, by Dr Stupid; Scripts by Fred Smith

	When you want to know more...

Home
Archives
Site Map
Search
About Groklaw
Awards
Legal Research
Timelines

ApplevSamsung
ApplevSamsung p.2
ArchiveExplorer
Autozone
Bilski
Cases
Cast: Lawyers
Comes v. MS
Contracts/Documents
Courts
DRM
Gordon v MS
GPL
Grokdoc
HTML How To
IPI v RH
IV v. Google
Legal Docs
Lodsys
MS Litigations
MSvB&N
News Picks
Novell v. MS
Novell-MS Deal
ODF/OOXML
OOXML Appeals
OraclevGoogle
Patents
ProjectMonterey
Psystar
Quote Database
Red Hat v SCO
Salus Book
SCEA v Hotz
SCO Appeals
SCO Bankruptcy
SCO Financials
SCO Overview
SCO v IBM
SCO v Novell
SCO:Soup2Nuts
SCOsource
Sean Daly
Software Patents
Switch to Linux
Transcripts
Unix Books

Gear

Groklaw Gear

You won't find me on Facebook

Donate

No Legal Advice

The information on Groklaw is not intended to constitute legal advice. While Mark is a lawyer and he has asked other lawyers and law students to contribute articles, all of these articles are offered to help educate, not to provide specific legal advice. They are not your lawyers.

Here's Groklaw's comments policy.

What's New

STORIES
No new stories

COMMENTS last 48 hrs
No new comments

Sponsors

Hosting:

On servers donated to ibiblio by AMD.

Webmaster

Tesseract OCR How-To, by Dr Stupid; Scripts by Fred Smith

Monday, December 11 2006 @ 08:45 AM EST

As you know, turning PDFs into text is a large part of what we do on Groklaw, in order to have a searchable and accessible database of the the litigation we cover. I have a script Carlo Graziani wrote for me that easily does HTML of PDFs that are text, but when the court documents were scanned in as pictures of each page, the script doesn't work. It's a lot of work translating those documents, having to OCR them and then correct all the errors or just hand type. It's hard to find a good OCR program that works on GNU/Linux, I gather because of patents -- yet another reason why someone needs to solve this software patent problem. And using commercial programs like Omnipage is not possible for most of us, because although it works very well, it's quite expensive. And it works only in Windows or a Mac.

Then someone noticed Tesseract OCR.

What is Tesseract OCR?:

A commercial quality OCR engine originally developed at HP between 1985 and 1995. In 1995, this engine was among the top 3 evaluated by UNLV. It was open-sourced by HP and UNLV in 2005

It was registered on Sourceforge in January. And then in August, Google announced it had fixed a few bugs and was rereleasing it so the community could work on improving it. It does OCR. And it works on GNU/Linux. But it was not what I would call user friendly. There were a few other issues with Tesseract, as Google explained:

A few things to know about Tesseract OCR: for now it only supports the English language, and does not include a page layout analysis module (yet), so it will perform poorly on multi-column material. It also doesn't do well on grayscale and color documents, and it's not nearly as accurate as some of the best commercial OCR packages out there. Yet, as far as we know, despite its shortcomings, Tesseract is far more accurate than any other Open Source OCR package out there. If you know of one that is more accurate, please do tell us!

I was thrilled when Groklaw member Fred Smith sent me some helper scripts to make it work better, and you're free to use them too. To make sure we all understand how to use them, I asked Dr Stupid to write up a how-to for Tesseract OCR, explaining how to use Fred's helper scripts. We still have quite a few exhibits to do from IBM's lengthy list supporting its various summary judgment motions, so if you want to give this a whirl, you might practice on one of those. After the how-to, I will show you the scripts themselves. I would like to thank both Fred and Dr Stupid for doing this for us.

***************************************

Tesseract OCR - HOW-TO - simple conversion of PDFs to text,
by Dr Stupid

Introduction

One activity that appears to be central to the smooth running of Groklaw is the conversion of documents from a PDF to plain text format. There are many ways of achieving this, not least the brute force approach of manual transcription - something made possible by a large pool of volunteers (to which we remain forever indebted.)

Some PDFs are readily converted to text by their very nature, being directly generated from a word processing package. In these cases the plain text is held within the file and can be extracted. Unfortunately a great many are instead produced by scanning software; what one sees within [insert your preferred PDF viewing program here] is nothing more than a series of pictures (one to a page) of the original paper document.

OCR (Optical Character Recognition) software can take much of the drudgery out of dealing with such documents, and some of the programs available on the market are very sophisticated. Those who out of necessity or choice require FOSS options, however, have had to grapple with programs of limited functionality or user-friendliness.

The situation improved in 2005 when HP made their own "tesseract" OCR engine (written by the University of Nevada) available as open source with the assistance. This engine offers good performance on English documents but is a little awkward to use as it stands, since it can only work with input files in TIFF format, not PDFs.

This year, a GL reader called Fred Smith sent in some helpful scripts that make it much easier to use Tesseract to convert PDFs into plain text. The rest of this article explains how to compile tesseract on your Linux system and make use of those scripts. With luck, those overlength memorandums may never look so daunting again :)

Aside: Perhaps an enterprising GL reader can put together a Kommander file to create a simple GUI for the script...?

Check requirements

Tesseract can only work with TIFF files, so you need software to convert PDFs to TIFFs. You need to have a working Ghostscript installation with TIFF support: your distro's standard Ghostscript should do fine.

You need to have the following libraries installed: libtiff, libjpeg, and zlib. Most distros will have these installed "out of the box". You also need the corresponding header files. These are usually packaged in the corresponding "-devel" packages.

So, using your preferred package management software make sure the following (or equivalent) are installed:

ghostscript

(ghostscript may be divided into smaller sub-packages in your distro)

libtiff
libtiff-devel
libjpeg
libjpeg-devel
zlib
zlib-devel

Check Ghostscript installation

Fetch a PDF (I'm using "Interesting.pdf" as an example) into your current directory and run the pdf2tif tool on it:

./pdf2tif Interesting.pdf

You should get a set of TIFF files, one per page of the PDF. Use an image viewer to check they are OK.

Get tesseract

The home page of Tesseract is http://sourceforge.net/projects/tesseract-ocr. The current version at time of writing is 1.02. Download the tarball (I henceforth assume it is tesseract-1.02.tar.gz) and untar it somewhere convenient, creating a directory "tesseract-1.02". Now put the helper scripts in that same directory, and go into the directory:

 tar zxvf tesseract-1.02.tar.gz
mv pdf2tif tesseract-1.02/
mv ocr.sh tesseract-1.02/
cd tesseract-1.02

Build tesseract

In the following, I assume that libtiff is installed into /usr/lib. If you have built your own libtiff from source, this might not be the case.

./configure --with-tibtiff=/usr/lib
make
ln -s ccmain/tesseract

The last stage is necessary to be able to run tesseract directly from where it has been compiled. As of the time of writing, the authors recommend that you do not run "make install".

Run tesseract

Now you can run the program on your PDF, using the helper script.

./ocr.sh Interesting.pdf

You should get a set of text files, one per page of the PDF. The inevitable tidying-up process I leave to you!

Acknowledgements

Many thanks to the authors and maintainers of tesseract, and to Fred Smith for the original helper scripts.

Fred Smith's scripts

Here are Fred's scripts as he sent them to me, for those who don't need a how to:

I've been hacking at some scripts and wanted to pass on to you my latest versions. I think they will work better and also be easier to use.
Instead of viewing the online PDF and printing to file, with these scripts you will need to have a local copy of the PDF, because we now use a new script named "pdf2tif" to turn the pdf directly into a tif file without any intermediate ps file. It seems to give considerably better resolution on the text (the tif file sure looks a lot better). It's not clear how much better the OCR'd result is, but I'd think it would be at least a little better (after all, these documents are pretty poor quality to start with.)
You'll want to put this in your path somewhere (I put mine in /usr/local/bin) and make sure to give it execute permission.
Here's pdf2tif:
-----------------------------
#!/bin/sh
# Derived from pdf2ps.
# $Id: pdf2tif,v 1.0 2006/11/03 Fred Smith
# Convert PDF to TIFF file.
OPTIONS=""
while true
do
case "$1" in
-?*) OPTIONS="$OPTIONS $1" ;;
*) break ;;
esac
shift
done
if [ $# -eq 2 ]
then
outfile=$2
elif [ $# -eq 1 ]
then
outfile=`basename "$1" .pdf`-%02d.tif
else
echo "Usage: `basename $0` [-dASCII85EncodePages=false]
[-dLanguageLevel=1|2|3] input.pdf [output.ps]" 1>&2
exit 1
fi
# Doing an initial 'save' helps keep fonts from being flushed between pages.
# We have to include the options twice because -I only takes effect if it
# appears before other options.
exec gs $OPTIONS -q -dNOPAUSE -dBATCH -dSAFER -r300x300 -sDEVICE=tiffg3
"-sOutputFile=$outfile" $OPTIONS -c save pop -f "$1"
--------------------------
and here's the latest version of ocr.sh. To use this, edit the value for the variable "progdir" to point to wherever your tesseract binary is located. also, put it somewhere convenient and give it execute permission. It leaves the resulting .txt files (one per page) in your current directory.
-------------------------------
#!/bin/sh
# takes one parameter, the path to a pdf file to be processed.
# uses custom script 'pdf2tif' to generate the tif files,
# generates them at 300x300 dpi.
# drops them in our current directory
# then runs $progdir/tesseract on them, deleting the .raw
# and .map files that tesseract drops.
pdf2tif $1
# edit this to point to wherever you've got your tesseract binary
progdir=..
for j in *.tif
do
x=`basename $j .tif`
${progdir}/tesseract ${j} ${x}
rm ${x}.raw
rm ${x}.map
#un-comment next line if you want to remove the .tif files when done.
# rm ${j}
done

Tesseract OCR How-To, by Dr Stupid; Scripts by Fred Smith | 157 comments | Create New Account

Comments belong to whoever posts them. Please notify us of inappropriate comments.

Tesseract OCR How-To, by Dr Stupid; Scripts by Fred Smith

Authored by: Anonymous on Monday, December 11 2006 @ 09:02 AM EST

Very cool.