As you know, turning PDFs into text is a large part of what we do on Groklaw, in order to have a searchable and accessible database of the the litigation we cover. I have a script Carlo Graziani wrote for me that easily does HTML of PDFs that are text, but when the court documents were scanned in as pictures of each page, the script doesn't work. It's a lot of work translating those documents, having to OCR them and then correct all the errors or just hand type. It's hard to find a good OCR program that works on GNU/Linux, I gather because of patents -- yet another reason why someone needs to solve this software patent problem. And using commercial programs like Omnipage is not possible for most of us, because although it works very well, it's quite expensive. And it works only in Windows or a Mac. Then someone noticed Tesseract OCR.
What is Tesseract OCR?: A commercial quality OCR engine originally developed at HP between 1985 and 1995. In 1995, this engine was among the top 3 evaluated by UNLV. It was open-sourced by HP and UNLV in 2005 It was registered on Sourceforge in January. And then in August, Google announced it had fixed a few bugs and was rereleasing it so the community could work on improving it. It does OCR. And it works on GNU/Linux. But it was not what I would call user friendly. There were a few other issues with Tesseract, as Google explained: A few things to know about Tesseract OCR: for now it only supports the English language, and does not include a page layout analysis module (yet), so it will perform poorly on multi-column material. It also doesn't do well on grayscale and color documents, and it's not nearly as accurate as some of the best commercial OCR packages out there. Yet, as far as we know, despite its shortcomings, Tesseract is far more accurate than any other Open Source OCR package out there. If you know of one that is more accurate, please do tell us!
I was thrilled when Groklaw member Fred Smith sent me some helper scripts to make it work better, and you're free to use them too. To make sure we all understand how to use them, I asked Dr Stupid to write up a how-to for Tesseract OCR, explaining how to use Fred's helper scripts. We still have quite a few exhibits to do from IBM's lengthy list supporting its various summary judgment motions, so if you want to give this a whirl, you might practice on one of those. After the how-to, I will show you the scripts themselves. I would like to thank both Fred and Dr Stupid for doing this for us. ***************************************
Tesseract OCR - HOW-TO - simple conversion of PDFs to text,
by Dr Stupid
Introduction
One activity that appears to be central to the smooth running of Groklaw is the conversion of documents
from a PDF to plain text format. There are many ways of achieving this, not least the brute force approach of manual transcription -
something made possible by a large pool of volunteers (to which we remain forever indebted.)
Some PDFs are readily converted to text by their very nature, being directly generated from a word processing package. In these cases the plain text is held within the file and can be extracted. Unfortunately a great many are instead produced by scanning software; what one sees within [insert your preferred PDF viewing program here] is nothing more than a series of pictures (one to a page) of the original paper document.
OCR (Optical Character Recognition) software can take much of the drudgery out of dealing with such documents, and some of the programs available on the market are very sophisticated. Those who out of necessity or choice require FOSS options, however, have had to grapple with programs of limited functionality or user-friendliness.
The situation improved in 2005 when HP made their own "tesseract" OCR engine (written by the University of Nevada) available as open source with the assistance. This engine offers good performance on English documents but is a little awkward to use as it stands, since it can only work with input files in TIFF format, not PDFs.
This year, a GL reader called Fred Smith sent in some helpful scripts that make it much easier to use Tesseract to convert PDFs into plain text. The rest of this article explains how to compile tesseract on your Linux system and make use of those scripts. With luck, those overlength memorandums may never look so daunting again :)
Aside: Perhaps an enterprising GL reader can put together a Kommander file to create a simple GUI for the script...?
Check requirements
Tesseract can only work with TIFF files, so you need software to convert PDFs to TIFFs. You need to have a working Ghostscript installation with TIFF support: your distro's standard Ghostscript should do fine.
You need to have the following libraries installed: libtiff, libjpeg, and zlib.
Most distros will have these installed "out of the box".
You also need the corresponding header files. These are usually packaged in the
corresponding "-devel" packages.
So, using your preferred package management software
make sure the following (or equivalent) are installed:
ghostscript (ghostscript may be divided into smaller sub-packages in your distro)
libtiff
libtiff-devel
libjpeg
libjpeg-devel
zlib
zlib-devel
Check Ghostscript installation
Fetch a PDF (I'm using "Interesting.pdf" as an example) into your current directory
and run the pdf2tif tool on it:
./pdf2tif Interesting.pdf
You should get a set of TIFF files, one per page of the PDF. Use an image viewer
to check they are OK.
Get tesseract
The home page of Tesseract is http://sourceforge.net/projects/tesseract-ocr.
The current version at time of writing is 1.02.
Download the tarball (I henceforth assume it is tesseract-1.02.tar.gz) and
untar it somewhere convenient, creating a directory "tesseract-1.02". Now
put the helper scripts in that same directory, and go into the directory:
tar zxvf tesseract-1.02.tar.gz
mv pdf2tif tesseract-1.02/
mv ocr.sh tesseract-1.02/
cd tesseract-1.02
Build tesseract
In the following, I assume that libtiff is installed into /usr/lib. If you have
built your own libtiff from source, this might not be the case.
./configure --with-tibtiff=/usr/lib
make
ln -s ccmain/tesseract
The last stage is necessary to be able to run tesseract directly from where it
has been compiled. As of the time of writing, the authors recommend that you
do not run "make install".
Run tesseract
Now you can run the program on your PDF, using the helper script.
./ocr.sh Interesting.pdf
You should get a set of text files, one per page of the PDF. The inevitable tidying-up process I leave to you!
Acknowledgements
Many thanks to the authors and maintainers of tesseract, and to Fred Smith for the original helper scripts.
Fred Smith's scripts
Here are Fred's scripts as he sent them to me, for those who don't need a how to:
I've been hacking at some scripts and wanted to pass on to you my latest
versions. I think they
will work better and also be easier to use.
Instead of viewing the online PDF and printing to file, with these
scripts you will need to
have a local copy of the PDF, because we now use a new script named
"pdf2tif" to turn the pdf
directly into a tif file without any intermediate ps file. It seems to
give considerably
better resolution on the text (the tif file sure looks a lot better).
It's not clear how
much better the OCR'd result is, but I'd think it would be at least a
little better
(after all, these documents are pretty poor quality to start with.)
You'll want to put this in your path somewhere (I put mine in
/usr/local/bin)
and make sure to give it execute permission.
Here's pdf2tif:
-----------------------------
#!/bin/sh
# Derived from pdf2ps.
# $Id: pdf2tif,v 1.0 2006/11/03 Fred Smith
# Convert PDF to TIFF file.
OPTIONS=""
while true
do
case "$1" in
-?*) OPTIONS="$OPTIONS $1" ;;
*) break ;;
esac
shift
done
if [ $# -eq 2 ]
then
outfile=$2
elif [ $# -eq 1 ]
then
outfile=`basename "$1" .pdf`-%02d.tif
else
echo "Usage: `basename $0` [-dASCII85EncodePages=false]
[-dLanguageLevel=1|2|3] input.pdf [output.ps]" 1>&2
exit 1
fi
# Doing an initial 'save' helps keep fonts from being flushed between
pages.
# We have to include the options twice because -I only takes effect if
it
# appears before other options.
exec gs $OPTIONS -q -dNOPAUSE -dBATCH -dSAFER -r300x300 -sDEVICE=tiffg3
"-sOutputFile=$outfile" $OPTIONS -c save pop -f "$1"
--------------------------
and here's the latest version of ocr.sh. To use this, edit the value for
the variable "progdir"
to point to wherever your tesseract binary is located. also, put it
somewhere convenient and
give it execute permission. It leaves the resulting .txt files (one per
page) in your current
directory.
-------------------------------
#!/bin/sh
# takes one parameter, the path to a pdf file to be processed.
# uses custom script 'pdf2tif' to generate the tif files,
# generates them at 300x300 dpi.
# drops them in our current directory
# then runs $progdir/tesseract on them, deleting the .raw
# and .map files that tesseract drops.
pdf2tif $1
# edit this to point to wherever you've got your tesseract binary
progdir=..
for j in *.tif
do
x=`basename $j .tif`
${progdir}/tesseract ${j} ${x}
rm ${x}.raw
rm ${x}.map
#un-comment next line if you want to remove the .tif files when done.
# rm ${j}
done
|