decoration decoration
Stories

GROKLAW
When you want to know more...
decoration
For layout only
Home
Archives
Site Map
Search
About Groklaw
Awards
Legal Research
Timelines
ApplevSamsung
ApplevSamsung p.2
ArchiveExplorer
Autozone
Bilski
Cases
Cast: Lawyers
Comes v. MS
Contracts/Documents
Courts
DRM
Gordon v MS
GPL
Grokdoc
HTML How To
IPI v RH
IV v. Google
Legal Docs
Lodsys
MS Litigations
MSvB&N
News Picks
Novell v. MS
Novell-MS Deal
ODF/OOXML
OOXML Appeals
OraclevGoogle
Patents
ProjectMonterey
Psystar
Quote Database
Red Hat v SCO
Salus Book
SCEA v Hotz
SCO Appeals
SCO Bankruptcy
SCO Financials
SCO Overview
SCO v IBM
SCO v Novell
SCO:Soup2Nuts
SCOsource
Sean Daly
Software Patents
Switch to Linux
Transcripts
Unix Books
Your contributions keep Groklaw going.
To donate to Groklaw 2.0:

Groklaw Gear

Click here to send an email to the editor of this weblog.


To read comments to this article, go here
Using Tesseract to OCR PDFs the Groklaw Way
Saturday, November 11 2006 @ 01:22 PM EST

One of the problems we've faced on Groklaw is how incredibly difficult it is to get plain text from PDF files that comes from scanned paper legal documents. No doubt companies like OmniPage Pro would love it if thousands of us would buy their product, but aside from it not being available for GNU/Linux systems, and being a closed proprietary product, it's very expensive for mere mortals such as Groklaw volunteers, $499. When Google released Tesseract, that was great, except it's a bit hard to use. So here are some instructions to make it a bit easier.

******************************
Check requirements

You need to have a working Ghostscript installation with TIFF support. You need
to have the following libraries installed: /libtiff/, /libjpeg/, and /zlib/.
Most distros will have these installed "out of the box".

You also need the corresponding header files. These are usually packaged in the
corresponding "-devel" packages. So, using your preferred package management
software make sure the following (or equivalent) are installed:

ghostscript

(ghostscript may be divided into smaller sub-packages in your distro)

libtiff
libtiff-devel
libjpeg
libjpeg-devel
zlib
zlib-devel


Check Ghostscript installation

Fetch a PDF (I'm using "Interesting.pdf" as an example) into your current
directory and run the pdf2tif tool on it:

./pdf2tif Interesting.pdf

You should get a set of TIFF files, one per page of the PDF. Use an image viewer
to check they are OK.


Get tesseract

The home page of Tesseract is http://sourceforge.net/projects/tesseract-ocr. The
current version at time of writing is 1.02. Download the tarball (I henceforth
assume it is tesseract-1.02.tar.gz) and untar it somewhere convenient, creating
a directory "tesseract-1.02". Now put the helper scripts in that same directory,
and go into the directory.

tar zxvf tesseract-1.02.tar.gz
mv pdf2tif tesseract-1.02/
mv ocr.sh tesseract-1.02/
cd tesseract-1.02


Build tesseract

In the following, I assume that libtiff is installed into /usr/lib. If you have
built you own libtiff from source, this might not be the case.

./configure --with-tibtiff=/usr/lib
make
ln -s ccmain/tesseract

The last stage is necessary to be able to run tesseract directly from where it
has been compiled. As of the time of writing, the authors recommend that you do
*not* run "make install".


Run tesseract

Now you can run the program on your PDF, using the helper script.

./ocr.sh Interesting.pdf

You should get a set of text files, one per page of the PDF. The inevitable
tidying-up process I leave to you!


Acknowledgements

Many thanks to the authors and maintainers of tesseract, and to Fred Smith for
the original helper scripts.



#!/bin/sh
# Derived from pdf2ps.
# $Id: pdf2tif,v 1.0 2006/11/03 Fred Smith
# Convert PDF to TIFF file.

OPTIONS=""
while true
do
case "$1" in
-?*) OPTIONS="$OPTIONS $1" ;;
*) break ;;
esac
shift
done

if [ $# -eq 2 ]
then
outfile=$2
elif [ $# -eq 1 ]
then
outfile=`basename "$1" \.pdf`-%02d.tif
else
echo "Usage: `basename $0` [-dASCII85EncodePages=false]
[-dLanguageLevel=1|2|3] input.pdf [output.ps]" 1>&2
exit 1
fi

# Doing an initial 'save' helps keep fonts from being flushed between pages.
# We have to include the options twice because -I only takes effect if it
# appears before other options.
exec gs $OPTIONS -q -dNOPAUSE -dBATCH -dSAFER -r300x300 -sDEVICE=tiffg3 "-sOutputFile=$outfile" $OPTIONS -c save pop -f "$1"



#!/bin/sh

# takes one parameter, the path to a pdf file to be processed.
# uses custom script 'pdf2tif' to generate the tif files,
# generates them at 300x300 dpi.
# drops them in our current directory
# then runs $progdir/tesseract on them, deleting the .raw
# and .map files that tesseract drops.

./pdf2tif $1

# edit this to point to wherever you've got your tesseract binary
progdir=.

for j in *.tif
do
x=`basename $j .tif`
${progdir}/tesseract ${j} ${x}
rm ${x}.raw
rm ${x}.map
#un-comment next line if you want to remove the .tif files when done.
# rm ${j}
done

  View Printable Version


Groklaw © Copyright 2003-2013 Pamela Jones.
All trademarks and copyrights on this page are owned by their respective owners.
Comments are owned by the individual posters.

PJ's articles are licensed under a Creative Commons License. ( Details )