|
Using Tesseract to OCR PDFs the Groklaw Way |
|
Saturday, November 11 2006 @ 01:22 PM EST
|
One of the problems we've faced on Groklaw is how incredibly difficult it is to get plain text from PDF files that comes from scanned paper legal documents. No doubt companies like OmniPage Pro would love it if thousands of us would buy their product, but aside from it not being available for GNU/Linux systems, and being a closed proprietary product, it's very expensive for mere mortals such as Groklaw volunteers, $499. When Google released Tesseract, that was great, except it's a bit hard to use. So here are some instructions to make it a bit easier.
******************************
Check requirements
You need to have a working Ghostscript installation with TIFF support. You need
to have the following libraries installed: /libtiff/, /libjpeg/, and /zlib/.
Most distros will have these installed "out of the box".
You also need the corresponding header files. These are usually packaged in the
corresponding "-devel" packages. So, using your preferred package management
software make sure the following (or equivalent) are installed:
ghostscript
(ghostscript may be divided into smaller sub-packages in your distro)
libtiff
libtiff-devel
libjpeg
libjpeg-devel
zlib
zlib-devel
Check Ghostscript installation
Fetch a PDF (I'm using "Interesting.pdf" as an example) into your current
directory and run the pdf2tif tool on it:
./pdf2tif Interesting.pdf
You should get a set of TIFF files, one per page of the PDF. Use an image viewer
to check they are OK.
Get tesseract
The home page of Tesseract is http://sourceforge.net/projects/tesseract-ocr. The
current version at time of writing is 1.02. Download the tarball (I henceforth
assume it is tesseract-1.02.tar.gz) and untar it somewhere convenient, creating
a directory "tesseract-1.02". Now put the helper scripts in that same directory,
and go into the directory.
tar zxvf tesseract-1.02.tar.gz
mv pdf2tif tesseract-1.02/
mv ocr.sh tesseract-1.02/
cd tesseract-1.02
Build tesseract
In the following, I assume that libtiff is installed into /usr/lib. If you have
built you own libtiff from source, this might not be the case.
./configure --with-tibtiff=/usr/lib
make
ln -s ccmain/tesseract
The last stage is necessary to be able to run tesseract directly from where it
has been compiled. As of the time of writing, the authors recommend that you do
*not* run "make install".
Run tesseract
Now you can run the program on your PDF, using the helper script.
./ocr.sh Interesting.pdf
You should get a set of text files, one per page of the PDF. The inevitable
tidying-up process I leave to you!
Acknowledgements
Many thanks to the authors and maintainers of tesseract, and to Fred Smith for
the original helper scripts.
#!/bin/sh
# Derived from pdf2ps.
# $Id: pdf2tif,v 1.0 2006/11/03 Fred Smith
# Convert PDF to TIFF file.
OPTIONS=""
while true
do
case "$1" in
-?*) OPTIONS="$OPTIONS $1" ;;
*) break ;;
esac
shift
done
if [ $# -eq 2 ]
then
outfile=$2
elif [ $# -eq 1 ]
then
outfile=`basename "$1" \.pdf`-%02d.tif
else
echo "Usage: `basename $0` [-dASCII85EncodePages=false]
[-dLanguageLevel=1|2|3] input.pdf [output.ps]" 1>&2
exit 1
fi
# Doing an initial 'save' helps keep fonts from being flushed between pages.
# We have to include the options twice because -I only takes effect if it
# appears before other options.
exec gs $OPTIONS -q -dNOPAUSE -dBATCH -dSAFER -r300x300 -sDEVICE=tiffg3 "-sOutputFile=$outfile" $OPTIONS -c save pop -f "$1"
#!/bin/sh
# takes one parameter, the path to a pdf file to be processed.
# uses custom script 'pdf2tif' to generate the tif files,
# generates them at 300x300 dpi.
# drops them in our current directory
# then runs $progdir/tesseract on them, deleting the .raw
# and .map files that tesseract drops.
./pdf2tif $1
# edit this to point to wherever you've got your tesseract binary
progdir=.
for j in *.tif
do
x=`basename $j .tif`
${progdir}/tesseract ${j} ${x}
rm ${x}.raw
rm ${x}.map
#un-comment next line if you want to remove the .tif files when done.
# rm ${j}
done
|
|
|
|