decoration decoration

When you want to know more...
For layout only
Site Map
About Groklaw
Legal Research
ApplevSamsung p.2
Cast: Lawyers
Comes v. MS
Gordon v MS
IV v. Google
Legal Docs
MS Litigations
News Picks
Novell v. MS
Novell-MS Deal
OOXML Appeals
Quote Database
Red Hat v SCO
Salus Book
SCEA v Hotz
SCO Appeals
SCO Bankruptcy
SCO Financials
SCO Overview
SCO v Novell
Sean Daly
Software Patents
Switch to Linux
Unix Books


Groklaw Gear

Click here to send an email to the editor of this weblog.

You won't find me on Facebook


Donate Paypal

No Legal Advice

The information on Groklaw is not intended to constitute legal advice. While Mark is a lawyer and he has asked other lawyers and law students to contribute articles, all of these articles are offered to help educate, not to provide specific legal advice. They are not your lawyers.

Here's Groklaw's comments policy.

What's New

No new stories

COMMENTS last 48 hrs
No new comments


hosted by ibiblio

On servers donated to ibiblio by AMD.

Tesseract OCR How-To, by Dr Stupid; Scripts by Fred Smith
Monday, December 11 2006 @ 08:45 AM EST

As you know, turning PDFs into text is a large part of what we do on Groklaw, in order to have a searchable and accessible database of the the litigation we cover. I have a script Carlo Graziani wrote for me that easily does HTML of PDFs that are text, but when the court documents were scanned in as pictures of each page, the script doesn't work. It's a lot of work translating those documents, having to OCR them and then correct all the errors or just hand type. It's hard to find a good OCR program that works on GNU/Linux, I gather because of patents -- yet another reason why someone needs to solve this software patent problem. And using commercial programs like Omnipage is not possible for most of us, because although it works very well, it's quite expensive. And it works only in Windows or a Mac.

Then someone noticed Tesseract OCR.

What is Tesseract OCR?:

A commercial quality OCR engine originally developed at HP between 1985 and 1995. In 1995, this engine was among the top 3 evaluated by UNLV. It was open-sourced by HP and UNLV in 2005

It was registered on Sourceforge in January. And then in August, Google announced it had fixed a few bugs and was rereleasing it so the community could work on improving it. It does OCR. And it works on GNU/Linux. But it was not what I would call user friendly. There were a few other issues with Tesseract, as Google explained:

A few things to know about Tesseract OCR: for now it only supports the English language, and does not include a page layout analysis module (yet), so it will perform poorly on multi-column material. It also doesn't do well on grayscale and color documents, and it's not nearly as accurate as some of the best commercial OCR packages out there. Yet, as far as we know, despite its shortcomings, Tesseract is far more accurate than any other Open Source OCR package out there. If you know of one that is more accurate, please do tell us!

I was thrilled when Groklaw member Fred Smith sent me some helper scripts to make it work better, and you're free to use them too. To make sure we all understand how to use them, I asked Dr Stupid to write up a how-to for Tesseract OCR, explaining how to use Fred's helper scripts. We still have quite a few exhibits to do from IBM's lengthy list supporting its various summary judgment motions, so if you want to give this a whirl, you might practice on one of those. After the how-to, I will show you the scripts themselves. I would like to thank both Fred and Dr Stupid for doing this for us.


Tesseract OCR - HOW-TO - simple conversion of PDFs to text,
by Dr Stupid


One activity that appears to be central to the smooth running of Groklaw is the conversion of documents from a PDF to plain text format. There are many ways of achieving this, not least the brute force approach of manual transcription - something made possible by a large pool of volunteers (to which we remain forever indebted.)

Some PDFs are readily converted to text by their very nature, being directly generated from a word processing package. In these cases the plain text is held within the file and can be extracted. Unfortunately a great many are instead produced by scanning software; what one sees within [insert your preferred PDF viewing program here] is nothing more than a series of pictures (one to a page) of the original paper document.

OCR (Optical Character Recognition) software can take much of the drudgery out of dealing with such documents, and some of the programs available on the market are very sophisticated. Those who out of necessity or choice require FOSS options, however, have had to grapple with programs of limited functionality or user-friendliness.

The situation improved in 2005 when HP made their own "tesseract" OCR engine (written by the University of Nevada) available as open source with the assistance. This engine offers good performance on English documents but is a little awkward to use as it stands, since it can only work with input files in TIFF format, not PDFs.

This year, a GL reader called Fred Smith sent in some helpful scripts that make it much easier to use Tesseract to convert PDFs into plain text. The rest of this article explains how to compile tesseract on your Linux system and make use of those scripts. With luck, those overlength memorandums may never look so daunting again :)

Aside: Perhaps an enterprising GL reader can put together a Kommander file to create a simple GUI for the script...?

Check requirements

Tesseract can only work with TIFF files, so you need software to convert PDFs to TIFFs. You need to have a working Ghostscript installation with TIFF support: your distro's standard Ghostscript should do fine.

You need to have the following libraries installed: libtiff, libjpeg, and zlib. Most distros will have these installed "out of the box". You also need the corresponding header files. These are usually packaged in the corresponding "-devel" packages.

So, using your preferred package management software make sure the following (or equivalent) are installed:

(ghostscript may be divided into smaller sub-packages in your distro)

Check Ghostscript installation

Fetch a PDF (I'm using "Interesting.pdf" as an example) into your current directory and run the pdf2tif tool on it:
./pdf2tif Interesting.pdf
You should get a set of TIFF files, one per page of the PDF. Use an image viewer to check they are OK.

Get tesseract

The home page of Tesseract is The current version at time of writing is 1.02. Download the tarball (I henceforth assume it is tesseract-1.02.tar.gz) and untar it somewhere convenient, creating a directory "tesseract-1.02". Now put the helper scripts in that same directory, and go into the directory:
 tar zxvf tesseract-1.02.tar.gz
mv pdf2tif tesseract-1.02/
mv tesseract-1.02/
cd tesseract-1.02

Build tesseract

In the following, I assume that libtiff is installed into /usr/lib. If you have built your own libtiff from source, this might not be the case.
./configure --with-tibtiff=/usr/lib
ln -s ccmain/tesseract
The last stage is necessary to be able to run tesseract directly from where it has been compiled. As of the time of writing, the authors recommend that you do not run "make install".

Run tesseract

Now you can run the program on your PDF, using the helper script.
./ Interesting.pdf
You should get a set of text files, one per page of the PDF. The inevitable tidying-up process I leave to you!


Many thanks to the authors and maintainers of tesseract, and to Fred Smith for the original helper scripts.

Fred Smith's scripts

Here are Fred's scripts as he sent them to me, for those who don't need a how to:

I've been hacking at some scripts and wanted to pass on to you my latest versions. I think they will work better and also be easier to use.

Instead of viewing the online PDF and printing to file, with these scripts you will need to have a local copy of the PDF, because we now use a new script named "pdf2tif" to turn the pdf directly into a tif file without any intermediate ps file. It seems to give considerably better resolution on the text (the tif file sure looks a lot better). It's not clear how much better the OCR'd result is, but I'd think it would be at least a little better (after all, these documents are pretty poor quality to start with.)

You'll want to put this in your path somewhere (I put mine in /usr/local/bin) and make sure to give it execute permission.

Here's pdf2tif:


# Derived from pdf2ps.
# $Id: pdf2tif,v 1.0 2006/11/03 Fred Smith
# Convert PDF to TIFF file.

while true
case "$1" in
-?*) OPTIONS="$OPTIONS $1" ;;
*) break ;;

if [ $# -eq 2 ]


elif [ $# -eq 1 ]

outfile=`basename "$1" .pdf`-%02d.tif


echo "Usage: `basename $0` [-dASCII85EncodePages=false]

[-dLanguageLevel=1|2|3] input.pdf []" 1>&2

exit 1


# Doing an initial 'save' helps keep fonts from being flushed between pages.
# We have to include the options twice because -I only takes effect if it
# appears before other options.
exec gs $OPTIONS -q -dNOPAUSE -dBATCH -dSAFER -r300x300 -sDEVICE=tiffg3
"-sOutputFile=$outfile" $OPTIONS -c save pop -f "$1"


and here's the latest version of To use this, edit the value for the variable "progdir" to point to wherever your tesseract binary is located. also, put it somewhere convenient and give it execute permission. It leaves the resulting .txt files (one per page) in your current directory.



# takes one parameter, the path to a pdf file to be processed.
# uses custom script 'pdf2tif' to generate the tif files,
# generates them at 300x300 dpi.
# drops them in our current directory
# then runs $progdir/tesseract on them, deleting the .raw
# and .map files that tesseract drops.

pdf2tif $1

# edit this to point to wherever you've got your tesseract binary

for j in *.tif

x=`basename $j .tif`
${progdir}/tesseract ${j} ${x}
rm ${x}.raw
rm ${x}.map

#un-comment next line if you want to remove the .tif files when done.
# rm ${j}


Tesseract OCR How-To, by Dr Stupid; Scripts by Fred Smith | 157 comments | Create New Account
Comments belong to whoever posts them. Please notify us of inappropriate comments.
Tesseract OCR How-To, by Dr Stupid; Scripts by Fred Smith
Authored by: Anonymous on Monday, December 11 2006 @ 09:02 AM EST
Very cool.

[ Reply to This | # ]

OT Here
Authored by: SpaceLifeForm on Monday, December 11 2006 @ 09:04 AM EST
Please make any links clickable.


You are being MICROattacked, from various angles, in a SOFT manner.

[ Reply to This | # ]

Corrections Here
Authored by: DaveJakeman on Monday, December 11 2006 @ 09:29 AM EST
If required.

I would rather stand corrected than sit confused.
Should one hear an accusation, try it on the accuser.

[ Reply to This | # ]

What would you want from a gui?
Authored by: Maot on Monday, December 11 2006 @ 10:06 AM EST
As a serious question - I'm pretty happy myself simply playing on the command
line :-D

Would you want to be able to simply select a PDF file and then wait to see the
result as text all stitched back together as one document? Would you like to be
able to view the original PDF pages (or more likely the converted tiff's) side
by side with the converted text? In different windows? In your favourite
browser as separate frames?

Also to the coders out there, what language would you use to do this.

I can see it would be easy to knock up a Java app to automate this and provide a
GUI. Or the Java app could be a simple web server and allow the user interface
to be built in script/HTML. Or for that matter it could be a c++ wrapper with
embedded web server for similar functionality (I shy away from cross platform
c++ gui though). There is so many options.

Maybe I should go play with it after work this evening and see what can be done
quickly and easily considering my lack of spare time...

[ Reply to This | # ]

tesseract b0rken on amd64?
Authored by: rsmith on Monday, December 11 2006 @ 10:20 AM EST
Tesseract dumps core on my freeBSD amd64 system.

After some fixes to make it compile on FreeBSD (essentially replacing #includes
of linux/limits.h -> limits.h and malloc.h -> stdlib.h), the compilation
finishes OK.

But there are tons of warnings concerning variable size mismatches and
signed/unsigned comparisons.

When running the program, it dumps core in reverse32 with a segmentation fault.

Best to treat this program as not 64-bit clean for now.

Intellectual Property is an oxymoron.

[ Reply to This | # ]

Tesseract OCR is in Debian Etch
Authored by: Anonymous on Monday, December 11 2006 @ 10:35 AM EST
Can't say about any earlier versions, but is available through apt-get.

[ Reply to This | # ]

Tesseract OCR RPM's
Authored by: inode_buddha on Monday, December 11 2006 @ 10:43 AM EST
I haven't the time at the moment (job interview) but perhaps I could roll all
this up into some RPM files later this evening or tomorrow. Would anyone be
interested in this?

Copyright info in bio

"When we speak of free software,
we are referring to freedom, not price"
-- Richard M. Stallman

[ Reply to This | # ]

Tesseract availability in Gentoo
Authored by: WhiteFang on Monday, December 11 2006 @ 11:03 AM EST
Tesseract is not currently available in the Gentoo repository. It seems to be in
a very active state of flux. Gentoo developer policy is generally to _not_ make
CVS based ebuilds available. If anyone is interested in building such an
experimental ebuild, I wouldn't mind being "Joe User" and testing it.

DRM - Degrading, Repulsive, Meanspirited 'Nuff Said.
"I shouldn't have asked ... "

[ Reply to This | # ]

Adobe Acrobat
Authored by: Anonymous on Monday, December 11 2006 @ 11:13 AM EST
I like using Adobe Acrobat to OCR PDF documents. It allows one to also have
searcheable PDFs which retain the original image.

Adobe Acrobat also will OCR documents which are thousands of pages long.

In regard to price, I think you get what you pay for. It is much easier to user

a commercial product than a rough-edged free product. I also don't find the
commercial products that expensive. An hour with a lawyer is much more
expensive. The commerical products also often accompany scanners for free.

The main limitation of commercial products is that they are not available on

On an Intel Mac, one can run Mac OS X simultaneously with Windows and
Linux. Thus one has the capacity to run nearly all software and have the
choice of which is the best process.

[ Reply to This | # ]

OmniPage on Linux
Authored by: Anonymous on Monday, December 11 2006 @ 11:15 AM EST
I think it was OmniPage, but I might be misremembering. In any case, years ago I
sent an e-mail to a company that made a rather good OCR program for Macintosh
asking if there's any chance of a Linux version. I got a reply, apparently from
a technical person, saying, yes, we already have a version that runs on Linux
but I'm not sure when it will be released. About a year later I wrote again to
ask what had happened to the Linux version and I got a reply, apparently from a
non-technical person, saying that there never was and never will be a Linux

Perhaps in the meantime they had made some kind of exclusive deal with somebody.

[ Reply to This | # ]

Tesseract OCR In Debian
Authored by: John Hasler on Monday, December 11 2006 @ 11:25 AM EST
For users of Debian/Unstable 'apt-get install tesseract-ocr tesseract-ocr-data'
will install Tesseract. The packages may also be available in
Debian-derivatives such as Ubuntu.

IOANAL. Licensed under the GNU General Public License

[ Reply to This | # ]

Tesseract OCR How-To, by Dr Stupid; Scripts by Fred Smith
Authored by: Steve Martin on Monday, December 11 2006 @ 11:26 AM EST
Wahoo! I and my fingers thank you.

"When I say something, I put my name next to it." -- Isaac Jaffee, "Sports

[ Reply to This | # ]

Thank you for the interesting article
Authored by: Anonymous on Monday, December 11 2006 @ 11:56 AM EST
Cool to see that is a resource for all sorts of things in which the
Groklaw community has expertice above and beyond the other communities out

Even though I don't need OCR stuff myself, it's facinating to see the technology
that make Groklaw tick.

[ Reply to This | # ]

Tesseract OCR How-To, by Dr Stupid; Scripts by Fred Smith
Authored by: grash on Monday, December 11 2006 @ 12:13 PM EST
Very interesting article. I'm assuming that the scripts along with the
Tesseract software creates a text file with little or no need to correct
pagination. I have very good luck with Evince (GNU) or Acrobat Reader (free but
non-GNU). I just bring up the PDF with Evince and a text editor (Gedit), do a
select all and copy in Evince, then a Paste in Gedit. Works great for me.
Although you don't have to worry about spelling etc., it does take the necessary
time to paginate. I've never had a PDF that this didn't work on but I wonder if
there are some out there that you are unable to select the text (i.e. similar to
a TIF - picture only). Anyway, great article. I'll have to check it out.

[ Reply to This | # ]

Tesseract OCR How-To, by Dr Stupid; Scripts by Fred Smith
Authored by: Anonymous on Monday, December 11 2006 @ 12:13 PM EST
Hmmm, shades of Linux car versus Microsoft car....

Whilst I've run various unixes for 20 years, and am not what is termed "a newbie"; the tesseract compile fails on Fedora core 6 with, umm, something that only a programmer could love. Unfortunately - in spite of my grey hair and 12 other programming languages I dont understand C++, so I am stuck until some kind person comes to my rescue.

I find particularly stunning that gcc is talking about something called â, which isn't in the code and I cant even type let alone fix.

source='tessinit.cpp' object='tessinit.o' libtool=no
depfile='.deps/tessinit.Po' tmpdepfile='.deps/tessinit.TPo'
depmode=gcc3 /bin/sh ../config/depcomp
g++ -DHAVE_CONFIG_H -I. -I. -I.. -I../ccstruct -I../ccutil -I../cutil -I../classify -I../image -I../dict -I../viewer -DNDEBUG -O3 -Wall -c -o tessinit.o `test -f 'tessinit.cpp' || echo './'`tessinit.cpp
source='tface.cpp' object='tface.o' libtool=no
depfile='.deps/tface.Po' tmpdepfile='.deps/tface.TPo'
depmode=gcc3 /bin/sh ../config/depcomp
g++ -DHAVE_CONFIG_H -I. -I. -I.. -I../ccstruct -I../ccutil -I../cutil -I../classify -I../image -I../dict -I../viewer -DNDEBUG -O3 -Wall -c -o tface.o `test -f 'tface.cpp' || echo './'`tface.cpp
../cutil/globals.h:46: error: previous declaration of â with â linkage
../ccutil/getopt.h:23: error: conflicts with new declaration with â linkage
../cutil/globals.h:47: error: previous declaration of â with â linkage
../ccutil/getopt.h:24: error: conflicts with new declaration with â linkage
make[3]: *** [tface.o] Error 1
make[3]: Leaving directory `/home/UK/905639/Downloads/tesseract-1.02/wordrec'
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory `/home/UK/905639/Downloads/tesseract-1.02/wordrec'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/home/UK/905639/Downloads/tesseract-1.02'
make: *** [all] Error 2

[ Reply to This | # ]

Authored by: Altair_IV on Monday, December 11 2006 @ 12:23 PM EST
For better OCR'ing of poorly-scanned documents, check out the application called unpaper. It can clean and straighten up the document before you scan it, hopefully giving you better accuracy.

It's available in the main Debian repositories, and probably most of the other distros as well.


Changing the subject, personally I've been wishing for a good OCR program that can handle other languages, especially CJK. At least there are some options for OCR regardng western scripts, however poor, but there's absolutely nothing available for complex scripts right now. Right now for Japanese I'm limited to running a program that came with an old scanner under wine. It works well enough, but there really needs to be an FOSS solution.

Monsters from the id!!
m(_ _)m

[ Reply to This | # ]

Tesseract OCR How-To, by Dr Stupid; Scripts by Fred Smith
Authored by: fredex on Monday, December 11 2006 @ 12:26 PM EST

Thanks for publishing my scripts, and especially thanks to Dr. Stupid for his
elaboration on how to use them.

I'd like to throw in an additional tidbit or two:

1. Tesseract 1.0.x does not seem to work (even on x86/i386 Linux) when compiled
on a verson 4.x of g++. If I compile it on my CentOS 4.4 box it will run forever
and never produce output. As a result I'm using a static binary built on an old
RH 7.3 I still have around (which uses Red Hat's much-maligned 2.96 version),
and that binary works fine on CentOS 4.4.

And, yes, as another poster noted it spews tons of warnings at compile-time, but
I've ignored them since it does seem to function well enough for my needs.

2. Since sending you the script I've done one small tweak so that in
addition to leaving you individual text files per page, it also concatenates
them into a single file for a slight increase in convenience. I can post that
version later on after I get home.

[ Reply to This | # ]

What to transscribe?
Authored by: Anonymous on Monday, December 11 2006 @ 12:27 PM EST
In a related matter: Could you make a prioritized list of what would be
interesting to have transcribed? I for one don't quite know where to start or
what would be most helpful and what would just be a waste for time.

[ Reply to This | # ]

Tesseract OCR How-To, by Dr Stupid; Scripts by Fred Smith
Authored by: Anonymous on Monday, December 11 2006 @ 02:41 PM EST
One of the problems of using Ghostscript to do the translation from PDF to TIFF
is that Ghostscript wants to automatically rescale the image. Any method that
modifies the scanned image harms the OCR process. To get around this problem, I
wrote a routine in C++ which searches through (at least some) PDF files and
extracts (certain image types) to a TIFF. I have used this successfully with
grayscale images. The code is not in a refined state, and so I have not made it
publically available before. Also I do not understand the standard Linux
autobuild process, and so I have a simple makefile. If anyone is interested, and
knows a project which might use this code, I can make the source available.
Please suggest where to put it.

[ Reply to This | # ]

How good is it?
Authored by: rsmith on Monday, December 11 2006 @ 06:15 PM EST

It seems clear that tesseract still needs some work before it is ready for prime time. Before we get carried away, how good it the recognition in its current state?

I've tried both ocrad and gocr in the past on a couple of the (admittedly bad quality) AT&T contracts. The results were not encouraging. It was less work to retype the documents then to correct the OCR errors.

Has anyone done a recent test with these apps, with a relatively good quality document like IBM-882?

PJ, would you be interested in a comparison of these programs, maybe in the form of an article?

Intellectual Property is an oxymoron.

[ Reply to This | # ]

OCR software patents
Authored by: Anonymous on Monday, December 11 2006 @ 06:27 PM EST
It would seem that:

(a) Certain patents that describe ocr technology advances (patents that are
software in nature) at this point in time are not 'trivial' compared to a
significant percentage of software patents.

(b) It is not certain that open source developers would reach the same level of
proficiency as commercial products given (a) and the fact that the advances
would remain trade secret without a published patent.

(c) A difficult software based breakthrough in ocr technology would fall in line
with other fields of patentable research in that significant investment yielded
a substantial result.

I am observing this after sampling a small portion of ocr patents spanning the
last 20 years.

Please do not mistake this as an attempt to argue for justification of the large
number of trivial software patents that have been granted.

Just some point/counter-point

[ Reply to This | # ]

GIMP reads PDFs
Authored by: Anonymous on Monday, December 11 2006 @ 08:10 PM EST
I don't know if this helps, but the GIMP will read a PDF and save as whatever
image format you want. Stomfi

[ Reply to This | # ]

Thanks for the suggestions and scripts!
Authored by: Ikester on Monday, December 11 2006 @ 09:26 PM EST
I've been looking for a decent way to perform OCR on GNU/Linux, and now that I
know, I can possibly help on some docs. As a test case, I was able to pull some
passable text out of some really poor scans of an economics journal. gocr, the
OCR engine that comes with Ubuntu, didn't come close.

It may look garbled, but it is easily cleaned up, since tesseract did most of
the heavy lifting. To wit:

In his celebrated article "Toward a Reconstruction of Utility and Welfare
nomiesj' Murray Rothbard vvrote that
[i]ndifferenCe Can never be demonstrated by action. Quite the Contrary.
Every action necessarily signifies a Chojcev and every Choice signifies a
definite preference. Action specifically implies the Contrary of indiffer-
ence|^~R.If a person is really indifferent between tWo alternativesv then he
cannot and Will not Choose between them. Indifference is therefore never
relevant for action and cannot be demonstrated in action. (Rothbard 19977
pp. 225-26)
This seems to be undeniable7 and any attempt to explain vvhy one ehoses
to do X rather than y With reference to indifference rather than preference
strikes one as a logical absurdity a "category mistake.?' Indeed? it seems
to be
a truth similar to the truth that no "constants' can ever be used to
explain a
"variable?' and Why any attempt to explain a variable outcome With
to some constant conditions is likewise absurd.
Nonetheless7 Rothbard and Mises have been criticized by Noziek (1977)
and Caplan (l999)7 for inconsistency in admitting the concept of indifference
into economic analysis after all? even if only indirectly These criticisms have
been ansvvered by Block (19807 1999) and Hulsmann (1999). However? their
ansvvers, although largely correct? seem to bring less than full clarity to the
matter. Setting out from Noziekss eritieism7 I hope to remedy this deficiency
As correctly noted by Block (1980, pp. 423-Z5)7 aside from some rather
confused and easily disposed of remarks? Noziek has but one challenging crit-
icism of Rothbardes and Misesss verdict on indifference. He argues that their
HANS-HERMANN HOPPE is a professor of economies at the University of Nevada at
Las Vegas.

[ Reply to This | # ]

Software Patents - In this case they are good
Authored by: Anonymous on Monday, December 11 2006 @ 10:58 PM EST
Hello PJ,

There is no doubt trivial patents are bad (eg one click).

However developing more reliable OCR algorithms is a highly complex task. Surely
the people who did all that research and testing deserve compensation?

Also patents for GIF / JPEG / MPEG etc compression and SSL encryption were
entirely justified. People developed very clever solutions to a problem and I
think they deserve the right to compensation, via patents.

I think that some software patents are good, as long as they are for an
invention which is a breakthrough and not even remotely trivial.

PJ it is a bit rude to complain that people spent a large amount of time
developing OCR algorithms and then expect them to hand it over for nothing.

[ Reply to This | # ]

Tesseract OCR How-To, by Dr Stupid; Scripts by Fred Smith
Authored by: juliac on Tuesday, December 12 2006 @ 07:47 AM EST

Tesseract is good to know about, but DocMorph is even better. Just upload the pdf (or a file in any of more than 50 different formats) and a few seconds later, download the text file. The OCR is superb, even with a marginal pdf, and it knows 17 different languages.

Have you contributed to Groklaw lately?

[ Reply to This | # ]

Compiling tesseract on Fedora Core 6 x86_64
Authored by: Anonymous on Tuesday, December 12 2006 @ 12:12 PM EST
I tried to compile tesseract on: PowerPC Linux, PowerPC OSX, x86_64 Linux, and
x86 Linux. The only success was on x86 Linux.

To compile on Fedora Core 6 x86_64, configure like this:

CC='gcc -m32' CXX='g++ -m32' ./configure

You get a 32-bit executable, but that doesn't really matter much.

I also had to comment out some duplicate declarations in ccmain/tfacep.h and

I looked at the code, there's no way tesseract is going to run 64-bit without a
lot of work.

[ Reply to This | # ]

Tesseract OCR How-To, by Dr Stupid; Scripts by Fred Smith
Authored by: Anonymous on Tuesday, December 12 2006 @ 12:20 PM EST
"ln -s ccmain/tesseract

The last stage is necessary to be able to run tesseract directly from where it has been compiled. As of the time of writing, the authors recommend that you do not run "make install".

I tried installing tesseract on SuSE10. When I try to run it, it briefly pops up a black window with some msg and then disappears. The download site is slightly lacking in documentation.

[ Reply to This | # ]

Groklaw © Copyright 2003-2013 Pamela Jones.
All trademarks and copyrights on this page are owned by their respective owners.
Comments are owned by the individual posters.

PJ's articles are licensed under a Creative Commons License. ( Details )