|
Tesseract OCR How-To, by Dr Stupid; Scripts by Fred Smith |
|
Monday, December 11 2006 @ 08:45 AM EST
|
As you know, turning PDFs into text is a large part of what we do on Groklaw, in order to have a searchable and accessible database of the the litigation we cover. I have a script Carlo Graziani wrote for me that easily does HTML of PDFs that are text, but when the court documents were scanned in as pictures of each page, the script doesn't work. It's a lot of work translating those documents, having to OCR them and then correct all the errors or just hand type. It's hard to find a good OCR program that works on GNU/Linux, I gather because of patents -- yet another reason why someone needs to solve this software patent problem. And using commercial programs like Omnipage is not possible for most of us, because although it works very well, it's quite expensive. And it works only in Windows or a Mac. Then someone noticed Tesseract OCR.
What is Tesseract OCR?: A commercial quality OCR engine originally developed at HP between 1985 and 1995. In 1995, this engine was among the top 3 evaluated by UNLV. It was open-sourced by HP and UNLV in 2005 It was registered on Sourceforge in January. And then in August, Google announced it had fixed a few bugs and was rereleasing it so the community could work on improving it. It does OCR. And it works on GNU/Linux. But it was not what I would call user friendly. There were a few other issues with Tesseract, as Google explained: A few things to know about Tesseract OCR: for now it only supports the English language, and does not include a page layout analysis module (yet), so it will perform poorly on multi-column material. It also doesn't do well on grayscale and color documents, and it's not nearly as accurate as some of the best commercial OCR packages out there. Yet, as far as we know, despite its shortcomings, Tesseract is far more accurate than any other Open Source OCR package out there. If you know of one that is more accurate, please do tell us!
I was thrilled when Groklaw member Fred Smith sent me some helper scripts to make it work better, and you're free to use them too. To make sure we all understand how to use them, I asked Dr Stupid to write up a how-to for Tesseract OCR, explaining how to use Fred's helper scripts. We still have quite a few exhibits to do from IBM's lengthy list supporting its various summary judgment motions, so if you want to give this a whirl, you might practice on one of those. After the how-to, I will show you the scripts themselves. I would like to thank both Fred and Dr Stupid for doing this for us. ***************************************
Tesseract OCR - HOW-TO - simple conversion of PDFs to text,
by Dr Stupid
Introduction
One activity that appears to be central to the smooth running of Groklaw is the conversion of documents
from a PDF to plain text format. There are many ways of achieving this, not least the brute force approach of manual transcription -
something made possible by a large pool of volunteers (to which we remain forever indebted.)
Some PDFs are readily converted to text by their very nature, being directly generated from a word processing package. In these cases the plain text is held within the file and can be extracted. Unfortunately a great many are instead produced by scanning software; what one sees within [insert your preferred PDF viewing program here] is nothing more than a series of pictures (one to a page) of the original paper document.
OCR (Optical Character Recognition) software can take much of the drudgery out of dealing with such documents, and some of the programs available on the market are very sophisticated. Those who out of necessity or choice require FOSS options, however, have had to grapple with programs of limited functionality or user-friendliness.
The situation improved in 2005 when HP made their own "tesseract" OCR engine (written by the University of Nevada) available as open source with the assistance. This engine offers good performance on English documents but is a little awkward to use as it stands, since it can only work with input files in TIFF format, not PDFs.
This year, a GL reader called Fred Smith sent in some helpful scripts that make it much easier to use Tesseract to convert PDFs into plain text. The rest of this article explains how to compile tesseract on your Linux system and make use of those scripts. With luck, those overlength memorandums may never look so daunting again :)
Aside: Perhaps an enterprising GL reader can put together a Kommander file to create a simple GUI for the script...?
Check requirements
Tesseract can only work with TIFF files, so you need software to convert PDFs to TIFFs. You need to have a working Ghostscript installation with TIFF support: your distro's standard Ghostscript should do fine.
You need to have the following libraries installed: libtiff, libjpeg, and zlib.
Most distros will have these installed "out of the box".
You also need the corresponding header files. These are usually packaged in the
corresponding "-devel" packages.
So, using your preferred package management software
make sure the following (or equivalent) are installed:
ghostscript (ghostscript may be divided into smaller sub-packages in your distro)
libtiff
libtiff-devel
libjpeg
libjpeg-devel
zlib
zlib-devel
Check Ghostscript installation
Fetch a PDF (I'm using "Interesting.pdf" as an example) into your current directory
and run the pdf2tif tool on it:
./pdf2tif Interesting.pdf
You should get a set of TIFF files, one per page of the PDF. Use an image viewer
to check they are OK.
Get tesseract
The home page of Tesseract is http://sourceforge.net/projects/tesseract-ocr.
The current version at time of writing is 1.02.
Download the tarball (I henceforth assume it is tesseract-1.02.tar.gz) and
untar it somewhere convenient, creating a directory "tesseract-1.02". Now
put the helper scripts in that same directory, and go into the directory:
tar zxvf tesseract-1.02.tar.gz
mv pdf2tif tesseract-1.02/
mv ocr.sh tesseract-1.02/
cd tesseract-1.02
Build tesseract
In the following, I assume that libtiff is installed into /usr/lib. If you have
built your own libtiff from source, this might not be the case.
./configure --with-tibtiff=/usr/lib
make
ln -s ccmain/tesseract
The last stage is necessary to be able to run tesseract directly from where it
has been compiled. As of the time of writing, the authors recommend that you
do not run "make install".
Run tesseract
Now you can run the program on your PDF, using the helper script.
./ocr.sh Interesting.pdf
You should get a set of text files, one per page of the PDF. The inevitable tidying-up process I leave to you!
Acknowledgements
Many thanks to the authors and maintainers of tesseract, and to Fred Smith for the original helper scripts.
Fred Smith's scripts
Here are Fred's scripts as he sent them to me, for those who don't need a how to:
I've been hacking at some scripts and wanted to pass on to you my latest
versions. I think they
will work better and also be easier to use.
Instead of viewing the online PDF and printing to file, with these
scripts you will need to
have a local copy of the PDF, because we now use a new script named
"pdf2tif" to turn the pdf
directly into a tif file without any intermediate ps file. It seems to
give considerably
better resolution on the text (the tif file sure looks a lot better).
It's not clear how
much better the OCR'd result is, but I'd think it would be at least a
little better
(after all, these documents are pretty poor quality to start with.)
You'll want to put this in your path somewhere (I put mine in
/usr/local/bin)
and make sure to give it execute permission.
Here's pdf2tif:
-----------------------------
#!/bin/sh
# Derived from pdf2ps.
# $Id: pdf2tif,v 1.0 2006/11/03 Fred Smith
# Convert PDF to TIFF file.
OPTIONS=""
while true
do
case "$1" in
-?*) OPTIONS="$OPTIONS $1" ;;
*) break ;;
esac
shift
done
if [ $# -eq 2 ]
then
outfile=$2
elif [ $# -eq 1 ]
then
outfile=`basename "$1" .pdf`-%02d.tif
else
echo "Usage: `basename $0` [-dASCII85EncodePages=false]
[-dLanguageLevel=1|2|3] input.pdf [output.ps]" 1>&2
exit 1
fi
# Doing an initial 'save' helps keep fonts from being flushed between
pages.
# We have to include the options twice because -I only takes effect if
it
# appears before other options.
exec gs $OPTIONS -q -dNOPAUSE -dBATCH -dSAFER -r300x300 -sDEVICE=tiffg3
"-sOutputFile=$outfile" $OPTIONS -c save pop -f "$1"
--------------------------
and here's the latest version of ocr.sh. To use this, edit the value for
the variable "progdir"
to point to wherever your tesseract binary is located. also, put it
somewhere convenient and
give it execute permission. It leaves the resulting .txt files (one per
page) in your current
directory.
-------------------------------
#!/bin/sh
# takes one parameter, the path to a pdf file to be processed.
# uses custom script 'pdf2tif' to generate the tif files,
# generates them at 300x300 dpi.
# drops them in our current directory
# then runs $progdir/tesseract on them, deleting the .raw
# and .map files that tesseract drops.
pdf2tif $1
# edit this to point to wherever you've got your tesseract binary
progdir=..
for j in *.tif
do
x=`basename $j .tif`
${progdir}/tesseract ${j} ${x}
rm ${x}.raw
rm ${x}.map
#un-comment next line if you want to remove the .tif files when done.
# rm ${j}
done
|
|
Authored by: Anonymous on Monday, December 11 2006 @ 09:02 AM EST |
Very cool. [ Reply to This | # ]
|
|
Authored by: SpaceLifeForm on Monday, December 11 2006 @ 09:04 AM EST |
Please make any links clickable.
---
You are being MICROattacked, from various angles, in a SOFT manner.[ Reply to This | # ]
|
- "Pentagon panel eyes software security" - Authored by: Brian S. on Monday, December 11 2006 @ 11:07 AM EST
- News pick: France plans open source center of excellence - Authored by: rcbixler on Monday, December 11 2006 @ 11:30 AM EST
- "Microsoft-Novell affair: In your face" - Authored by: Brian S. on Monday, December 11 2006 @ 01:32 PM EST
- Red Hat dismisses threat posed by Oracle and Microsoft - Authored by: SpaceLifeForm on Monday, December 11 2006 @ 02:13 PM EST
- L'Inq: Microsoft Vista to churn "$70 billion" for IT industry - Authored by: Jude on Monday, December 11 2006 @ 02:22 PM EST
- SCO Stock: It's dead cat time - Authored by: Anonymous on Monday, December 11 2006 @ 03:00 PM EST
- Newspick - Ubuntu begins its transformation - Authored by: Alan(UK) on Monday, December 11 2006 @ 06:04 PM EST
- "Novell Linux push fails to cover NetWare losses" - Authored by: Brian S. on Monday, December 11 2006 @ 07:18 PM EST
- "Sun slams Ecma’s OpenXML OK" - Authored by: Brian S. on Monday, December 11 2006 @ 07:28 PM EST
- Samsung takes patent lead - Authored by: Anonymous on Monday, December 11 2006 @ 08:32 PM EST
|
Authored by: DaveJakeman on Monday, December 11 2006 @ 09:29 AM EST |
If required.
---
I would rather stand corrected than sit confused.
---
Should one hear an accusation, try it on the accuser.[ Reply to This | # ]
|
|
Authored by: Maot on Monday, December 11 2006 @ 10:06 AM EST |
As a serious question - I'm pretty happy myself simply playing on the command
line :-D
Would you want to be able to simply select a PDF file and then wait to see the
result as text all stitched back together as one document? Would you like to be
able to view the original PDF pages (or more likely the converted tiff's) side
by side with the converted text? In different windows? In your favourite
browser as separate frames?
Also to the coders out there, what language would you use to do this.
I can see it would be easy to knock up a Java app to automate this and provide a
GUI. Or the Java app could be a simple web server and allow the user interface
to be built in script/HTML. Or for that matter it could be a c++ wrapper with
embedded web server for similar functionality (I shy away from cross platform
c++ gui though). There is so many options.
Maybe I should go play with it after work this evening and see what can be done
quickly and easily considering my lack of spare time...[ Reply to This | # ]
|
|
Authored by: rsmith on Monday, December 11 2006 @ 10:20 AM EST |
Tesseract dumps core on my freeBSD amd64 system.
After some fixes to make it compile on FreeBSD (essentially replacing #includes
of linux/limits.h -> limits.h and malloc.h -> stdlib.h), the compilation
finishes OK.
But there are tons of warnings concerning variable size mismatches and
signed/unsigned comparisons.
When running the program, it dumps core in reverse32 with a segmentation fault.
Best to treat this program as not 64-bit clean for now.
---
Intellectual Property is an oxymoron.[ Reply to This | # ]
|
|
Authored by: Anonymous on Monday, December 11 2006 @ 10:35 AM EST |
Can't say about any earlier versions, but is available through apt-get. [ Reply to This | # ]
|
|
Authored by: inode_buddha on Monday, December 11 2006 @ 10:43 AM EST |
I haven't the time at the moment (job interview) but perhaps I could roll all
this up into some RPM files later this evening or tomorrow. Would anyone be
interested in this?
---
-inode_buddha
Copyright info in bio
"When we speak of free software,
we are referring to freedom, not price"
-- Richard M. Stallman[ Reply to This | # ]
|
|
Authored by: WhiteFang on Monday, December 11 2006 @ 11:03 AM EST |
Tesseract is not currently available in the Gentoo repository. It seems to be in
a very active state of flux. Gentoo developer policy is generally to _not_ make
CVS based ebuilds available. If anyone is interested in building such an
experimental ebuild, I wouldn't mind being "Joe User" and testing it.
---
DRM - Degrading, Repulsive, Meanspirited 'Nuff Said.
"I shouldn't have asked ... "[ Reply to This | # ]
|
|
Authored by: Anonymous on Monday, December 11 2006 @ 11:13 AM EST |
I like using Adobe Acrobat to OCR PDF documents. It allows one to also have
searcheable PDFs which retain the original image.
Adobe Acrobat also will OCR documents which are thousands of pages long.
In regard to price, I think you get what you pay for. It is much easier to user
a commercial product than a rough-edged free product. I also don't find the
commercial products that expensive. An hour with a lawyer is much more
expensive. The commerical products also often accompany scanners for free.
The main limitation of commercial products is that they are not available on
Linux.
On an Intel Mac, one can run Mac OS X simultaneously with Windows and
Linux. Thus one has the capacity to run nearly all software and have the
choice of which is the best process.[ Reply to This | # ]
|
|
Authored by: Anonymous on Monday, December 11 2006 @ 11:15 AM EST |
I think it was OmniPage, but I might be misremembering. In any case, years ago I
sent an e-mail to a company that made a rather good OCR program for Macintosh
asking if there's any chance of a Linux version. I got a reply, apparently from
a technical person, saying, yes, we already have a version that runs on Linux
but I'm not sure when it will be released. About a year later I wrote again to
ask what had happened to the Linux version and I got a reply, apparently from a
non-technical person, saying that there never was and never will be a Linux
version.
Perhaps in the meantime they had made some kind of exclusive deal with somebody.[ Reply to This | # ]
|
|
Authored by: John Hasler on Monday, December 11 2006 @ 11:25 AM EST |
For users of Debian/Unstable 'apt-get install tesseract-ocr tesseract-ocr-data'
will install Tesseract. The packages may also be available in
Debian-derivatives such as Ubuntu.
---
IOANAL. Licensed under the GNU General Public License[ Reply to This | # ]
|
|
Authored by: Steve Martin on Monday, December 11 2006 @ 11:26 AM EST |
Wahoo! I and my fingers thank you.
---
"When I say something, I put my name next to it." -- Isaac Jaffee, "Sports
Night"[ Reply to This | # ]
|
|
Authored by: Anonymous on Monday, December 11 2006 @ 11:56 AM EST |
Cool to see that groklaw.net is a resource for all sorts of things in which the
Groklaw community has expertice above and beyond the other communities out
there.
Even though I don't need OCR stuff myself, it's facinating to see the technology
that make Groklaw tick.[ Reply to This | # ]
|
|
Authored by: grash on Monday, December 11 2006 @ 12:13 PM EST |
Very interesting article. I'm assuming that the scripts along with the
Tesseract software creates a text file with little or no need to correct
pagination. I have very good luck with Evince (GNU) or Acrobat Reader (free but
non-GNU). I just bring up the PDF with Evince and a text editor (Gedit), do a
select all and copy in Evince, then a Paste in Gedit. Works great for me.
Although you don't have to worry about spelling etc., it does take the necessary
time to paginate. I've never had a PDF that this didn't work on but I wonder if
there are some out there that you are unable to select the text (i.e. similar to
a TIF - picture only). Anyway, great article. I'll have to check it out.[ Reply to This | # ]
|
|
Authored by: Anonymous on Monday, December 11 2006 @ 12:13 PM EST |
Hmmm, shades of Linux car versus Microsoft car....
Whilst I've run various
unixes for 20 years, and am not what is termed "a newbie"; the tesseract compile
fails on Fedora core 6 with, umm, something that only a programmer could love.
Unfortunately - in spite of my grey hair and 12 other programming languages I
dont understand C++, so I am stuck until some kind person comes to my
rescue.
I find particularly stunning that gcc is talking about something
called â, which isn't in the code and I cant even type let alone
fix.
source='tessinit.cpp' object='tessinit.o' libtool=no
depfile='.deps/tessinit.Po' tmpdepfile='.deps/tessinit.TPo'
depmode=gcc3 /bin/sh ../config/depcomp
g++ -DHAVE_CONFIG_H -I. -I.
-I.. -I../ccstruct -I../ccutil -I../cutil -I../classify -I../image -I../dict
-I../viewer -DNDEBUG -O3 -Wall -c -o tessinit.o `test -f 'tessinit.cpp' ||
echo './'`tessinit.cpp
source='tface.cpp' object='tface.o' libtool=no
depfile='.deps/tface.Po' tmpdepfile='.deps/tface.TPo'
depmode=gcc3 /bin/sh ../config/depcomp
g++ -DHAVE_CONFIG_H -I. -I.
-I.. -I../ccstruct -I../ccutil -I../cutil -I../classify -I../image -I../dict
-I../viewer -DNDEBUG -O3 -Wall -c -o tface.o `test -f 'tface.cpp' || echo
'./'`tface.cpp
../cutil/globals.h:46: error: previous declaration of â with
â linkage
../ccutil/getopt.h:23: error: conflicts with new declaration with
â linkage
../cutil/globals.h:47: error: previous declaration of â with â
linkage
../ccutil/getopt.h:24: error: conflicts with new declaration with â
linkage
make[3]: *** [tface.o] Error 1
make[3]: Leaving directory
`/home/UK/905639/Downloads/tesseract-1.02/wordrec'
make[2]: ***
[all-recursive] Error 1
make[2]: Leaving directory
`/home/UK/905639/Downloads/tesseract-1.02/wordrec'
make[1]: ***
[all-recursive] Error 1
make[1]: Leaving directory
`/home/UK/905639/Downloads/tesseract-1.02'
make: *** [all] Error
2
[ Reply to This | # ]
|
|
Authored by: Altair_IV on Monday, December 11 2006 @ 12:23 PM EST |
For better OCR'ing of poorly-scanned documents, check out the application called
unpaper. It can clean and straighten up the document before you scan it,
hopefully giving you better accuracy.
It's available in the main Debian
repositories, and probably most of the other distros as well.
unpaper
Changing the subject,
personally I've been wishing for a good OCR program that can handle other
languages, especially CJK. At least there are some options for OCR regardng
western scripts, however poor, but there's absolutely nothing available for
complex scripts right now. Right now for Japanese I'm limited to running a
program that came with an old scanner under wine. It works well enough, but
there really needs to be an FOSS solution.
--- Monsters from the
id!!
m(_ _)m [ Reply to This | # ]
|
|
Authored by: fredex on Monday, December 11 2006 @ 12:26 PM EST |
PJ:
Thanks for publishing my scripts, and especially thanks to Dr. Stupid for his
elaboration on how to use them.
I'd like to throw in an additional tidbit or two:
1. Tesseract 1.0.x does not seem to work (even on x86/i386 Linux) when compiled
on a verson 4.x of g++. If I compile it on my CentOS 4.4 box it will run forever
and never produce output. As a result I'm using a static binary built on an old
RH 7.3 I still have around (which uses Red Hat's much-maligned 2.96 version),
and that binary works fine on CentOS 4.4.
And, yes, as another poster noted it spews tons of warnings at compile-time, but
I've ignored them since it does seem to function well enough for my needs.
2. Since sending you the ocr.sh script I've done one small tweak so that in
addition to leaving you individual text files per page, it also concatenates
them into a single file for a slight increase in convenience. I can post that
version later on after I get home.
[ Reply to This | # ]
|
|
Authored by: Anonymous on Monday, December 11 2006 @ 12:27 PM EST |
In a related matter: Could you make a prioritized list of what would be
interesting to have transcribed? I for one don't quite know where to start or
what would be most helpful and what would just be a waste for time.[ Reply to This | # ]
|
|
Authored by: Anonymous on Monday, December 11 2006 @ 02:41 PM EST |
One of the problems of using Ghostscript to do the translation from PDF to TIFF
is that Ghostscript wants to automatically rescale the image. Any method that
modifies the scanned image harms the OCR process. To get around this problem, I
wrote a routine in C++ which searches through (at least some) PDF files and
extracts (certain image types) to a TIFF. I have used this successfully with
grayscale images. The code is not in a refined state, and so I have not made it
publically available before. Also I do not understand the standard Linux
autobuild process, and so I have a simple makefile. If anyone is interested, and
knows a project which might use this code, I can make the source available.
Please suggest where to put it.[ Reply to This | # ]
|
|
Authored by: rsmith on Monday, December 11 2006 @ 06:15 PM EST |
It seems clear that tesseract still needs some work before it is ready for
prime time. Before we get carried away, how good it the recognition in its
current state?
I've tried both ocrad and gocr in the past on a couple of the
(admittedly bad quality) AT&T contracts. The results were not encouraging.
It was less work to retype the documents then to correct the OCR
errors.
Has anyone done a recent test with these apps, with a relatively
good quality document like IBM-882?
PJ, would you be interested in a
comparison of these programs, maybe in the form of an
article?
--- Intellectual Property is an oxymoron. [ Reply to This | # ]
|
|
Authored by: Anonymous on Monday, December 11 2006 @ 06:27 PM EST |
It would seem that:
(a) Certain patents that describe ocr technology advances (patents that are
software in nature) at this point in time are not 'trivial' compared to a
significant percentage of software patents.
(b) It is not certain that open source developers would reach the same level of
proficiency as commercial products given (a) and the fact that the advances
would remain trade secret without a published patent.
(c) A difficult software based breakthrough in ocr technology would fall in line
with other fields of patentable research in that significant investment yielded
a substantial result.
I am observing this after sampling a small portion of ocr patents spanning the
last 20 years.
Please do not mistake this as an attempt to argue for justification of the large
number of trivial software patents that have been granted.
Just some point/counter-point[ Reply to This | # ]
|
|
Authored by: Anonymous on Monday, December 11 2006 @ 08:10 PM EST |
I don't know if this helps, but the GIMP will read a PDF and save as whatever
image format you want. Stomfi[ Reply to This | # ]
|
|
Authored by: Ikester on Monday, December 11 2006 @ 09:26 PM EST |
I've been looking for a decent way to perform OCR on GNU/Linux, and now that I
know, I can possibly help on some docs. As a test case, I was able to pull some
passable text out of some really poor scans of an economics journal. gocr, the
OCR engine that comes with Ubuntu, didn't come close.
It may look garbled, but it is easily cleaned up, since tesseract did most of
the heavy lifting. To wit:
NOTE
A Non; ON PREFERENCE AND INDIFFERENCE
IN ECONOMIC ANALYSIS
HANSFHHRMANN HOPPH
In his celebrated article "Toward a Reconstruction of Utility and Welfare
Eco-
nomiesj' Murray Rothbard vvrote that
[i]ndifferenCe Can never be demonstrated by action. Quite the Contrary.
Every action necessarily signifies a Chojcev and every Choice signifies a
definite preference. Action specifically implies the Contrary of indiffer-
ence|^~R.If a person is really indifferent between tWo alternativesv then he
cannot and Will not Choose between them. Indifference is therefore never
relevant for action and cannot be demonstrated in action. (Rothbard 19977
pp. 225-26)
This seems to be undeniable7 and any attempt to explain vvhy one ehoses
to do X rather than y With reference to indifference rather than preference
strikes one as a logical absurdity a "category mistake.?' Indeed? it seems
to be
a truth similar to the truth that no "constants' can ever be used to
explain a
"variable?' and Why any attempt to explain a variable outcome With
reference
to some constant conditions is likewise absurd.
Nonetheless7 Rothbard and Mises have been criticized by Noziek (1977)
and Caplan (l999)7 for inconsistency in admitting the concept of indifference
into economic analysis after all? even if only indirectly These criticisms have
been ansvvered by Block (19807 1999) and Hulsmann (1999). However? their
ansvvers, although largely correct? seem to bring less than full clarity to the
matter. Setting out from Noziekss eritieism7 I hope to remedy this deficiency
here.
As correctly noted by Block (1980, pp. 423-Z5)7 aside from some rather
confused and easily disposed of remarks? Noziek has but one challenging crit-
icism of Rothbardes and Misesss verdict on indifference. He argues that their
HANS-HERMANN HOPPE is a professor of economies at the University of Nevada at
Las Vegas.
THE QUARTERLY JOURNAL or AUSTRIAN ECONOMICS vor. 8, No. 4 (WINTER zoos); 87-91
87
[ Reply to This | # ]
|
|
Authored by: Anonymous on Monday, December 11 2006 @ 10:58 PM EST |
Hello PJ,
There is no doubt trivial patents are bad (eg one click).
However developing more reliable OCR algorithms is a highly complex task. Surely
the people who did all that research and testing deserve compensation?
Also patents for GIF / JPEG / MPEG etc compression and SSL encryption were
entirely justified. People developed very clever solutions to a problem and I
think they deserve the right to compensation, via patents.
I think that some software patents are good, as long as they are for an
invention which is a breakthrough and not even remotely trivial.
PJ it is a bit rude to complain that people spent a large amount of time
developing OCR algorithms and then expect them to hand it over for nothing.
[ Reply to This | # ]
|
|
Authored by: juliac on Tuesday, December 12 2006 @ 07:47 AM EST |
Tesseract is good to know about, but
DocMorph
is even
better. Just upload the pdf (or a file in any of more than 50 different
formats) and a few seconds later, download the text file. The OCR is superb,
even with a marginal pdf, and it knows 17 different
languages.
--- Have you contributed to Groklaw lately? [ Reply to This | # ]
|
|
Authored by: Anonymous on Tuesday, December 12 2006 @ 12:12 PM EST |
I tried to compile tesseract on: PowerPC Linux, PowerPC OSX, x86_64 Linux, and
x86 Linux. The only success was on x86 Linux.
To compile on Fedora Core 6 x86_64, configure like this:
CC='gcc -m32' CXX='g++ -m32' ./configure
You get a 32-bit executable, but that doesn't really matter much.
I also had to comment out some duplicate declarations in ccmain/tfacep.h and
ccutil/getopt.h
I looked at the code, there's no way tesseract is going to run 64-bit without a
lot of work.
[ Reply to This | # ]
|
|
Authored by: Anonymous on Tuesday, December 12 2006 @ 12:20 PM EST |
"ln -s ccmain/tesseract
The last stage is necessary to be
able to run tesseract directly from where it has been compiled. As of the time
of writing, the authors recommend that you do not run "make
install".
I tried installing tesseract on SuSE10. When I try to run
it, it briefly pops up a black window with some msg and then disappears. The
download site is slightly lacking in documentation.[ Reply to This | # ]
|
|
|
|
|