decoration decoration
Stories

GROKLAW
When you want to know more...
decoration
For layout only
Home
Archives
Site Map
Search
About Groklaw
Awards
Legal Research
Timelines
ApplevSamsung
ApplevSamsung p.2
ArchiveExplorer
Autozone
Bilski
Cases
Cast: Lawyers
Comes v. MS
Contracts/Documents
Courts
DRM
Gordon v MS
GPL
Grokdoc
HTML How To
IPI v RH
IV v. Google
Legal Docs
Lodsys
MS Litigations
MSvB&N
News Picks
Novell v. MS
Novell-MS Deal
ODF/OOXML
OOXML Appeals
OraclevGoogle
Patents
ProjectMonterey
Psystar
Quote Database
Red Hat v SCO
Salus Book
SCEA v Hotz
SCO Appeals
SCO Bankruptcy
SCO Financials
SCO Overview
SCO v IBM
SCO v Novell
SCO:Soup2Nuts
SCOsource
Sean Daly
Software Patents
Switch to Linux
Transcripts
Unix Books

Gear

Groklaw Gear

Click here to send an email to the editor of this weblog.


You won't find me on Facebook


Donate

Donate Paypal


No Legal Advice

The information on Groklaw is not intended to constitute legal advice. While Mark is a lawyer and he has asked other lawyers and law students to contribute articles, all of these articles are offered to help educate, not to provide specific legal advice. They are not your lawyers.

Here's Groklaw's comments policy.


What's New

STORIES
No new stories

COMMENTS last 48 hrs
No new comments


Sponsors

Hosting:
hosted by ibiblio

On servers donated to ibiblio by AMD.

Webmaster
Trying to script this is frustrating | 200 comments | Create New Account
Comments belong to whoever posts them. Please notify us of inappropriate comments.
pdftotext is your friend
Authored by: NobodyYouKnow on Monday, May 28 2012 @ 09:44 AM EDT
Yes. That does work better.

If you pipe the output of pdftotext -layout through this awk script:

/^        / { print ; next }
/^[0-9 ][0-9 ][0-9 ]     / { print substr( $0, 9 ) ; next }
{ print }

you get a plain text version that needs very little massaging. The awk script strips off the line numbers of the numbered lines. Well, it works for number 397 anyhow.

There's 8 spaces in the first pattern and 5 spaces after the last ] in the second.

[ Reply to This | Parent | # ]

Trying to script this is frustrating
Authored by: bugstomper on Tuesday, May 29 2012 @ 08:35 PM EDT
I thought I had a script with some good sed recipes to convert pdftotext
--layout into good html, but I'm finding that every other page has something
different in the formatting. For example, some put the page numbers at the
bottom and some at the top. Some have centered headings in the middle
introducing something like "Testimony of So-and-So". Others all of a
sudden start a series of questions and answers with lines beginning with
"Q. " and "A. ".

I've been trying to get one script that can handle all of it and then run
through all of them at once. I am getting closer, though.

[ Reply to This | Parent | # ]

Groklaw © Copyright 2003-2013 Pamela Jones.
All trademarks and copyrights on this page are owned by their respective owners.
Comments are owned by the individual posters.

PJ's articles are licensed under a Creative Commons License. ( Details )