decoration decoration
Stories

GROKLAW
When you want to know more...
decoration
For layout only
Home
Archives
Site Map
Search
About Groklaw
Awards
Legal Research
Timelines
ApplevSamsung
ApplevSamsung p.2
ArchiveExplorer
Autozone
Bilski
Cases
Cast: Lawyers
Comes v. MS
Contracts/Documents
Courts
DRM
Gordon v MS
GPL
Grokdoc
HTML How To
IPI v RH
IV v. Google
Legal Docs
Lodsys
MS Litigations
MSvB&N
News Picks
Novell v. MS
Novell-MS Deal
ODF/OOXML
OOXML Appeals
OraclevGoogle
Patents
ProjectMonterey
Psystar
Quote Database
Red Hat v SCO
Salus Book
SCEA v Hotz
SCO Appeals
SCO Bankruptcy
SCO Financials
SCO Overview
SCO v IBM
SCO v Novell
SCO:Soup2Nuts
SCOsource
Sean Daly
Software Patents
Switch to Linux
Transcripts
Unix Books

Gear

Groklaw Gear

Click here to send an email to the editor of this weblog.


You won't find me on Facebook


Donate

Donate Paypal


No Legal Advice

The information on Groklaw is not intended to constitute legal advice. While Mark is a lawyer and he has asked other lawyers and law students to contribute articles, all of these articles are offered to help educate, not to provide specific legal advice. They are not your lawyers.

Here's Groklaw's comments policy.


What's New

STORIES
No new stories

COMMENTS last 48 hrs
No new comments


Sponsors

Hosting:
hosted by ibiblio

On servers donated to ibiblio by AMD.

Webmaster
html_scrub -- An HTML Editing Utility for Groklaw by Scott McKellar
Monday, June 07 2004 @ 03:26 AM EDT

Scott McKellar decided to take pity on me and write a command line HTML cleaning utility for me. As many of you know, Geeklog, the underlying software Groklaw uses, chokes on certain HTML. When volunteers send me documents they have turned into HTML from text, using certain automatic HTML utilities, I end up spending hours sometimes cleaning out the tags Geeklog doesn't like. It's like picking fleas out of your dog's coat. It takes a long time, it's no fun, and sometimes you miss things.

This is particularly a problem when volunteers use web authoring tools in Windows. I've struggled with the problem for some time, so Scott decided to try to do something about it. He wrote a utility for me called html_scrub that does that cleaning chore for me, and it's licensed under the GPL, so everyone can use it.

He has it up on Freshmeat today. His personal page is here. If anyone wants to write a GUI for it, I'd love it. Then volunteers could pre-clean. I wanted to let you know about it, so you can try it out if you'd like to. Don't sue me or him if your house falls down or your hair turns purple when you try it. There are always bugs in new software, so be sure to let him know if you find any. Alan Canon already found a javascript bug, but today's release fixes it, and Scott says html_scrub is ready to be taken for a spin.

Scott explains his html_scrub:

"When people contribute HTML documents to Groklaw, PJ (or one of her lieutenants) has to edit them to make sure that they don't include certain kinds of HTML that create problems for GeekLog. I wrote a command line utility called html_scrub to automate this task. Depending on what you tell it in a configuration file, html_scrub can eliminate unwanted HTML tags or certain attributes within specified HTML tags -- or, if you prefer, it can just warn you about them so that you can screen them manually. For more information, see the html_scrub web page.

"This utility is available under the GPL in the form of C source code and a simple Makefile. You should be able to compile it on any Linux or Unix-like system, possibly after a little tweaking of the Makefile. If you're on a Windows box without a compiler, you can download a Unix-like environment from Cygwin and compile from there with gcc. If you're not that ambitious, I can provide an executable for you to run within a DOS session.

"I haven't given html_scrub a thorough workout yet, so bug reports are welcome. I have tried to make the code modular enough that others may extend or reuse it. For example, it would be nice to have a GUI version so that the user doesn't have to work from a command prompt. I don't have the necessary skills to do that myself."

It feels really nice to have software written for you, I must say. I am starting to get what folks mean about writing software to scratch an itch. It must be very nice to be able to write whatever you want to do what you need done. So, thank you very much, Scott. And if you want to know what tags I can use in Geeklog, here is the list, with brackets removed, because if I leave them in, Geeklog will have a nervous breakdown trying to figure out what to do: p, blockquote, b, i, u, strike, a, em, strong, br, tt, hr, li, ol, ul, code, pre, font, div, span, table, tr, th, td, font color="" . . . /font.

If anyone wants to know what I'd really find useful, it'd be a way to hit a key and get [p] [blockquote] [i]" and then hit another key and get "[/i] [blockquote] [p] with the brackets in there, of course, instead of []. I used those to trick Geeklog. That one thing would make my life better. If you look at the source of any article, you'll see why I need that, if it's possible.


  


html_scrub -- An HTML Editing Utility for Groklaw by Scott McKellar | 116 comments | Create New Account
Comments belong to whoever posts them. Please notify us of inappropriate comments.
Errors and corrections here please
Authored by: seanlynch on Monday, June 07 2004 @ 02:36 PM EDT
Should the second [blockquote][p] in the last paragraghs be [/blockquote][/p]?

Seán

[ Reply to This | # ]

block quoting
Authored by: MarkusQ on Monday, June 07 2004 @ 02:46 PM EDT

Here's an opportunity to scratch your own itch. Depending on what you use to edit the text, you might want to try one or more of the following:
  • If there is an "auto correction" feature for fixing spelling, try telling it that the "correct" spelling of "{{" is really "<i><blockquote><p>" and likewise "}}" should really be </i></blockquote><p>"

  • Just put in the "{{" & "}}" and (perphaps with a macro or a script in something like sed, ruby, perl, awk, etc.) globally find and replace them with the html goo.

  • If your editor supports macros, try making one to insert the html at the current location (in many editors this can be done quite easily by "recording" the macro) and then bind/assign the macros to the keystroke(s) of your choice--e.g. Alt-{ and Alt-}.
-- MarkusQ

P.S. A word of warning though, learning to scratch your own itch can be very addictive. If you find yourself wanting to rewrite emacs so it works the way you think it ought to, you've gone too far!

[ Reply to This | # ]

Quick Keystrokes
Authored by: Anonymous on Monday, June 07 2004 @ 02:50 PM EDT
If you're using vi for text/html editing, you can use the abbreviation function,
which lets you assign an arbitrary string (such as you've shown above) to a
single key.

I often assign the '=' or '-' key for abbreviations, since I rarely use them in
normal text. In fact, you can assign an abbreviation to another abbreviation
such that, say if the '=' key is assigned to give "hello worldn" and
the '-' key is assigned to "=====" then you'll get five copies of
"hello worldn" entered with a single press of the '-' key. There are
other neat capabilities in vi, also.

Larry N.

Larry N.

[ Reply to This | # ]

html_scrub -- An HTML Editing Utility for Groklaw by Scott McKellar
Authored by: Anonymous on Monday, June 07 2004 @ 02:51 PM EDT
A million thanks to you and other open source developers. I really appreciate
all the work and effort from you and the open source developers. One of these
days, I will contribute open source code or services to pay you back.

[ Reply to This | # ]

html_scrub -- An HTML Editing Utility for Groklaw by Scott McKellar
Authored by: Anonymous on Monday, June 07 2004 @ 02:54 PM EDT

C'mon, why do you need a GUI for a simple single-purpose command-line program?

[ Reply to This | # ]

html_scrub -- An HTML Editing Utility for Groklaw by Scott McKellar
Authored by: pb on Monday, June 07 2004 @ 02:57 PM EDT
There's no shortage of HTML cleaning utilities out there. HTML Tidy takes care of most of my html parsing/scrubbing needs, and I just can't recommend it highly enough.

I've also used DeCSS (no, not the one for DVD encryption keys...) to get rid of extraneous CSS, and I've written something like an HTML parser in PHP that can also filter tags like html_scrub does.

[ Reply to This | # ]

html_scrub -- An HTML Editing Utility for Groklaw by Scott McKellar
Authored by: koa on Monday, June 07 2004 @ 02:59 PM EDT
There is also a utility that re-formats html code so that it is easier to read
as well:

http://www.digital-mines.com/htb/

---
...move along...nothing to see here...

[ Reply to This | # ]

html_scrub -- An HTML Editing Utility for Groklaw by Scott McKellar
Authored by: perpetual_newbie on Monday, June 07 2004 @ 03:00 PM EDT
Careful now.

Since anything that touches AIX/Dynix code is part of System V, and anything
that touches THAT code is part of System V, then we have a problem.

JFS touched System V from AIX, so it belongs to SCO.
The GPL has touched JFS (under the release), so it belongs to SCO.
This code was written under the GPL, so it belongs to SCO.
If you use it on your computer, your computer belongs to SCO.
Since your computer is in your house, your house now belongs to SCO.

**continue ad nauseum**

[ Reply to This | # ]

html_scrub -- An HTML Editing Utility for Groklaw by Scott McKellar
Authored by: Anonymous on Monday, June 07 2004 @ 03:06 PM EDT
I could probably knock up a GUI for KDE if people would
like one.

[ Reply to This | # ]

Online HTML Tidy tool.
Authored by: Anonymous on Monday, June 07 2004 @ 03:09 PM EDT
There is a copy of the w3c HTML Tidy script with a slightly modified python web interface here Its not designed to be user configurable, but for those that can't or won't install HTML Tidy locally, it does the trick. rgds Franki

[ Reply to This | # ]

OTs here...
Authored by: AIB on Monday, June 07 2004 @ 03:29 PM EDT
.

[ Reply to This | # ]

HTMLArea solves many problems.
Authored by: Anonymous on Monday, June 07 2004 @ 03:29 PM EDT

Here is a nifty tool for converting a <textarea> into a open source javascript WYSIWYG editor: HTMLArea
It works like a champ and it is very object oriented so it is very easy to diasble features you do not wish to include. It is also very easy to setup and configure. Lastly, it handles cut and paste very well even from OpenOffice and M$ Word documents.

-pyguy

[ Reply to This | # ]

html_scrub -- An HTML Editing Utility for Groklaw by Scott McKellar
Authored by: Anonymous on Monday, June 07 2004 @ 03:30 PM EDT
Shouldn't [p][blockquote][i]...[/i][/blockquote][/p] just be
[blockquote]...[/blockquote] og [p class=quote]...[/p] and the the layout done
with a style command a la: [style]blockquote, p:quote {margin:0 0 20px 0;
font-style:italic; }[/style]?

[ Reply to This | # ]

Editing HTML
Authored by: rsmith on Monday, June 07 2004 @ 03:30 PM EDT
PJ,

For editing HTML there are a couple of editors you can try.

For newbies something like bluefish might be most appropriate.

If you're a touch-typist and are willing to learn keyboard shortcuts, give emacs
with HTML or XML mode a try.

---
Never ascribe to malice that which is adequately explained by incompetence.

[ Reply to This | # ]

Another
Authored by: koa on Monday, June 07 2004 @ 03:30 PM EDT
And yet another tool for this purpose:

http://www.w3.org/People/Raggett/tidy/

---
...move along...nothing to see here...

[ Reply to This | # ]

Are you sure a GUI is what you want?
Authored by: Anonymous on Monday, June 07 2004 @ 03:35 PM EDT
I'm not sure if you have appropriate access, but you really ought to be able to
get html_scrub to run on anything you enter into Geeklog. (Of course, that does
require extra effort from your host.) That seems like a much more appropriate
place to put that kind of functionality.

[ Reply to This | # ]

PHP already does this.
Authored by: Anonymous on Monday, June 07 2004 @ 03:45 PM EDT
Actually, PHP already has this functionality built right in.

http://us3.php.net/manual/en/function.strip-tags.php

You could just hack that into geeklog and submit it as a patch.

Tim

[ Reply to This | # ]

html_scrub -- An HTML Editing Utility for Groklaw by Scott McKellar
Authored by: rben13 on Monday, June 07 2004 @ 04:04 PM EDT
You can also extend the range of tags that can be used in stories for geeklog.
It's in the config file. I've done that so that I can use h3 and h4 tags as
well as a variety of others. I don't know if you looked at that to help some of
the problem.

I've also found that embedded styles get mangled by geeklog, especially if you
are trying to change colors in a span. That can be frustrating. What would be
really great is if someone at geeklog would include the html_scrup features in
the article submission portion of geeklog and if they'd fix the module that
trashes the embedded styles.

[ Reply to This | # ]

html_scrub -- An HTML Editing Utility for Groklaw by Scott McKellar
Authored by: Anonymous on Monday, June 07 2004 @ 04:04 PM EDT
Sounds like the Html-Tidy thing I'm using with (X)html-Kit. Great programs!
Although if I understand correctly, this will do just a little more by letting
you pick what attributes to delete and have more actions to perform?

[ Reply to This | # ]

OT: any news?
Authored by: Anonymous on Monday, June 07 2004 @ 04:07 PM EDT
Any news on SCO's motion on the depositions?

Or from DC? Or Red Hat?

[ Reply to This | # ]

The tag list with brackets in html
Authored by: Anonymous on Monday, June 07 2004 @ 04:28 PM EDT

There is two more important tags, they are:

&lt; gives <
and
&gt; gives >
To get the & to display you use the tag &amp;
That is: &amp; gives &
Working out how to display &amp; is left as an exercise.

The accepted tag list becomes: <p>, <blockquote>, <b>, <i>, <u>, <strike>, <a>, <em>, <strong>, <br>, <tt>, <hr>, <li>, <ol>, <ul>, >code>, <pre>, <font>, <div>, <span>, <table>, <tr>, <th>, <td>, <font> <color="">

To be consistant I have left the <font> closing tag out of the list as the closing tag has been left out of all the other options.

[ Reply to This | # ]

html_scrub -- An HTML Editing Utility for Groklaw by Scott McKellar
Authored by: minkwe on Monday, June 07 2004 @ 04:30 PM EDT
Have you considered HTMLAREA, it's as functional as microsofts activeX controls used in Hotmail. It provides configurable HTML editing for any HTML TEXTAREA control.

I use it on all my php based sites. It's cool, and easy to configure. Its just a few CSS and javascript files. It's quite easy to use as well. I'm sure mathfox can figure it out.

In addition, it is BSD licensed. You can get it from: http://www .interactivetools.com/products/htmlarea/license.html

An online demo is avialable at http://dynarch.co m/htmlarea/examples/spell-checker.html

---
SCO: Your honor, they are trying to confuse us with the facts!

[ Reply to This | # ]

Speaking Of Scrubbing HTML...
Authored by: dmscvc123 on Monday, June 07 2004 @ 04:43 PM EDT
I wonder what AdTI is taking off their website now that they've got it password
protected...

[ Reply to This | # ]

html_scrub -- An HTML Editing Utility for Groklaw by Scott McKellar
Authored by: John on Monday, June 07 2004 @ 05:12 PM EDT
In emacs or xemacs, try Control-C followed by a key

for instance Control-c space = &nbsp;
Control-c p = <p>

Also you can make your own codes like $$ and ## and at the end move to the head
of the file and type alt-x replace-string, $$ (the text to replace),
<p><blockquote><i> (the replacement text), enter and all $$
have been turned into the replacement text.

Then go back to the top and type alt-x replace-string, ##,
</i></blockquote></p>, and all ## have been transformed.

BTW the <p> </p> are not needed as wrap around for
<blockquote></blockquote).


---
JJJ

[ Reply to This | # ]

PJ started with no easy way to do what she wanted
Authored by: Thomas Frayne on Monday, June 07 2004 @ 06:30 PM EDT
Now she has too many.

Would someone who knows what several of these options are suggest which would be
best for her, and how she would use it?

[ Reply to This | # ]

html_scrub -- the author responds
Authored by: Anonymous on Monday, June 07 2004 @ 06:44 PM EDT
Thanks to everyone for the feedback. I hope html_scrub turns out to be useful.
Concerning various points:

1. To archivist: I considered posting an executable that Windows users could
use, but I wasn't sure if people would be willing to download it, due to a
concern for viruses and the like. However I guess I should go ahead and do it.
If you don't trust it, you don't have to download it.

2. Also to archivist: yes, I accidently omitted the GPL from the zip version.

3. An email from Cory Jaeger noted that, at the bottom of my web page, I used a
web URL where I meant to put an email address. Oops.

I hope to correct the above problems later this evening.

I'm not familiar with HTML Tidy or ntb, and I don't know how much they may
overlap with html_scrub, but I'm glad to hear about them. At first glance it
looks like HTML Tidy is mostly concerned with reformatting and correcting syntax
glitches. It's not obvious to me that it screens designated tags and attributes
like html_scrub, so maybe my little contribution still has a place in your
toolbox.

Finally, a minor point of attribution: the Javascript problem was reported to me
by Antti Kaihola. Alan Canon reported a different bug: html_scrub wasn't
expecting to see hyphens in tag names such as HTTP-EQUIV. That one was easier
to fix.

Scott McKellar
http://home.swbell.net/mck9/html_scrub/

[ Reply to This | # ]

JavaScript filtering...
Authored by: rakaz on Monday, June 07 2004 @ 07:12 PM EDT
Scott,

I haven't tested html_scrub yet, so I do not know exactly how it performs in the
situations I describe below. But since you do not mention these situation
specifically on your webpage I assume you do not handle them currently:

1. javascript eventhandlers, such as onclick, onmouseover, etc. on every
possible type of tag.
2. href attributes on 'a' tags that start with 'javascript:'.

I think it would be useful to add a configuration option to drop or allow
scripting altogether. This means if this configuration option is set, it would
not only handle <script> tags, but also the two situations metioned above.

[ Reply to This | # ]

What editor do you use
Authored by: Anonymous on Monday, June 07 2004 @ 07:23 PM EDT
In order to help we really need to know what editor you use. Someone will know
how to tweak it.

[ Reply to This | # ]

Groklaw (geeklog) specific .cfg file
Authored by: bbaston on Monday, June 07 2004 @ 07:34 PM EDT
Alpha of a Groklaw-specific html_scrub.cfg

---
Ben
-------------
IMBW, IANAL2, IMHO, IAVO, {;)}
imaybewrong, iamnotalawyertoo, inmyhumbleopinion, iamveryold, hairysmileyface,

[ Reply to This | # ]

PJ, Let folks know!
Authored by: Anonymous on Monday, June 07 2004 @ 08:26 PM EDT
You have zillions of devoted geek readers, and you help us all understand
interesting legal stuff. It's simply wrong for you to spend "hours"
fooling around with anything that we could fix with a few Perl scripts.

So in the future, please let us know if you find yourself doing boring
repetitive things!

[ Reply to This | # ]

Next time just ask me - a groklaw reader.
Authored by: Anonymous on Tuesday, June 08 2004 @ 03:08 AM EDT
PJ. It is *NOT* your job to spend hours with a chore of cleaning the files manually. There are thousands of hackers here eager to help you. If you told us earlier, I would personally hack a set of macros for the Vim editor exclusively for you. - See, I am not a C programmer. There are other guys among the readers that could make a macro/script/hack for such a chore for *any* editor or environment of your choice.

Please let us know next time you face *ANY* repetitive task. I think you really should concentrate on research, writing and organizing. Simply because you are so very good at it

Stano

[ Reply to This | # ]

html_scrub -- An HTML Editing Utility for Groklaw by Scott McKellar
Authored by: Anonymous on Tuesday, June 08 2004 @ 04:20 AM EDT
" font color="" . . . /font."

<wince>

'Standard' HTML being that which is 'deprecated'?

Draconis

[ Reply to This | # ]

html_scrub -- An HTML Editing Utility for Groklaw by Scott McKellar
Authored by: pyrodave on Tuesday, June 08 2004 @ 10:13 AM EDT
Good job!

"If you can't do it at least 3 different ways, it isn't (Li)nix."
-- Author Unknown (Someone at HP I think?)

I would love to do this in AWK, but not many people use regular expressions
anymore so I don't wanna give us all a headache...

P.S. - Keep up the good work PJ

[ Reply to This | # ]

XSLT, anyone?
Authored by: OscarGunther on Tuesday, June 08 2004 @ 02:06 PM EDT
As long as we're bandying solutions about (nice job, Scott), why don't we move
PJ entirely into the new millenium and talk about using XML to code submissions?
If transcribers can be persuaded to use a standard Groklaw DTD, then a few XSLT
scripts could be used to transform them into Groklaw-ready HTML. The benefit of
this approach is that, since we deal with a relatively small set of document
types--legal filings, PJ's quote-and-commentary, and maybe comparison
tables--it's conceptually easy to define the DTD and PJ potentially would do
almost no formatting at all. Output would be generated automagically.

[ Reply to This | # ]

"Don't sue me or him if your house falls down or your hair turns purple when you try it. "
Authored by: Anonymous on Tuesday, June 08 2004 @ 02:49 PM EDT
I believe the GPL covers your fear of lawsuits <-:

[ Reply to This | # ]

Groklaw © Copyright 2003-2013 Pamela Jones.
All trademarks and copyrights on this page are owned by their respective owners.
Comments are owned by the individual posters.

PJ's articles are licensed under a Creative Commons License. ( Details )