|
html_scrub -- An HTML Editing Utility for Groklaw by Scott McKellar |
|
Monday, June 07 2004 @ 03:26 AM EDT
|
Scott McKellar decided to take pity on me and write a command line HTML cleaning utility for me. As many of you know, Geeklog, the underlying software Groklaw uses, chokes on certain HTML. When volunteers send me documents they have turned into HTML from text, using certain automatic HTML utilities, I end up spending hours sometimes cleaning out the tags Geeklog doesn't like. It's like picking fleas out of your dog's coat. It takes a long time, it's no fun, and sometimes you miss things. This is particularly a problem when volunteers use web authoring tools in Windows. I've struggled with the problem for some time, so Scott decided to try to do something about it. He wrote a utility for me called html_scrub that does that cleaning chore for me, and it's licensed under the GPL, so everyone can use it. He has it up on Freshmeat today. His personal page is here. If anyone wants to write a GUI for it, I'd love it. Then volunteers could pre-clean. I wanted to let you know about it, so you can try it out if you'd like to. Don't sue me or him if your house falls down or your hair turns purple when you try it. There are always bugs in new software, so be sure to let him know if you find any. Alan Canon already found a javascript bug, but today's release fixes it, and Scott says html_scrub is ready to be taken for a spin.
Scott explains his html_scrub: "When people contribute HTML documents to Groklaw, PJ (or one of her lieutenants) has to edit them to make sure that they don't include certain kinds of HTML that create problems for GeekLog. I wrote a command line utility called html_scrub to automate this task. Depending on what you tell it in a configuration file, html_scrub can eliminate unwanted HTML tags or certain attributes within specified HTML tags -- or, if you prefer, it can just warn you about them so that you can screen them manually. For more information, see the html_scrub web page.
"This utility is available under the GPL in the form of C source code and a simple Makefile. You should be able to compile it on any Linux or Unix-like system, possibly after a little tweaking of the Makefile. If you're on a Windows box without a compiler, you can download a Unix-like environment from Cygwin and compile from there with gcc. If you're not that ambitious, I can provide an executable for you to run within a DOS session.
"I haven't given html_scrub a thorough workout yet, so bug reports are welcome. I have tried to make the code modular enough that others may extend or reuse it. For example, it would be nice to have a GUI version so that the user doesn't have to work from a command prompt. I don't have the necessary skills to do that myself."
It feels really nice to have software written for you, I must say. I am starting to get what folks mean about writing software to scratch an itch. It must be very nice to be able to write whatever you want to do what you need done. So, thank you very much, Scott. And if you want to know what tags I can use in Geeklog, here is the list, with brackets removed, because if I leave them in, Geeklog will have a nervous breakdown trying to figure out what to do: p, blockquote,
b, i, u, strike, a, em, strong, br, tt, hr, li, ol, ul, code, pre, font, div, span, table, tr, th, td, font color="" . . . /font. If anyone wants to know what I'd really find useful, it'd be a way to hit a key and get [p] [blockquote] [i]" and then hit another key and get "[/i] [blockquote] [p] with the brackets in there, of course, instead of []. I used those to trick Geeklog. That one thing would make my life better. If you look at the source of any article, you'll see why I need that, if it's possible.
|
|
Authored by: seanlynch on Monday, June 07 2004 @ 02:36 PM EDT |
Should the second [blockquote][p] in the last paragraghs be [/blockquote][/p]?
Seán[ Reply to This | # ]
|
|
Authored by: MarkusQ on Monday, June 07 2004 @ 02:46 PM EDT |
Here's an opportunity to scratch your own itch. Depending on what you use
to edit the text, you might want to try one or more of the
following:
- If there is an "auto correction" feature for fixing
spelling, try telling it that the "correct" spelling of "{{" is really
"<i><blockquote><p>" and likewise "}}" should really be
</i></blockquote><p>"
- Just put in the "{{" & "}}"
and (perphaps with a macro or a script in something like sed, ruby, perl, awk,
etc.) globally find and replace them with the html goo.
- If your editor
supports macros, try making one to insert the html at the current location (in
many editors this can be done quite easily by "recording" the macro) and then
bind/assign the macros to the keystroke(s) of your choice--e.g. Alt-{ and
Alt-}.
-- MarkusQ
P.S. A word of warning though, learning to scratch
your own itch can be very addictive. If you find yourself wanting to
rewrite emacs so it works the way you think it ought to, you've gone too far!
[ Reply to This | # ]
|
|
Authored by: Anonymous on Monday, June 07 2004 @ 02:50 PM EDT |
If you're using vi for text/html editing, you can use the abbreviation function,
which lets you assign an arbitrary string (such as you've shown above) to a
single key.
I often assign the '=' or '-' key for abbreviations, since I rarely use them in
normal text. In fact, you can assign an abbreviation to another abbreviation
such that, say if the '=' key is assigned to give "hello worldn" and
the '-' key is assigned to "=====" then you'll get five copies of
"hello worldn" entered with a single press of the '-' key. There are
other neat capabilities in vi, also.
Larry N.
Larry N.[ Reply to This | # ]
|
|
Authored by: Anonymous on Monday, June 07 2004 @ 02:51 PM EDT |
A million thanks to you and other open source developers. I really appreciate
all the work and effort from you and the open source developers. One of these
days, I will contribute open source code or services to pay you back.[ Reply to This | # ]
|
|
Authored by: Anonymous on Monday, June 07 2004 @ 02:54 PM EDT |
C'mon, why do you need a GUI for a simple single-purpose command-line program?[ Reply to This | # ]
|
|
Authored by: pb on Monday, June 07 2004 @ 02:57 PM EDT |
There's no shortage of HTML cleaning utilities out there.
HTML Tidy takes
care of most of my
html parsing/scrubbing needs, and I
just can't recommend it highly
enough. I've also
used DeCSS (no, not the one for DVD encryption keys...)
to
get rid of extraneous CSS, and I've written something like
an HTML
parser in PHP that can also filter tags like
html_scrub does. [ Reply to This | # ]
|
|
Authored by: koa on Monday, June 07 2004 @ 02:59 PM EDT |
There is also a utility that re-formats html code so that it is easier to read
as well:
http://www.digital-mines.com/htb/
---
...move along...nothing to see here...[ Reply to This | # ]
|
|
Authored by: perpetual_newbie on Monday, June 07 2004 @ 03:00 PM EDT |
Careful now.
Since anything that touches AIX/Dynix code is part of System V, and anything
that touches THAT code is part of System V, then we have a problem.
JFS touched System V from AIX, so it belongs to SCO.
The GPL has touched JFS (under the release), so it belongs to SCO.
This code was written under the GPL, so it belongs to SCO.
If you use it on your computer, your computer belongs to SCO.
Since your computer is in your house, your house now belongs to SCO.
**continue ad nauseum**
[ Reply to This | # ]
|
|
Authored by: Anonymous on Monday, June 07 2004 @ 03:06 PM EDT |
I could probably knock up a GUI for KDE if people would
like one. [ Reply to This | # ]
|
|
Authored by: Anonymous on Monday, June 07 2004 @ 03:09 PM EDT |
There is a copy of the w3c HTML Tidy script with a slightly modified python web
interface
here
Its not designed to be user configurable, but for those that can't or
won't install HTML Tidy locally, it does the trick.
rgds
Franki
[ Reply to This | # ]
|
|
Authored by: AIB on Monday, June 07 2004 @ 03:29 PM EDT |
. [ Reply to This | # ]
|
|
Authored by: Anonymous on Monday, June 07 2004 @ 03:29 PM EDT |
Here is a nifty tool for converting a <textarea> into a open source
javascript WYSIWYG editor: HTMLArea It works like a champ and it is very object oriented so it is very easy to
diasble features you do not wish to include. It is also very easy to setup and
configure. Lastly, it handles cut and paste very well even from OpenOffice and
M$ Word documents.
-pyguy
[ Reply to This | # ]
|
|
Authored by: Anonymous on Monday, June 07 2004 @ 03:30 PM EDT |
Shouldn't [p][blockquote][i]...[/i][/blockquote][/p] just be
[blockquote]...[/blockquote] og [p class=quote]...[/p] and the the layout done
with a style command a la: [style]blockquote, p:quote {margin:0 0 20px 0;
font-style:italic; }[/style]?[ Reply to This | # ]
|
|
Authored by: rsmith on Monday, June 07 2004 @ 03:30 PM EDT |
PJ,
For editing HTML there are a couple of editors you can try.
For newbies something like bluefish might be most appropriate.
If you're a touch-typist and are willing to learn keyboard shortcuts, give emacs
with HTML or XML mode a try.
---
Never ascribe to malice that which is adequately explained by incompetence.[ Reply to This | # ]
|
|
Authored by: koa on Monday, June 07 2004 @ 03:30 PM EDT |
And yet another tool for this purpose:
http://www.w3.org/People/Raggett/tidy/
---
...move along...nothing to see here...[ Reply to This | # ]
|
|
Authored by: Anonymous on Monday, June 07 2004 @ 03:35 PM EDT |
I'm not sure if you have appropriate access, but you really ought to be able to
get html_scrub to run on anything you enter into Geeklog. (Of course, that does
require extra effort from your host.) That seems like a much more appropriate
place to put that kind of functionality.
[ Reply to This | # ]
|
|
Authored by: Anonymous on Monday, June 07 2004 @ 03:45 PM EDT |
Actually, PHP already has this functionality built right in.
http://us3.php.net/manual/en/function.strip-tags.php
You could just hack that into geeklog and submit it as a patch.
Tim[ Reply to This | # ]
|
|
Authored by: rben13 on Monday, June 07 2004 @ 04:04 PM EDT |
You can also extend the range of tags that can be used in stories for geeklog.
It's in the config file. I've done that so that I can use h3 and h4 tags as
well as a variety of others. I don't know if you looked at that to help some of
the problem.
I've also found that embedded styles get mangled by geeklog, especially if you
are trying to change colors in a span. That can be frustrating. What would be
really great is if someone at geeklog would include the html_scrup features in
the article submission portion of geeklog and if they'd fix the module that
trashes the embedded styles.[ Reply to This | # ]
|
|
Authored by: Anonymous on Monday, June 07 2004 @ 04:04 PM EDT |
Sounds like the Html-Tidy thing I'm using with (X)html-Kit. Great programs!
Although if I understand correctly, this will do just a little more by letting
you pick what attributes to delete and have more actions to perform?[ Reply to This | # ]
|
|
Authored by: Anonymous on Monday, June 07 2004 @ 04:07 PM EDT |
Any news on SCO's motion on the depositions?
Or from DC? Or Red Hat?[ Reply to This | # ]
|
|
Authored by: Anonymous on Monday, June 07 2004 @ 04:28 PM EDT |
There is two more important tags, they are:
< gives
<
and
> gives >
To get the & to display you use the
tag &
That is: & gives &
Working out how to display
& is left as an exercise.
The accepted tag list
becomes:
<p>, <blockquote>, <b>, <i>, <u>,
<strike>, <a>, <em>, <strong>, <br>, <tt>,
<hr>, <li>, <ol>, <ul>, >code>, <pre>,
<font>, <div>, <span>, <table>, <tr>, <th>,
<td>, <font> <color="">
To be consistant I have left
the <font> closing tag out of the list as the closing tag has been left
out of all the other options.
[ Reply to This | # ]
|
|
Authored by: minkwe on Monday, June 07 2004 @ 04:30 PM EDT |
Have you considered HTMLAREA, it's as functional as microsofts activeX controls
used in Hotmail. It provides configurable HTML editing for any HTML TEXTAREA
control.
I use it on all my php based sites. It's cool, and easy to
configure. Its just a few CSS and javascript files. It's quite easy to use as
well. I'm sure mathfox can figure it out.
In addition, it is BSD
licensed.
You can get it from:
http://www
.interactivetools.com/products/htmlarea/license.html
An online demo is
avialable at http://dynarch.co
m/htmlarea/examples/spell-checker.html
--- SCO: Your honor, they are
trying to confuse us with the facts! [ Reply to This | # ]
|
|
Authored by: dmscvc123 on Monday, June 07 2004 @ 04:43 PM EDT |
I wonder what AdTI is taking off their website now that they've got it password
protected...[ Reply to This | # ]
|
|
Authored by: John on Monday, June 07 2004 @ 05:12 PM EDT |
In emacs or xemacs, try Control-C followed by a key
for instance Control-c space =
Control-c p = <p>
Also you can make your own codes like $$ and ## and at the end move to the head
of the file and type alt-x replace-string, $$ (the text to replace),
<p><blockquote><i> (the replacement text), enter and all $$
have been turned into the replacement text.
Then go back to the top and type alt-x replace-string, ##,
</i></blockquote></p>, and all ## have been transformed.
BTW the <p> </p> are not needed as wrap around for
<blockquote></blockquote).
---
JJJ[ Reply to This | # ]
|
|
Authored by: Thomas Frayne on Monday, June 07 2004 @ 06:30 PM EDT |
Now she has too many.
Would someone who knows what several of these options are suggest which would be
best for her, and how she would use it?
[ Reply to This | # ]
|
|
Authored by: Anonymous on Monday, June 07 2004 @ 06:44 PM EDT |
Thanks to everyone for the feedback. I hope html_scrub turns out to be useful.
Concerning various points:
1. To archivist: I considered posting an executable that Windows users could
use, but I wasn't sure if people would be willing to download it, due to a
concern for viruses and the like. However I guess I should go ahead and do it.
If you don't trust it, you don't have to download it.
2. Also to archivist: yes, I accidently omitted the GPL from the zip version.
3. An email from Cory Jaeger noted that, at the bottom of my web page, I used a
web URL where I meant to put an email address. Oops.
I hope to correct the above problems later this evening.
I'm not familiar with HTML Tidy or ntb, and I don't know how much they may
overlap with html_scrub, but I'm glad to hear about them. At first glance it
looks like HTML Tidy is mostly concerned with reformatting and correcting syntax
glitches. It's not obvious to me that it screens designated tags and attributes
like html_scrub, so maybe my little contribution still has a place in your
toolbox.
Finally, a minor point of attribution: the Javascript problem was reported to me
by Antti Kaihola. Alan Canon reported a different bug: html_scrub wasn't
expecting to see hyphens in tag names such as HTTP-EQUIV. That one was easier
to fix.
Scott McKellar
http://home.swbell.net/mck9/html_scrub/
[ Reply to This | # ]
|
|
Authored by: rakaz on Monday, June 07 2004 @ 07:12 PM EDT |
Scott,
I haven't tested html_scrub yet, so I do not know exactly how it performs in the
situations I describe below. But since you do not mention these situation
specifically on your webpage I assume you do not handle them currently:
1. javascript eventhandlers, such as onclick, onmouseover, etc. on every
possible type of tag.
2. href attributes on 'a' tags that start with 'javascript:'.
I think it would be useful to add a configuration option to drop or allow
scripting altogether. This means if this configuration option is set, it would
not only handle <script> tags, but also the two situations metioned above.[ Reply to This | # ]
|
|
Authored by: Anonymous on Monday, June 07 2004 @ 07:23 PM EDT |
In order to help we really need to know what editor you use. Someone will know
how to tweak it.[ Reply to This | # ]
|
|
Authored by: bbaston on Monday, June 07 2004 @ 07:34 PM EDT |
Alpha of a Groklaw-specific html_scrub.cfg
--- Ben
-------------
IMBW, IANAL2, IMHO, IAVO, {;)}
imaybewrong, iamnotalawyertoo, inmyhumbleopinion, iamveryold, hairysmileyface, [ Reply to This | # ]
|
|
Authored by: Anonymous on Monday, June 07 2004 @ 08:26 PM EDT |
You have zillions of devoted geek readers, and you help us all understand
interesting legal stuff. It's simply wrong for you to spend "hours"
fooling around with anything that we could fix with a few Perl scripts.
So in the future, please let us know if you find yourself doing boring
repetitive things![ Reply to This | # ]
|
|
Authored by: Anonymous on Tuesday, June 08 2004 @ 03:08 AM EDT |
PJ. It is *NOT* your job to spend hours with a chore of cleaning the files
manually. There are thousands of hackers here eager to help you. If you told us
earlier, I would personally hack a set of macros for the Vim editor exclusively
for you. - See, I am not a C programmer. There are other guys among the readers
that could make a macro/script/hack for such a chore for *any* editor or
environment of your choice.
Please let us know next time you face *ANY*
repetitive task. I think you really should concentrate on research, writing and
organizing. Simply because you are so very good at it
Stano [ Reply to This | # ]
|
|
Authored by: Anonymous on Tuesday, June 08 2004 @ 04:20 AM EDT |
" font color="" . . . /font."
<wince>
'Standard' HTML being that which is 'deprecated'?
Draconis [ Reply to This | # ]
|
|
Authored by: pyrodave on Tuesday, June 08 2004 @ 10:13 AM EDT |
Good job!
"If you can't do it at least 3 different ways, it isn't (Li)nix."
-- Author Unknown (Someone at HP I think?)
I would love to do this in AWK, but not many people use regular expressions
anymore so I don't wanna give us all a headache...
P.S. - Keep up the good work PJ[ Reply to This | # ]
|
|
Authored by: OscarGunther on Tuesday, June 08 2004 @ 02:06 PM EDT |
As long as we're bandying solutions about (nice job, Scott), why don't we move
PJ entirely into the new millenium and talk about using XML to code submissions?
If transcribers can be persuaded to use a standard Groklaw DTD, then a few XSLT
scripts could be used to transform them into Groklaw-ready HTML. The benefit of
this approach is that, since we deal with a relatively small set of document
types--legal filings, PJ's quote-and-commentary, and maybe comparison
tables--it's conceptually easy to define the DTD and PJ potentially would do
almost no formatting at all. Output would be generated automagically.[ Reply to This | # ]
|
|
Authored by: Anonymous on Tuesday, June 08 2004 @ 02:49 PM EDT |
I believe the GPL covers your fear of lawsuits <-: [ Reply to This | # ]
|
|
|
|
|