|
May I Please Pick Your Brains? Request for Info on Software Terms |
|
Friday, May 26 2006 @ 03:58 AM EDT
|
May I please pick your brains? The Open Source as Prior Art project is looking to build a dictionary (they call it a
thesaurus) of software terms to use in creating a taxonomy for use in its
electronic source code publication system. A description of the purpose of the project and the publication
process can be found
here.
Rather than build the thesaurus from scratch, OSAPA is looking for examples
of other collections of software terms. It seems everyone thought the US Patent Office had such a list, but it turns out they don't, which might just explain why they seem to have so much trouble finding prior art.
Do you know of any such collection? If you know of any, you can post it here as a comment and I'll collect it all and send it along or you can post it directly
to the osapa.org wiki or send it to the osapa.org mailing list at
http://lists.osdl.org/mailman/listinfo/priorart-discuss. Thank you for any help you can provide. I noticed in researching that there was just an international workshop this week on mining software repositories, and I've written to the folks that sponsored that conference, hoping someone there might know. Then I realized some of you right here on Groklaw would know, if anyone would, where to find any such collections, if they exist. So I offered to ask you.
To explain further, here's the email message posted to the osapa.org mailing list that caught my eye:
Date: Tue, 23 May 2006 12:51:10 -0700
From: "Diane Peters"
Subject: [priorart-discuss] Software Thesaurus
To: "'OSS and USPTO prior art discussions'"
Message-ID:
Content-Type: text/plain; charset="us-ascii"
Hi everyone,
As some of you may recall, coming out of our February meetings in
D.C. we were hoping to receive from the USPTO a thesaurus of types that (we
understood) was accessed by patent examiners when searching for prior
art in the software patent field. Our plan had been to post the
thesaurus and encourage developers in the community to annotate and
otherwise add to its terms. The thesaurus could then serve as a
common reference point for communications between the USPTO and the
community. It could used by developers to describe code when
electronically published; it could also be used by the USPTO to locate
that code. Such a thesaurus might have other interesting and valuable
uses as well.
It now appears that the USPTO does not have a thesaurus for the
various software technologies, so we will need to locate another or
start building a thesaurus/library/glossary ourselves.
The USPTO has offered this paper
http://www.netlib.org/utk/papers/dig-lib/main.html as a suggested
starting point, particularly the projects discussed in the first
section following the intro, and the indexing discussed thereafter.
There's also mention of IEEE standards for library data models.
We will add this job under to the wiki under the "longer jobs"
section of "Things that can be done right now." If you know of other useful
starting points, pls feel free to add to the wiki in that same
location. Once we have a sense of what's been done and what may be
useful, we can either select one as a starting point or break out
what's useful from each and build from there.
Diane M. Peters, General Counsel
Open Source Development Labs, Inc. I took a look at the paper the USPTO suggested looking at. The Netlib is a collection of mathematical software, from all I can tell. Their search page suggests using the GAMS class hierarchy or the freeWAIS-sf query syntax. I have no idea what that is, but I'm just telling you what I'm finding. The
NHSE page indicates that the project ran out of funding in 2004, and it mentions something called Repository in a Box, a toolkit developed in 1996: From 1994 - 2004, NHSE existed as a distributed collection of software, documents, data, and information of interest to the high performance and parallel computing community. The significance of the collaborative effort is evident through the many useful reports and tools generated as well as the many repositories that have been created, and are still being created, with the Repository in a Box (RIB) toolkit developed in 1996. However, continued operation of the site without funding has become impractical. Therefore, the site has been taken down. The NHSE meta-repository, which consists of metadata describing software applications and tools from the PTLib, HPC-Netlib and BenchWeb repositories combined, is still available. However, since PTLib and HPC-Netlib are no longer maintained, the metadata from those repositories are frozen in time. Only the BenchWeb content is still maintained.
The Netlib collection of mathematical software and other tools is still maintained and we recommend you visit that repository. Links to the archived NHSE repositories mentioned above can be found on that site as well.
On behalf of all of the federal agencies and institutions that helped make NHSE possible, we would like to thank all of the contributors over the years who submitted tools, links, applications, and other useful and usable material to the collection.
Many thanks.
NHSE Technical Team
nhse AT cs.utk.edu
So that is where I am in my research, and if you have other suggestions, I'd be very interested.
|
|
Authored by: Naich on Friday, May 26 2006 @ 04:05 AM EDT |
This is the non-anonymous off topic thread. Please do not use anonymous
threads.[ Reply to This | # ]
|
|
Authored by: Naich on Friday, May 26 2006 @ 04:07 AM EDT |
Thank you. [ Reply to This | # ]
|
|
Authored by: rsmith on Friday, May 26 2006 @ 04:26 AM EDT |
A start might be the jargon file (framed
version). The jargon file
homepage also has other interesting parts. One of my favorites being the Story of
Mel. --- Intellectual Property is an oxymoron. [ Reply to This | # ]
|
|
Authored by: Anonymous on Friday, May 26 2006 @ 04:47 AM EDT |
How about "The Free On-line Dictionary of Computing",
http://www.foldoc.org/ [ Reply to This | # ]
|
|
Authored by: Magpie on Friday, May 26 2006 @ 04:53 AM EDT |
What about this? Seems to be supported by Imperial College (London
University)
http://foldoc.org/ [ Reply to This | # ]
|
|
Authored by: Anonymous on Friday, May 26 2006 @ 05:20 AM EDT |
I use Wikipedia a lot.
It also uses hyperlinks wherever possible! Which means it ignore that
obnoxiously silly idea of ontologies...
Wikipedia also use the ODL "Open Document License", the literary
equivalent of GPL. There is a lso a thesaurus collection within the Wikipedia.[ Reply to This | # ]
|
- Wikipedia - Authored by: Anonymous on Friday, May 26 2006 @ 09:27 AM EDT
- Wikipedia - Authored by: Anonymous on Saturday, May 27 2006 @ 02:53 AM EDT
|
Authored by: Anonymous on Friday, May 26 2006 @ 05:29 AM EDT |
A taxonomy is different from a dictionary, though a dictionary might be useful
source material. A taxonomy says things like "to describe this concept, always
use this word", and "the meaning of word A is entirely contained in that of word
B".
For a good example, see UK IPSV. It doesn't have nearly enough
detail in the computing area, but it might be a good place to start hanging more
detail off. [ Reply to This | # ]
|
|
Authored by: leopardi on Friday, May 26 2006 @ 05:55 AM EDT |
See the RIB home
page. RIB uses
IEEE Standard 1420.1,
Basic
Interoperability Data Model (BIDM), and adds NHSE
extensions.
Unfortunately, the data model is just that, and is neither
a
taxonomy nor a glossary.
The
ACM Taxonomy
or the
ACM
Computing Classification System
may be more useful.
[ Reply to This | # ]
|
|
Authored by: pogson on Friday, May 26 2006 @ 06:15 AM EDT |
The world built Wikipedia.org and it contains
almost any term in which we might be interested concerning prio art.
As a
test, I searched
for
- superheterodyne
- RAID
- windows
- iefbr14
- FOCAL
- quicksor
t
- logarithmic and
- feedback
and obtained useful hits. If you find
something not in Wikipedia, create an article or modify the appropriate article.
An important feature of Wikipedia is that the USPTO or anyone may clone the
database and/or add prior art as it is
discovered.
--- http://www.skyweb.ca/~alicia/ , my homepage, an
eclectic survey of topics: berries, mushrooms, teaching in N. Canada, Linux,
firearms and hunting... [ Reply to This | # ]
|
- Wikipedia - Authored by: Chaosd on Friday, May 26 2006 @ 07:38 AM EDT
- Wikipedia - Authored by: Anonymous on Saturday, May 27 2006 @ 01:09 AM EDT
|
Authored by: Sean DALY on Friday, May 26 2006 @ 06:31 AM EDT |
See http://www.w3.org/TR/
DOM-Level-2-Core/glossary.html
[ Reply to This | # ]
|
|
Authored by: gbl on Friday, May 26 2006 @ 07:04 AM EDT |
Eric Raymond has experience of building
dictionaries. Perhaps someone could approach him to help?
--- If you
love some code, set it free. [ Reply to This | # ]
|
|
Authored by: gvc on Friday, May 26 2006 @ 07:40 AM EDT |
WAIS is the "Wide Area Information Service" pioneered by Brewster Kahle. It was
started before the Web, using a network of search engines for information
retrieval requests. freeWAIS is a free implementation of the system, and the
"query syntax" is the particular protocol used to communicate between WAIS
clients and servers.
Kahle went on to found the Internet Archive including the Wayback machine
that has been mentioned here. [ Reply to This | # ]
|
|
Authored by: Chaosd on Friday, May 26 2006 @ 07:47 AM EDT |
The RFC Database might hold some good
material.
There is also a good general reference site at Zytrax - not specifically a dictionary,
but quite a lot of web/PC related discussion. --- -----
No question is stupid || All questions are stupid
[ Reply to This | # ]
|
|
Authored by: gvc on Friday, May 26 2006 @ 07:48 AM EDT |
GAMS is "Guide to Available Mathematical Software." See NIST's description.
"Class
hierarchy" is jargon from the Object Oriented Programming (C++ and friends)
bandwagon. All it means in this case is a taxonomy that you can search --
mathematical software has been organized by GAMS into groups, subgroups, etc.
much as biologists have organized life on this planet.
Like WAIS, GAMS has
a search service, and the document is referring to the particular protocol used
to navigate and search within this taxonomy. [ Reply to This | # ]
|
|
Authored by: Chaosd on Friday, May 26 2006 @ 07:59 AM EDT |
Almost forgot, the Unix man (and info) pages. Can be easily
searched using the apropos and whatis commands, and includes the
following (unix related) sections:
- Executable programs or shell
commands
- System calls (functions provided by the
kernel)
- Library calls (functions within program
libraries)
- Special files (usually found in /dev)
- File formats
and conventions eg /etc/passwd
- Games
- Miscellaneous (including
macro packages and conventions), e.g. man(7), groff(7)
- System
administration commands (usually only for root)
- Kernel routines [Non
standard]
Most importantly the man pages usually contain Author,
Copyright, History and 'see also' sections. --- -----
No question is stupid || All questions are stupid
[ Reply to This | # ]
|
|
Authored by: gvc on Friday, May 26 2006 @ 08:10 AM EDT |
Pamela,
I am an information retrieval researcher and there might be an interesting
project here. How do I contact you? (Or you can contact me if you like.) I
have a vague recollection that it is possible to send you email, but I don't see
any such link. Perhaps I'm being blind.
thanks,
gvc[ Reply to This | # ]
|
|
Authored by: rdc3 on Friday, May 26 2006 @ 08:33 AM EDT |
I would suggest that the OSDL Prior Art project contact Google with respect
to this task. Google makes its business out of "organizing the world's
information."
Classification of prior art publications using taxonomies
or thesauri is becoming increasingly irrelevant in the context of powerful
full-text information retrieval techniques with citation indexing. Google and
Google Scholar are my preferred tools. Google Scholar has the
advantage of
helping you move forward to find newer
publications that cite prior
works.
The quality of links between documents is the determining factor
in citation searching. Google, CiteSeer and other systems could be greatly
improved by standardized forms of linking.
Software systems link to
each other through APIs, which may frequently be formalized using standard data
formats and protocols. In classifying software systems to aid future search,
one of the best things that could be done is to document the specific standards
(RFCs, W3C specs, ISO documents, etc.) and APIs that are used by the
software.
Indeed, a set of standard identifiers for these items could be
considered a controlled vocabulary. However, it is a controlled vocabulary
that naturally grows over time, as new protocols, formats and technology bases
become widely used.
Beyond linking to the formal specs, it would also
be
to link to academic publications addressing any
fundamental techniques used
(e.g., novel data structures
or algorithms).
Of course, references to patents
also make sense, where
they are known. Again, a set of standardized
identifiers
for publications makes a natural controlled vocabulary that grows over
time.
[ Reply to This | # ]
|
|
Authored by: DL on Friday, May 26 2006 @ 08:50 AM EDT |
ISBN 0070314888
It's over 700 pages, so it's pretty comprehensive.
It's old-school IBM-centric, but considering the long history and number of
patents IBM has, it should be a worthwhile reference. It is useful for
decryping IBMisms into other IBMisms.
I don't know that IBM would be willing to contribute from it for this project.
As far as I know, I hasn't been updated in a while.
The Jargon file already mentioned is comprehensive, but the tone is quite
irreverent. Fair warning: some of the entries in this one are not suitable for
posting on this site, however accurate they may be.
---
DL[ Reply to This | # ]
|
|
Authored by: lisch on Friday, May 26 2006 @ 09:26 AM EDT |
The Encyclopedia of Computer Science and
Engineering (ISBN 0442276796)
is a classic
standard reference. The latest edition is a bit pricey for
most
people's day-to-day use, but this massive tome
would probably help OSPA. [ Reply to This | # ]
|
|
Authored by: DaveJakeman on Friday, May 26 2006 @ 09:32 AM EDT |
ISBN 0-7221-6595-1
First published by Oxford University Press, 1983. Later published by Sphere
Books Ltd, 30-32 Gray's Inn Road, London WC1X 8JL.
From the back page:
"The Dictionary of Computing is the essential reference for all those
professionally involved in computing, both in academic and industrial life. It
is also suitable for people who have had no previous contact with computers but
now find they need specific reliable information, or for those with personal
computers who want to find out more about the subject.
"This dictionary contains over 3,750 terms used in computing. Terms which
range in complexity from basic ideas and equipment to graduate-level computer
science. Where relevant, entries are supplemented by instructive diagrams and
tables.
"The entries have been written by practitoners in all branches of computing
under the scrutiny of distinguished scholars from both sides of the
Atlantic."
---
Champagne for my real friends, real pain for my sham friends - Francis Bacon
---
Should one hear an accusation, try it out on the accuser.[ Reply to This | # ]
|
|
Authored by: jesse on Friday, May 26 2006 @ 09:40 AM EDT |
I took a look at the paper the USPTO suggested looking at. The
Netlib is a collection of mathematical software, from all I can tell. Their
search page suggests using the GAMS class hierarchy or the freeWAIS-sf query
syntax. I have no idea what that is, but I'm just telling you what I'm
finding.
netlib is a holdover from before the web
existed. In those days it used E-mail to provide queries for data, and E-mailed
replies for the results.
The "FreeWAIS-sf" query was a modified
(ie - without license restrictions) WAIS (Wide
Area Information Search) with
scientific format (I think that was the "-sf"). The query itself did not use the
current key+key type syntax, but depended on each key located in the document,
with adjacency computation (ie - number of significant words between the query
key words) to determine relative scoring. The results were then sorted and
E-mailed back to the requestor.
The definition of "significant words"
was usually any word NOT in a dictionary of ignored words. The database was a
keyed file by word, document, starting location in the document, and ending
location in the document, along with an initial weight value generated during
indexing. I don't know what the weighting was based on, possibly things like
frequency the word appeard, number of words between occurance...
This
was all done before Yahoo was a gleam in somebodies eye. The major searches were
done by "Veronica" which was based on gopher and the non-free WAIS
engine.
Now gopher - was a pre-web browser that used a basic
TCP
connection (similar to telnet) for the browser. The user would use the
gopher client and specify the target server (a
complete imitation of telent at
the command line). The server would then present the root level document which
was
basically a table of contents + a header and footer. The user then used the
arrow keys to step down the page and select the content desired (I belive a tab
would jump to the next key entry - basically a URL). The gopher server would
supply the data just like a web server does, one data file, then disconnect. It
was up to the gopher client to provide the interface for reading the file, and
identifing the links
included. This is why web browsers still have a
"gopher://..." link. The gopher server didn't care what kind of browser you were
using. So the first web server was likely a gopher server that provided files in
HTML format.
Oh, one other thing. Gopher used port 80 for the server...
and all web browsers since (and web servers) assume port 80 for the default URL
"http://..".
I used this search for an early information search for a
web browser based database of user questions (early FAQ) support for a DoD MSRC
helpdesk. (This was before forms were available in web standards. The only input
field available was a "non-standard" WAIS query field, which was a single input
text line in a web page.) [ Reply to This | # ]
|
- netlib - Authored by: Anonymous on Friday, May 26 2006 @ 02:11 PM EDT
|
Authored by: DaveJakeman on Friday, May 26 2006 @ 09:46 AM EDT |
Digital Press (presumably now HP, if it still exists as such)
Digital order number: EY-3433E-DP
ISBN: 0-932376-82-7
659 pages.
From the back page:
"Based on the work of DEC engineers, documentation writers and educational
specialists, the Dictionary guides users of Digital Equipment Corporation's
products through the maze of technical terms, mnemonics and acronyms used to
identify or describe them. It is an indispensable sourcebook for computer
specialists, technical writers, course developers, instructors, students and
translators."
Contains a mixture of generic and DEC-specific computing terms.
---
Champagne for my real friends, real pain for my sham friends - Francis Bacon
---
Should one hear an accusation, try it out on the accuser.[ Reply to This | # ]
|
|
Authored by: talexb on Friday, May 26 2006 @ 10:04 AM EDT |
It puzzles and astounds me as to how the US Patent Office can make what they
consider to be good decisions on awarding any kind of patent without a fair bit
of relevant technical know-how. To me, it's almost an admission that they were
guessing, and suggests that many, if not all, software patents may need to be
re-examined.
It reminds me of Alice Through the Looking Glass. How bizarre.
[ Reply to This | # ]
|
|
Authored by: epostma on Friday, May 26 2006 @ 10:15 AM EDT |
I've recently discovered the Dictionary
of Algorithms and Datastructures by NIST. It's quite good, but it only deals
with the theoretical side of computer science.
Erik. [ Reply to This | # ]
|
|
Authored by: DL on Friday, May 26 2006 @ 11:15 AM EDT |
Where * is one of several programming languages.
These are cookbooks containing hundreds of useful algorithms--all that math
you've forgotten since school--that you can put in your code. Some of these
books have over 1,000 pages, so they cover a lot of material.
The first edition of Numerical Recipes in C came out in 1988. That predates
most, if not all, software patents. Regardless, only patents issued less than 2
years prior to its publication have not expired.
You'll want the newer editions for real work, but that first edition will be the
most effective for debunking prior art.
---
DL[ Reply to This | # ]
|
|
Authored by: jturner on Friday, May 26 2006 @ 12:52 PM EDT |
The proprietary NAG library is organized like this.
The "complete functional summary" PDF here
relates to another
widely-used (but far from free) data analysis system. Netlib may be the biggest
and oldest repository for free algorithms.
In terms of disciplines, scipy
categorizes free Python packages like this.
I'll see if I can
think of more. It is much easier to list (free or otherwise) packages/algorithms
than to categorize them...
[ Reply to This | # ]
|
|
Authored by: Anonymous on Friday, May 26 2006 @ 02:03 PM EDT |
I doubt much of it is useful, but I suppose The Jargon File fits within the rubric of
the broad description... [ Reply to This | # ]
|
|
Authored by: Anonymous on Friday, May 26 2006 @ 05:56 PM EDT |
Interestingly many of the core free math routines from Netlib are used in the
very successful commercial product matlab. A large part of the documentation for
the bigger matlab routines refers directly to netlib functions and white
papers.
[ Reply to This | # ]
|
|
Authored by: Ted Powell on Friday, May 26 2006 @ 09:41 PM EDT |
The following quote is from WWW -- Wealth,
Weariness or Waste :
Controlled vocabulary and thesauri in support of online
information access by Professor David Batty:
A thesaurus, to a
layman, is a fat book prepared by somebody called Peter Mark Roget and used by
college students to enlarge their vocabulary when writing term papers -- and,
often and unfortunately, to vary the representation of the same concept from
sentence to sentence. A thesaurus to an information scientist is a controlled
set of the terms used to index information in a database, and therefore also to
search for information in that database so the same concepts are represented by
the same term. For many years in this country, thesauri were often presented as
alphabetized lists of key terms, taken from the document to be indexed with
references to and from other terms made as necessary. This traditional practice
has changed in recent years to a more structured approach based on an analytical
technique. Ironically, this means that the original misuse of the word
"thesaurus" by information scientists, to describe purely alphabetical lists of
terms (Roget organized his thesaurus by categories of knowledge, and included an
alphabetized list of terms only as an index), has been amended so that it is now
closer to a proper use of Roget's meaning to include both categorization and
alphabetical listing.
The heirarchical structure is especially
important when looking for prior art overlapping a patent claim, given that the
claimant has a certain incentive to avoid the use of common nomenclature. The UNESCO Thesaurus:
hierarchical list is a nice example of a heirarchy that covers a lot of
ground.
The site's home page
has explanations of common structural terms, such as BT, RT, NT: Broader Term,
Related Term, Narrower Term, respectively.
--- "If you don't have the
source code, you are probably going to
be screwed in the long run." --Philip Greenspun [ Reply to This | # ]
|
|
Authored by: Anonymous on Monday, May 29 2006 @ 06:46 PM EDT |
It seems that the IEEE standard might be a good one to use, since it was
created by the software profession, it's been around a long time, and it was
also recently updated.
IEEE Standard Glossary of Software Engineering
Terminology (rev 2002)
http://standards.ieee.org/reading/ieee/std/se/610.12-1990.pdf
Seems to
cost money though.
J
[ Reply to This | # ]
|
|
Authored by: Anonymous on Wednesday, May 31 2006 @ 11:17 PM EDT |
You can find it here. [ Reply to This | # ]
|
|
|
|
|