not born knowing what spectral analysis is. So when SCO said that spectral analysis is
one of the methods they used to find "infringing" code, I had no idea
what they were talking about. When Sontag compared it to finding a
needle in a pile of needles, I figured it wasn't much use. And it turns
out, on further investigation, that my intuitive conclusion may be about
right, at least when it comes to using it for software code data mining
for infringing code in this case.
An alert reader noticed something
interesting. One of Canopy Group's companies is called DataCrystal.
Could that be at least one of the three groups SCO hired to try to sort
through the code of UNIX System V and the Linux kernel? DataCrystal
does "advanced pattern recognition" and "AI systems". They actually claim to do a
great deal more besides. One of the things listed on their what-we-do
page is data mining. Presumably, that's what SCO wanted to do. And a
look at their About page indicates that if
you are the RIAA, you probably would want to have a company like DataCrystal to hunt down pirates for you. Another reader noticed this page
about a DataCrystal, and he
wondered if it might be the same company. It isn't, because this DataCrystal
is the name of a project at USC, not a company. While I don't know if
the Canopy Group company DataCrystal was hired by SCO, or whether there is any connection between the company and the project, it did make me start to wonder about the field in general. If you really wanted to know if two piles of code had identical or similar code, can data mining find out? And would matches
be reliable for use in the way SCO apparently is using them? Judging by
the SCOForum demo, we might think no. And we might be right.
I asked a
Groklaw resource person, a man who worked for over a decade doing basic and exploratory
research for the US DoD and the Canadian Ministry of Defence on topics related
to secure communications and signals intelligence, including cryptology,
statistical processing of natural language, signal processing, and
computational learning, if he'd be willing to explain it in general and understandable
terms, so we can follow along. Very likely this subject is going to be
a very significant part of the case when it goes to trial.
Here is what
he explained to me:
Data mining is looking for
patterns or similarities in large quantities of information. Google is a
good example of data mining-on-demand where the pattern is supplied by
the user and the large quantity of information is the entire set of
webpages on the internet. But data mining in general is potentially much
broader. For example, a typical data-mining task might be to take a
sample document and look for other documents in a database that might be
similar to it. But even beyond that, data mining can be applied to other
kinds of data -- pictures, for example, or sound
recordings. Not being a rocket scientist, I wanted to be sure I'd understood, so I wrote back
and asked these followup questions, and got this reply:
There are lots of different ways to approach
problems like this. Beyond the most elementary, what all the techniques
have in common is that they rely on mathematical models and
transformations of the data. Part of the reason is efficiency, since
turning the problem into math usually means there's a computationally
clever way to do it. Another part of the reason is that, by transforming
the problem into math, you make it possible to find and grade a
continuum of approximate matches -- in short, to find ranges of
similarities rather than just identities. Note very well that
'similarity' here is completely dependent on the particular flavor of
math you've chosen as your technique. This is extremely
OK, so you've taken your document or picture or
whatever, and you've mined your database for similar items. Those items
will be graded for similarity to your original, just as some search
engines will rate their returned items in terms of probable pertinence.
The most sophisticated and respectable data-mining systems will be using
grades based on probabilities. This is because the underlying math will
be using probability models. Many times the grade will reflect not
merely the strength of a match in terms of probability, but also the
likelihood that such a match would be found at random searching any old
data at all. This also is extremely important, since 'any old data at
all' can be subject to a wide range of interpretations. (This could
pertinent in the SCO case, since, if data-mining techniques are used,
it's a reasonable question whether any contamination discovered this way
is real, or whether it's spurious, i.e., capable of being found to the same
degree in other, unrelated data.)
Now the DataCrystal webpage
consists mostly of a laundry list of any and all of the subjects ever
associated with data mining, artificial intelligence, knowledge
discovery, or machine learning. But the .pdf white papers all focus on
using data-mining techniques for indexing and retrieving digital video
and audio. What's more, they're offering not just indexing and retrieval
services, but also housing, protecting, and distributing the data
It outlines an enhanced technique for expanding
data-mining coverage. It's a technique for building patterns out of
patterns and data mining on the derived metapatterns in
have two questions to follow up:
there you have it. At least now we know in general what they are
talking about. As the case goes forward, and more is revealed, no doubt
it'll be interesting to meaningfully follow along.
1. . .the results would depend on
how you programmed the software? In other words...it can look for
similarities, but it can't evaluate them?
Q:..there might be in actuality no common code at
ANS: You know how Google sometimes matches all the words in
your query, but not necessarily conjointly or in the same order?
In the case of computer code, especially code written in C expressing
similar or common algorithms, it would be astounding if there weren't
pattern similarities at some level. If nothing else such things are
enforced by the design of the language and commonly-held notions about
good coding style.
Q: ...it simply would have to be the case
that some of the code is close enough that they might have a
ANS: Just the contrary. As with the 1st slide example,
the ancestry of that memory-management code is known to virtually
anybody who's studied C from Kernighan and Ritchie's book. A similarity
like that would stand out like Devil's Tower, but what it indicates is
exactly the opposite of what they contend: it shows that everybody knows
Q: And can they program the math to increase
"matches"? Pls. explain a bit more this part.
ANS: Here's an
example. Suppose you came up with a hitherto-unknown page of blank
verse. The question is, was it written by Shakespeare or not?
Data mining your way through that problem, you'd get one level of
certainty if your database contained the Bible, Goethe, Racine, Pushkin,
and the New York Times. You'd get a different level of certainty if your
database were confined to Elizabethan dramatists. The scores for
putative Shakespeare against the mixed database would probably be huge
just for matching any English. The scores against Elizabethan dramatists
would probably be quite a bit weaker, but clearly more
conclusive. The mixed-database test -- the one with the Bible, Goethe,
etc. -- will probably say 'Shakespeare indeed!' but it's
expressing the idea that 'if it's English it's Shakespeare.'
the other hand, the Elizabethan dramatist test might say
yes, might say no, but the answer will be based on such
things as a small number of very subtle differences between,
say, Shakespeare's and Marlowe's vocabulary. It expresses
perhaps the idea that 'in any 1000-line chunk of Shakespeare
and any 1000-line chunk of Marlowe, Shakespeare is likely to
use the word 'ope' once and Marlowe not at all. This example
doesn't use 'ope' at all therefore it's probably Marlowe.
can see it's still a matter of interpretation and
probability, but the second test is simply more credible on
grounds that are external to the data-mining method itself.
another point of view. How does a data-mining
search for SVr4 code look if you run it against all C programs? In all
likelihood you're going to find some matches. Are the matches
against Linux actually any stronger than matches against an arbitrary
body of C code? Against other Unix-like kernels? etc.
interpretive issues, but there are statistical grounds for
deciding them, and speaking strictly for myself, I seriously doubt
they've been fielded satisfactorily. For my money you couldn't even
start taking the matter seriously unless exactly the same tests were run
against every body of other kernel code like all the BSDs, and a
chunk of the SVr4 kernel against the rest of that same kernel. And
even then, you've only generated the raw information to start the
business of verifying and refining the procedure.
what is spectral analysis? Is that what this is?
general, spectral analysis refers to breaking things down into component
frequencies -- sort of like how a prism breaks white light into colors,
and so on.
In this case it refers to using the periodicities of
the individual characters of program text as frequencies to look for a
very specific set of 'colors' associated with a particular swatch of
program code. It's not determinative either. It may also refer to a kind
of computational trick using spectral-based techniques to look for
certain kinds of approximate matches very quickly.
analogy to Google made it all come clearer to me. On top of all that he
wrote, I know with Google, input affects output. And input means humans,
imperfect humans. I certainly know that I get different results from
Google if I plug in the identical search terms, but in a different
order, for example. So I totally get how results could be skewed by
what you tell the software to do. For example, I get different results
if I search for "Dave Farber" and IP than I do if I search for IP and
"Dave Farber", and it's different still if I search for just IP or just
"Dave Farber" or just Farber or just Farber and IP. And that's using
the same pile of data. Input affects output.
Obviously they would argue
that their methods are so refined, blah blah. But that human element
can't be removed, because humans write the software, no matter how
sophisticated. So how reliable are the matches? You use Google. What do
you think? Doesn't a human at some point have to interpret the value of
"A continuum of approximate matches" does not
infringement prove, on its face. As he says, it's an interpretive
issue. And data mining seems to be a better match with something like
matching amino acid strings than figuring out if someone stole
somebody's code, which requires knowing who has or doesn't have a valid
copyright, which way matching code travelled, who had the code first, etc.
If I've understood
what my friend has written, it means that if SCO swapped out Linux and
searched Windows 2000 code instead, it'd likely find instances that
looked like "infringing code" also. That's the same as saying that so far,
they are holding maybe nothing. It all reinforces in my
mind that, once again, nothing has been proven to date by their claims
of similarity, derivative or obfuscated code matches, and nothing can be
proven using data mining techniques, until this case goes to trial and
the experts speak, followed by a decision by a judge.
If you are interested,
here is a white paper, "Text Mining -- Automated Analysis of Natural
Language Texts" that explains the process of searching just for simple
text, and while it does the explaining, it also shows just how much
human input goes into structuring your search before you begin the
search and why the results still may not be what you want. It is hard to see how such techniques
could answer the question: "Is this infringing code?" At best, it could show you where to begin
to investigate. And here are the DataCrystal project's white papers.
other thing I found out in my investigation. Guess where most of the
cutting-edge brains working on such data-mining techniques work? . . .
No, really. Guess. . .
That's right: at IBM.