You may have noticed that in the teleconference on Tuesday, SCO CEO Darl McBride made the claim that they have shown the code to IBM in discovery and that IBM knows exactly what code is in dispute. Specifically, he said this:
" . . .by the way, we have shared the code in question there with IBM under the litigation event. They know what we're talking about there."
There is room for skepticism. While it is impossible to rule out that there may have been code shown privately that is not in the public record yet, if Darl was referring to the list of files it presented to IBM in discovery so far in the record, I think we need to look at those lists of "infringing files" more carefully.
I noticed, as soon as their discovery list of files was released, coders everywhere were fallling over laughing or snorting in contempt. I'm not a coder, so I asked some of our readers to explain why the lists strike them as so pitiful. There were many replies, including some fine comments, but they included and were based on code that went over my head and would not be accessible to some of Groklaw's readers either.
Most people in the world are not programmers, and it's a language we don't know. So my request was for a translation into English, so we too could grasp what they were noticing, exactly what SCO has on their lists, how they likely arrived at the lists, and what it indicates as to how much SCO actually provided during discovery, so we can understand why IBM filed a motion to compel discovery after receiving the lists from SCO.
The final result is mostly Frank Sorenson's work, but it incorporates helpful input from other Groklaw readers, so it represents the work of a group. I hope you enjoy looking at it from this fresh perspective. This case is, after all, about code, so the rest of us can only gain insight by trying to comprehend that part of the story.
Because I am not a programmer, I appreciated Justin Rowles' explanation about the utilities find and grep that Frank talks about in his article:
"Unix provides highly flexible tools for searching directory trees
and the files they contain. The two most common ones are called
grep. Use of these tools is
taught in 'Unix 101' type classes. For example, if I wanted to
find all the files on my hard disk that started with 'apple' and ended
in 'pie', I could use the
to do so. It would find files called 'apple pie', 'apple and
blackberry pie' and so on.
grep is a similar tool for looking at the
contents of files. It would be used to look at files and find, for
example, which ones contained the word 'custard'. Usually it
searches files in a single specified directory, but it can also be used
to search a list of files generated by another command, like
"Both of these tools are highly flexible, and can be used together by
a competent Unix person to search their disks for highly specific
things. I could use
find files that are called 'apple something pie', but not 'apple and
redcurrant pie' and then check all of those files with
grep to leave only those
which also contain 'custard'. I can do all this in one
instruction to the computer.
"In fact, in GNU/Linux,
been improved. GNU
the ability to search directory structures, so I can dispense
with step one above. In SCO Unix, you can't do that, so you need to
Keep this explanation in mind, all you nonprogrammers, as we take a look now at the file lists with Frank. And the other thing you need to know to understand what Frank describes is that a Caldera employee was a key Linux contributor, Christoph Hellwig, and he wasn't the only one, and the evidence indicates strongly that Caldera knew at the time the contributions being made. Old SCO also contributed code to Linux. I think you will conclude, as I did, that when Darl says that they "deep dived" and looked at the code every which way, as he again claimed yesterday, he couldn't have been describing the process used to come up with the lists they have provided to IBM in the court case. They definitely didn't need spectral analysis, the missing MIT mathematicians, or physicists to come up with such lists as those they provided IBM and the court in their Supplemental Responses. Google and a couple of simple utilities are sufficient. With that introduction, here is Frank's article.
The SCO Group's List of "Infringing" Files -- How Might They Have Come Up With This List?
~by Frank Sorenson
In IBM's Reply Memorandum in Support of their (First) Motion to Compel Discovery (text here), IBM includes SCO's Supplemental Responses to IBM's First Set of Interrogatories (text here) and tells the Judge that SCO is still not answering their questions. One of the responses SCO provided was a list of files that may or may not be infringing, according to SCO. Why might IBM view the list as inadequte? To someone without the programming background, it might be hard to know.
A closer look by a computer programmer, with English translation for nonprogrammers, may give a clearer picture of why SCO's responses were neither "responsive nor identified with meaningful particularity", according to IBM. It also reveals the likely method SCO used to draw up the list, which bears on SCO's earlier claims that it had three groups of analysts, including the MIT mathematicians, analyzing the code.
SCO's response includes five lists from several categories:
- A list of "source code files identified by SCO thus far ... part of which include information (including methods) that IBM was required to maintain as confidential or proprietary...and/or which constitute trade secrets misused by IBM..." It's a list of 115 files.
- A list of "source code files identified by SCO thus far...which may...include information (including methods) that IBM was required to maintain as confidential or proprietary...and/or which constitute trade secrets misused by IBM..." It's a list of 591 files.
- A list of people at IBM that SCO claims to be aware of "in which part of the confidential or proprietary and/or trade secrets [were] known or [have] been disclosed." There are 5 lists of names, whose names appear in the Linux code base, adding up to about 74 people.
- A list of IBM copyrights. This is a list of 22 names.
- A list of people who "likely have knowledge, although their names do not appear in the Linux code base." It's a list of 62 names.
First, a little background on Linux/Unix utilities and tools, then we will examine each of these lists, how they may have been created, and what (if anything) they mean. We conclude with some general comments.
There are a number of useful utilities in Linux/Unix. Because we will be using some of them in our discussion, we'll briefly mention a few before moving on:
One utility is called
grep, and it is a utility designed to search inside a file (or files) for lines containing a certain pattern. In its simplest form, it is usually used like this: '
grep string filename', but it also accepts numerous flags (options) to allow it to perform various functions. When calling grep as egrep, extended pattern matches are enabled. Here, we will use grep to quickly find files containing strings that we are interested in.
Another commonly used utility is
find, which is used to search a directory for files having certain properties, such as a specific name or pattern. Here, it will be used to locate files that we are interested in searching the contents of.
sort does just what it says; it sorts a list of strings. It can also be used with the -u option (unique) to remove duplicate references.
cat is used to type out the contents of files, and is very similar to
type under DOS/Windows.
xargs is used to execute commands on the output of a previous command. We will be using it to reprocess the output of find commands and the output of other utilities.
SCO's Lists of Files
Let's start with List 2: The list of "source code files identified by SCO thus far...which may...include information (including methods) that IBM was required to maintain as confidential or proprietary...and/or which constitute trade secrets misused by IBM..." This is a list of 591 files.
While this list contains a number of files from Linux, 591 of them, SCO fails to mention what kernel version, and only says they're from 2.4 and/or 2.5 kernels. As IBM correctly points out, "This is no small problem since there are 75 different releases of the Linux kernel 2.5 alone." SCO also says that they do not claim the entire source code found in those files, but that this information is interspersed in those 330,000 lines of code.
IBM also points out that since it is Unix code (SVRx) that SCO claims was misappropriated, pointing to the Linux source code does not really answer their question, which was: from where were the trade secrets misappropriated? SCO passes this argument off by saying that they have not completed discovery, and that since IBM hasn't given them everything they've asked for, they don't know exactly where it came from.
Because SCO is claiming that it is IBM's trade secrets that were misappropriated, they don't have the trade secrets yet themselves. In other words, they need IBM to reveal more information. The question becomes "Why does SCO believe that this list contains their trade secrets if they don't know the trade secrets and need IBM to point them out?"
In attempts to answer this, a number of discussions have occurred, here on Groklaw, on the Linux Kernel Mailing List, and elsewhere. Here on Groklaw, Lev managed to narrow the Linux kernel version down to either 2.5.68 or 2.5.69. Many people were quick to point out that most files on the list contained one or more strings that SCO likes to claim as theirs: SMP, JFS, RCU, and NUMA.
By using the appropriate utilities, it is possible to reproduce SCO's list (number 2) without any manual investigation of the contents of any of those files. A sorted (and cleaned up) copy of SCO's list number 2 is located here for reference. While this solution is certainly not the only one, and is probably not optimal, it is the one that the author managed to construct:
find . -type f -name "*.[ch]" -print0
| xargs -0 egrep -wil 'smp|rcu|numa'
| cut -c 3- > /tmp/output1
find fs/jfs -type f -path "*.[ch]" -print0
| xargs -0 egrep -Li "@sco|@caldera" >> /tmp/output1
egrep -v 'alpha|parisc|sparc|sound|drivers' /tmp/output1
| sort -u > /tmp/SCOFiles-list2.output
This may look like quite a mess, but it can be deconstructed into manageable pieces. All three lines really consist of several commands strung together using the |, or pipe. This means that the results of one command are used as input to the next command.
Picking apart these lines, first I found all files with a filename ending in .c or .h (C source code and header files). I searched the contents of these files for any of the strings 'smp', 'rcu', or 'numa' (without caring about upper- or lower-case). I placed these matching files into the file /tmp/output1. Next, I searched the JFS filesystem code for .c or .h filenames, removing any files that mention someone at SCO or Caldera working on them. The results were appended to /tmp/output1. Finally, I searched the /tmp/output1 file and removed all file names referring to alpha, parisc, or sparc (essentially Sun and HP). References to driver files and sound were then also removed.
When applying this process to the kernel versions identified by Lev, we get 3 false positives and 3 false negatives with the 2.5.68 kernel and just one false positive with the 2.5.69 kernel. As the list is otherwise identical to SCO's, I believe that SCO used the Linux 2.5.69 kernel to generate these lists.
The false positive was
include/asm-h8300/smplock.h. There may be a number of explanations for this, one of the most likely being that someone at SCO messed up, and missed a line when sending the list to the lawyers. This is, of course, presuming that the person preparing the list used a similar process, which I believe is likely.
What does this mean? Essentially, that SCO searched for any reference in the Linux kernel source for SMP, JFS, RCU, and NUMA, and claimed all of those files as possibly infringing. They included the entire JFS source code, but, perhaps realizing that it would look really bad to claim a file that implicated SCO or Caldera by showing the names of their employees, removed those files.
A number of people have pointed out that some of the files are so trivial that they could not contain trade secrets. For example,
include/asm-arm/spinlock.h contains only 6 lines, but is included in the list because it contains the string SMP (as in "we don't do SMP"):
#error ARM architecture does not support SMP spin locks
#endif /* __ASM_SPINLOCK_H */
In providing this list to IBM, it appears that all SCO has done is to make vague claims over all of SMP, JFS, RCU, and NUMA, which is hardly news, but they have given no explanation of how they created their list of possibly infringing files. They haven't answered IBM's question at all (which relates to original SVRx code), and they look silly in the process, at least to those who understand the code and the list.
It is obvious that SCO did not spend a great deal of time or effort at answering IBM's question with valuable information. If they actually did spend time and effort to produce this list, their technical person is not extremely skilled.
List 1: A list of "source code files identified by SCO thus far ... part of which include information (including methods) that IBM was required to maintain as confidential or proprietary...and/or which constitute trade secrets misused by IBM...", the list of 115 files.
The first thing to note is that the files in this list are actually a subset of the files in List 2. For reference, a copy of SCO's list number 2 can be found here. Using our trusty Linux utilities, we can again construct a sequence of commands that produces SCO's list automatically. The following commands will produce all of SCO's files (again, 100%) with just 2 false positives:
| xargs egrep -l 'International Business Machines|ibm.|IBM Corp' > /tmp/output1
| xargs egrep -wl 'IBM|RCU'
| xargs egrep -L 'sco' >> /tmp/output1
sort -u /tmp/output1 > /tmp/SCOFiles-list1.output
These commands first search (List 2) for anything that would be easily identifiable as coming from IBM, files containing "International Business Machines", "IBM Corp", or "ibm." (as could be contained in an email address like firstname.lastname@example.org). Next, any mention whatsoever of "IBM" or "RCU" is included, as long as the file does not also contain "sco".
Again, while we do not know for certain that this is the method that SCO used to produce this list, it is easy to demonstrate that even though our commands do not produce an identical list, SCO spent little more time to create this list than List 2.
We are unable to determine determine whether someone messed up and omitted the two false positives,
include/linux/list.h, or whether our search string is not sufficiently developed to produce the same list. What we do know is that this list of "definitely infringing files" is little more than files with IBM mentioned, minus files referring to SCO. IBM is asking for specifics because SCO has given no explanation of how they built their list. Also, they've avoided the question of where in SVRx these trade secrets came from, and why SCO believes they are trade secrets.
List 3: A list of people at IBM that SCO claims to be aware of "in which part of the confidential or proprietary and/or trade secrets [were] known or [have] been disclosed." This consists of 5 lists of authors, for a total of about 74 people.
In SCO's Supplemental Response, they identify a number of people as having disclosed proprietary information and/or trade secrets. They break down these names into "US Authors" (30), "German Authors" (24), "Australian Authors" (2), "Other" (15), and "Austin Office (JFS)" (3). We won't be going into the same detail in analyzing this section because it involves the names and email addresses of people and we have redacted this information from the text version of the document. Those curious should view SCO's filing to see examples.
Suffice it to say that these lists can be regenerated by searching the kernel source for all files containing an email address at IBM. It contains actual lines from the copyright notices contained in the Linux kernel. On more than one, the line also contained references to other email addresses that the person used, and at least one just ends like this: "email@example.com or". The next line in the kernel source file contains the alternate address.
This list is fairly easy to generate, but does require a bit more manual intervention than most of the others. Since some people have contributed using multiple names (such as Pat and Patrick), someone has manually merged these names together. It was done sloppily, though, since there are other email IBM-related email addresses in the source code which are not mentioned.
Here, SCO is apparently telling IBM that they believe that every contribution from IBM is tainted, but they'll need all the source code ever written from IBM in order to prove it. I have serious doubts that everyone that ever contributed to Linux from IBM has done so under such suspicious circumstances (I actually have serious doubts that _any_ contributions are tainted in this way).
List 4: A list of IBM copyrights (a list of 22 names)
This list is as easy to generate as List 3. It is merely a list of all the various copyright notices involving IBM in the kernel source. It's actually a pretty boring list, and doesn't seem to tell anyone much, including IBM. It can be regenerated merely by searching for "Copyright" or "(C)" in the same line as "IBM Corporation". They're all just lines like:
Fred So-and-So, IBM Corporation
List 5: A list of people who "likely have knowledge, although their names do not appear in the Linux code base." (a list of 62 names).
We've left the best for last. Here, we've left the kernel source, but where has SCO gotten this list? Ready? Okay... Here goes. They got it from a Google search.
Well, at least that is what it appears. The fact is that you can find the names on this list by searching on Google for email addresses from IBM that posted to the Linux Kernel Mailing List (LKML). Like I said, I don't actually know that this is how SCO did it, but if you're really curious, look at SCO's filing, then check out Google Groups for messages that hit the Linux Kernel Mailing List: '"ibm.com" group:fa.linux.kernel' (for example).
Without doing an extensive study, it is difficult to know exactly how much (or little) work was done to actually build the list, but it is clear that SCO belives that these individuals "likely have knowledge" because their email address can be found on the Linux Kernel Mailing List. To test this theory (in a highly unscientific manner), we chose 5-10 email addresses from the LKML (compliments of Google) and all were located on SCO's list. We then tested things the other way around, and had similar results. The addresses we chose were easy to find on the LKML. One brief example: SCO's list includes the email address firstname.lastname@example.org, which is easy to find here.
So SCO produced a list that they believe holds the names of people with knowledge of Linux. They may have actually searched the Changelogs, as well. A list of names you can find on Google hardly qualifies as a response to IBM's interrogatory.
Some General Comments
In SCO's list, in the legal document, SCO has replaced all the slashes (/) in the file names with periods (.). There are several theories in the Linux community as to why. One possibility is that the lawyers may have written it up using a program that doesn't like slashes, instead of using Unix or Linux. While I used GNU utilities such as grep, the person preparing the list may have used a different platform.
Regular file/path names can be converted to the dotted format with the following command (if you so desire): '
cat /tmp/SCOFiles | sed s:/:.:g' At any rate, they could be converted back easily enough. Interestingly, the path /arch/ppc64/kernel was also changed to .arch.ppc.64.kernel for some yet unknown reason.
Whoever prepared these lists was rather sloppy. They didn't pay attention to detail, missed obvious files and email addresses, and didn't edit very well. Obvious references to SCO or Caldera have been removed, but some of the less-obvious ones remain. For example, some contributions to JFS by Christoph Hellwig (once an employee of SCO) remain. Presumably, at least some of those contributions occurred while he was working for SCO.
Some of the files included are trivial and obviously contain no relevant information. The 6-line files that just say "we don't do SMP" come to mind.
It is easy for coders to understand IBM's contention that SCO has not been answering their questions, regardless of the amount of data that they have produced. They don't explain how anything they have reported is a trade secret. And the fact that their lists can be recreated over a weekend using simple scripts indicates to us that their answers are too broad to qualify as answers to the questions they were asked.
Maybe SCO hasn't heard the old saying: "Never tangle with a geek when source code is on the line."
Prepared by Frank Sorenson
With numerous helpful comments from other Groklaw Regulars