decoration decoration
Stories

GROKLAW
When you want to know more...
decoration
For layout only
Home
Archives
Site Map
Search
About Groklaw
Awards
Legal Research
Timelines
ApplevSamsung
ApplevSamsung p.2
ArchiveExplorer
Autozone
Bilski
Cases
Cast: Lawyers
Comes v. MS
Contracts/Documents
Courts
DRM
Gordon v MS
GPL
Grokdoc
HTML How To
IPI v RH
IV v. Google
Legal Docs
Lodsys
MS Litigations
MSvB&N
News Picks
Novell v. MS
Novell-MS Deal
ODF/OOXML
OOXML Appeals
OraclevGoogle
Patents
ProjectMonterey
Psystar
Quote Database
Red Hat v SCO
Salus Book
SCEA v Hotz
SCO Appeals
SCO Bankruptcy
SCO Financials
SCO Overview
SCO v IBM
SCO v Novell
SCO:Soup2Nuts
SCOsource
Sean Daly
Software Patents
Switch to Linux
Transcripts
Unix Books
Your contributions keep Groklaw going.
To donate to Groklaw 2.0:

Groklaw Gear

Click here to send an email to the editor of this weblog.


To read comments to this article, go here
Google Taking Blog Comments Searching Real-Time?
Tuesday, January 22 2008 @ 02:05 PM EST

A reader sent me an article about something intriguing he noticed in Google results. He was wondering if any of you have noticed it too. I contacted Google to see what they might say, and here's what they told me: "We crawl the web continuously and schedule visits to each page intelligently to maximize freshness." Of course, to maximize intelligently, one has to experiment. In fact, Google has a blog post about their various experiments:
From time to time, we run live experiments on Google — tests visible to a relatively few people -- to discover better ways to search. We do this because there’s no good substitute for understanding how real people, in real-world situations, actually operate. Theories are fine, but “improving the user experience” really happens best when we understand what people do online.

So to learn more, we sometimes randomly select a group of people to see a possible improvement to search options. Or we may select a group of people and try out a new element while they're searching. If you ever wonder why your Google site looks slightly different from that of the person sitting next to you, this is why.

He seems to have hit on one. So, take a look and see what you think.

*********************************

Google Taking Blog Comments Searching Real-Time?
~ by Bill Binko

I have come across some anecdotal evidence that something is happening at Google that I thought Groklaw members would appreciate. I have checked with other search engines (Yahoo!, MSN, and Ask plus some smaller players) and this seems to be uniquely a Googlism.

It seems that Google has massively increased its capacity to perform full-text searches for current, dynamic content. Specifically, posts and public comments are now indexed and full-text-searchable in minutes, not days if the page is listed in the site's RSS or Atom feed. Even smaller sites, such as small-town newspapers and niche community sites, seem to be included in this new system.

Perhaps as interesting, these recent posts are often being given top-billing in the search results, showing that Google may be slanting its bias away from "established" pages with a longer history, and towards "current" information. This seems in line with the FAQ on FeedBurner.com (which Google recently acquired):

Q. Why did Google acquire FeedBurner?

A. Google believes that feed-based content and advertising is a developing space where we can add value for users, advertisers and publishers. FeedBurner's technology and talented team are a great addition to Google's current solutions for advertisers and publishers.

Background

Over the past couple of weeks, I have noticed that my Google Alerts were getting more and more rapid. That is, the time between the content being made available on the net and the time Google notified me of it via an Alert was dropping fast. This was even true when the content was posted on smaller sites, such as a local (small-town) newspaper site.

I assumed that Google was focusing on RSS/Atom feeds and that they had just increased their crawl rate. However, I believe it's much more than that.

A few days later, I was reading an article on Slashdot, and one of the comments held this quote (which I really like):

"Nihilism means nothing to the dancing peasants."
It wasn't attributed to anyone (it's from Oscar Wilde), so I went to look it up on Google. Because I wanted the entire quote, I surrounded it with double-quotes, and got this this result.

Notice that the *first* result is the Slashdot story. What was fascinating was that at the time I searched, this story had only been posted for about an *hour*, and the comment was about a half-hour old! Also interesting is that the comment was absolutely *not* in the RSS feed itself -- it was only on the page whose URL was listed in the feed.

The other night, I started doing some investigating and found something that seems amazing to me. Google seems to now be full-text indexing not only RSS Feeds but the entire contents of all of the pages listed the feed at a refresh rate of less than 2 hours, not just for big RSS feeds like Slashdot, but for many small ones as well.

To verify this, I started with a site that I understood. TribalPages.com is a family tree site that I have done some consulting for, and they have RSS feeds for their two forums. The traffic on those forums is very low by any standards, with 20-50 posts per day. Yet when I searched Google for distinctive phrases from those posts, such as "about the search in the forum", I found results in the main Google index less than an hour after the posts were made. Additional tests showed that it didn't really matter whether I included any meaningful keywords in the text: the entire text was already searchable on Google.

My next thought was that Google was personalizing the results or searching sites that I regularly visited (which would be interesting enough). Several friends confirmed what I'd found, and they were not regulars to TribalPages or TBO.com, another site I tested. However, just to be sure, I ssh'd into a server that was just put online in a colo site last Friday and used lynx to visit a small-town newspaper site that I'd never visited, bradenton.com). I found a unique phrase on a news story less than a few hours old ("detectives also had found two burn barrels") and searched Google for it. The story was the only result returned. Again, the search phrase was not in the RSS feed.

Try it yourself

Here's a walk-through of how to test this for yourself. I'd be interested to hear whether others can confirm this or have a better explanation.

1. Pick a fairly low volume website that has articles with comments. It must also have an RSS feed of the articles. For this example, we'll choose the American Bar Association Journal's Daily News page. (Yes, I realize that isn't small, but I don't want to hammer some small site from Groklaw).

2. Find a story that is between 1 and 12 hours old. Here, we'll use this one.

3. Go to a section of the page that is not in the RSS feed. Most feeds contain only the first paragraph or two, so any text after that should be a good test. It's tempting to use comments, and that often works, but some sites (like Groklaw) do not allow Google to index pages with comments and others use Javascript to display them. Pick out a unique phrase that is unlikely to be found elsewhere, but is really unrelated to the content. We'll use the phrase "pose significant issues for employers concerning"

4. Search Google using double quotes around the phrase, and you will see search results like this that point right back to your article. When I ran that search, the results said it was posted "2 hours ago".

Here are some sites that I've tested that now seem to have all of the pages listed in their feeds fully indexed within 2 hours of being posted.

  • Large News Sites
    • http://slashdot.org
    • http://news.com
    • http://nytimes.com

  • Legal Sites
    • http://law.com
    • http://abajournal.com
    • http://medicalfutility.blogspot.com

  • Niche Sites
    • http://www.tribalpages.com (Genealogy)
    • http://www.scrapbookinggems.com/ (Scrapbooking)
    • http://coastalsurfing.com/ (Surfing)

  • Small-town Newspapers
    • http://www.fredericksburgstandard.com/news/ (Fredericksburg, TX)
    • http://www.bradenton.com/local/ (Bradenton, FL)
    • http://www.cordeledispatch.com/local/ (Cordele, GA)

I couldn't find a single significant site that has news articles posted with a valid RSS feed that didn't seem to have all of its robot-visible content available for full-text search on Google -- even those articles posted late in the day. There are some borderline cases: for example, Google has indexed PatentlyO's news text but not its comments (even on the same page), and ESPN.com uses Javascript to display its comments, so they are not searchable. But by-and-large, I've been amazed at the breadth of this change and the fact that we haven't anything heard about it.

Implications

If my observations are correct, the implications of this are huge. Even given Google's history, this seems to push the boundaries of what they are capable of to a new level: there is now no lag time between posting a comment and the world finding it.

Researchers and reporters will undoubtedly find the new, timely information invaluable. It seems this could also magnify the power of "New Media", in that there is no longer any time lag: as long as you are of a size that gets you on Google's radar, and particularly if you are first with a story, you are in the game in a new way.

Perhaps I'm wrong and misreading the tea leaves. If so, I'm sure a Groklaw member (or two) will explain it to me! One way or another, something interesting is going on.


  View Printable Version


Groklaw © Copyright 2003-2013 Pamela Jones.
All trademarks and copyrights on this page are owned by their respective owners.
Comments are owned by the individual posters.

PJ's articles are licensed under a Creative Commons License. ( Details )