decoration decoration
Stories

GROKLAW
When you want to know more...
decoration
For layout only
Home
Archives
Site Map
Search
About Groklaw
Awards
Legal Research
Timelines
ApplevSamsung
ApplevSamsung p.2
ArchiveExplorer
Autozone
Bilski
Cases
Cast: Lawyers
Comes v. MS
Contracts/Documents
Courts
DRM
Gordon v MS
GPL
Grokdoc
HTML How To
IPI v RH
IV v. Google
Legal Docs
Lodsys
MS Litigations
MSvB&N
News Picks
Novell v. MS
Novell-MS Deal
ODF/OOXML
OOXML Appeals
OraclevGoogle
Patents
ProjectMonterey
Psystar
Quote Database
Red Hat v SCO
Salus Book
SCEA v Hotz
SCO Appeals
SCO Bankruptcy
SCO Financials
SCO Overview
SCO v IBM
SCO v Novell
SCO:Soup2Nuts
SCOsource
Sean Daly
Software Patents
Switch to Linux
Transcripts
Unix Books

Gear

Groklaw Gear

Click here to send an email to the editor of this weblog.


You won't find me on Facebook


Donate

Donate Paypal


No Legal Advice

The information on Groklaw is not intended to constitute legal advice. While Mark is a lawyer and he has asked other lawyers and law students to contribute articles, all of these articles are offered to help educate, not to provide specific legal advice. They are not your lawyers.

Here's Groklaw's comments policy.


What's New

STORIES
No new stories

COMMENTS last 48 hrs
No new comments


Sponsors

Hosting:
hosted by ibiblio

On servers donated to ibiblio by AMD.

Webmaster
Google Taking Blog Comments Searching Real-Time?
Tuesday, January 22 2008 @ 02:05 PM EST

A reader sent me an article about something intriguing he noticed in Google results. He was wondering if any of you have noticed it too. I contacted Google to see what they might say, and here's what they told me: "We crawl the web continuously and schedule visits to each page intelligently to maximize freshness." Of course, to maximize intelligently, one has to experiment. In fact, Google has a blog post about their various experiments:
From time to time, we run live experiments on Google — tests visible to a relatively few people -- to discover better ways to search. We do this because there’s no good substitute for understanding how real people, in real-world situations, actually operate. Theories are fine, but “improving the user experience” really happens best when we understand what people do online.

So to learn more, we sometimes randomly select a group of people to see a possible improvement to search options. Or we may select a group of people and try out a new element while they're searching. If you ever wonder why your Google site looks slightly different from that of the person sitting next to you, this is why.

He seems to have hit on one. So, take a look and see what you think.

*********************************

Google Taking Blog Comments Searching Real-Time?
~ by Bill Binko

I have come across some anecdotal evidence that something is happening at Google that I thought Groklaw members would appreciate. I have checked with other search engines (Yahoo!, MSN, and Ask plus some smaller players) and this seems to be uniquely a Googlism.

It seems that Google has massively increased its capacity to perform full-text searches for current, dynamic content. Specifically, posts and public comments are now indexed and full-text-searchable in minutes, not days if the page is listed in the site's RSS or Atom feed. Even smaller sites, such as small-town newspapers and niche community sites, seem to be included in this new system.

Perhaps as interesting, these recent posts are often being given top-billing in the search results, showing that Google may be slanting its bias away from "established" pages with a longer history, and towards "current" information. This seems in line with the FAQ on FeedBurner.com (which Google recently acquired):

Q. Why did Google acquire FeedBurner?

A. Google believes that feed-based content and advertising is a developing space where we can add value for users, advertisers and publishers. FeedBurner's technology and talented team are a great addition to Google's current solutions for advertisers and publishers.

Background

Over the past couple of weeks, I have noticed that my Google Alerts were getting more and more rapid. That is, the time between the content being made available on the net and the time Google notified me of it via an Alert was dropping fast. This was even true when the content was posted on smaller sites, such as a local (small-town) newspaper site.

I assumed that Google was focusing on RSS/Atom feeds and that they had just increased their crawl rate. However, I believe it's much more than that.

A few days later, I was reading an article on Slashdot, and one of the comments held this quote (which I really like):

"Nihilism means nothing to the dancing peasants."
It wasn't attributed to anyone (it's from Oscar Wilde), so I went to look it up on Google. Because I wanted the entire quote, I surrounded it with double-quotes, and got this this result.

Notice that the *first* result is the Slashdot story. What was fascinating was that at the time I searched, this story had only been posted for about an *hour*, and the comment was about a half-hour old! Also interesting is that the comment was absolutely *not* in the RSS feed itself -- it was only on the page whose URL was listed in the feed.

The other night, I started doing some investigating and found something that seems amazing to me. Google seems to now be full-text indexing not only RSS Feeds but the entire contents of all of the pages listed the feed at a refresh rate of less than 2 hours, not just for big RSS feeds like Slashdot, but for many small ones as well.

To verify this, I started with a site that I understood. TribalPages.com is a family tree site that I have done some consulting for, and they have RSS feeds for their two forums. The traffic on those forums is very low by any standards, with 20-50 posts per day. Yet when I searched Google for distinctive phrases from those posts, such as "about the search in the forum", I found results in the main Google index less than an hour after the posts were made. Additional tests showed that it didn't really matter whether I included any meaningful keywords in the text: the entire text was already searchable on Google.

My next thought was that Google was personalizing the results or searching sites that I regularly visited (which would be interesting enough). Several friends confirmed what I'd found, and they were not regulars to TribalPages or TBO.com, another site I tested. However, just to be sure, I ssh'd into a server that was just put online in a colo site last Friday and used lynx to visit a small-town newspaper site that I'd never visited, bradenton.com). I found a unique phrase on a news story less than a few hours old ("detectives also had found two burn barrels") and searched Google for it. The story was the only result returned. Again, the search phrase was not in the RSS feed.

Try it yourself

Here's a walk-through of how to test this for yourself. I'd be interested to hear whether others can confirm this or have a better explanation.

1. Pick a fairly low volume website that has articles with comments. It must also have an RSS feed of the articles. For this example, we'll choose the American Bar Association Journal's Daily News page. (Yes, I realize that isn't small, but I don't want to hammer some small site from Groklaw).

2. Find a story that is between 1 and 12 hours old. Here, we'll use this one.

3. Go to a section of the page that is not in the RSS feed. Most feeds contain only the first paragraph or two, so any text after that should be a good test. It's tempting to use comments, and that often works, but some sites (like Groklaw) do not allow Google to index pages with comments and others use Javascript to display them. Pick out a unique phrase that is unlikely to be found elsewhere, but is really unrelated to the content. We'll use the phrase "pose significant issues for employers concerning"

4. Search Google using double quotes around the phrase, and you will see search results like this that point right back to your article. When I ran that search, the results said it was posted "2 hours ago".

Here are some sites that I've tested that now seem to have all of the pages listed in their feeds fully indexed within 2 hours of being posted.

  • Large News Sites
    • http://slashdot.org
    • http://news.com
    • http://nytimes.com

  • Legal Sites
    • http://law.com
    • http://abajournal.com
    • http://medicalfutility.blogspot.com

  • Niche Sites
    • http://www.tribalpages.com (Genealogy)
    • http://www.scrapbookinggems.com/ (Scrapbooking)
    • http://coastalsurfing.com/ (Surfing)

  • Small-town Newspapers
    • http://www.fredericksburgstandard.com/news/ (Fredericksburg, TX)
    • http://www.bradenton.com/local/ (Bradenton, FL)
    • http://www.cordeledispatch.com/local/ (Cordele, GA)

I couldn't find a single significant site that has news articles posted with a valid RSS feed that didn't seem to have all of its robot-visible content available for full-text search on Google -- even those articles posted late in the day. There are some borderline cases: for example, Google has indexed PatentlyO's news text but not its comments (even on the same page), and ESPN.com uses Javascript to display its comments, so they are not searchable. But by-and-large, I've been amazed at the breadth of this change and the fact that we haven't anything heard about it.

Implications

If my observations are correct, the implications of this are huge. Even given Google's history, this seems to push the boundaries of what they are capable of to a new level: there is now no lag time between posting a comment and the world finding it.

Researchers and reporters will undoubtedly find the new, timely information invaluable. It seems this could also magnify the power of "New Media", in that there is no longer any time lag: as long as you are of a size that gets you on Google's radar, and particularly if you are first with a story, you are in the game in a new way.

Perhaps I'm wrong and misreading the tea leaves. If so, I'm sure a Groklaw member (or two) will explain it to me! One way or another, something interesting is going on.


  


Google Taking Blog Comments Searching Real-Time? | 149 comments | Create New Account
Comments belong to whoever posts them. Please notify us of inappropriate comments.
Corrections Here
Authored by: artp on Tuesday, January 22 2008 @ 02:20 PM EST
Summary of change in the title block, please.

---
Userfriendly on WGA server outage:
When you're chained to an oar you don't think you should go down when the galley
sinks ?

[ Reply to This | # ]

[OT] Off Topic Comments Thread
Authored by: artp on Tuesday, January 22 2008 @ 02:23 PM EST
Change the Title block.
Read the Instructions below the text entry box.
Change Mode to HTML if necessary.
HTMLify Web links, please.
Review recent article on using HTML in Groklaw Comments.

---
Userfriendly on WGA server outage:
When you're chained to an oar you don't think you should go down when the galley
sinks ?

[ Reply to This | # ]

Newspicks discussion here please
Authored by: tiger99 on Tuesday, January 22 2008 @ 02:26 PM EST
Comments about items in the Groklaw newspicks can go here. Please put the title
of the Newspick item in your title, so we can see what your comment is about at
a glance.

[ Reply to This | # ]

Google Taking Blog Comments Searching Real-Time?
Authored by: Anonymous on Tuesday, January 22 2008 @ 02:28 PM EST
What is this doing to the access times, etc for regular traffic? If they are
indexing all web pages every few hours, is a performance penalty that could
affect other Internet users in some way?

[ Reply to This | # ]

Google Taking Blog Comments Searching Real-Time?
Authored by: joef on Tuesday, January 22 2008 @ 02:32 PM EST
As of 1928Z I searched for the string "lag time between posting a comment
and the world" (past paragraph of the article) and got no hit, I'll retry
periodically and see how soon it comes up. 1928Z is some 23 minutes after the
timestamp on PJ's article.

[ Reply to This | # ]

Google ruining search results again
Authored by: cmc on Tuesday, January 22 2008 @ 03:03 PM EST
I personally wish Google would leave well enough alone. I've used a lot of
search engines through the years. My personal favorites were WebCrawler, then
MetaCrawler, and now Google. But Google used to be so much better than it is
now. Remember when you got meaningful results? Remember when the text you
searched for was actually on the pages returned in the results?

Nowadays, more often that not, the text I enter is not on the pages returned.
So why are the pages listed in the results? Because somewhere on the internet,
a page which linked to the page in the results contains the text I entered. How
on earth Google thinks that will help me is beyond comprehension. PageRank is
only useful for poisoning search results.

PageRank was the worst thing to ever happen to search results. And putting
more-current pages at the top of results is just as bad in my opinion. More
frequent does not mean more worthy.

[ Reply to This | # ]

Google Taking Blog Comments Searching Real-Time?
Authored by: Anonymous on Tuesday, January 22 2008 @ 03:04 PM EST
Does that mean that we have to wait until after 4 pm before this very thread
supplants the slashdot article?

[ Reply to This | # ]

Google Taking Blog Comments Searching Real-Time?
Authored by: Holocene Epoch on Tuesday, January 22 2008 @ 03:08 PM EST
That would explain why my personal site's photo gallery was top of the Google
search when I changed the software and had "random Photography" as the
title [since back to the original software].

I also unlocked a Domain Name so that I could change registrars, would this be
the same tech / similar tech to that which sent me an email to say that the
status of the Domain Name had changed??

Oh, Seattle could use some global warming if anyone has some extra, finally
reached freezing.

[ Reply to This | # ]

Agreed
Authored by: jeevesbond on Tuesday, January 22 2008 @ 04:03 PM EST

I've seen this also. On Slashdot a few weeks ago a user posted: 'just Google for "xyz"' someone else came back an hour later complaining that the only result for 'xyz' was the user's comment about Googling for it. :)

PJ: have you looked at Google Webmaster Tools? In there, under Tools -> Set crawl rate you can see graphs of what Google is downloading from your site per day, what's really interesting--in relation to this story--is the 'Your current speed' section. For me the 'Faster' speed setting is greyed out, but maybe for sites like Slashdot and Groklaw it isn't?

I'd be intrigued to find out whether you can access that 'faster' setting. :)

[ Reply to This | # ]

Google Taking Blog Comments Searching Real-Time?
Authored by: Stevieboy on Tuesday, January 22 2008 @ 04:12 PM EST
I personally find software that tries to predict what you might be wanting to do
or that gives you 'helpful information' telling you what you're doing wrong or
what you should do next extremely irritating and productivity reducing.

Windows is the biggest (but not the only) culprit repeatedly telling you to
update this or that over and over again - to the guys at Microsoft and any other
software producer, if I've clicked the message once it means I've taken it on
board and don't want to see the same message again!! Ever!!!!!

Give me predictable software rather than predictive. I just want software that
does what it says on the tin - no less and no more - and doesn't try to second
guess me.

Or have I misunderstood what Google is trying to do?

[ Reply to This | # ]

Another tool, useful to everyone
Authored by: Anonymous on Tuesday, January 22 2008 @ 04:27 PM EST

Think of those that would try and flood a site with comments - for some reason I can't think of the silly term at the moment. Basically, simulate a grass-roots movement.

Now they can go one better, flood Google so their responses hit in the top sections. Potentially useful to those that wish to do research but it's also useful to those who wish to prevent research.

RAS

[ Reply to This | # ]

FYI: this article is now indexed
Authored by: cwbinko on Tuesday, January 22 2008 @ 04:56 PM EST
Google results for the unique phrase "We crawl the web continuously and schedule visits" are now resolving to this page.

Seems like Groklaw is at worst being indexed at 3 hour intervals (sorry I wasn't testing earlier).

BTW: Great feedback from the community - keep it coming.

- Bill

[ Reply to This | # ]

Google Taking Blog Comments Searching Real-Time?
Authored by: ssavitzky on Tuesday, January 22 2008 @ 08:30 PM EST
Some sites, Livejournal for example, have a "live feed" of new
articles that Google undoubtedly taps into. Basically they just connect to a
socket, and drink from the firehose of new articles. Wouldn't surprise me if
Slashdot has one too. And of course a lot of blogs are on Blogger, which they
own.

---
Never anger a bard, for they are not subtle and people remember funny songs.

[ Reply to This | # ]

Small-time newspaper?
Authored by: DodgeRules on Tuesday, January 22 2008 @ 09:35 PM EST
... and used lynx to visit a small-town newspaper site that I'd never visited, bradenton.com).
<sarcasm>
Well I don't know if I should be happy to see my local paper listed in a GL posting or be insulted.
</sarcasm>

[ Reply to This | # ]

Question for author B.B. re: test system config
Authored by: Anonymous on Tuesday, January 22 2008 @ 09:49 PM EST
Bill Binko:

I'm wondering about the system you did this testing from.

1. What browser did you use? (Probably not important, I'm just curious.)

2. Is the Google toolbar installed in this browser?

3. Does this browser accept (and preserve) Google's persistent cookies?

[ Reply to This | # ]

Google Taking Blog Comments Searching Real-Time?
Authored by: SoundChsr on Wednesday, January 23 2008 @ 12:53 AM EST
I noticed this tonight myself. I re-posted a couple sections of my gOS review
on linuxtweakers.org (my new site - notice no clickie - not trying to
advertise). When I went over to Google to verify that the site was getting
indexed, the search results showed that the articles I had just posted were
indexed in under 20 minutes. Astounding!

// George

[ Reply to This | # ]

Groklaw © Copyright 2003-2013 Pamela Jones.
All trademarks and copyrights on this page are owned by their respective owners.
Comments are owned by the individual posters.

PJ's articles are licensed under a Creative Commons License. ( Details )