A reader sent me an article about something intriguing he noticed in Google results. He was wondering if any of you have noticed it too. I contacted Google to see what they might say, and here's what they told me: "We crawl the web continuously and schedule visits to each page intelligently to maximize freshness." Of course, to maximize intelligently, one has to experiment. In fact, Google has a blog post about their various experiments: From time to time, we run live experiments on Google — tests visible to a relatively few people -- to discover better ways to search. We do this because there’s no good substitute for understanding how real people, in real-world situations, actually operate. Theories are fine, but “improving the user experience” really happens best when we understand what people do online.
So to learn more, we sometimes randomly select a group of people to see a possible improvement to search options. Or we may select a group of people and try out a new element while they're searching. If you ever wonder why your Google site looks slightly different from that of the person sitting next to you, this is why. He seems to have hit on one. So, take a look and see what you think.
*********************************
Google Taking Blog Comments Searching Real-Time?
~ by Bill Binko
I have come across some anecdotal evidence that something is
happening at Google that I thought Groklaw members would
appreciate. I have checked with other search engines (Yahoo!, MSN,
and Ask plus some smaller players) and this seems to be uniquely a Googlism.
It seems that Google has massively increased its capacity to perform
full-text searches for current, dynamic content. Specifically,
posts and public comments are now indexed and full-text-searchable
in minutes, not days if the page is listed in the site's RSS or Atom
feed. Even smaller sites, such as small-town newspapers and niche
community sites, seem to be included in this new system.
Perhaps as interesting, these recent posts are often being given
top-billing in the search results, showing that Google may be
slanting its bias away from "established" pages with a longer
history, and towards "current" information. This seems in line with
the FAQ on FeedBurner.com (which Google recently acquired):
Q. Why did Google acquire FeedBurner?
A. Google believes that feed-based content and advertising is a
developing space where we can add value for users, advertisers
and publishers. FeedBurner's technology and talented team are a
great addition to Google's current solutions for advertisers and
publishers.
Background
Over the past couple of weeks, I have noticed that my Google Alerts
were getting more and more rapid. That is, the time between the
content being made available on the net and the time Google notified
me of it via an Alert was dropping fast. This was even true when
the content was posted on smaller sites, such as a local
(small-town) newspaper site.
I assumed that Google was focusing on RSS/Atom feeds and that they
had just increased their crawl rate. However, I believe it's much
more than that.
A few days later, I was reading an
article on Slashdot, and one of
the comments held this quote (which I really like):
"Nihilism means nothing to the dancing peasants."
It wasn't attributed to anyone (it's from Oscar Wilde), so I went to
look it up on Google. Because I wanted the entire quote, I
surrounded it with double-quotes, and got this
this result.
Notice that the *first* result is the Slashdot story. What was
fascinating was that at the time I searched, this story had only
been posted for about an *hour*, and the comment was about a half-hour
old! Also interesting is that the comment was absolutely *not* in
the RSS feed itself -- it was only on the page whose URL was listed
in the feed.
The other night, I started doing some investigating and found something
that seems amazing to me. Google seems to now be full-text indexing not
only RSS Feeds but the entire contents of all of the pages listed
the feed at a refresh rate of less than 2 hours, not just for
big RSS feeds like Slashdot, but for many small ones as well.
To verify this, I started with a site that I understood. TribalPages.com is a family tree site that
I have done some consulting for, and they have RSS feeds for their
two forums. The traffic on those forums is very low by any
standards, with 20-50 posts per day. Yet when I searched Google for
distinctive phrases from those posts, such as "about the search in
the forum", I found results in the main Google index less than an
hour after the posts were made. Additional tests showed that it
didn't really matter whether I included any meaningful keywords in
the text: the entire text was already searchable on Google.
My next thought was that Google was personalizing the results or
searching sites that I regularly visited (which would be interesting
enough). Several friends confirmed what I'd found, and they were
not regulars to TribalPages or TBO.com, another site I tested. However, just to be sure, I ssh'd into a server that was just put
online in a colo site last Friday and used lynx to visit a
small-town newspaper site that I'd never visited,
bradenton.com). I found a unique phrase on a news story
less than a few hours old ("detectives also had found two burn
barrels") and searched Google for it. The
story was the only
result returned. Again, the search phrase was not in the RSS feed.
Try it yourself
Here's a walk-through of how to test this for yourself. I'd be
interested to hear whether others can confirm this or have a better
explanation.
1. Pick a fairly low volume website that has articles with
comments. It must also have an RSS feed of the articles. For
this example, we'll choose the American Bar Association
Journal's Daily News page. (Yes,
I realize that isn't small, but I don't want to hammer some
small site from Groklaw).
2. Find a story that is between 1 and 12 hours old. Here, we'll
use this one.
3. Go to a section of the page that is not in the RSS feed.
Most feeds contain only the first paragraph or two, so any
text after that should be a good test. It's tempting to use
comments, and that often works, but some sites (like Groklaw)
do not allow Google to index pages with comments and others
use Javascript to display them. Pick out a unique phrase that
is unlikely to be found elsewhere, but is really unrelated to
the content. We'll use the phrase "pose significant issues
for employers concerning"
4. Search Google using double quotes around the phrase, and you
will see search results like
this
that point right back to your article. When I ran that
search, the results said it was posted "2 hours ago".
Here are some sites that I've tested that now seem to have all of
the pages listed in their feeds fully indexed within 2 hours of
being posted.
- Large News Sites
- http://slashdot.org
- http://news.com
- http://nytimes.com
- Legal Sites
- http://law.com
- http://abajournal.com
- http://medicalfutility.blogspot.com
- Niche Sites
- http://www.tribalpages.com (Genealogy)
- http://www.scrapbookinggems.com/ (Scrapbooking)
- http://coastalsurfing.com/ (Surfing)
- Small-town Newspapers
- http://www.fredericksburgstandard.com/news/
(Fredericksburg, TX)
- http://www.bradenton.com/local/ (Bradenton, FL)
- http://www.cordeledispatch.com/local/ (Cordele, GA)
I couldn't find a single significant site that has news
articles posted with a valid RSS feed that didn't
seem to have all of its robot-visible content available for
full-text search on Google -- even those articles posted late in the
day. There are some borderline cases: for example, Google has
indexed PatentlyO's news text but not its comments (even on the
same page), and ESPN.com uses Javascript to display its comments, so
they are not searchable. But by-and-large, I've been amazed at the
breadth of this change and the fact that we haven't anything heard about it.
Implications
If my observations are correct, the implications of this are huge. Even given Google's history, this seems to push the boundaries of
what they are capable of to a new level: there is now no lag time
between posting a comment and the world finding it.
Researchers and reporters will undoubtedly find the new, timely
information invaluable.
It seems this could also magnify the power of "New Media", in that there is no
longer any time lag: as long as you are of a size that gets you on
Google's radar, and particularly if you are first with a story, you are in the game in a new way.
Perhaps I'm wrong
and misreading the tea leaves. If so, I'm sure a Groklaw member (or
two) will explain it to me! One way or another, something
interesting is going on.
|