louisrosenfeld.com logotype

Home > Bloug Archive

Apr 09, 2002: Google, the Information Explosion, and ROT

Will Google scale?

Assume that web content is doubling at some insanely fast rate—five years, five months—you pick your favorite number. Given this volume, how can Google's popularity-based ranking algorithm keep up?

Sure, Google will know about most new content through spidering. But that doesn't mean you and I will know about that new content. After all, as it's new and won't have inter-site links to it yet, Google will retrieve the new content but rank it so low that most of us will never find it. And because we don't know it's there, we'll never link to it. Vicious cycle.

Therefore it seems that existing links will increase in value over time. In other words, the "cost" of obtaining a link in 2002 will be much less than achieving the same link in 2007. If this is the case, Google retrievals will be prejudiced to rank older content with more established link networks than newer content.

Which is great if you're interested in what information was valued in 2002, but bad for just about everything else. How come? Because the flip side to information explosion is information "ROT" (which stands for Redundant, Outdated, and Trivial). Older content degrades in quality over time, so why should it have an inordinate impact on the quality trusty Google's retrievals?

Sorry if this is starting to sound like a rant for social security reform. I'll turn the question back at you: as information volume and ROT increase concurrently, will Google's ranking algorithm continue to be useful? Or will it be dangerous for us to rely so heavily on Google?

email this entry

Comment: Mathew (Apr 9, 2002)

I'm not sure that it is as bad as that; perhaps if the *only* way we all found new links was through Google, then it would be.

However, if I create a wonderful new site, I might email some friends of mine, I might post to relevant lists, groups, forums, and if it is any good people will start to link to it.

Google will rank it low, but then most of the early visitors will not find it by searching but by some other relationship, through an email footer or a link from a kind and popular friend.

Are their really sites now that became very popular solely through searches? The big sites I know have a buzz around them, they get sent virus style around the web.

Googles Ranking will continue to be useful as a shortcut to finding those pages which have already become popular, just as it is now.

Comment: Jerry Kindall (Apr 9, 2002)

Google is already using "freshness" in their ranking algorithms. Evidence of this can be seen in the fact that "Google bombs" come on strong at first but eventually descend on the results page. For example, you can't find Matt Haughey's original "Critical IP sucks" page on the first page of a search for "Critical IP" anymore.

Comment: Lou (Apr 9, 2002)

If freshness if figured in, it may mitigate the "senior" bias that I was referring to earlier to at least some degree. But the ratio of links to pages would drop anyway, and I guess that the ranking algorithm would still lose much of its value over time.

Comment: Andrew (Apr 9, 2002)

Google is like history. It only credits those who have the most sticky "links" over time in the most-published contexts.

Right now, Google is better than everything else, but it will eventually become just another layer in the strata of information retrieval methods. Like cities built on the ruins of cities. (A small shovel will easily uncover other strata below... Inktomi/Hotbot ... Northern Light ... Yahoo ... Alta Vista ...) Unless of course Google changes itself, in which case it'll be a whole new creature, just still *named* Google.

I'm curious to see if we'll evolve a social class of information taggers, a global tribe or sect, that is committed to tagging and passing along truly great information, regardless of what rock it might be found under? These people would all have their own predilections and eccentric interests, they would often link to one another's links, creating a redundant web of manually nurtured, living info-tissue.

Oh....wait.... those are bloggers!

Comment: Prentiss Riddle (Apr 10, 2002)

Right: the blogging phenomenon may have come along just in time to save Google from a scaling crisis, at least for now.

But for the record, it doesn't appear that blogs are the only source Google uses to find new and important content. I just did a few searches on recent headlines from the top page of my employer (http://www.rice.edu) and the more obscure university newspaper page(http://www.rice.edu/projects/reno/rn/current.html/). Google seems to be indexing both of them quite frequently, even though the bulk of our pages only get spidered about once a month. Google seems to have identified both of those pages as being like weblogs: frequently updated pages containing links to new and important content, and therefore worthy of frequent spidering.

In fact, Google *seems* to indicate this class of frequently spidered pages by putting "freshness dates" in its results. A page of results for the search "rice university" shows that most are presented without a date, but some -- typically "newsy" top pages for major departments -- list a date between the page size and the cache link.

If the folks at Google are as smart as I think they are, they have found a way to identify these "news" pages automatically. This is more scalable than the approach of Daypop (http://www.daypop.com/), which spiders a hand-picked list of blogs and news sites on a daily basis. Now, if Google only permitted advanced searchers to specify that they'd like to see results *only* from "news" pages. That might yield an excellent "current events" search (or "current memes and silly fads" as the case may be), the niche Daypop has carved out for itself.

To get back to your original question: there may be more fundamental reasons why Google won't scale. I suspect there are some basic exponential growth factors in their fundamental lookup and ranking formulas, but I'd also bet they've got some very good computer scientists whose job it is to come up with heuristics and short cuts to beat the growth. In other words, the absolutely correct answer to a query may take n**m microseconds to calculate, but they may have found a way to get a "good enough" answer to the same query in n or sqrt(n) microseconds.

Comment: Prentiss Riddle (Apr 10, 2002)

Argh. The comment script took closing punctuation to be part of those URLs. The correct URLs for the Rice pages can be left as an exercise for the reader, but here's a correct link to Daypop, which is certainly worth a look if you haven't seen it: http://www.daypop.com

Comment: Jeff Stuit (Apr 10, 2002)

Lou, I don't understand the premise of ROT. If "older content degrades over time", what does that say about great works of literature, sacred texts written in "dead" languages, old movies, or vintage 78 records? Or, if you want a more specific example on the web - what about anything published on http://www.bartelby.com/ ?

I've always thought that some content actually gets *better* as it ages.

Comment: Lou (Apr 10, 2002)

Jeffy: of course, some content is like fine wine. But most of what we create on a daily basis as part of our work, hobbies, etc. doesn't. It goes unmaintained and we lose interest in it ourselves, so it has less interest for others. In the workplace, content becomes quickly out-of-date; so when you're looking for that corporate travel policy, you'll find a bunch of old ones along with the current one.

I humbly submit that most of the content we encounter while searching--whether Web-wide or on our intranets--is *not* ROT resistant. And that Don Quixote and Grimms' Fairy Tales are another much rarer story...

Comment: vpoptom (Apr 11, 2002)

I don't know why but I get a bit of a different feeling about Google as compared to the past search facilities. I believe Google will weather the storm and only a wholly new technology will unseat them. They are actually doing what the dotCOMbomb did in the 90's in working without a clearly defined business plan.
But it seems clear that they want to excel at providing search results, believing the model will show itself somewhere down the line. I agree with them, I think they will succeed.

It is an interesting thing you noted in the ultimate irrelevancy of the most widely linked-to documents out there... I was just thinking about that yesterday. But the algorithms are out there somewhere to allow for the display of what it is you seek. I have this weird feeling that Google will remain on top of their game.

Comment: Jeff Stuit (Apr 11, 2002)

Here's something I find paradoxical: many an intelligent observer of the Web, including its inventor, have complained about the scourge of "linkrot". Well, what Google is supplying here are links that have definitely not rotted - they've stayed around a long time. Ironically, what are those links being labelled now? ROT!

An old link just can't get a break these days, whether it lives or dies! ;-)

Aren't we really just talking about values here? Whether something is ROT, linkrot, or otherwise, and whether being that way is good or bad, is just in the eye of the beholder, isn't it?

Comment: Lou (Apr 11, 2002)

I don't think we're talking values as much as context. Work on any corporate site or intranet and you'll find that ROT is a huge problem, even if that outdated and inaccurate content is interesting from an anthropological or other view point.

Reminds me of the archaeologists who are doing digs in the Fresh Kills landfill in Staten Island (or is it New Jersey?). They're thrilled with what they're finding, but to most it's still just garbage...

Comment: Prentiss Riddle (Apr 12, 2002)

FYI: I crossposted a pointer to this discussion on the usenet group comp.infosystems.search. To my surprise, a couple of people have posted throughtful replies which actually attempt to address Google's scalability problems.

Here's Google Groups' own archive of the thread:

Comment: mss@interchange.ubc.ca (Aug 4, 2002)

google and web rot

Add a Comment:



URL (optional, but must include http://)

Required: Name, email, and comment.
Want to mention a linked URL? Include http:// before the address.
Want to include bold or italics? Sorry; just use *asterisks* instead.

DAYENU ); } else { // so comments are closed on this entry... print(<<< I_SAID_DAYENU
Comments are now closed for this entry.

Comment spam has forced me to close comment functionality for older entries. However, if you have something vital to add concerning this entry (or its associated comments), please email your sage insights to me (lou [at] louisrosenfeld dot com). I'll make sure your comments are added to the conversation. Sorry for the inconvenience.