louisrosenfeld.com logotype

Home > Bloug Archive

Jul 29, 2005: Search Log Analysis and the Long Tail

As Vilfredo Pareto, Mr. 80/20, might have predicted, a site's search queries can be universally displayed as Zipf Distributions. For example, Rich Wiggins, who analyzes Michigan State University's search logs, tells me "that out of 250,000 unique queries, 500 or so at the top are 40% of the total, and 1000 or so covers 50% or more."

Search log analysis (SLA) is a rational attempt to make sense of these distributions by focusing on those most popular queries. Finding patterns among the popular queries helps us determine how to best allocate resources for improving the search experience. We might, for example, decide to develop best bet search results for the 100 most common queries, and see if we need to plug content gaps for those top 500 queries that retrieve 0 results. In IA terms, a little of this type of effort can go quite a long way.

Addressing the top 500 queries makes good sense because it could improve the user experience for 40% of those who search our sites. But what about all those other queries? There has been quite a lot of discussion recently about the value of the "long tail" in e-commerce and in many other contexts. I wonder if we're missing out on important opportunities by ignoring the long tail of search? In the MSU example, those 249,000 esoteric queries do account for a whopping 50% of all searches. Is that a market information architects should seek to serve? Or what about the "middle tail," the 500 queries that count for 10% of all MSU searches? Nothing to sneeze at.

I'd love to hear from anyone who's looked at their logs' middle and long tails, and whether what was learned justified the investment of effort. What did you learn, and how?

Performing SLA on random samples from the middle and long tail segments and hunting for patterns might be a good starting point. But standard log analysis just helps us figure out what's popular, and how we can make sure popular queries perform better. Should we be looking to learn something different from the more esoteric queries found in the long and middle tails? I wonder what useful secrets we may find among the unpopular queries.

Maybe analyzing search's tail could help us:

  • Uncover hidden users and, potentially, market segments (e.g., frequent occurrences of queries in a foreign language), although this is really another hunt for popularity
  • Identify ongoing, stable information needs in situations where fads often spike and show up among the most common queries
  • Learn about rising trends, assuming we track a query over time as it migrates from long tail to middle to popular (the reverse would be true as well, of course)

Other potential benefits? If nothing else, I guess we'd learn just how strange users are when they're filling out that little search box on our sites.

email this entry

Comment: Alex (Jul 29, 2005)

Isn't the long-tail in search kind of a focus/benefit of existing folksonomy and social tagging services? del.icio.us often being cited, but now even My Web 2.0 by yahoo

Comment: Peter Boersma (Jul 30, 2005)

Lou,

Maybe I misinterpreted Rich's quote, but when you say "249,000 esoteric queries do account for a whopping 50% of all searches" I thought that would have been only 10% (100-40-50). Oh wait, now that I read it again, it could also be interpreted as 50% (100-50). The "or" in "or more" did it. I guess that makes more sense :-)

Anyway: Would you say it's a viable option to combine SLA with WTA (Web Traffic Analysis) by mapping the time-of-search to the actual web traffic log and see where & when visitors decided to switch to search? It would also tell you what percentage of users don't even try to navigate but turn to search immediately (a sign of dauntingness of you homepage?).

Wonder if people have tried that approach...

Comment: Jay Fienberg (Jul 31, 2005)

Re, Alex's comment above: I'm currently working on an enterprise IA project where we're working to integrate end-user and content-creator tag results in with word-search results. (So, imagine del.icio.us tag pages folded into federated search results.)

The intention is definitely for the search system to more responsively bridge differences between word content, content creator terms, and end-user terms. And, we hope that esoteric search terms will end-up being used as tags that connect relevant content with those terms.

I'm currently trying to figure out if what we're designing for tagging will itself be a sufficient mechanism for IAs + admins and/or content creators to tag items for "best bet" resutls, e.g., based on search log analysis, based on tag term usage, etc.

(Because of the importance of various controlled vocabularies of terms inside the enterprise, we're having to evaluate the trade-offs of having only "tags" vs multiple types of tags for free-form and pre-set vocabularies. Best bets might end up its own type of tag.)

In general, I think this is an interesting area to look at in terms of combining techniques that can allow end-users to improve search results for themselves and each other, and that allow IAs / admins / content creators to use analysis and domain knowledge to shape search results.

Comment: Margaret L Ruwoldt (Jul 31, 2005)

Over the last couple of years I've done a long tail analysis or three, based on a sampling method described by Martin Belam http://www.currybet.net/articles/day_in_the_life/index.php

At the University of Melbourne (.au) our long tail of unique search terms contains a significant proportion of people looking for the same *types* of information. Examples: searching for a person's name as a way of finding their email address and phone number; or searching for a course/subject code (3 or 6 numeric digits) as a way of finding lecture notes and other online-learning material.

Doing the maths on the long tail was a way of demonstrating that a 'gut feeling' reading of daily search logs was reasonably accurate. It also confirmed what we already knew from users' unsolicited feedback about the web site and from our own usability testing.

In our case, the long tail analysis points to stuff that's harder to fix than simply reworking web pages: it's more about enterprise IA than about just 'the web site'.

The long-tail statistical evidence is a starting-point for making improvements to the University's web presence. I don't see much value in doing the analysis frequently -- perhaps once a year will be sufficient, given the time it takes to make significant changes to the enterprise IA.

Comment: Todd O'Neill (Aug 1, 2005)

I do think combining SLA and WTA can beneficial. We have "not great" WTA, but we have developed pretty good SLA skills.

One thing that we did early on was to capture referer URL with our queries so that we could see "where" someone searched from. We are in a very controlled environment in whic users log in or are automatically logged in to our web properties.

That means we have access to other information about users like address, age, etc. There is some information that we deliberately don't (or won't) collect, but there are other bits that will be great for future SLA efforts. In one case we collect the user's computer address in the search log which is used to troubleshoot technical issues.

Comment: Lou (Aug 1, 2005)

Peter and Todd: I think the marriage of WTA and SLA would be incredibly useful as it would enable us to get a truer sense of what a "session" really is. But it'll be a Holy Grail until server and search engine vendors agree to some standards to make their respective logs interoperable.

Comment: Lou (Aug 7, 2005)

Rich Wiggins has done a lot of thinking on this issue lately, and has blogged it here:

http://wigblog.blogspot.com/2005/08/long-tail-and-short-head-of-zipf-curve.html

Comment: Rich Wiggins (Aug 8, 2005)

Here's one way to summarize how long a long tail can be. I took a sample of 31,000 searches at Michigan State from July, and analyzed how many unique search strings it takes to get to achieve a certain percent of all searches performed. Here's a summary:

10 unique seaches account for 10% of the total

32 items account for 20%
85 = 30%
203 = 40%
441 = 50%
925 = 60%
2182 = 70%
5617 = 80%
13770 = 90%
31000 = 100%

At the end of the long tail you find a huge number of searches that are performed only once. Most of the "long tail" has too few searches to make any manual Best Bets or taxonomy efforts cost effective.

More info and illustrative graphs appear at:

http://wigblog.blogspot.com/

/rich

PS -- I'd LOVE to compare your Zipf curve with ours. I've done this for other universities and they are remarkably similar.

Comment: Jeff Werness (Aug 15, 2005)

Based on extensive review at a previous employer, there were decidely distinct approaches to ongoing short- and long-tail analysis:

1) Review top 100 queries; combine concepts, and create multiple methods of site promotion and 'best bets' for the resulting ~30 concepts. The top 100 search strings accounted for ~60 percent of all queries (conceptually).

2) Review the next 400 most-commmon queries. Deduce the concepts (usually resulted in about 80-100 unique concepts). Feed this data back to merchants for use in promotions and forecasting over time.

3) Review *some* of the remaining search strings for unusual activity and/or provide answers to specific questions from merchants or other staff, e.g., "has anyone searched for..." or to answer questions for ourselves on naming conventions.

This activity was based on 700K daily queries. During the holidays this volume doubled to 1.4M to 1.7M daily queries. Due to the volume, we were never able to retain or review all queries(terrabytes of search data eventually become unusable as language changes over time) so we only stored the top 2000 daily queries.

The long-tail data analysis is useful for some sites and not others. Smaller sites may be able to effectively deal with that volume after the top most-requested queries, but for significantly larger sites, it becomes an analysis in futility because there's so much data that nothing can be done with it in an operationally efficient manner.

Search logs are great for:
- understanding what people can't find
- understanding what people want
- making things more findable through 'best bets' and other promotional activities
- understanding how language changes over time
- understanding how naming conventions can help improve navigation

But search logs are not useful in fixing a single customer's inability to find a $2.99 replacement part for their widget. From a product development standpoint, there are other means to help customers do that, like configurators and parts matchers, and lots of other things that are specific to other industries.

Add a Comment:

Name

Email

URL (optional, but must include http://)

Required: Name, email, and comment.
Want to mention a linked URL? Include http:// before the address.
Want to include bold or italics? Sorry; just use *asterisks* instead.

DAYENU ); } else { // so comments are closed on this entry... print(<<< I_SAID_DAYENU
Comments are now closed for this entry.

Comment spam has forced me to close comment functionality for older entries. However, if you have something vital to add concerning this entry (or its associated comments), please email your sage insights to me (lou [at] louisrosenfeld dot com). I'll make sure your comments are added to the conversation. Sorry for the inconvenience.

I_SAID_DAYENU ); } ?>