louisrosenfeld.com logotype

Home > Bloug Archive

Jun 28, 2004: Search and Taxonomy--Why Separate Teams?

A friend who works in a mammoth enterprise environment recently sent me this very good question:

Do you have any opinions, quotes, pointers to quotes, etc. regarding the need for search teams to work closely with taxonomy teams? I don't see how they shouldn't, as our core competencies fall on both sides of the information fence: information organization and information retrieval.

Yep, I definitely have an opinion on this.

It's my experience that these two groups are typically separate because search is (mis)understood to be a technology. "Search? Oh, you must mean that search engine that we just purchased." Because it's viewed as a technology, search ends up in the hands of an IT or IT-affiliated group who are often qualified to support only the technical aspects of search. And of course, search is much more than a technology: there are all sorts of user, content, and business analyses that should inform how search systems are designed. The data produced by these analyses help us understand how search interfaces, query languages and query builders should be designed, and how search results should be presented. There's much to good search system design besides the algorithms and technical characteristics encoded in the piece of software known as a search engine.

And interestingly, good taxonomy development utilizes many of the same analyses. But taxonomy teams often develop separately from search teams. CMS or portal development teams, often grappling with metadata schema, controlled vocabularies, and tagging commonly are owners of corporate taxonomies; in other cases, the owners are people less tied to a specific technology, like corporate librarians and knowledge managers.

It's also my experience that users search and browse, often utilizing both methods within the same finding experience. A critical goal of information architecture is to blend these tools in an elegant way, blurring irrelevant back-room distinctions between search and taxonomy teams. Why irrelevant? These teams are falsely separated because we too often define search and taxonomies by their supporting technologies. Or for purely political reasons.

To rant a bit, it really drives me nuts to hear people talk of "search and IA" (which they often understand as browsable taxonomies). This is an absolutely false distinction, and leads to poor search design, poor taxonomy design, and perhaps worst of all, missed opportunities to better integrate the two to support finding, IA's ultimate goal. For example, search often is greatly improved when it leverages metadata tags. Metadata therefore should be designed with search in mind. So why separate teams? I don't see any good reason, just a lot of bad ones.

email this entry

Comment: Rich Wiggins (Jun 28, 2004)

Yes, too many places put total faith in the robotic search engine. They think they should carefully design the browsing view or the taxonomy, and leave all the search results to the robot. They don't analyze search logs. They don't do Best Bets; with Best Bets, you plan your search results as carefully as you plan your browsing view.

Recently I've been looking at university help desk sites. It's amazing how many just let Infoseek serve up whatever its algorithm chooses -- and usually it's not helpful.

Comment: Prentiss Riddle (Jun 28, 2004)

How can I get Google to hire you? My employer bought a Google Appliance, truly a lovely machine, but it pretty much offers just one way to do things. The Google way is probably the best one-size-fits-all solution out there, but unfortunately one size never really does fit all. When my users ask how to use the Appliance in some other way, I continually have to tell them they can't.

To name a specific example: Google mostly ignores metadata tags. That's because Google was developed in a global Web environment where it has to guard vigorously against spam and bait-and-switch pages, which means it tries hard to index only what a user sees, not hidden content like meta tags. But inside an organization malicious markup is less of an issue. It would be nice if there were options on the Appliance to tell it to pay attention to meta tags after all.

Comment: Lou (Jun 28, 2004)

Due to their excellent reputation in Web search, Google has managed to maneuver themselves into a wonderful business position: they get away with placing their own business model far ahead of their clients'. The clients dont' seem to mind. At least for now.

It'll definitely be interesting to see if Google abandons the "black box" approach down the road. I can't see how they can continue offering search services in isolation from other ways to find content.

Comment: Mark Thristan (Jun 29, 2004)

I am in complete agreement on this. Everyone talks about "search" as if it were a plug-in concept. At a recent meeting of UK enterprise-level intranet managers, I asked what was the balance they took between precision of matches and recall of matches in information retrieval. I was somewhat surprised that most had no idea what I was talking about - they all viewed search as an "out of the box" commodity. Some of these companies had some serious taxonomies in place, but had not really considered leveraging these to add some horsepower to search...

Comment: ML (Jun 29, 2004)

I think it's quite a challenge in many organizations that they see them as separate. I'm somewhat fortunate to be both in charge of how search is implemented and maintained(including search logs analysis) and tying it back into the web process by sharing with our web team. Another group I work with is the library in helping establishing a taxonomy to reflect the business material/search logs/site objectives.

As for Google...it's a shame that techie folks fall for the "star" of internet searching. Enterprise search is very different from internet search. The irony here at Stanford is the business school and SLAC are using Ultraseek while the main university uses Google. SLAC/GSB had specific needs and we knew during our(GSB) requirements assessment that Google didn't fit our needs.

It's kind of fun to balance search/taxonomy/IA with many folks...I'm going to miss the mix and challenges.

Comment: John O'Gorman (Jun 29, 2004)

This general topic - the persistent gulf between the technical side of things and the pragmatic - has been something I've 'ranted' about for years. Technical solutions - like buying a search engine - almost guarantee the isolation of the technology from the problem it is trying to solve. Google's 'solution' is as one poster put it very robust in a narrow application, but users expect more and are frustrated when their understanding runs into these limitations.

Another example of this kind of disconnect is in the field of structured authoring. This discipline is supposed to (in the final analysis)make it easier for users to create, store, and retrieve topics. Users in this context refer to both interal authors and editors and external 'task oriented' end users.

The problem is that structure is related more to the delivery of content than to the relevance of the content. This creates two problems: The first is that the technologies and architectures supporting structured authoring deal almost exclusively with containers: Books, Sections, Chapters, Titles, Paragraphs, Lists (bulleted and otherwise) which make it easier for writers to concentrate on the content and leave formatting and publishing to the technology behind the scenes. This separation of content and format is seen as a great re-use feature, but the question is: Re-use of what? and by whom?

The second problem with this approach to structured authoring is that putting the focus on containers relegates metadata and the taxonomies to second-class status.

In my view the direct effect is that while containers are managed with great discipline, the tags and taxonomies are left to diverge: authors use terms that are not consistent, new terms create new branches in the taxonomy, etc. The net effect is that end-users cannot retrieve the information they want.

The solution is to further separate containers (delivery components) from content (information), and make management of each of these classes equally robust. To do this, and to get back to the original issue, the search and taxonomy teams must work very closely with each other and with the technologists who will eventually recommend their part of the solution.

Comment: John O'Gorman (Jun 29, 2004)

This is a topic that I have struggled to understand for quite some time. The application of technical solutions to soft issues like findability is like the 'if I'm a hammer every issue is a nail' approach.

The problem is that there are currently no robust architectures to counter this approach. In my view, as soon as IT gets hold of a business problem like content search and retrieval you can pretty much guarantee that you've lost control of the solution.

The answer is to develop an approach to managing enterprise information assets that is viable *regardless* of the technology that IT subsequently chooses to implement to support it.

In my view, the primary question that a content delivery system needs to answer is: What is my customer doing with this information? Taxonomies, controlled vocabularies, and content property tables should all be designed with this question in mind. If Google does not support this goal, move on.

Comment: Lou (Jun 29, 2004)

John, great points. I agree that we sometimes spend time on containers/structure at the expense of semantics. For example, in the context of metadata, we often focus on metadata attributes/fields (e.g., "subject, "product name") and then--oops--we forget how challenging it is to develop useful metadata values (e.g., "cellular phone," "v60i").

Anyway, I've grappled a bit with showing how structural and semantic issues need to be accounted for in metadata development in this diagram:


Comment: Rich Wiggins (Jun 30, 2004)

I've tried to come up with a mantra or slogan to convey the concerns expressed here:

"Search is too important to be entrusted entirely to a robot"


"Why do we agonize over every detail of the browsing view but hand over the search view to a robot?"

or, echoing what Lou led off with:

"Search is as much a part of your information architecture as the browsing view."

Alas, I can't get Alison Krauss to turn any of those into a country song. :-)

Prentiss, doesn't the Google Appliance include a Best Bets feature? You could do log analysis and manually goose the most popular items to the top of the hit list, simulating what Meta tags would accomplish.

Comment: Avi Rappoport (Jul 1, 2004)

I agree entirely -- search and topical taxonomies have a lot to offer each other, and it's self-defeating to separate them. I'm strongly encouraging my clients to set up processes to provide a lot of communication. My search clients are feeding them user vocabulary as expressed in search logs, and setting up systems to search zones within the taxonomy. The taxonomy people sometimes identify areas that are not being indexed, which helps the search improve coverage.

The other aspect relates to user experience of the site (or intranet). Taxonomy people are much more interested in vocabulary and usability. Best Bets and synonyms and suggestions for additional content are continuing issues (and opportunities) that fit much better with taxonomies than the technical issues of search. The last thing we need is duplicate effort to identify these elements. I encourage my clients to integrate an ongoing process to manage Best Bets and synonyms as part of the taxonomy maintenance.

In other words: librarian full employment act!

Comment: Walter Underwood (Jul 2, 2004)

People don't tend to think of taxonomy as technology. It is much older than computers, so it must be easy and well-understood, right? Search engines are also simple, but they are mysterious and must be tended by specialists in white coats.

Here is one way to tie them together: every query makes an ad hoc category. The search results are a stab at populating that category. Some queries match common categories, like "HR: Vacation and Holidays". Carefully populating those categories will improve productivity when that matches an query.

Check out the US Geological Survey for a decent example of categories and search. There are some things I'd improve, but it is pretty good. http://search.usgs.gov/

Disclaimer: USGS uses Ultraseek CCE. I work on Ultraseek and wrote CCE.

Prentiss: I did a CCE taxonomy for Rice to exercise the product on my development box. Want it?

Also, there is often a mismatch between search and content. Part of writing for search is to think about what kind of answer your document is. What user questions does it answer? If it doesn't answer any, you can drop it. If it does, make sure that it can be found by those queries, and that the title reflects the answer. If it doesn't look like an answer in the search engine, it isn't really an answer.

Comment: John O'Gorman (Jul 3, 2004)

I have a question. First an observation. I like the USGS site from the search point of view - nice job, Walter - and I understand, given the diverse nature of the audience and the rangel of material that the USGS covers, that the potential vocabulary is quite large. My question, especially in light of Walter's comments about writing content for search, is this:
In an environment like a corporation, is the vocabulary sufficiently small to implement a more formal tagging method, where authors, upon completion and/or submission of a piece of corporate content, are asked to provide 6 - 10 values that would make the content easier to find?

Comment: John O'Gorman (Jul 8, 2004)

Another observation/question pair for the forum.

The reason that search has to be done at all is because the information the user is looking for is part of a much larger delivery context. It might be buried in a book, in a Help system, and/or a website. A side-bar observation is that someone has categorized the information in such a way that it is not evident to the user just where in the taxonomy or structure of the delivery context.

Here's the question. Could we obviate the need for taxonomies, hierarchies, and keywords if we focused author activity into answering one question at a time: like - what is the user objective that makes this information useful? In a related field, would this make activities like content analysis equivalent but not limited to document deconstruction with the same objective in mind?

Comment: Lou (Jul 8, 2004)

Interesting suggestion; but wouldn't you still need some sort of thesaural capability to handle synonyms?

Comment: John O'Gorman (Jul 11, 2004)

Lou, I think I know what you mean - and maybe we should do this offline - but could you give me an example of how different users might use synonyms to acheive the same objective. Thanks.

Comment: Lou (Jul 12, 2004)

Sure 'nough: let's say an author described her document as being about "policies". But you'd search for "guidelines* and I'd search for "procedures". You and I are both out of luck because her document, though relevant to our information needs, wasn't retrieved by either of our queries. We all have the same objective, you, me and the author, but without some sort of synonym handling in place, none of us achieve that objective.

Comment: John O'Gorman (Jul 12, 2004)

Good example. This is where I would make a pitch for orthogonal attribution and search. It is simpler and easier than attempting to maintain synonym handling, and practically guarantees a hit every time.

'Orthogonal' is a fancy way of saying that two or more dimensions intersect at right angles. It is another way of saying that values on one dimension are completely independent of values on another dimension. Grids on a map (two dimension)work this way, and objects in three dimensional space can be located using one value from each of three orthogonal dimensions. Orthogonality guarantees that given any three values (coordinates) and a point of reference, a person can locate their objective (content).
In the example you site, all three values: guidelines, policies, and procedures could be seen as largely equivalent on one dimension. (Although, in my model guidelines and policies are 'Why' values, while procedures are 'How', but that can wait.)
To make the search more accurate, additional orthogonal attributes are needed. In the Q6 architecture, this is accomplished by associating existing entities to the content:

An assetperson entity - say "ABC Corporation" to assume the role of Owner.

An infoperson entity - say "Mary Jane Whitford" to assume the role of Author.

An assetfunction entity - say "Human Resources" to indicate the Discipline responsible for the content.

A deliverytype entity - say "Guide" to show which templates (containers)were used in the creation of the output.

An infolocation entity - say "abccorp/intranet/hrguide.xml" which is the location of the information in the most flexible format.

An infotemporal entity - say "Current" to show the status of the content.

An assetprocedure entity - say "Review" to indicate a user objective.

Now, when the user goes to search for the document, inclusion in the search string for two or more orthogonal attributes: say "Review the Current ABC Corporation Human Resources Guide" should return proximate results.

That's not to say that you couldn't include a thesaurus as well, but with the awareness that values should be carefully screened to insure that orthogonality is not compromised.

Comment: Lou (Jul 13, 2004)

John, sounds very similar to faceted classification. And as with facets, each of the entities you describe could contain vocabularies (or values) for which there are synonyms. "ABC Corporation" might also be listed as "ABC". "Mary Jane Whitford" could also be listed as "Whitford, MJ". "Current" might be the same as "new". Etc...

What is good about this approach is that if enough entities are employed, you do get closer to relevant content. However, that assumes that content authors/editors/taggers will invoke so many entitities at the point of publication, and, even less likely, users will employ so many in their own queries. Search log analysis shows them to be much lazier than that. :-)

Now, if you're working with a corpus of data, rather than semi-structured text, you may have better results with this approach. "How many widgets were sold in the southeast sales territory during the third quarter of 2002?" is in line with this approach, and the sort of question we do expect to encounter with data retrieval. But most web sites consist chiefly of text.

Anyway, I'm probably misinterpreting your suggestion; does any of my response make sense? Thanks for posting your thoughts here; it's a fun topic.

Comment: John O'Gorman (Jul 13, 2004)


What I had in mind was a database-driven intranet where a search would create a SQL statement. The results of the search would then present the user content consistent with their objective.

I agree that the challenge of semi- and un-structured content is a daunting one, but convergence of data and content is the holy grail of enterprise content management applications, is it not? :~) I think with the right architecture that incorporates the speed and depth of databases and the flexibility of web-based presentation (combined of course with orthogonal attribution) is one of the keys to the kingdom.

This orthogonality is already evident in the way users frame their objectives (your example is a good one) and if content contributors could be encouraged to employ this technique when describing their contributions, we would be able to something like voice-driven queries.

A guy can dream, right?

Comment: Lou (Jul 13, 2004)

You bet! And I think you've really hit the nail on the head with the data/text convergence. It's happening--albeit slowly--but it's just a matter of time...

Comment: John O'Gorman (Jul 18, 2004)

Lou, I checked out the FacetMaps site and it is exactly what I was looking for. I really like the idea of multi-directional linking and being able to create views that have much more flexibility than single dimension hierarchies.

Thanks for the good discussion and input. I will let you know how the synergy goes between Q6 and FacetMaps.

Comment: Jeff Werness (Jul 18, 2004)

For more than four years I struggled with this very question at a large consumer electronics site where I led the information architecture practice. In the end, IA managed search, navigation, and taxonomies, however we worked very closely with content management and usability engineering.

At the onset, there was no concept of taxonomies or why they're important, let alone how information architecture, content management, search, usability, and content organization and structure all related to one another. In that environment, it was an easy to sweep taxonomies into the IA capability due to the lack of knowledge, and the fact that it was more of a brain powered aspect of the content -- it wasn't sexy like the internet was in 1998. :-)

As far as the underlying justification for the search team and the taxonomy teams to work closely, or even be organizationally part of the same team, my own experience says it's absolute, and as the complexity of the content, the user goals, and the business needs increase, it becomes all the more important.

Many industries struggle with a significant gap between internal classification and labelling systems and the fast-evolving language of its customers. The consumer electronics industry is notorious for this but it happens everywhere. The solution is to closely align the IA, taxonomy, and language experts if search is ever expected to be effective.

Add a Comment:



URL (optional, but must include http://)

Required: Name, email, and comment.
Want to mention a linked URL? Include http:// before the address.
Want to include bold or italics? Sorry; just use *asterisks* instead.

DAYENU ); } else { // so comments are closed on this entry... print(<<< I_SAID_DAYENU
Comments are now closed for this entry.

Comment spam has forced me to close comment functionality for older entries. However, if you have something vital to add concerning this entry (or its associated comments), please email your sage insights to me (lou [at] louisrosenfeld dot com). I'll make sure your comments are added to the conversation. Sorry for the inconvenience.