louisrosenfeld.com logotype

Home > Bloug Archive

Oct 12, 2006: WWYD #1

It might be nice to run a semi-regular feature in Bloug called "WWYD". No, smartypants, that doesn't stand for "What Would Yoda Do?". It's "What Would You Do?". Someone asks me a tough question, I take a stab, blog both question and answer, and seek out additional suggestions (and perhaps critiques of my answer) from Bloug readers. I've done this a few times before on Bloug; thought it might be fun to do it more regularly, so if you have a tough IA question that you'd like to get a variety of opinions on (including my own), send it my way.

OK, so let's get started. The first and official WWYD is a question about whether or not to give up on using a taxonomy in an enterprise setting. It's from an attendee at one of my recent enterprise IA seminars (who wishes to remain anonymous):

We recently bought and installed the Google mini to enhance the search function to our website. From what I understand, it crawls through every piece of data we have full text and brings up results that way. It also automatically indexes the information via the URL patterns.

Our organization a few years back has invested quite a bit of money and resources to create an elaborate taxonomy in order to organize all the materials we have for the web site. It is very complex and a bit cumbersome to put in practice—each item and web page uploaded to the web site is classified according to the taxonomy. As you can imagine, this is labor intensive and can be prone to human error if the person classifying it does not know what he or she is doing. Also, we have a large amount of material that is unclassified as people through the years have bypassed the classification system and just uploaded items to the CMS.

Our problem is thus: Should we do away with the taxonomy altogether and just let the Google mini crawl? Is there a way to merge the complexity and richness of the taxonomy with the capabilities of the Google mini? Where should I turn to for information or if other organizations have dealt with this problem before?

What Would Lou Do: I don't know enough about Mini, but I'm guessing that, as its a Google product, it won't take advantage of existing tagging. You should check their documentation, but Google typically avoids relying on metadata for a variety of reasons, and I'm not sure it can be configured to look at yours.

That said, if you've invested heavily in enterprise metadata, the problems with its adoption may have more to do with a poorly conceived content authoring/management workflow than anything else. People are notoriously bad a using a taxonomy to tag their own content. They just don't understand how their documents fit within a broader collection, and for that reason they don't often use sufficiently specific and precise metadata.

But if you have them do their own tagging, it's at least a good start, especially if you combine that effort with some sort of quality control and review process which determines the final set of metadata terms used to describe the content. In other words, an intelligent balance of decentralized and centralized efforts. Doing so can improve your site's navigation and, indirectly, the search experience, but as with everything else, it costs money. It's one thing to have a taxonomy; another (and often far more expensive) thing to derive its intended benefits.

A good place to start down this road would be to map out the content authoring and publishing workflow from end to end. Then take a close look at responsibility for each step of the workflow: who owns it, and who should own it? Which parts should remain in the author's or local unit's hands, and which parts should be centralized? You'll typically find that the stuff on the left (authorship) tends to be more locally owned, the stuff to the right (publication) should be more centrally-managed, and thinking through the tough stuff in between is where you earn your keep. Here's an over-simplified stab at mapping this process that might help you understand it better:
content workflow diagram

Complicating your life is the fact that there are various levels of "IA maturity" among the different publishers of content within your organization. So the line between where centralization and local autonomy will vary quite a bit, meaning one content workflow process won't fit everyone's needs.

In any case, it's likely your organization will increasingly centralize some aspect of its enterprise information architecture; if that's the case, a new central team would likely be the group that performed this quality control work. So you may be heading in this direction anyway. So maybe the better question to ask is: which responsibilities should a centralized team take on in your organization (including reviewing how metadata are applied)?

As far as other organizations who've dealt with this problem, well, there are many, but their solutions aren't especially generalizable, as they're so dependent on local politics, organizational culture, available technology and resources, and so on. You might find my recent posting about the BBC's path toward metadata adoption a useful start.

What Would You Do? Please share your thoughts...

email this entry

Comment: Rich Wiggins (Oct 12, 2006)

Getting good metadata requires institutional commitment. Examples I've heard of where this commitment exists include, as Lou alludes, the BBC, and certain research and technical environments where subject matter experts do their own tagging. In the BBC example, reporters know their work is more likely to be repurposed (print, radio, TV, Web) if they provide good metadata. In the lab environment, the scientists know that as individuals and as a community they'll search and share information more effectively with good metadata.

If your taxonomy is kind of clunky and not broadly understood or used, it's hard to imagine how it's going to do better than Google.

I'd suggest trying the Mini out for a while. Look at your search logs after a month. For the top 200 searches, test to see how the taxonomy-based scheme would've fared.

Comment: Dave (Oct 12, 2006)

Shouldn't it be WWLD; the "L" being "Lou"?

Comment: epc (Oct 12, 2006)

If the Google mini ranks results using pagerank, the taxonomy could be used to influence pagerank and search results if site navigation is derived from the taxonomy (through breadcrumbs in the pages, site map and sub-site maps, possibly presence of taxonomical terms in the URLs).

Comment: John (Oct 12, 2006)

I've installed Google Mini's and their bigger brothers the Google Search Appliance in large enterprises and come across these issues. Generally, I advise that because strict taxonomy is so difficult to do well when most users are untrained and unmotivated, enterprise search is a better bet. We've been showing clients how to optimise site search on their search appliances using search analytics, synonyms lists and through judicious use of Keymatch results.

The Google mini can make use of existing metadata in indexed content. You can create a search scope filter that uses metadata in Serving/Frontend/Flters in the latest versions of the mini. You can also get metadata by default using a hidden form field in the search frontend. If you had a GSA rather than a Mini, you could make great use of the OneBox feature for presenting metadata ahead of organic search results.

Finally, you can present metadata alongside search results in clever and useful ways using the XSLT that defines the frontend, for example presenting Author, date last updated etc alongside a document search result.

Comment: Lou (Oct 12, 2006)

Dave, clearly the information from the comments is superior to WWLD; people with actual Google Mini experience are responding. Very cool! :-)

John, your thinking is line with mine regarding hesitance with taxonomies. I'm not against them, but they're not a good starting point for most enterprises. A level of organizational maturity is required to really take advantage of traditional taxonomies.

Comment: RTodd (Oct 12, 2006)

Interesting thing about taxonomies, we “professionals” want perfection with well designed taxonomies. We measure our success by the level of complexity and even at times wear it as a badge of courage. The best taxonomy in world gets used while the worst never does irregardless of design. I like think of taxonomies as semantic views of a map where the goal is to get the user to the neighborhood of what they are looking for, not the exact house of content. We should strive to build the taxonomy where you have both content (artifacts labeled) as well as usage (users clicking the taxonomy structure). When you focus on this versus levels, domains, spans, etc, you are far more likely to succeed.

Comment: Faruk (Oct 13, 2006)

I believe that there is a need to manage content in an "inward-facing" taxonomy that is centrally controlled and managed by a librarian resource for the purposes of appropriate archiving and a robust ability for retrieval if necessary after some years have passed. There is, however, no reason why this inward taxonomy should be the same that the end-user of any particular day, week, period, or era should necessarily need to see.

I work for a government agency and finding a way to meet the records and data management requirements of the organisation whilst meeting the changing needs, behaviours and cognitive/discursive predelictions of our users is a key task.

My view is as long as there is a stable, robust, standards-based taxonomy for the internal storage and management of content or data then we can map whatever end-user facing taxonmy onto this as required - including finding a way of pulling in and incorporating user-generated tagging. I believe multifacted classification is the path of the future for most large-scale repositories of information.

Search technologies can assist our mapping of the two or more taxonomies in play at any time; and unless there is a clear advantage in manually applied metadata, system generated metadata from a CMS that has certain basic metadata requirements for authors built in to the authoring and publishing workflow should suffice.

For any data set that needs to last in terms of its accessibility, possibly well beyond the point where it is actively used by a population, I think having a well structured internal taxonomy is clearly essential.

Comment: Noreen Whysel (Oct 13, 2006)

Building on Faruk's comment, you could build a default taxonomy setting directly into the CMS workflow, matching tags to fields (author, department/agency, group focus, product, etc.), directory structure (where is the file saved?), type of file and permanence.

Comment: Victor Lombardi (Oct 13, 2006)

RE>Should we do away with the taxonomy altogether and just let the Google mini crawl?

The problem statement seems to be missing mention of the end users for whom all this work is presumably designed. Why not just do qualitative user testing and see which is more successful for the users?

Comment: Prentiss Riddle (Oct 14, 2006)

What Faruk, Victor and epc said:

- The answer depends entirely on the use case. Do you absolutely need a degree of precision that free text search can't provide, say for legal or medical applications?

- You need to look not only at how hard it is for content creators to use the taxonomy but how effective it is at helping your end users.

- You may be able to leverage what you've already got by exposing the taxonomy to the Google box.

Comment: Paula Thornton (Oct 14, 2006)

The challenges with Google are more significant for intranets. Internal content doesn't run around putting in extra words just to influence their results with search engines...nor should they -- there's no economic justification for it -- it puts us smack back with shaping the people to fit the results, only it's being applied to the content.

Google apparently does or will have some classification control, but I haven't worked with it to comment.

It doesn't take an investment in usability tests to see how bad the results are internally -- some layer of control is needed for intranets...but we haven't yet been able to hire the dedicated resource we want to put in place to figure out exactly what that 'ideal' approach should be (which is likely to fall in line with some of the recommendations John Wood has already outlined -- but we're also looking at very distinct search mechanisms -- including a task-based interface).

Comment: ben123 (Oct 18, 2006)

I am the person who posed the original question to Lou. Many thanks to Lou and to everyone who responded with helpful insights and comments. I think this will put us in the right direction.

The end result we are after is for users of our website to be able to find the information and materials they need. However this is done in the back end they most likely will not care. I am intrigued by how the BBC has tackled their meta tag issues.

I am also intrigued on using search analytics and usability tests to make sure that the metadata system in place is done wth the end user in mind.

We have a metadata system already -- the taxonomy. I wonder if search analytics (finding out the most common keyword searches, for example) can be used to further refine the taxonomy so that it is not so complex and hence, not user friendly for the content creator who ends up doing the manual tagging of material for the web.

Comment: Amy (Oct 20, 2006)

Search analytics can and should be used to refine your taxonomy. Simple search log analysis can offer insight into opportunities for improving the taxonomy.

I'm of the opinion that a taxonomy should be (to whatever extent possible, which varies depending on the corporation) a living entity that evolves to meet its users' needs; there's pretty much no such thing as a perfect taxonomy out of the box. Your taggers may not have the exact same vocabulary as your users, and that's not an inherently a bad thing, but it's helpful for their vocabularies to at least be informed by those of the user.

Comment: Dennis D. McDonald in Alexandria, Virginia USA (Oct 25, 2006)

Wow, what an informative series of comments!

My view of taxonomies has changed a lot since I was in library school and then worked for a company that had developed its own fulltext search engines. That is, taxonomies are wonderful things but (a) they have to change constantly and (b) they have to be flexible enough to support different worldviews of authors, indexers, and users -- groups that may never speak directly with each other in real life.

At one point I believed that the solution was for intelligent behind the scenes translation to occur between the terms used by the indexer and the terms used by the user. The better the translation, the better the precision of results.

Nowadays, though, search functions are so rapid and provide results that can be navigated quickly; this means that a fast search (say, Google Google-based)can put the user "into the ballpark" of high-precision results just as quickly as results generated through intelligent use and translation of taxonomies. The appropriate question then becomes, which approach makes more economic sense?
- Dennis McDonald

Comment: Walter Underwood (Oct 30, 2006)

I'm very surprised to see people suggest throwing away a very expensive taxonomy in order to keep a very cheap search implementation. A Google Mini is a throwaway box at $2000. Your taxonomy cost at least 100X that and maybe 1000X.

Switch to a search product that can do something useful with your taxonomy. This is a pretty basic feature for enterprise search -- I implemented it for Ultraseek CCE eight years ago.

If the categories are in an HTML meta tag, it shouldn't be too hard to generate a CCE topics.xml file from your taxonomy.

First, try a search engine that does use the taxonomy before deciding to ditch the taxonomy.

Was there an evaluation for search or did you buy Google because of the brand?

Comment: Avi Rappoport (Nov 7, 2006)

I've been thinking about this and I'd take a couple of big steps backward and ask some fundamental questions.

1) What are the information needs that the site is designed to address?

2) What's the purpose of the taxonomy? How does it fit into the information needs? Who needs it and how will they use it? How can you test this?

3) What are the resources required to keep the taxonomy current and useful? Could an automated categorization engine help? Could the taxonomy be simplified? How much is the company willing to invest in this?

Those are big questions, installing the search just uncovers them, but they were there before.

If the taxonomy is valuable enough, then you can expose it as metadata in the Google search engine result page and set up filters for search, or maybe get a search engine that takes advantage of the taxonomy.

So hah, me a search person, and I bring it all back to IA.

Comment: ben123 (Nov 10, 2006)

Hello Avi

I am trying to remain anonymous and to not identify my organization so please bear with me trying to answer your questions:

1) The site is designed to address the needs of the higher education community and professionals. We are an association which publishes books, hold conferences, try to influence legislation, sponsor events, have an info center, have a Job Bank, as well as publish copious amounts of info like press releases.

2) The taxonomy is designed to organize and categorize the information we have for the Web site. Our subject area is broad -- the knowledge base has several categories which overlap at certain points. The difficulty we are having is that the taxonomy is at 1,600 keywords -- five main categories of knowledge with eight index nodes for each category and within those eight, further sub-categories to drill down. Needless to say, tagging each piece of content using the taxonomy in its current form is a bit overwhelming.

3) We currently devote no resources or processes devoted to making the taxonomy more user-friendly. We currently have the taxonomy on paper and in Excel and it is untouched from the time it was first conceived.

You said:
"If the taxonomy is valuable enough, then you can expose it as metadata in the Google search engine result page and set up filters for search"

We are in the process of implementing a similar solution to the statement you made above. We are basically going to have our CMS dynamically turn the index nodes of tagged content to metadata, and then see what types of results we get in search as we search by keyword. I am also trying to get our web team to take seriously the concepts in Search Analytics by Lou Rosenfeld in his upcoming book. So far they see the SA model of improving search as a stopgap method until we can resolve our taxonomy-metadata problems


Add a Comment:



URL (optional, but must include http://)

Required: Name, email, and comment.
Want to mention a linked URL? Include http:// before the address.
Want to include bold or italics? Sorry; just use *asterisks* instead.

DAYENU ); } else { // so comments are closed on this entry... print(<<< I_SAID_DAYENU
Comments are now closed for this entry.

Comment spam has forced me to close comment functionality for older entries. However, if you have something vital to add concerning this entry (or its associated comments), please email your sage insights to me (lou [at] louisrosenfeld dot com). I'll make sure your comments are added to the conversation. Sorry for the inconvenience.