Jul 10, 2002: Some Batesian Inspiration
The latest First Monday features a great piece by Marcia Bates, a faculty member at UCLA's Department of Information Studies. "After the Dot-Bomb: Getting Web Information Retrieval Right this Time" provides an excellent encapsulation of how the Web world has ignored information retrieval, why it had better start paying attention now that the money has dried up, and what should happen.
Here's the abstract:
In the excitement of the "dot-com" rush of the 1990's, many Web sites were developed that provided information retrieval capabilities poorly or sub-optimally. Suggestions are made for improvements in the design of Web information retrieval in seven areas. Classifications, ontologies, indexing vocabularies, statistical properties of databases (including the Bradford Distribution), and staff indexing support systems are all discussed.
Abstracts are supposed to be descriptive, and don't delve into the opinions and emotions that drive what we write. I'm going out on a limb here, but I'm guessing that Bates is more than a little frustrated with the ignorance of information retrieval that has plagued the Web community. Like many information science people I know, she's seen many wheels reinvented, with credit taken by people who either believe they're visionaries or simply pretend to be. I like to think this is changing, but unfortunately the following statement still rings a bit too true:
...it has been almost an article of faith in the Internet culture that librarians have nothing to contribute to this new age.
Anyway, it's short and good, so get to it.
I've taken her seven conclusions and added a few of my own thoughts below:
1. Use faceted classifications, rather than hierarchical.
Often faceted classification can be extremely useful, especially for filtering (witness Epicurious' recipe database, my favorite example). But this sounds a little too much like a rule for my comfort. There are types of content for which hierarchical classification works fine. And designing and implementing faceted classification schemes also requires a significant investment. Still, I wonder if the Yahoo! directory might have maintained greater relevance (and market share) if they taken a faceted approach.
2. Develop an understanding of what distinguishes information classifications and vocabularies from the physical-world equivalents, and stop using the misleading term ontology.
Bates' railing against the term "ontology" brings back memories from the early '90s, when the University of Michigan's School of Information and Library Studies was renamed the School of Information, the result of a coup d'etat by renegade computer science faculty. These intrepid colonists saw the small and weak library school as fertile grounds for personal turf building efforts fed by unsuspecting doctoral students (yep, I was one) and research funds from the NSF and other organizations concerned with the information explosion.
Shortly after the coup, I recall participating in many arguments over this curious term "ontology". The computer scientists would become livid when us old guard types would explain that they really meant "classification systems," but never were could muster a reasonable definition of "ontology" that was any different than what many of us had been doing for eons. The computer scientists were really reacting strongly against being associated with librarianship, which they perceived as non-intellectual and vocational, even though we were really trying to introduce them to information retrieval, a field that's definitely neither non-intellectual or vocational.
And how could I forget: "ontology" was (and still must be) a much sexier term that would garner more research dollars than old-fashioned "classification systems". I imagine Bates has experienced similar silliness at UCLA, as a little bitterness seems to seep through in her article.
3. Use the many vocabularies specifically designed for information retrieval, rather than general English language vocabularies."
"Thesaurus" is another misunderstood term. We've got to do a better job of distinguishing searching and indexing thesauri from Roget's. Some are used for expanding a term ("give me some synonyms!"), others for narrowing ("what's the right term to search?"). Maybe that's a good distinction?
4. Understand and work with the underlying statistical characteristics of information in designing information retrieval. Failing to understand these factors simply leads to sub-optimal systems.
Pareto's Principle, aka the "80/20 rule," is just starting to become understood within the information architecture community as a guide for design decisions. And we really have no choice, as it's not cost-effective to design, implement and maintain 100% of all possible means of accessing our information, nor does 100% of our information merit high quality architecture in the first place. Find the 20% that addresses 80% of users' needs, and invest your design efforts into that good stuff.
5. Recognize that systems of information description are extremely size-sensitive. Design for all anticipated database size ranges from the beginning.
Conversely, view all claims that a search tool vendor makes regarding their technology's performance with great suspicion. Often they're "testing" with a tiny, homogeneous body of content. Your content is probably voluminous and heterogeneous; if it's not, it probably will be soon. So the happy scenarios that vendors portray probably won't pertain to your situation.
6. Be kind to your indexers: Design a targeted indexing-support system specifically for your human staff, and you will save much staff time.
Content management system vendors take note: your bloody products might actually provide value if 1) you enabled manual indexing by integrating thesaurus management capabilities; and 2) that manual indexing stuff is "real work" too, so start figuring out how to better integrate it within your work flow support.
7. If you develop a site with any information retrieval component at all, then hire information expertise.
And this is really the point of Bates' article: information retrieval is hard work that costs money and requires human labor as well as technology. Additionally, that human labor should come from IR experts, not specialists from other areas such as visual design or computer science. The stuff is just too complex for someone to be expert in topic X and information retrieval.
Bates' case is bang on for information architects, interaction designers, and any area related to information retrieval. Let's hope that the tide begins to turn soon; even in lousy economic times, content creation is increasingly overwhelming, the content we already have continues to go stale, and, dammit, the world will need us more and more.
email this entry
Joe Sokohl (Jul 11, 2002)
Great summary of her article, Lou! I'd read it just a few days ago, but, quite frankly, you were able to bring several points directly to us IAs. Not having a formal background in IR/LIS, I appreciate your "translations" of some of the headier points she made.
BTW, on a technical note, I think you have a missing somewhere...
joe (Jul 11, 2002)
...er, that's a missing "close-bold" marker....
Prentiss Riddle (Jul 11, 2002)
If ontogeny recapitulates phylogeny, does ontology recapitulate philology?
Scott B (Jul 11, 2002)
Article spotted 7/11/02
vanderwal (Jul 11, 2002)
Thanks Lou, your comments added depth to the Bates article. Thanks for bringing Pareto to the table you are bringing my some of my lives together.
Victor (Jul 12, 2002)
I've read this article, and I've read Soergel's article that Bates references, and neither actually looks at the way ontologies are being used. Citing the word's origins in philosophy, or saying ontologies are necessarily objective, are straw man arguments; no one using ontologies in practical ways (I'm excluding the AI and linguistics) is concerned with these things, just as we don't care that most people associate the word taxonomy with biology.
Conceptually, ontologies contain richer semantic relationships than classification schemes. Otherwise, they are very similar to thesauri. And at implementation time both can be boiled down to similar database models. But the computer science field that has and continues to dominate how we use electronic information has sided with ontologies as their organization scheme of choice.
There *does* seem to be a lot of bitterness in her argument. I think this just hurts LIS folks, because it distances them from comp sci rather than bringing the two together. Information scientists need computer scientists to implement their ideas. However, the reverse isn't true; computer scientists can reinvent the LIS wheel and go on to build an engine and transmission around it.
As IAs we can bridge these two fields and use the strengths of each. I hope we build bridges rather than burn them down.
robbin (Jul 27, 2002)
So the philosophy of information is just about as useful to computer scientists as ornithology is to birds?
lisa (Nov 19, 2002)
I appreciate your observations. As an ontologist from the AI-world looking into the IA world, I'm interested to see how IAs define "ontology". I believe an ontology is a faceted classification that is heavily typed. The value of an ontology for knowledge-based systems is not only in the classification, but in the semantic typing that can be used in constraints.
Both library and computer scientists build ontologies. However, they are for different users. The typical classification sceme that an IA/UE will build is for the front-end client whereas a computer scientist will build an ontology with back-end functionality as the main driver. What is interesting is how these perspectives will merge as both parties are interested in back-end functionality. What is your perspective on this?
Walter Underwood (Jan 22, 2003)
Belated comments on the first three points.
Faceted classification is neat, but hard to create, maintain, and (sometimes) navigate. It works best with information which is dense,
that is, that fills most of the intersections
of the facets, and is stable, that is, new
topics don't show up.
Sparse information creates too many dead zones,
and changing information requires redesigning
the facets, which is time-consuming and obsoletes
So it works well for recipes, with thousands of
years of convention, but not at all for, say,
reusable software or company intranets.
Try and use it, but be ready to go to something
more robust. Also, read Bella Hass Weinberg's
article about the near-universal failure of
complicated classification schemes:
Reusing subheadings is far more important, by the
way. Go the Yellow Pages, and you know that each
kind of business is divided into retail,
wholesale, leasing, repair, etc. Users can depend
on that and it really speeds access.
As for ontology and taxonomy, the terms have very
different connotations. Ontology is from
philosophy, and is about the essence of things.
Taxonomy is from science, and is about naming
and access. Essence and access are not the same
thing at all. Essence is nearly religion, but
access can be measured in with user testing.
An essence-based classification must determine
whether Orwell's "1984" is science fiction or
political commentary. An access-based approach
asks "where will people look for it?"
To make things really clear, I'd use "subject
headings" instead of "taxonomy".
Existing IR vocabularies just don't cut it for
commercial use. LCSH has a single heading for
our company: Information retrieval--Computer programs. No detail below that.
Organizations usually need their own vocabulary
with the local jargon. One of our customers
doesn't use the word "manufacturing". It is
always "order fulfullment". Tag things with
"manufacturing", and they won't find it.
Vocabularies can be reused where there is some
external influence which makes companies similar.
That might be employment law, franchising, or
government regulation. So you see pre-built
solutions for HR, car dealers, or pharmaceuticals.
// so comments are closed on this entry...
©2008 Louis Rosenfeld LLC. All rights reserved.