louisrosenfeld.com logotype

Home > Bloug Archive

Jan 09, 2005: Applications to Aid in Content Inventories?

I'm interested in learning about applications that can help us perform content inventories in large, distributed enterprise environments. Has anyone used software to help us answer the following questions:

1) What's out there? Are there sites, sub-sites, and other content areas (possibly non-HTML) that we just don't know about but should?

Hidden but useful content can be a big problem for those performing content inventories in an enterprise environment. Logs obviously only have limited utility (if any) in answering this question. Maybe tools that crawl and collect page links (which might go to sites that are owned by the same organization but use a different domain) might help here. Perhaps also manually searching Whois for domains owned by one's organization might also provide a good starting point (the domains can then be fed to a crawler).

2) Of what's out there, what's homogenous? In other words, are there pockets of content where documents/files/records share the same structure or are considered similar through some other form of pattern recognition?

Here the goal is to save labor by minimizing sampling for content analysis--if such patterns are detected, you only need to look at a few documents or records to get the "idea" of what that pocket of content is about.

3) Of what's out there, what's important enough to merit extra attention?

Some pockets of content are important and we should spend more time sampling and analyzing them. We might define "importance" in a number of ways. Applications might rely on logs to determine content popularity. Humans, obviously, might rely on other quality attributes (such as authority and currency) to assess importance or value, but right now I'm more interested in learning about automated approaches.

email this entry

Comment: Peter (Jan 9, 2005)

- A disk spider can uncover stuff that wasn't even linked to from anywhere anymore, or that is very hidden.
- a web spider can make a long list of everything you have.
- Server log files might indicate some level of what's important.
- A poll throughout the enterprise ("What are the 10 pages on the intranet you find contain the most important information in your daily work - please enter pagetitle and url") might give some idea of what's important.
- Directory structure often indicates homogenous-ness. In other words, everything under /hr/2004/docs/ is probably similar. It's a weak indicator though.

Good question, curious to see what else you come up with.

Comment: Denham (Jan 9, 2005)

I'm thinking the lawyers may have addressed this question with protocols, practices and discovery tools ???


Comment: Denham (Jan 9, 2005)

Would the lawyers not have tools for running down content as part of their legal discovery arsenal ?

Dennis Kennedy may be of help here.


Comment: Mags (Jan 10, 2005)

I had one of the development staff write a disk scraper to identify all the pages on the disk and collect information about each page; including any metadata in the page header, number of links, number of images, the file path and number of RIAs. Very useful because I discovered the parts of the file structure that had lots of content but very few links; and identified a number of redundant areas.

Comment: Martin (Jan 10, 2005)

In general it sounds to me like youre looking for some kind of indexing service that will crawl files and webcontent and make it all searchable.

Then, when that has been accomplished you would want to know how to promote areas of information for the users based on the patterns of interest, right?

I will be glad to elaborate on an existing range of products which covers this general area.

Comment: Lou (Jan 12, 2005)

Thanks for the responses everyone; offline, Alan Gilchrist suggested looking for ideas in patent analysis, Anders Ramsay suggested data mining, and Patrick Debois suggested looking into document clustering.

Sounds like there's no single tool or suite of tools that's being used by IAs or content managers to handle inventory and analysis, although maybe it's too soon for me to be pessimistic...

Comment: R. Todd Stephens, Ph.D. (Jan 18, 2005)

Patricia Seybold published a paper on tagging large content inventories.


Interesting Read and has a few products.

Comment: Todd O'Neill (Mar 3, 2005)

We've used our search engine crawler to gather the base for an inventory. It was able to give us URL, contents of the tag and all tags. In an Excel file. (Actually a flat log file that we converted to Excel.)
We crawled by directory (how our intranet is organized) and then prioritized the list of directories to create our work plan.
We analyzed one directory of a 1,000 files for "quantity" of taggin in an hour or so. Analysis of the same directory for "quality" of tagging took another few hours and we ended up with our "work cut out" for us.
Our search tool got us a long way toward expediting the inventory process. We did our extranet (2,000 files) by hand -- 5 people, 5 weeks, lots of detail, We could have saved weeks of time if we had had our search tool in place at that time.

Add a Comment:



URL (optional, but must include http://)

Required: Name, email, and comment.
Want to mention a linked URL? Include http:// before the address.
Want to include bold or italics? Sorry; just use *asterisks* instead.

DAYENU ); } else { // so comments are closed on this entry... print(<<< I_SAID_DAYENU
Comments are now closed for this entry.

Comment spam has forced me to close comment functionality for older entries. However, if you have something vital to add concerning this entry (or its associated comments), please email your sage insights to me (lou [at] louisrosenfeld dot com). I'll make sure your comments are added to the conversation. Sorry for the inconvenience.