Jon Udell: Categorizing Content

Tangled in the Threads
Jon Udell, Sept 15, 1999
Categorizing Content

A couple of weeks ago, I established myself as an RSS (RDF Site Summary) channel. I've written about this process elsewhere. The drill is simple: you write a file in RSS format, post it somewhere web-accessible, and register it with one or more channel hosts. Notable hosts include Netscape's my.netscape.com and UserLand Software's my.userland.com. These sites then periodically fetch your channel file, render its XML content as HTML, and rebroadcast it to their visitors.

In an essay on UserLand.com, Dave Winer -- a driving force behind the phenomenon of RSS-based content syndication -- expresses very nicely what I think is the most compelling aspect of this technology. It's profoundly democratic. To operate one of Microsoft's (now de-emphasized) ActiveDesktop channels, you pretty much had to be a Disney or a CBS -- that is, an organization with the development resources to create a channel with high production values, and with the clout to grab a chunk of channel-bar real-estate. RSS channels are nothing like that. To get into the game, you just need to offer a useful view of what's on the Web. Technically, it's a no-brainer. Here, for example, is a piece of my current channel file:
<?xml version="1.0"?>

<!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN"
      "http://my.netscape.com/publish/formats/rss-0.91.dtd"
> 

<rss version="0.91">

<channel>
<title>Jon Udell</title>
<link>http://udell.roninhouse.com/</link>
<language>en-us</language>
<description>Jon Udell's writings, discussions, and software</description>

<image>
<title>Jon</title>
<url>http://udell.roninhouse.com/jon88x31.gif</url>
<link>http://udell.roninhouse.com/</link>
<width>88</width>
<height>31</height>
<description>Articles, discussions, and more...</description>
</image>

<item>
<title>Talk | Categorizing syndicated content</title>
<link>http://www.byte.com/nntp/joncon?comment_id=2185#thread</link>
<description>An RSS category tag could enable channel publishers to collectively 
define a directory namespace, and locate their content within it. Will it work?
Why or why not?</description>
</item>

<item>
<title>Column | O'Reilly Open Source Convention</title>
<link>http://www.byte.com/column/threads/BYT19990901S0008</link>
<description>It's not just the Perl Conference any more, it's a multi-community
event that involves Linux, Apache, Python, and more. Michael Tiemann, Larry Wall,
and Ted Nelson all had interesting things to say.</description>
</item>

<item>
<title>Talk | A discussion about Zope</title>
<link>http://www.byte.com/nntp/programming?comment_id=1876#thread</link>
<description>Michel Pelletier, a developer of Zope, responds to my 8/30/1999
column. Enlightening discussion ensues.</description>
</item>
It's easy to play this game, and a lot of people are starting to play it. There are hundreds of these RSS channels registered at the Netscape and UserLand sites. What's more, it's easy to get into the channel hosting game. Recently a programmer named Carmen (she prefers not to use her last name) wrote to tell me about Carmen's Headline Viewer, a Windows-based RSS channel viewer (shareware, $15). One of the built-in channel providers is none other than mine. The effort required of me, to enable my channel to integrate into Carmen's Headline Viewer, was precisely none, and that's the beauty of this syndication model and of the XML standards that support it.

The need to categorize

RSS channels are a rich subject, which we've discussed in the newsgroups and doubtless will again. This week I want to focus on an issue that relates to channels, but also more broadly to the management of all web content: categorization.

Because I've been in the web publishing game a long time, I automatically tend to categorize stuff. So for example, my channel titles all have a two-part structure built on the pattern:
CATEGORY | TITLE
So far, the categories I've used have been: "Review" (a software review), "Talk" (a discussion-group message), "Column" (one of these columns), and "Person Name" (the name of someone whose article, or posting, I'm highlighting).

There isn't room, in an RSS title, for more elaborate categorization. So Dave Winer and I have been kicking around the idea of proposing an extension to RSS that would enable channel items to categorize themselves more richly. Dave called me up a few days ago, proposed a <category> tag, we had a long talk on the subject, and I wrote it up and put it out for discussion both in my BYTE newsgroup and in Dave's discussion group at UserLand.com.

Dave's original idea was to categorize at the channel level. Given that I tend to be interested in things like groupware, Perl, and XML, that would imply I'd categorize my channel like this:
<category>perl</category>
<category>groupware</category>
<category>xml</category>
This scheme would enable a channel host to organize views of its channels according to such categories. It seemed to me, though, that item-level categorization was also needed. For example, I'd be inclined to categorize my Zope item like this:
<category>OpenSource</category>
<category>Programming/Python</category>
<category>WebApplicationServers/Zope</category>
<category>Databases/OODB</category>
This kind of more granular categorization should, if consistently applied, yield interesting and useful views of channel space. Of course that's a huge if!

Where should these categories come from? One obvious candidate is the Netscape Open Directory, a collaboratively-maintained Yahoo-alike. There's already a connection between RSS channels and the Open Directory, by way of my.netscape.com. When you use its channel viewer and select "Add a channel," one of the choices leads you to a region of the Open Directory subtree that categorizes registered RSS channels.

Another candidate is the Dewey Decimal Classification system, maintained by the OCLC (Online Computer Library Center) Forest Press. It's also connected to the world of RSS channels, by way of James Carlyle's xmlTree.com site. At xmlTree.com, Carlyle collects and organizes XML content resources -- including RSS channels, which are among the most prominent such resources now available on the Web. After experimenting with a homegrown scheme for a while, Carlyle switched recently to Dewey Decimal so that, for example, my channel is listed under "005 - Computer programming, programs, data."

Embrace and extend

Dave proposed something that's downright anarchic as compared to Dewey Decimal and Open Directory. He suggested just letting people categorize their channel items in any old way they liked. How would such a scheme ever converge on anything sensible? Dave's idea was to build in a kind of Darwinian natural selection, where the definition of fitness in this environment is not reproduction, but repetition. Nodes in the category space maintained at a channel host, such as UserLand.com, would need to be reinforced by repetitive declarations. Without such reinforcement, they'd die out, like nerve impulses that fail to sum up to the activation threshold needed for propagation. Here are some examples:

Example: I propose Databases/OODB. You notice that the category exists, and request to add your item to it. The node is reinforced, and stays in the directory.

Example: I propose Databases/OODB. Nobody else cares about that category, however I continue to publish items into, so it is reinforced, and stays in the directory.

Example: I propose Databases/OODB. Nobody else cares. Even I lose interest after a while. When I cease publishing items into the category, it goes away.

What about namespace conflicts? Suppose I propose XML/Syndication, and you propose Syndication/XML. How will this be handled? Our idea is that there is no central authority, which means that these two nodes could coexist. Here are some possibilities:

Both achieve critical mass. Publishers of items in this space choose to support both nodes.

One achieves critical mass, one does not. Publishers of items in the "losing" node may decide to switch to the "winning" node. They don't have to, but they might want to.

Not so fast, said Mark Wilcox in a response to my posting:

Ever heard the phrase that goes something like "Those who don't know UNIX are bound to make a poor imitation of it?". Same thing with people who've never had any background in knowledge management (e.g. librarianship) go about attempting to develop their own information management system.

While I think you are on the right track, you can't let people totally pick what categories their items should be in. This was the biggest problem with HTML meta tags and a big reason why search engines for the most part ignore them.

Instead you should take an existing organizational scheme in electronic form, such as Dewey or Library of Congress subject headings. Both of these systems cover just about any subject you want, are constantly updated and maintained by information professionals from around the world. These systems also have enormous amounts of cross-referencing, something that Yahoo has tried to do, but is still working at it.

Next you have people submit their items, run their text through a process that then uses your information base to present the user with a small list of possible categories.

This increases the likelihood that similar items end up together as opposed to being scattered apart.

OCLC, which is a library entity that provides much of the classification information (e.g. library catalogs) to libraries in the US and around the world has done a couple of research projects on this area. I urge you to take a look at:

http://www.oclc.org/oclc/research/publications/review96/scorpion.htm

http://www.oclc.org/oclc/research/publications/review97/shafer/eval_scorpion/eval_sc.html

Well said! I think Mark's exactly right. Ideally the channel host would do exactly this, and not just for one scheme but for several. I'd love to have a channel host kick back to me, for point-and-click approval, suggested Dewey and Open Directory and perhaps other categorizations of my stuff. I'd also like to supply my own ad-hoc categorizations, and have other people do so, and find out whether -- with the help of Dave Winer's "survival-of-the-fittest" approach -- something useful might emerge.

Authorities are valuable, but there's not going to be just one authority, I don't think. That model just won't scale to handle the volume of stuff that we add to the Internet every day. But I see no reason not to embrace multiple existing authorities, even as we seek to create new ones. RSS channel space, and ultimately the web at large, is never going to be adequately described by just one view. There ought to be multiple views that are credible, to different people for different reasons, and we ought to be able to select among those views as needed.

This is a big subject, clearly. One way or another, I predict you'll soon be wrestling with the problem of content categorization, if you haven't already. If you've got some useful experiences to share, I'd love to hear about then in the newsgroups.

Jon Udell (http://udell.roninhouse.com/) was BYTE Magazine's executive editor for new media, the architect of the original www.byte.com, and author of BYTE's Web Project column. He's now an independent Web/Internet consultant, and is the author of Practical Internet Groupware, forthcoming from O'Reilly and Associates.

This work is licensed under a Creative Commons License.