Annotation, Microformats & Syndication

Syndication feeds (RSS and Atom) provide precise information: the title of each entry, the author, when it was created, when it was modified, a unique identifier for the post, the content of the post without any surrounding menus, graphics, advertising, etc. This metadata supports many of the features of aggregators and blog search tools. But there’s a problem: the entries in a feed don’t last. Without a standard way to represent the information in HTML, it is lost to the Web. As far as I know, no such standard exists.

Why is this a problem? Well, for a start it means that if I suddenly add a large number of entries to my blog, Technorati, Feedster, PubSub et. al. will not index the older ones unless they are in the feed. Furthermore, if someone comes up with a cool new blog search technology or the like, much the data will simply not be out there to be indexed. (This also increases the first-mover advantage of the existing services, which have already indexed the no-longer-available data.)

Blog search tools are not the only services that could use the data. This is also a problem with my ideas for integrating wikis and forums. It even turns out to be relevant within an AJAX application, like the work I’m doing on web annotation.

My annotation implementation allows users to highlight text in a forum post and add notes in the margin, as one might underline text and write notes in a paper book. Each of these annotations is stored in a database, along with information about the annotated post, such as the post’s ID, title and author. I store these with the annotation so that they can be retrieved independently and treated consistently regardless of what was annotated; this makes it much easier to add annotation to other web applications. The Javascript that manages the display and editing of annotations also needs to locate the post on the web page so that it can show the highlights and margin notes.

This metadata (post title, author, etc.) is already present on the page. So I added CSS classes to the HTML to indicate what is a post, and for each post which tags contain the title, the author, the body text, etc. This is great: adding support for annotation to a page requires inclusion of the Javascript and a few minor tweaks to the HTML.

One question remains: what should these CSS classes be? Is there a standard out there for this sort of thing?

Well, there is Dublin Core. I could use dc:title as a class on the HTML element containing a post title. Of course HTML 4 doesn’t support namespaces, but at least for human beings who have worked with Dublin Core this is crystal clear, and unlikely to conflict with any classes in Moodle. But this is considered evil.

A more promising precedent is the use of microformats. The hReview microformat, for example, uses classes in a regular HTML document to specify a review of a movie, book, etc. These use simple, unqualified classes like “title”, “author”, etc. But I can’t find a microformat for solving the problem of syndication feeds. For the moment, I’m basing my classes on the names of tags in the Atom specification.