Several sections of this web site are built by an Ant task from a collection of XML documents. This includes the entire photos section; in the past the blog was built this way also from a collection of OpenOfficeo.rg documents. This allows me to regenerate much of the site with a single command. To support this, I had to construct a collection of ant tasks, which you can downloaded here:

OpenOffice.org to Blog Conversion

As part of the site generation process, I have written a filter for converting between OpenOffice.org document format and my own internal metadata format. I then use the original document and metadata to produce HTML pages. This is all hooked in with the Ant tasks. The source code for a previous version of my blog is included in the download.

What the Ant Tasks Do

The ant tasks I've created make it possible to automatically generate index files, forward and back links, etc. The three tasks are as follows:

xmlconcat

This task runs an XSLT stylesheet on a set of XML files, then concatenates the results into one large XML document.

xmlsplit

This splits a single XML document into multiple documents.

filelister

This is used to generate a list of files in the form of an XML document, especially in cases where xmlconcat cannot be used because the files are not XML.

I recently discovered the Styler ant task, which has more powerful XML transformation support. It might be very useful for a process like this.

How the Blog Code Works

The source for the site is in the blog directory. Intermediate files are in work, with the final site under wwww. The OO.o conversion happens in multiple stages, detailed in the build file, but generally the process goes like this: 1) unzip source OO.o documents, 2) extract metadata from OO.o documents into a single master list file, 3) process metadata to include new information, such as destination paths and filenames, 3) generate an index page and RSS page from the metadata, 4) generate individual blog entries.

Flaws

There are a number of weaknesses in the system. I haven't fixed Unicode support for all ant tasks (though xmlconcat is fine), but they do work correctly as I am using them. The ant tasks are also not namespace-aware, but in a worst-case this only requires a two-step transform so I don't think this is a serious limitation.

The OpenOffice.org export filter is quite trivial. It ignores styling completely and only pays attention to certain structural elements, including paragraphs, unordered lists, headings, footnotes, block quotes, links, and embedded images. It also extracts certain pieces of metadata, such as the document creation time, subject, description, and specific user metadata fields. I haven't added support for any character-level styles.

Finally, the system is inefficient: it rebuilds the entire blog section every time. For large numbers of documents this could start to get slow, although at the moment the process is in the 20 second range. The second problem is that it doesn't automatically FTP the files, because it doesn't know which ones changed. As a result, there are cross-index pages I would like to have but don't because they would be affected by every new article. I have added the capability to filelister to generate an SHA1 digest for each file. This makes it possible to only FTP files that have changed.

You appear to be using Internet Explorer. You may therefore have difficulty navigating this site. More Information...