More Annotating Exploder

Since I last wrote about annotation for Internet Explorer, I have solved many problems (including the crash). That’s because IE has many problems. For a time it seemed I had a difficult choice between breaking my application and breaking IE, but I believe I have found a reasonable compromise. My explanation will have to be technical.

Exploder in Space

IE, like other Microsoft tools (Visual Studio, Frontpage) mangles HTML. Here, for example, is a simple list I gave it (I have replaced the spaces with dots for clarity):


<ul>.<li>.one.</li>.<li>.two.</li>.</ul>

And here is what IE turns it into:


<UL>
<LI>one.
<LI>two.</LI></UL>

Pretty kinky, huh? IE did four things:

  1. It removed a close LI tag.
  2. It changed all the tags to upper case.
  3. It added several newlines.
  4. It removed several spaces.

The first two are pretty bad, but they don’t affect me. It’s the whitespace that’s the problem. I store highlights as character offsets from the start of a forum post. So if IE is adding or removing characters, highlights may be displayed in the wrong place on other browsers, just as annotations created on other browsers may look wrong on IE.

Can I finesse the problem?

Since I have determined IE’s behavior1, I could emulate it in other browsers by adding and deleting spaces accordingly. This would work, but it would mean that the annotation code would have to implement IE’s broken implementation forever, even if IE is fixed or retired. That’s a high price to pay.

I could just let things be — the problem is relatively rare. IE users may experience a problem caused by their choice of browser; frankly, that’s fine with me. But annotations created with IE will look wrong elsewhere, making other browser seem broken when they’re not. That’s bad, and could encourage Exploder use.

I could fix single-character errors by declaring that highlights must begin and end on word boundaries. I’m storing the highlighted text, so I could also use that plus the position to adjust for errors. This would solve most problems, since an element can at most introduce one character of error. But I’m counting the offset from the start of the forum post, and a post can have many elements, so the error could get quite large. What if I count from the start of child element instead, thereby reducing maximum error to one character?

XPointer

XPointer, the W3C standard for specifying a location in an XML document, works like this. In fact, the W3C’s Annotea project uses XPointer to locate annotations. If I adopted XPointer, it would solve my problem and bring me closer to Annotea standards.

But there’s a reason I’m not using XPointer. The specification is very complex, so much so that adoption is limited and I don’t believe anyone has implemented the whole thing. Worse, my annotation is a special case.

For a start, I need to display annotations in order, both in the margin next to a post and in the annotation summary. XPointers aren’t ordered (actually, a subset of simple XPointers can be ordered, but IE messes that up by creating and deleting whitespace-only text nodes). Furthermore, highlights modify a document, effectively breaking any XPointers. I would have to map XPointers between an idealized highlight-free document and the actual highlighted one.

I think I made the right decision the first time. I want to be putting annotation in the hands of users, not pioneering complex standards.

A Hack that Would Work

Early on in this process I did have one other thought, but I discarded it because it’s too ugly. Since whitespace is causing all the trouble, I could perform all offset calculations without regard to whitespace. Just insist that highlights not begin or end on a space, and don’t count spaces. Then IE can do whatever perverse acrobatics it wants and it can’t hurt me anymore!

I do have an objection: who ever heard of a “non-whitespace character offset”? These offsets may be what I need to make my application work, but do they really mean anything? Long experience programming has taught me how risky it is to start playing around with representations that don’t correspond to anything in the world. These annotation offsets would be part of a format, and formats can long outlive code.

Actually, there is a measure which ignores spaces but is more meaningful and not specific to any browser: word count.

The question then becomes, “what’s a word”? Unicode can represent punctuation, dingbats, diacritics, and more; figuring out a convention universal to all the countless possible languages is beyond me. The obvious solution is to define a word as a sequence of non-whitespace characters. This definition is less than perfect, but I believe it’s far superior to a raw non-whitespace character offset (it’s also easier to read and debug).

Determining word breaks still isn’t straightforward. For example:


one<p>million</p>dollars

Paragraphs break words, so that’s three words. My system might refer to the word “million” as 2.0—2.7 (word 2, characters 0 through 7). But take this:


one<em>million</em> dollars

Emphasis doesn’t break words, so that’s two words; using the same scheme, “million” would be represented as 1.4-1.11.

It’s messy. But it solves the problem — completely and in all browsers. I think it’s the lesser of many evils.

Notes

1 I believe these are the rules for spacing: 1) Ignoring tags, retain only the first of multiple spaces. 2) Remove all leading spaces at the start of block-level elements. 3) Add a newline after the close of a block-level element.

2005-07-25