Shrinking and stretching the boundaries of markup

Doing less and more than XML.

It’s easy to forget that XML started out as a simplification process, trimming SGML into a more manageable and more parseable specification. Once XML reached a broad audience, of course, new specifications piled on top of it to create an ever-growing stack.

That stack, despite solving many problems, brings two new issues: it’s bulky, and there are a lot of problems that even that bulk can’t solve.

Two proposals at last week’s Balisage markup conference examined different approaches to working outside of the stack, though both were clearly capable of working with that stack when appropriate.

Shrinking to MicroXML

John Cowan presented on MicroXML, a project that started with a blog post from James Clark and has since grown to be a W3C Community Group. Cowan has been maintaining the Editor’s Draft of MicroXML, as well as creating a parser library for it, MicroLark. Uche Ogbuji, also chair of the W3C Community Group, has written a series of articles (1 2) about it.

Cowan’s talk was a broad overview of practical progress on a subject that had once been controversial. Even in the early days of XML 1.0, there were plenty of people who thought that it was still too large a subset of SGML, and the Common XML I edited was one effort to describe how practitioners typically used less than the XML made available. The subset discussion has come up repeatedly, with a proposal from Tim Bray and others over the years. In an age where JSON has replaced many data-centric uses of XML, there is less pressure to emphasize programming needs (data types, for example), but still demand for simplifying to a cleaner document model.

MicroXML is both a syntax — compatible with XML 1.0 — and a model, kept as absolutely simple as possible. In many ways, they’re trying to make syntax decisions based on their impact on the model.

Ideally, Cowan said, the model would just be element nodes (with attribute maps) containing other element nodes and content. That focus on a clean model means, for example, that while you can declare XML namespaces in MicroXML, the namespace declarations are just attributes. The model doesn’t reflect namespace URIs, and applications have to do that processing. Similarly, whitespace handling is simplified, and attributes become a simple unordered map of key names to values. The syntax allows comments, but they are discarded in the model. Processing instructions remain a question, because they would complicate the model substantially, but the XML declaration and CDATA sections would go. Empty element tags are in, for now …

Some of the pieces that raised controversy with questions were the current proposal to limit element and attribute names to the ASCII character set, and the demise of processing instructions. I mostly heard cheers for the end of draconian error handling, though there were memories of the bug compatibility wars of an earlier age reminding the audience that it could in fact be worse, or weirder.

There may be, as Cowan noted “no consensus on a single conversion path” to or from JSON, but MicroXML takes some steps in that direction, suggesting that JSONML should be able to support round-tripping of the MicroXML model, and JSONx could work for JSON to XML.

While I suspect that MicroXML has a bright future ahead of it in the document space, it seems unlikely to take much territory back from JSON in the data space. MicroXML doesn’t seem to be aiming at JSON at all, however.

Overlapping information and hierarchical tools

Very few programmers want to think about overlapping data structures. In most computing cases — bad pointers? — overlapping data structures are a complex mess. However, they’re extremely common in human communications. LMNL (Layered Markup and Annotation Language), itself a decade-long conversation that has suffered badly from decaying links, has always been an outsider in the normally hierarchical markup conversation. There may be conflicts between XML trees and JSON maps, but both of those become uncomfortable when they have to represent an overlapped structure.

Wendell Piez examined the challenges of processing overlapping markup — LMNL — with tools that expect XML’s neat hierarchies. Obviously, feeding this

[excerpt
[source [date}1915{][title}The Housekeeper{]]
[author
[name}Robert Frost{]
[dates}1874-1963{]] }
[s}[l [n}144{n]}He manages to keep the upper hand{l]
[l [n}145{n]}On his own farm.{s] [s}He's boss.{s] [s}But as to hens:{l]
[l [n}146{n]}We fence our flowers in and the hens range.{l]{s]
{excerpt]

into an XML parser would just produce error messages, even if those tags were angle brackets. Piez compiles that into XML, which represents the content separately from the specified ranges. It is, of course, not close to pretty, but it is processable. At the show, he demonstrated using this to create an annotated HTML format as well as a format that handles overlap gracefully: graphics, using SVG (or here as a PNG if your browser doesn’t like SVG).

Should you be paying attention to LMNL? If your main information concerns fit in neat hierarchies or clean boxes, probably not. If your challenges include representing human conversations, or other data forms where overlap happens, you may find these components critical, even though making them work with existing toolsets is difficult.

tags: , ,