# Thursday, 24 June 2004

I started teaching a class at OIT this week on "Web Services Theory", in which I'm trying to capture not only reality, but the grand utopian vision that Web Services were meant to solve (more on that later).  That got me thinking about the way the industry as a whole has approached file formats over the last 15 years or so. 

There was a great contraction of file formats in the early 90s, which resulted in way more problems than anyone had anticipated I think, followed by a re-expansion in the late 90s when everyone figured out that the whole Internet thing was here to stay and not just a fad among USENET geeks. 

Once upon a time, back when I was in college I worked as a lab monkey in a big room full on Macs as a "support technician".  What that mostly meant was answering questions about how to format Word documents, and trying to recover the odd thesis paper from the 800k floppy that was the only copy of the 200 page paper and had somehow gotten beer spilled all over it.  (This is back when I was pursuing my degree in East Asian Studies and couldn't imagine why people wanted to work with computers all day.)

Back then, Word documents were RTF.  Which meant that Word docs written on Windows 2.0 running on PS/2 model 40s were easily translatable into Word docs running under System 7 on Mac SEs.  Life was good.  And when somebody backed over a floppy in their VW bug and just had to get their thesis back, we could scrape most of the text off the disc even if had lost the odd sector here and there.  Sure, the RTF was trashed and you had to sift out the now-useless formatting goo, but the text was recoverable in large part.  In other sectors of the industry, files were happily being saved in CSV or fixed length text files (EDI?) and it might have been a pain to write yet another CSV parser, but with a little effort people could get data from one place to another. 

Then the industry suddenly decided that it could add lots more value to documents by making them completely inscrutable.  In our microcosm example, Word moved from RTF to OLE Structured Storage.  We support monkeys rued the day!  Sure, it made it really easy to serialize OLE embedded objects, and all kinds of neat value added junk that most people didn't take advantage of anyway.  On the other hand, we now had to treat our floppies as holy relics, because if so much as one byte went awry, forget ever recovering anything out of your document.  Best to just consider it gone.  We all learned to be completely paranoid about backing up important documents on 3-4 disks just to make sure.  (Since the entire collection of all the papers I ever wrote in college fit on a couple of 1.4Mb floppies, not a big deal, but still a hassle.)

Apple and IBM were just as guilty.  They were off inventing "OpenDoc" which was OLE Structured Storage only invented somewhere else.  And OpenDoc failed horribly, but for lots of non-technical reasons.  The point is, the industry in general was moving file formats towards mutually incomprehensible binary formats.  In part to "add value" and in part to assure "lock in".  If you could only move to another word processing platform by losing all your formatting, it might not be worth it. 

When documents were only likely to be consumed within one office or school environment, this was less of an issue, since it was relatively easy to standardize on a single platform, etc.  When the Internet entered the picture, it posed a real problem, since people now wanted to share information over a much broader range, and the fact that you couldn't possibly read a Word for Windows doc on the Mac just wasn't acceptable. 

When XML first started to be everyone's buzzword of choice in the late 90s, there were lots of detractors who said things like "aren't we just going back to delimited text files? what a lame idea!".  In some ways it was like going back to CSV text files.  Documents became human readable (and machine readable) again.  Sure, they got bigger, but compression got better too, and disks and networks became much more capable.  It was hard to shake people loose from proprietary document formats, but it's mostly happened.  Witness WordML.  OLE structured storage out, XML in.  Of course, WordML is functionally RTF, only way more verbose and bloated, but it's easy to parse and humans can understand it (given time). 

So from a world of all text, we contracted down to binary silo-ed formats, then expanded out to text files again (only with meta-data this time).  It's like a Big Bang of data compatibility.  Let's hope it's a long while before we hit another contracting cycle.  Now if we could just agree on schemas...

Work | XML