This year’s upcoming debut of Microsoft Office 11 will mark the start of a long process of education and adaptation. Here we’ll explore how existing Office documents can benefit from the new features, how developers will prepare XML-aware Office templates, and how users will apply them to create and analyse XML data.
Microsoft’s Jean Paoli, the architect of Office 11’s XML support, was co-editor of the XML 1.0 specification with Tim Bray. The first thing Paoli showed Bray was that any existing .doc file can be saved as XML — specifically, as WordML, which expresses both the style and the content of the document in pure XML.
“When I showed that to Tim,” Paoli said, “he was jumping for joy.” In a separate interview, Bray — an Internet search pioneer and founder of data-visualisation provider Antarctica Systems — said the same thing.
Although it’s true that Google can index Word, PDF, and other formats, .doc files are inherently opaque. WordML is a bridge from the .doc format to the world of XML and its associated technologies of transformation, indexing, and search. In Word 11, you need only Save As XML to enter that world.
Word 11’s Save As XML feature presents a check-box labelled “Save as data only”. What data means here, is tagged elements belonging to an XML Schema. For a pre-existing .doc file — a status report, a book chapter — there are no such elements. If you check “Save as data only”, Word warns that you’ll lose your document formatting. In this case, you will lose more than that. The output will be an empty file because the document has no data in the XML sense. Let’s conjure up some.
The example that Paoli offered began with a standard .dot file — that is, an existing Word template, just like those you already use. To make that template a launchpad for a family of documents that store valid XML data, the first step is to acquire, or create, an XSD (XML Schema Definition) file. And that step is a doozy.
Few IT professionals have experience modelling data with XML Schema’s predecessor, DTD (Document Type Definition), which has been around for more than 15 years. Even fewer have XML Schema experience. After Office 11 ships, we face a classic chicken-and-egg scenario. Developers can’t really learn the art of modelling data in business documents without user feedback. But users can’t provide that feedback until they start actually working with XML-enriched documents. Office 11’s XML support isn’t a final solution. Rather, it allows for a long, difficult, and absolutely vital bootstrapping process.
Caveats aside, after the developer has an XSD file — for example, one that defines required structure and data types for a résumé — it’s straightforward to map it to a Word template. In the current beta, you use Tools/Templates and Add-Ins/XML Schema to associate your schema with the template. In the XML Structure task pane you then choose the schema’s root tag and wrap that element around the document. That exposes its contained elements for more granular mapping. Validation of structure and data types happens interactively.
All of Word’s formatting power is available here. But does that formatting carry over to the saved XML? It depends.
If the XSD file defines a field merely as a string, with no internal XML structure, the formatting will be lost when you save only XML data, not WordML. You can certainly elaborate more structure within that field, but that’s the kind of trade-off developers and users will wrestle with for years to come. It’s costly for developers to define structure, and costly for users to interact with it. The solution will often be to punt on the more elaborate structure, and focus on the benefit of being able to search for words, say, in the Experience sections of a pile of résumés.
The process of schematising an Excel template — say, for an expense report — is similar. Starting with a pre-existing spreadsheet template, you create or acquire a schema, and map the schema to the template, element by element. You can then hand the XML-enhanced template to a user. Expense reports spawned from the template are now, necessarily, schema-valid.
Until Microsoft announced InfoPath (formerly XDocs), examples such as Word résumés and Excel expense reports illustrated a new vision for Office as an information-gathering toolset. Word would create documents full of text and graphics; Excel would create documents full of numbers and charts; both would allow IT to exert control over the data. When it arrives as the newest member of the Office family, InfoPath will complicate that picture. It’s clear that InfoPath, in many cases, will be the strongest tool for gathering semi-structured data. It is tuned neither for the complex documents that are Word’s forte nor the data grids that are Excel’s, but rather for gathering information that might be viewed in Word, or analysed in Excel, or injected into a business process via e-mail or SOAP calls.
But nothing else in the Office suite will have anything like Excel’s analytic prowess. Excel 11’s newfound ability to absorb arbitrary schema-governed XML data, coupled with the explosion of XML data coming from everywhere — Web services, XML-aware databases, the rest of the Office suite, and other emerging XML applications — makes it more valuable.
If you start with a raw XML file — just data, no schema — Excel will read the data and make a best-effort map to the grid. In the resulting worksheet, that data is immediately available for editing, sorting, charting, pivot-table analysis, and more. Of course, when the data comes from a Web service, as it increasingly will, it is likely to be schematised. In that case, your options multiply. Once you associate a schema with the XML data, you can select elements shown in the XML Structure task pane.
For developers, schematisation of business documents, such as resumes and expense reports, will be a long and gradual process. But Excel’s new ability to read in and analyse XML data — from XML-aware databases, Web services, and other sources — will be immediately useful.