Introducing OpenXML
OpenXML is the name given to the document format introduced by Microsoft for Office 2007. This is a published standards-compliant format based on two pervasive technologies — ZIP and XML — replacing the proprietary binary formats used by previous versions of Microsoft Office. Since OpenXML is based on open standards and represents content as XML, it is possible to work directly with document structures without even the need to have Office software present.
This opens up new avenues for document automation and for simplifying both the creation and import of data from such documents as spreadsheets by direct programming. In the past, this meant relying on mail merges or through Office applications themselves either running on the desktop using VBA or operating as COM automation servers, which could be unreliable.
In this article, I will be introducing the main structures of an OpenXML document. In subsequent articles, I will show you how to read and update Excel spreadsheets and Word documents using data drawn from your MultiValue applications.
Viewing the OpenXML Format
OpenXML, as the name suggests, was designed to be an open document interchange format, competing with the emerging ODF (Open Document Format) standard. OpenXML effectively spans two specifications: the XML representation of document content and a storage format known as the Open Packaging Convention (OPC). The latter uses ZIP compatible compression both to reduce the size of the final document and to enables a single document file to archive a variety of different content. So to begin to understand how to manipulate an OpenXML document, we need to first open a Word 2007 file to see what it contains.
- Although OpenXML was introduced as the standard document format for Office 2007, you can create and consume OpenXML documents from earlier versions of Office. For that, you need to download the free Compatibility Pack from Microsoft .
- Armed with an OpenXML compatible version of Office, start Microsoft Word and create a new document with the classic text "Hello, World."
- Save it with the standard .docx extension and uncheck the box to save a thumbnail.
- Next, locate the document in Windows Explorer, copy it and change the extension on the copy to .zip.
Now if you open this file you will see that it is a regular zipped archive containing three folders and a top level document named [Content_Types].xml (fig. 1).
Fig. 1 OpenXML Document Structure
Packaging Convention
The Open Packaging Convention makes it easier to manage complex documents by separating out the document elements through a structure of folders and relationships. At a physical level, all OpenXML documents follow a similar structure, composed of a number of folders that contain in turn the various elements that make up the document. The key to the structure is found in the top level [Content_Types].xml file which acts as an index, describing the main elements and defining the entry point for the software to begin reading the document (fig. 2).
<Types xmlns="http://schemas.openxmlformats.org/package/2006/content types"> <Default Extension="rels" ContentType="application/vnd.openxmlformats-package.relationships+xml" /> <Default Extension="xml" ContentType="application/xml" /> <Override PartName="/word/document.xml" ContentType="application/vnd.openxmlformats- officedocument.wordprocessingml.document.main+xml" />
Fig. 2
For a regular Word document, the main content is held in the document.xml file found in the Word folder. This contains the document text (the Hello, World you typed along with all the adorning information for layout, font handling, table specifications, and such.
Open up the document.xml and you will see the Hello, World text you typed inside a structure representing the paragraph, run and text familiar to anyone who has programmed Word through VBA (fig.3).
<w:body> <w:p> <w:pPr> <w:rPr> <w:lang w:val = "en-GB"/> </w:rPr> </w:pPr> <w:r> <w:t>Hello, World.</w:t> </w:r> </w:p>
Fig. 3
The document.xml is not the only file in the Word folder; there you will also find style information, settings, and the font table held in their own document parts. If you were to add headers and footers to your document, each one of those would similarly create a separate file in the Word folder. Each such file is known as a 'document part'.
The use of document parts provides great flexibility. A PowerPoint presentation, for example, uses this format to hold each slide as a separate part file, making it easier for developers to copy content between presentations.
Media Content
To see the real advantage of the OPC format, add an image to the original document (fig. 4) and save it once more, again renaming a copy with a zip extension. When you open up this document content, you will see that the Word folder now contains a new sub-folder named media. This holds the image that you added in an unencoded format (fig. 5).
Fig. 4 Image in Word (nice tiger!)
Fig. 5 Image in OpenXML package.
The packaging convention is specifically designed to handle media and other binary content such as embedded objects. These are poorly represented in regular XML where they generally need to be encoded into swollen formats and make the document structure large and slow to parse. With OPC, these are stored unaltered in a separate media folder, and referenced in the document parts.
Relationships
The constituent parts of the document are bound together using relationships. For each document part, there is a corresponding file in the _rels sub-folder to define the document part relationship. Here the image added is given a Relationship Id (fig. 6).
<Relationship Id="rId4" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image" Target="media/image1.jpeg" /> </Relationships>
Fig. 6
The Relationship Id uniquely identifies a media or style element within the document part. If you look again inside the updated document.xml, you will see this id referenced in the (rather complex) image specification (trimmed here in figure 7).
<pic:pic xmlns:pic = "http://schemas.openxmlformats.org/drawingml/2006/picture"> <pic:nvPicPr> <pic:cNvPr id = "0" name = "hobbes_200_225.jpg"/> <pic:cNvPicPr/> </pic:nvPicPr> <pic:blipFill> <a:blip r:embed = "rId4" cstate = "print"/> </pic:blipFill> </pic:pic>
Fig. 7
This level of indirection, along with the complexity in representing certain elements as pictures, are two of the features that make navigating or restructuring an OpenXml document in code fairly cumbersome. It does, however, mean that some tasks — such a personalizing documents with customer-specific images — can be achieved in a very straight forward manner by simply replacing the media folder content.
Now that we have briefly outlined the packaging convention, it is time to turn our attention to the document format itself.
Changing a Document
Representing a document structure, be that a word processing document, spreadsheet, or presentation, as an XML document makes it amenable to processing using regular XML tools such as XDOM and XQuery. Once one dispenses with the difficulties of the packaging convention and, in particular, adding new document parts, within each part you can make full use of these tools to transform, locate, and add content. However, even a brief look at the document.xml will convince you that real world navigation of these documents is not entirely straight forward and requires a lot of background research; the markup reference part of the specification alone runs to over 5,000 pages!
So the easiest way to work with OpenXML is to start with an existing document and to then change the details programmatically. To go some way towards smoothing this, Microsoft recently released a new API targeting .NET developers, the OpenXML SDK 2.0, which presents a LINQ-based view of the content while taking care of some — though by no means all — of the intricacies underneath. For an open source alternative, the PHPExcel project on CodePlex provides a rich source of document management functionality and a good set of classes to plunder for your own use.
So, can you simply take an OpenXML document, unzip and change the content, and zip it up again? Not necessarily. The Office applications are very picky about the form of ZIP compression used — just try to extract and then re-compress the document you created above using the built-in Compressed Folder handler in Windows Explorer. Even if you have made no changes, Word will complain that the document is broken (fig. 8).
Fig. 8 Oops - Windows built-in Zip isn't good enough.
To compress the archive effectively, you need to be selective about the type of compression and the zip tool you use. WinZip, for example, works fine if set to maximum portable compression. Using this (you can download the evaluation version, if required) you can open the document.xml, change the text to read something else, and then zip this up again to create a new document. Give it a try. This will be the pattern for most of the development operations we will be covering.
For the very best results you need to use the System.IO.Packing classes included in the .NET framework. We will be looking at those and the OpenXML SDK in the next article.