WordML Transformation

XSL Formatter V3.4 can transform WordML document into FO without specifying an XSL stylesheet and format the FO.

WordMLToFO Stylesheet is the XSLT stylesheet.It needs XSLT processor to use. We have tested using the following XSLT processors and confirmed the operation of WordMLToFO Stylesheet.

XSLT Processor Notes
Saxon 6.5.3 Tested under the Sun Java SDK, Java 2 Platform, Standard Edition 1.4 or higher. Instant Saxon does not work with WordMLToFO Stylesheet.
MSXML3, MSXML4

WordMLToFO Stylesheet is based on XSLT 1.0 W3C recommendation. Some part of template uses extended function of Result Tree Fragment. The extended function is as follows:

WordMLToFO Stylesheet automatically selects extended function of each XSLT processor using function-available(). If you want to use another XSLT processor, please confirm whether an extended function can be used. If the XSLT processor you choose complies to the exslt.org extension, you will not need to rewrite the stylesheet.

Transform Specification

WordML Specification

WordML is the new XML file format that was adopted from Microsoft Office2003. WordML specification is available from the Microsoft web site:

XSL Formatter V3.4 regards an XML document with a name space, http://schemas.microsoft.com/office/word/2003/wordml as WordML, and automatically transforms it into FO.

Page Format

Transformation

Page Format is described in the w:sectPr element in the WordML. WordMLToFO Stylesheet process w:sectPr as following.

  • Generate fo:layout-master-set from /w:wordDocument/w:body//w:sectPr.
  • Generate fo:page-sequence from each /w:wordDocument/w:body/descendant::wx:sect, then process the underlying w:table, w:p elements.

In the most Word documents, all of the section has one-to-one correspondence with the wx:sect element that is the child of w:wordDocument element. Using this method there are no problems in almost case.

Problems

There are some problems about Page Format.

  • Changing section without page-break

    In the Word document, we can change section format without inserting page-break. For instance, we can change columns from two to three in the middle of the page. But WordMLToFO Stylesheet transforms section to the fo:page-sequence. In XSL-FO, fo:page-sequence always generates page-breaks between fo:page-sequence objects.

  • Documents containing the outline-level elements

    Word has the functionality called outline. If we use outline functionality, the Word document become structured style in the outline view. In the WordML files, wx:sub-section element corresponds to outline level and this element will be nested according to the outline level. We can still change section even in the deep nested outline level. In such WordML document, there is no element which wraps the each section. Of course, wx:sub-section does not match with section. So about the WordML documents that contains outline elements (wx:sub-sect) and deep positioned section breaks, WordMLToFO Stylesheet cannot offer accurate page format transformation results.

Furthermore WordMLToFO Stylesheet does not support the following page format attributes.

  • Text flow

    Text flow is the parameter that specifies the inline progression direction and paragraph progression direction. In the Word document, we can specify for the page format and table cell format. Text flow is partially implemented in WordMLToFO Stylesheet but we cannot get appropriate results if it is specified both in page format and table cell format.

Style Expansion

In the Word documents, character or paragraph format are determined using styles. Word document has the following styles. Table style, paragraph style, character style, etc. The style has following characters:

  • It has hierarchically relation. We can produce new styles using one style as "base style" inheriting formats from base style, changing formats for new style.
  • It has structural configuration. Table style contains not only the table style for itself but also paragraph style, character style that are used for the paragraphs in the table cell. In the same manner, paragraph style contains not only for paragraph itself, but also character style that are used in the characters in the paragraph.

Contrastively XSL-FO has no notion of style. If we generate fo:inline or fo:block objects, we must specify properties that are the last result of applying the corresponding styles. Consequently WordMLToFO Stylesheet process all of the corresponding styles for the each document factor (described in following table) and then outputs the final result as the properties of the XSL-FO element property.

Document Factor Condition Corresponding Stylesheet Corresponding XSL-FO Element
Paragraph Inside of table Table style, Paragraph style fo:block
Outside of table Paragraph style
Inline (text run) In the paragraph inside the table Table style, Paragraph style, Character style fo:inline
In the paragraph outside the table Paragraph style, Character style
Table row, cell - Table style fo:table, fo:table-row, fo:table-cell

Document Factor Mapping

Mapping Rule

WordMLToFO Stylesheet has the following mapping rule between WordML document factor and XSL-FO object.

Document Factor WordML Element XSL-FO Element
Paragraph w:p fo:block
Inline (text run w:r fo:inline
List w:p (Contains w:pPr/w:listPr element) fo:list-block, fo:list-item, fo:list-item-label, fo:list-item-body
Table w:tbl, w:tr, w:tc fo:table, fo:table-row, fo:table-cell
Image w:pict fo:external-graphic

Paragraph

There are differences between the Word document paragraph and XSL-FO fo:block. Word paragraph is composed of inline (mainly characters) and paragraph mark. Paragraph mark expresses the function of CR/LF. In contrast, fo:block is the rectangular area that contains the plural line areas. In the Word document, you can set attributes to the paragraph mark. For instance, if you apply hidden attributes to the whole paragraph including paragraph mark, the whole paragraph will vanish. WordMLToFO Stylesheet cannot reproduce such result because it transforms paragraph to the fo:block and hidden attribute is applicable only for fo:inline. As a result the empty fo:block will remain.

List

List is the special case of the Word paragraph. It contains w:pPr/w:listPr as its descendant. Word list is explained in the following model.

  • List label part is located at the position specified by (left indent - hanging) from the left margin.
  • The first line of the list body part is located at the position specified by (list tab) from the left margin.
  • The second or following line of the list body part is located at at the position specified by (left margin) from the left margin.
  • The right part of the line turns at at the position specified by (right indent) from the right margin.

In other words, the Word list model is composed of four components. Left indent, hanging, right indent and list tab.

In contrast, XSL-FO list model is composed of two (or more) fo:block elements that belongs to fo:list-item-label and fo:list-item-body elements. The list label and the list body text are separated into the plural blocks. Each horizontal position is specified using independent start-indent, end-indent property.

WordMLToFO Stylesheet transforms the list as following:

  • The series of list paragraph are transformed into fo:list-block, fo:list-item objects.
  • Some old type list paragraph (Word 6.0/95 format) is converted into one fo:block object.

Current WordMLToFO Stylesheet implementation does not reproduce the Word list layout completely.

Image

As it is not allowed to contain binary data in XML file, WordML encodes image data Base64 encoding and express it as a text format. See following XML. The text of w:binData element expresses the image data.

<w:p>
  + <w:pPr>
  - <w:r>
    + <w:rPr>
    - <w:pict>
      + <v:shapetype ... >
        <w:binData w:name="wordml://02000001.jpg">/9j/4AAQ...55O7uddCm6cOVn/9l=</w:binData>
      + <v:shape id="_x0000_i1025" type="#_x0000_t75" style="width:244.5pt;height:356.5pt">
    </w:pict>
  </w:r>
</w:p>

The image format that we have confirmed are as follows.

  • PNG (Portable Network Graphics)
  • JPEG (Jpeg File Interchange Format)
  • Graphics Interchange Format
  • Windows Meta File
  • Windows Enhanced Meta File
Word sometimes change image formats when embedding images into Word documents. So it is not guaranteed that the extracted image format is the same as original image format.

WordMLToFO Style Sheet outputs the Base64 encoding data to the src property of fo:external-graphic. XSL Formatter V3.2 can output this as an image. However, output FO is a big size inevitably.

About the image file in Word, there are two type of positioning flavors. The inline type is the image that is located in the line with characters. The floating type is the image that is located in the page specifying the distance from anchor position. The anchor has the types such as page anchor, margin anchor, paragraph anchor, etc. About the latter type, we can specify the parameters about text wrapping type (the relationship between image and body text). XSL-FO supports positioned image using fo:block-container. But there are many differences between Word and XSL-FO object positioning. WordMLToFO Stylesheet transforms image positioning using the former type. So WordMLToFO Stylesheet cannot reproduce the original position in the Word document displaying. This is the limitation of image transformation.

About the image in WordML, there is another pattern that contains no encoded image data but contains link information to image file.

<w:p>
+ <w:pPr>
- <w:r>
  + <w:rPr>
  - <w:pict>
    + <v:shapetype id="_x0000_t75" ...>
      </v:shapetype>
    - <v:shape id="_x0000_s1026" type="#_x0000_t75" ...>
        <v:imagedata src="C:\Documents and Settings\toshi\My Documents\My Pictures\nashan.jpg" />
      </v:shape>
    </w:pict>
  </w:r>
</w:p>

In such a case, WordMLToFO Stylesheet copies "src" attribute of v:imagedata element to the "src" property of the fo:external-graphic element in FO file directly.

Page-Header and Page-Footer

In WordML, the size of header/footer is specified at the position from the top of the page, and doesn't have relations with page margin. And, the size of header/footer and the text area changes depending on the number of lines. However, the size of fo:region-before/fo:region-after (extent value) is fixation in XSL-FO. In WordMLToFO Style Sheet, page margin size of WordML is used as extent value of fo:region-before/fo:region-after. Therefore, after conversion of XSL-FO, the text and header/footer might overlap or be different from WordML in size. Please adjust page margin of WordML.

Multi Column

In WordML, each section can change the number of columns. However, the number of columns are not permitted to be changed on the page in XSL-FO. fo:page-sequence of each section is generated in WordMLToFO Style Sheet. Therefore, the page breaks at each number of columns change (section change).

Other Document Elements and Limitations

  • Field

    Word supports many types of field. WordMLToFO Stylesheet transforms fields using its "result text". Many fields have the elements corresponding to the result text, but there exists exceptional pattern. For instance, WordMLToFO Stylesheet cannot offer the text result from special types of field such as list-box.

  • Tab character

    When you use Word, tab character is useful for positioning text in the line and it is widely used in creating documents. In contrast XSL-FO has no corresponding functions about tab character. WordMLToFO Stylesheet transforms tab character (w:tab) into XSL-FO fo:leader object. But the original form cannot be reproduced.

  • Auto Shape

    Auto Shape is used to draw graphics in Word document. Current WordMLToFO Stylesheet implementation does not support Auto Shape.

  • Footnote, Endnote

    Current WordMLToFO Stylesheet implementation does not support footnote/endnote.

  • Line Height

    The line height might not be correctly set.

  • Hyphenation

    A word in the Word document with hyphenation setting is divided as follows in WordML:

    <w:t>Fo</w:t>
    <w:t>r</w:t>
    <w:t>matter</w:t>
    

    For that reason, the word is also divided in the transformed FO. As the result, the word cannot be hyphenated.


WordMLToFO Stylesheet

Antenna House provides WordMLToFO Stylesheet as an option separately. Although the same styleshteet as this is included in XSL Formatter V3.4, the source code of the stylesheet itself is not contained. If you purchase WordMLToFO Stylesheet, it's available to customize the stylesheet in your own way and include it in XSL Formatter V3.4. Moreover, WordMLToFO Stylesheet currently sold may be upgraded rather than XSL Formatter V3.4 built-in stylesheet. In order to know the version of a WordMLToFO Stylesheet, please check the contents of FO converted from WordML. There is an axf:generator property in <fo:root>.

<fo:root axf:generator="WordMLToFO V2.0" ...>

Add the following description to the Option Setting File in order to include the stylesheet.

<stylesheet ns="http://schemas.microsoft.com/office/word/2003/wordml" href="[WordMLToFO install directory]/WordMLToFO.xsl"/>


Copyright © 1999-2005 Antenna House, Inc. All rights reserved.
Antenna House is a trademark of Antenna House, Inc.