XSL Formatter V3.4 can transform WordML document into FO without specifying an XSL stylesheet and format the FO.
WordMLToFO Stylesheet is the XSLT stylesheet.It needs XSLT processor to use. We have tested using the following XSLT processors and confirmed the operation of WordMLToFO Stylesheet.
XSLT Processor | Notes |
---|---|
Saxon 6.5.3 | Tested under the Sun Java SDK, Java 2 Platform, Standard Edition 1.4 or higher. Instant Saxon does not work with WordMLToFO Stylesheet. |
MSXML3, MSXML4 |
WordMLToFO Stylesheet is based on XSLT 1.0 W3C recommendation. Some part of template uses extended function of Result Tree Fragment. The extended function is as follows:
WordMLToFO Stylesheet automatically selects extended function of each XSLT processor using function-available(). If you want to use another XSLT processor, please confirm whether an extended function can be used. If the XSLT processor you choose complies to the exslt.org extension, you will not need to rewrite the stylesheet.
WordML is the new XML file format that was adopted from Microsoft Office2003. WordML specification is available from the Microsoft web site:
XSL Formatter V3.4 regards an XML document with a name space, http://schemas.microsoft.com/office/word/2003/wordml as WordML, and automatically transforms it into FO.
Page Format is described in the w:sectPr element in the WordML. WordMLToFO Stylesheet process w:sectPr as following.
In the most Word documents, all of the section has one-to-one correspondence with the wx:sect element that is the child of w:wordDocument element. Using this method there are no problems in almost case.
There are some problems about Page Format.
In the Word document, we can change section format without inserting page-break. For instance, we can change columns from two to three in the middle of the page. But WordMLToFO Stylesheet transforms section to the fo:page-sequence. In XSL-FO, fo:page-sequence always generates page-breaks between fo:page-sequence objects.
Word has the functionality called outline. If we use outline functionality, the Word document become structured style in the outline view. In the WordML files, wx:sub-section element corresponds to outline level and this element will be nested according to the outline level. We can still change section even in the deep nested outline level. In such WordML document, there is no element which wraps the each section. Of course, wx:sub-section does not match with section. So about the WordML documents that contains outline elements (wx:sub-sect) and deep positioned section breaks, WordMLToFO Stylesheet cannot offer accurate page format transformation results.
Furthermore WordMLToFO Stylesheet does not support the following page format attributes.
Text flow is the parameter that specifies the inline progression direction and paragraph progression direction. In the Word document, we can specify for the page format and table cell format. Text flow is partially implemented in WordMLToFO Stylesheet but we cannot get appropriate results if it is specified both in page format and table cell format.
In the Word documents, character or paragraph format are determined using styles. Word document has the following styles. Table style, paragraph style, character style, etc. The style has following characters:
Contrastively XSL-FO has no notion of style. If we generate fo:inline or fo:block objects, we must specify properties that are the last result of applying the corresponding styles. Consequently WordMLToFO Stylesheet process all of the corresponding styles for the each document factor (described in following table) and then outputs the final result as the properties of the XSL-FO element property.
Document Factor | Condition | Corresponding Stylesheet | Corresponding XSL-FO Element |
---|---|---|---|
Paragraph | Inside of table | Table style, Paragraph style | fo:block |
Outside of table | Paragraph style | ||
Inline (text run) | In the paragraph inside the table | Table style, Paragraph style, Character style | fo:inline |
In the paragraph outside the table | Paragraph style, Character style | ||
Table row, cell | - | Table style | fo:table, fo:table-row, fo:table-cell |
WordMLToFO Stylesheet has the following mapping rule between WordML document factor and XSL-FO object.
Document Factor | WordML Element | XSL-FO Element |
---|---|---|
Paragraph | w:p | fo:block |
Inline (text run | w:r | fo:inline |
List | w:p (Contains w:pPr/w:listPr element) | fo:list-block, fo:list-item, fo:list-item-label, fo:list-item-body |
Table | w:tbl, w:tr, w:tc | fo:table, fo:table-row, fo:table-cell |
Image | w:pict | fo:external-graphic |
There are differences between the Word document paragraph and XSL-FO fo:block. Word paragraph is composed of inline (mainly characters) and paragraph mark. Paragraph mark expresses the function of CR/LF. In contrast, fo:block is the rectangular area that contains the plural line areas. In the Word document, you can set attributes to the paragraph mark. For instance, if you apply hidden attributes to the whole paragraph including paragraph mark, the whole paragraph will vanish. WordMLToFO Stylesheet cannot reproduce such result because it transforms paragraph to the fo:block and hidden attribute is applicable only for fo:inline. As a result the empty fo:block will remain.
List is the special case of the Word paragraph. It contains w:pPr/w:listPr as its descendant. Word list is explained in the following model.
In other words, the Word list model is composed of four components. Left indent, hanging, right indent and list tab.
In contrast, XSL-FO list model is composed of two (or more) fo:block elements that belongs to fo:list-item-label and fo:list-item-body elements. The list label and the list body text are separated into the plural blocks. Each horizontal position is specified using independent start-indent, end-indent property.
WordMLToFO Stylesheet transforms the list as following:
Current WordMLToFO Stylesheet implementation does not reproduce the Word list layout completely.
As it is not allowed to contain binary data in XML file, WordML encodes image data Base64 encoding and express it as a text format. See following XML. The text of w:binData element expresses the image data.
<w:p> + <w:pPr> - <w:r> + <w:rPr> - <w:pict> + <v:shapetype ... > <w:binData w:name="wordml://02000001.jpg">/9j/4AAQ...55O7uddCm6cOVn/9l=</w:binData> + <v:shape id="_x0000_i1025" type="#_x0000_t75" style="width:244.5pt;height:356.5pt"> </w:pict> </w:r> </w:p>
The image format that we have confirmed are as follows.
☞ | Word sometimes change image formats when embedding images into Word documents. So it is not guaranteed that the extracted image format is the same as original image format. |
---|
WordMLToFO Style Sheet outputs the Base64 encoding data to the src property of fo:external-graphic. XSL Formatter V3.2 can output this as an image. However, output FO is a big size inevitably.
About the image file in Word, there are two type of positioning flavors. The inline type is the image that is located in the line with characters. The floating type is the image that is located in the page specifying the distance from anchor position. The anchor has the types such as page anchor, margin anchor, paragraph anchor, etc. About the latter type, we can specify the parameters about text wrapping type (the relationship between image and body text). XSL-FO supports positioned image using fo:block-container. But there are many differences between Word and XSL-FO object positioning. WordMLToFO Stylesheet transforms image positioning using the former type. So WordMLToFO Stylesheet cannot reproduce the original position in the Word document displaying. This is the limitation of image transformation.
About the image in WordML, there is another pattern that contains no encoded image data but contains link information to image file.
<w:p> + <w:pPr> - <w:r> + <w:rPr> - <w:pict> + <v:shapetype id="_x0000_t75" ...> </v:shapetype> - <v:shape id="_x0000_s1026" type="#_x0000_t75" ...> <v:imagedata src="C:\Documents and Settings\toshi\My Documents\My Pictures\nashan.jpg" /> </v:shape> </w:pict> </w:r> </w:p>
In such a case, WordMLToFO Stylesheet copies "src" attribute of v:imagedata element to the "src" property of the fo:external-graphic element in FO file directly.
In WordML, the size of header/footer is specified at the position from the top of the page, and doesn't have relations with page margin. And, the size of header/footer and the text area changes depending on the number of lines. However, the size of fo:region-before/fo:region-after (extent value) is fixation in XSL-FO. In WordMLToFO Style Sheet, page margin size of WordML is used as extent value of fo:region-before/fo:region-after. Therefore, after conversion of XSL-FO, the text and header/footer might overlap or be different from WordML in size. Please adjust page margin of WordML.
In WordML, each section can change the number of columns. However, the number of columns are not permitted to be changed on the page in XSL-FO. fo:page-sequence of each section is generated in WordMLToFO Style Sheet. Therefore, the page breaks at each number of columns change (section change).
Word supports many types of field. WordMLToFO Stylesheet transforms fields using its "result text". Many fields have the elements corresponding to the result text, but there exists exceptional pattern. For instance, WordMLToFO Stylesheet cannot offer the text result from special types of field such as list-box.
When you use Word, tab character is useful for positioning text in the line and it is widely used in creating documents. In contrast XSL-FO has no corresponding functions about tab character. WordMLToFO Stylesheet transforms tab character (w:tab) into XSL-FO fo:leader object. But the original form cannot be reproduced.
Auto Shape is used to draw graphics in Word document. Current WordMLToFO Stylesheet implementation does not support Auto Shape.
Current WordMLToFO Stylesheet implementation does not support footnote/endnote.
The line height might not be correctly set.
A word in the Word document with hyphenation setting is divided as follows in WordML:
<w:t>Fo</w:t> <w:t>r</w:t> <w:t>matter</w:t>
For that reason, the word is also divided in the transformed FO. As the result, the word cannot be hyphenated.
Antenna House provides WordMLToFO Stylesheet as an option separately. Although the same styleshteet as this is included in XSL Formatter V3.4, the source code of the stylesheet itself is not contained. If you purchase WordMLToFO Stylesheet, it's available to customize the stylesheet in your own way and include it in XSL Formatter V3.4. Moreover, WordMLToFO Stylesheet currently sold may be upgraded rather than XSL Formatter V3.4 built-in stylesheet. In order to know the version of a WordMLToFO Stylesheet, please check the contents of FO converted from WordML. There is an axf:generator property in <fo:root>.
Add the following description to the Option Setting File in order to include the stylesheet.