Technical Notes

Formatting HTML

AH Formatter V5.0 can format HTML designed for Web (except for HTML using a frame). However, there may not be much HTML that can obtain a good result without adding adjustment after formatting. The reasons are as follows:

For example, if an HTML is printed from a Web browser without dropping out the right side part, an appropriate result will be obtained even if it is formatted by AH Formatter V5.0. However, in order to obtain a better result, HTML must be designed both for the browser and printing. In CSS, probably the style is finely specified by the rule like:

@media print { ... }
@page { ... }

Moreover, there is a big difference in the CSS implementation level between the present Web browsers. If the HTML contains the grammar mistakes by thinking of the appearance with a specific browser, or the HTML uses inaccurate CSS, probably a good result could not be obtained.

The concrete font is not specified to many of (X)HTML on the Web. (It is desirable considering the character of Web.) Since the font setting for every script in the Option Setting File is always effective in AH Formatter V5.0 GUI in Windows version, a suitable font is chosen. However, there is no such thing except GUI of the Windows versions, such as the UNIX version. Please set <script-font> appropriately in the option setting fine and specify the option setting file when executing the format.

CAUTION: Since AH Formatter V5.0 formats the document for printing purpose, @media screen is not applied even if it is a screen of GUI.
CAUTION: HTML saved from the web browser
Many web browsers have the function to save (X)HTML currently referred to. However, XHTML saved by this function may not turn into the right XHTML. When Such XHTML is formatted by AH Formatter V5.0, it will become an error and formatting fails. In such a case, please specify HTML as a formatting type. In addition, there maybe a case when a white space is inserted into Japanese text without notice. Such a text cannot be formatted finely.

Cascading Order of CSS Stylesheet

The cascading order of the CSS stylesheet is defined in the CSS2 Specification as follows.

  1. user agent declarations
  2. user normal declarations
  3. author normal declarations
  4. author important declarations
  5. user important declarations

AH Formatter V5.0 corresponds to the followings.

Default CSS for HTML

Default CSS for HTML is used as the first stylesheet (user agent declarations) when formatting (X)HTML. This is html.css which is placed in the directory indicated by the environment variable, AHF50_DEFAULT_HTML_CSS. (When html.css does not exist, it is formatted as all the elements are inline.

This stylesheet is created based on the display of a web browser, the style specified by CSS, etc. However, there may be specification which cannot be well displayed depending on the environment. Probably, there is also a difference of taste. Users are required to optimize the default CSS according to their own environment etc. Some examples are shown below.

Detection of Formatting Type

When the formatting starts by setting the detection of formatting type automatically, the formatting type will be determined in the following procedures.

  1. If there is no XML declaration and DOCTYPE is for HTML, it will be detected as HTML.
  2. If the name space is for XHTML, it will be detected as XHTML.
  3. If there is the setting of MIME, and it is for XHTML, it will be detected as XHTML.
  4. When the input file has the XHTML extension, it will be detected as XHTML.
  5. If there is the setting of MIME, and it is for HTML, it will be detected as HTML.
  6. When the input file has the HTML extension, it will be detected as HTML.
  7. If there is no XML declaration and name space does not exist and the root element is <HTML> with case insensitive, it will be detected as HTML.
  8. When CSS which is not XSLT is specified (to the internal or external document), it will be detected as XML+CSS.
  9. If the name space is for XSL-FO, it will be detected as XSL-FO.
  10. Other than these will be detected as XML+CSS.

Although the document does not need to be XML if it's HTML formatting, it is required except HTML that the document should be well formed XML.

Difference in Formatting with XSL Formatter V4

There are some differences in formatting between AH Formatter V5.0 and XSL Formatter V4 as listed below.

Incompatibility of XSL1.0 and XSL1.1

Some incompatible changes from XSL1.0 are made to XSL1.1.

Shorthand

Since the shorthand in the property of XSL has succeeded the definition of CSS, the value is evaluated like CSS. That is,

margin="0pt -10pt"

is evaluated as 2 values instead of one formula. However, when it's not a shorthand, this is evaluated as one formula. For example, the following is one formula.

margin-left="0pt -10pt"

XSL Formatter V4.2 processes such an ambiguous expression by the shorthand as follows.

When using a formula in the shorthand, it can be enclosed with parentheses, etc.

With CSS, when a function of calc() is written as calc(10pt-5pt), - is evaluated as a operator. It is because there is no description of whether to separate - from <length-unit> in calc() of the CSS3 specification. Syntactically, It is allowed to use <length-unit> with - in succession.

URI

<uri-specification> in XSL specification is supposed to specify the character string which fulfills IRI (RFC3987) specification in url(). IRI is called URI for convenience in this document. Schemes which can actually be specified in AH Formatter V5.0 are as follows:

When a bare string is specified without using url() and it doesn't match to either of other values, it is considered that URI is specified. For example, the following two are the same.

<fo:external-graphic src="url('http://localhost/image.png')"/>
<fo:external-graphic src="http://localhost/image.png"/>

Moreover, it's possible to specify the relative URI without specifying the scheme name.

<fo:external-graphic src="url('image.png')"/>
<fo:external-graphic src="image.png"/>

AH Formatter V5.0 allows specifying the file name on a local file system instead of URI for user's convenience. However, generally there is no compatibility between URI and a local file name. For example, while a white space is not allowed for URI, a white space may be available for a local file name. Moreover, since the direct use of the % may be available to use, a character string called foo%20bar.png will point out a different resource between the two cases, evaluating as URI and evaluating as a local file name.

AH Formatter V5.0 solves this problem as follows:

The relative URI is combined with base-uri and transformed into the absolute URI. All local file names are transformed into a file: scheme at this time. For example, in the Windows environment, when base-uri is C:\dir\, it is transformed as follows:

foobar.pngfile:///C:/dir/foobar.png
url('foobar.png')file:///C:/dir/foobar.png
url('url(foobar.png)')file:///C:/dir/url(foobar.png)
subdir\foobar.pngfile:///C:/dir/subdir/foobar.png
url('subdir\foobar.png')file:///C:/dir/subdir%5Cfoobar.png
url('subdir/foobar.png')file:///C:/dir/subdir/foobar.png
foo bar.pngfile:///C:/dir/foo%20bar.png
url('foo bar.png')file:///C:/dir/foo%20bar.png
foo%20bar.pngfile:///C:/dir/foo%2520bar.png
url('foo%20bar.png')file:///C:/dir/foo%20bar.png
foo%%20bar.pngfile:///C:/dir/foo%25%2520bar.png
url('foo%%20bar.png')file:///C:/dir/foo%25%2520bar.png
foo#bar.pngfile:///C:/dir/foo#bar.png
url('foo#bar.png')file:///C:/dir/foo#bar.png
foo%23bar.pngfile:///C:/dir/foo%2523bar.png
url('foo%23bar.png')file:///C:/dir/foo%23bar.png

A local file name cannot be written directly into url(). For example:

url('C:\My Document\foobar.png')

The string above will not operate as expected. Please specify a local file name without surrounding by url().

# is a separator of fragmentation. In file:///C:/dir/foo#bar.png, the resource actually accessed is file:///C:/dir/foo. Please specify url('foo%23bar.png') to access a resource called foo#bar.png.

UNC (Universal Naming Convention) in Windows, for example, \\host\My Document\foobar.png is transformed into file://host/My%20Document/foobar.png.

Please refer to Graphics for the data scheme and the jar scheme.

Table Auto Layout

The table (fo:table) has the attribute, table-layout="fixed" and table-layout="auto". The former specifies the fixed layout which has the fixed column width, and the latter is a specification of the automatic layout which calculates the column width automatically. When the value is omitted, the default value is table-layout="auto". In the XSL specification, the automatic layout serves as implementation-independent. We will explain the implementation of AH Formatter V5.0 in this document.

An automatic layout takes time not a little for calculating the width of columns. Please specify table-layout="fixed", if a high-speed formatting is desired.

In AH Formatter V5.0, the processing method of the table differs between the specification of table-layout and the specification of the width to fo:table. When the width of all columns is specified, even if table-layout="auto" is specified, it is treated as table-layout="fixed". Moreover, proportional-column-width() is supposed to be available to specify only in the case of table-layout="fixed" according to the XSL specification. In AH Formatter V5.0, when a column with proportional-column-width() and a column without the width specification are intermingled, it is considered that column-width="proportional-column-width(1)" is specified to the column without the width specification. In addition, it is considered and processed that table-layout="fixed" is specified. That is, in such case, all columns will have the width specification.

table-layoutWidth of fo:tableProcessing Method
fixedYes The width is divided equally and assigned to the column as which width is not specified. When the content exceeds the width, it will overflow.
No The table width becomes 100%. The width is divided equally and assigned to the column where the width is not specified. When the content exceeds the width, it will overflow.
autoYes The content of the column are calculated and the width is assigned to the column where the width is not specified. When the table width exceeds its specified width even if the minimum width of a column is adopted, the table width expands to the exceeded width.
No The content of the column are calculated and the width is assigned to the column where the width is not specified. When the table width does not fill to 100% even if the maximum width of a column is adopted, it will become the table width. When the table width exceeds 100% even if the width of a column is adopted, it will become the table width. Otherwise, the width of a table becomes 100%.

When table-layout="auto" is specified, the content of the column where the width is not specified are investigated. More desirable column width can be determined if all rows are investigated, but it takes too much time for a big table. AH Formatter V5.0 usually investigates the contents only to the column for 100 rows at the maximum and determines the width of a column. This number of rows can be changed by table-auto-layout-limit of Option Setting File.

When table-layout="fixed" is specified, since the contents of the column are not investigated, the processing speed is always high.

Line Breaking

AH Formatter V5.0 processes the line breaking according to UAX#14: Line Breaking Properties. There are some cases that the processing differs from UAX#14.

Font Selection

Fonts in FO or CSS are specified by the font-family property. There are various cases in settings when the candidates of the font are enumerated like font-family="'Courier New', serif", or when there is no specification of font-family, AH Formatter V5.0 determines which font should be applied to a character string as follows.

  1. The character strings in the region are divide into the character strings with the same character by the script information corresponding to the character defined by Unicode, the language specified in FO or CSS, or the script information, etc. and the script of the divided character string is determined. This method of determination is complicated because of the reason that there contains the ambiguous characters to determine if it's a full width character or not in Unicode. Or from only the Chinese character as a character string, the language is being unable to determine.

  2. When font-selection-strategy="auto" is specified, for the first character of the string, whether font-family specified in FO or CSS supports the range or script of Unicode is investigated in order. The supported font which is found first is adopted. When no font-family is specified, it is considered that the generic font family specified in the default font family is specified.
    When font-selection-strategy="character-by-character" is specified, each character of this character string is investigated in order whether the font-family specified by FO or CSS has its glyph. Then re-divides a character string per the supported fonts. When no font-family is specified, it is considered that the generic font family as the standard font family is specified.

In XSL or CSS, the following five can be used as the generic font family.

AH Formatter V5.0 has the information of which font is actually made to correspond to these for every script. Moreover, the defaul generic font which does not belong to any script can also be defined now. These can be specified in the Font Setting page of the Option Setting dialog in GUI, and also can be specified with <script-font> in the option setting file.

  1. When the generic font classified by the script corresponding to the script of the target character string is specified, whether it supports the character string is investigated.

  2. When the corresponding generic font classified by the script is not specified, the default generic font is investigated.

  3. When auto-fallback-font="true" is specified in the Option Setting File and any fonts specified in the font-family don't support the target character string, the following fallback processing will be performed.

    1. The font specified to the fallback assosiated with the corresponding script is investigated.
    2. The font specified to the fallback of the standard generic font is investigated.
    3. Even then any fonts don't support the target character string, the following fonts are investigated in order.
      • Windows version
        1. Lucida Sans Unicode
        2. Microsoft Sans Serif
        3. IPAPGothic
        4. Code2000
        5. MS PGothic
        6. Arial Unicode MS
      • Non-Windows version
        1. Helvetica
        2. IPAPGothic
        3. Code2000

  4. It is an error even then the font which supports the target character string is not found.

The settings in the Option Setting dialog is reflected on the option setting file. For example, it is written like

<script-font script="Hans" serif="SimSun" sans-serif="SimHei" monospace="SimSun"/>

Since there is no specification of cursive here, cursive in the default generic font is adopted to Hans. Like immediately after the installation, when <script-font script="Hans"/> itself is not specified, it is considered that the default group is specified. The following default group is set up with the Windows version. No scripts which are not specified here are set up. Moreover, it is not set up when the font does not actually exist.

Scriptserifsans-serifcursivefantasymonospace
Standard Times New Roman Arial Segeo Script or
Comic Sans MS or
Monotype Corsiva
Impact Courier New
Jpan MS Mincho or
MS Gothic
MS Gothic MS Mincho or
MS Gothic
MS Mincho or
MS Gothic
MS Gothic or
MS Mincho
Hans SimSun or
MS Song
SimHei or
MS Hei or
MS Song
SimSun or
MS Song
SimSun or
MS Song
>SimHei or
MS Hei or
MS Song
Hant MingLiU MingLiU MingLiU MingLiU MingLiU
Hang Batang or
BatangChe
Gulim or
BatangChe
Batang or
BatangChe
Batang or
BatangChe
BatangChe
Arab Arabic Typesetting Arabic Typesetting Arabic Typesetting Arabic Typesetting Arabic Typesetting
Hebr FrankRuehl FrankRuehl FrankRuehl FrankRuehl FrankRuehl
Thai Angsana New Angsana New Angsana New Angsana New Angsana New
Deva Mangal Mangal Mangal Mangal Mangal

The following default group is set up with the Macintosh version.

Scriptserifsans-serifcursivefantasymonospace
Standard Times or
Times New Roman
Helvetica or
Arial
Monaco or
Chalkboard
Monaco or
Chalkboard
Courier
Jpan HiraMinPro W3 HiraKakuPro W3 HiraMaruPro W3 or
HiraKakuPro W3
HiraMaruPro W3 or
HiraKakuPro W3
HiraKakuPro W3
Hans STXihei STSong STXihei STXihei STSong
Hant LiHeiPro LiSongPro LiHeiPro LiHeiPro LiSongPro
Hang AppleMyungjo AppleGothic AppleMyungjo AppleMyungjo AppleGothic
Arab Geeza Pro Geeza Pro Geeza Pro Geeza Pro Geeza Pro
Hebr NewPeninimMT NewPeninimMT NewPeninimMT NewPeninimMT NewPeninimMT
Thai Thonburi Thonburi Thonburi Thonburi Thonburi
Deva DevanagariMT DevanagariMT DevanagariMT DevanagariMT DevanagariMT

The following default group is set up with the other UNIX version.

Scriptserifsans-serifcursivefantasymonospace
StandardTimesHelveticaTimesTimesCourier

Formatting Large Document

For example, when formatting the simple FO (or HTML etc.) without <fo:page-number-citation> and outputting PDF, since AH Formatter V5.0 outputs PDF by throwing away pages which has already been formatted, no matter how huge the document is, AH Formatter V5.0 can process without consuming more than memory for 1 page (except for the formatting from GUI). However, if the page refers to the back page by <fo:page-number-citation> we cannot know what page number the currently referenced page will be until the page is actually being formatted. For that reason, if the page containing the unsolved <fo:page-number-citation> appears, XSL Formatter V4.2 will suspend the output, storing the result on the memory in the middle of formatting. When the document has a table of contents at the start, the output will not be performed until all the page number that appears in a table of contents is solved. A limit arises in the number of formatting pages and this means that the formatting of a large-scale document is impossible because of the memory consumption in large quantities.

In order to solve this problem, AH Formatter V5.0 makes it possible to process the document with 2-pass format. With the first path, the formatting is processed only for the purpose of the solution of <fo:page-number-citation>, and all the required page number information is collected. With the second pass, the formatting starts again from the start of the page. Since all <fo:page-number-citation> is solved at this time, AH Formatter V5.0 can output the document by throwing away the already formatted pages. Although the processing time will increase, most memories are not consumed and the it becomes available to format the large-scale of document. The following shows how to perform 2 path formatting.

The following shows how to perform 2-path formatting.

CAUTION: It's not available to process the 2-pass formatting from GUI.
CAUTION: It's not available to process the 2-pass formatting with AH Formatter V5.0 Lite.

Temporary File

AH Formatter V5.0 does not make the temporary file for work except for the case of being inescapable. Followings are the cases that AH Formatter V5.0 makes the temporary file for work.