AH Formatter V5.0 can format HTML designed for Web (except for HTML using a frame). However, there may not be much HTML that can obtain a good result without adding adjustment after formatting. The reasons are as follows:
For example, if an HTML is printed from a Web browser without dropping out the right side part, an appropriate result will be obtained even if it is formatted by AH Formatter V5.0. However, in order to obtain a better result, HTML must be designed both for the browser and printing. In CSS, probably the style is finely specified by the rule like:
@media print { ... } @page { ... }
Moreover, there is a big difference in the CSS implementation level between the present Web browsers. If the HTML contains the grammar mistakes by thinking of the appearance with a specific browser, or the HTML uses inaccurate CSS, probably a good result could not be obtained.
The concrete font is not specified to many of (X)HTML on the Web. (It is desirable considering the character of Web.) Since the font setting for every script in the Option Setting File is always effective in AH Formatter V5.0 GUI in Windows version, a suitable font is chosen. However, there is no such thing except GUI of the Windows versions, such as the UNIX version. Please set <script-font> appropriately in the option setting fine and specify the option setting file when executing the format.
CAUTION: | Since AH Formatter V5.0 formats the document for printing purpose, @media screen is not applied even if it is a screen of GUI. |
---|
CAUTION: |
HTML saved from the web browser
Many web browsers have the function to save (X)HTML currently referred to. However, XHTML saved by this function may not turn into the right XHTML. When Such XHTML is formatted by AH Formatter V5.0, it will become an error and formatting fails. In such a case, please specify HTML as a formatting type. In addition, there maybe a case when a white space is inserted into Japanese text without notice. Such a text cannot be formatted finely. |
---|
The cascading order of the CSS stylesheet is defined in the CSS2 Specification as follows.
AH Formatter V5.0 corresponds to the followings.
It is html.css. See alsoDefault CSS for HTML.
This can be specified by <usercss>, in the Option Setting File and by the command line of -css or -s. (As for the .NET, Java interface, etc, they are equivalent to the corresponding command line.) These are applied in the following order.
Only the option setting file is applied in GUI. What is specified on the CSS page of the Format Option Setting dialog will be reflected in the Option Setting File.
This can be specified by <link> or <style> inside HTML, by the processing instraction of <?xml-stylesheet .. ?>. These are applied in the following order.
Default CSS for HTML is used as the first stylesheet (user agent declarations) when formatting (X)HTML. This is html.css which is placed in the directory indicated by the environment variable, AHF50_DEFAULT_HTML_CSS. (When html.css does not exist, it is formatted as all the elements are inline.
This stylesheet is created based on the display of a web browser, the style specified by CSS, etc. However, there may be specification which cannot be well displayed depending on the environment. Probably, there is also a difference of taste. Users are required to optimize the default CSS according to their own environment etc. Some examples are shown below.
It is specified as follows by default CSS.
q:before { content: '\201C' } q:after { content: '\201D' }
The current AH Formatter V5.0 cannot change the quotation marks depending on the language. The following specification may be preferable.
q:before { content: '\22' } q:after { content: '\22' }
A footnote number is specified to be placed in the margin of the left page. If you don't want to make it overflow into the margin, please specify padding-left or specify list-style-position:inside to @footnote. decimal is specified for numbering. Although it is written that super-decimal is used in CSS3 GCPM, since there are many fonts without super-decimal, it is not adopted with default CSS. Probably, it is good to correct as follows when you want to use super-decimal.
::footnote-call { content: counter(footnote, super-decimal); } ::footnote-marker { content: counter(footnote, super-decimal); -ah-margin-end: 0.5em; text-indent: 0; }
When the formatting starts by setting the detection of formatting type automatically, the formatting type will be determined in the following procedures.
Although the document does not need to be XML if it's HTML formatting, it is required except HTML that the document should be well formed XML.
There are some differences in formatting between AH Formatter V5.0 and XSL Formatter V4 as listed below.
For example, V4 formats the following
<fo:block text-transform="capitalize"> HELLO world! </fo:block>
as follows.
Hello World!
AH Formatter V5.0 formats as follows.
HELLO World!
That is, although V4 changes the letters except the initial letter into lower case, V5 does nothing. In order to make it the same as V4, please specify as follows.
<fo:block text-transform="capitalize-lowercase">
The implementation of <fo:bidi-override> by AH Formatter V5.0 slightly differs from XSL Formatter V4. For example, V4 formats the following specification
<fo:bidi-override direction="rtl">
This is a <fo:bidi-override unicode-bidi="embed">test</fo:bidi-override>
</fo:bidi-override>
as follows.
This is a test
AH Formatter V5.0 formats as follows.
test This is a
When you want to make it the same as V4, please specify bidi-override-mode="4" in the Option Setting File. In addition, when the following is specified,
<fo:bidi-override direction="rtl">
This is a ‫test‬
</fo:bidi-override>
there is no need to consider the influence.
With AH Formatter V5.0, the initial value of otf-metrics-mode is changed from "windows" to "typographic". The baseline may slightly change depending on fonts. Especially, a difference will be clear with MORISAWA font.
AH Formatter V5.0 improves the processing of trimming a line of text. Although finer control was attained by axf:text-justify-trim with this enhancement, a difference may arise in the number of characters included in one line with XSL Formatter V4. When you want to make it the same as V4 by FO which does not use text-justify-mode="4", please specify text-justify-mode="4." in the Option Setting File.
AH Formatter V5.0 improves the processing when putting fonts with different baselines like a mixture of Western and Japanese text. For example,
<fo:block>Latin漢字</fo:block> <fo:block>漢字Latin</fo:block> <fo:block>Latin</fo:block> <fo:block>漢字</fo:block>
like the above, you may specfy font-family="'Times New Roman', 'MS Mincho'" so that Japanese fonts are not applied to Latin. Since the first font specified as font-family determines a baseline by XSL Formatter V4 at this time, a difference may arise in the height of a line. Since AH Formatter V5.0 selects the font in the font-family by the script or the language specification, a suitable baseline will be applied by specifying language="jpn" in the example above. When you want to make it the same as V4, please specify baseline-mode="4" in the Option Setting File.
font-selection-strategy="character-by-character" is supported from AH Formatter V5.0 In addition, auto-fallback-font in the Option Setting File makes it possible to control the fallback. See also Font Selection.
Some incompatible changes from XSL1.0 are made to XSL1.1.
In XSL1.1, even if writing-mode or reference-orientation are specified to fo:region-*, these are ignored and not effective. In order to make these specifications effective in XSL1.1, it is necessary to specify the followings to fo:page-sequence.
writing-mode="from-page-master-region()" reference-orientation="from-page-master-region()"
In order to evaluate it as well as XSL 1.0 without making any changes in FO, specify default-from-page-master-region="true" in the Option Setting File.
In XSL1.0, fo:table is supposed to generate a reference area (see 5.6 in XSL1.0). However, in XSL1.1, it was corrected that this was an error. The difference is mainly generated when converting from margin-* to start-indent and end-indent specified in fo:table. For example:
<fo:block margin-left="10pt"> <fo:table margin-left="0pt"> ...
In the table like above, left margins may differ between XSL1.0 and XSL1.1. If start-indent etc. are used instead of margin-*, such incompatibility will not be generated.
In order to evaluate it as well as XSL 1.0 without making any changes in FO, specify table-is-reference-area="true" in the Option Setting File.
Since the shorthand in the property of XSL has succeeded the definition of CSS, the value is evaluated like CSS. That is,
margin="0pt -10pt"
is evaluated as 2 values instead of one formula. However, when it's not a shorthand, this is evaluated as one formula. For example, the following is one formula.
margin-left="0pt -10pt"
XSL Formatter V4.2 processes such an ambiguous expression by the shorthand as follows.
When using a formula in the shorthand, it can be enclosed with parentheses, etc.
With CSS, when a function of calc() is written as calc(10pt-5pt), - is evaluated as a operator. It is because there is no description of whether to separate - from <length-unit> in calc() of the CSS3 specification. Syntactically, It is allowed to use <length-unit> with - in succession.<uri-specification> in XSL specification is supposed to specify the character string which fulfills IRI (RFC3987) specification in url(). IRI is called URI for convenience in this document. Schemes which can actually be specified in AH Formatter V5.0 are as follows:
When a bare string is specified without using url() and it doesn't match to either of other values, it is considered that URI is specified. For example, the following two are the same.
<fo:external-graphic src="url('http://localhost/image.png')"/>
<fo:external-graphic src="http://localhost/image.png"/>
Moreover, it's possible to specify the relative URI without specifying the scheme name.
<fo:external-graphic src="url('image.png')"/> <fo:external-graphic src="image.png"/>
AH Formatter V5.0 allows specifying the file name on a local file system instead of URI for user's convenience. However, generally there is no compatibility between URI and a local file name. For example, while a white space is not allowed for URI, a white space may be available for a local file name. Moreover, since the direct use of the % may be available to use, a character string called foo%20bar.png will point out a different resource between the two cases, evaluating as URI and evaluating as a local file name.
AH Formatter V5.0 solves this problem as follows:
The relative URI is combined with base-uri and transformed into the absolute URI. All local file names are transformed into a file: scheme at this time. For example, in the Windows environment, when base-uri is C:\dir\, it is transformed as follows:
foobar.png | file:///C:/dir/foobar.png |
url('foobar.png') | file:///C:/dir/foobar.png |
url('url(foobar.png)') | file:///C:/dir/url(foobar.png) |
subdir\foobar.png | file:///C:/dir/subdir/foobar.png |
url('subdir\foobar.png') | file:///C:/dir/subdir%5Cfoobar.png |
url('subdir/foobar.png') | file:///C:/dir/subdir/foobar.png |
foo bar.png | file:///C:/dir/foo%20bar.png |
url('foo bar.png') | file:///C:/dir/foo%20bar.png |
foo%20bar.png | file:///C:/dir/foo%2520bar.png |
url('foo%20bar.png') | file:///C:/dir/foo%20bar.png |
foo%%20bar.png | file:///C:/dir/foo%25%2520bar.png |
url('foo%%20bar.png') | file:///C:/dir/foo%25%2520bar.png |
foo#bar.png | file:///C:/dir/foo#bar.png |
url('foo#bar.png') | file:///C:/dir/foo#bar.png |
foo%23bar.png | file:///C:/dir/foo%2523bar.png |
url('foo%23bar.png') | file:///C:/dir/foo%23bar.png |
A local file name cannot be written directly into url(). For example:
url('C:\My Document\foobar.png')
The string above will not operate as expected. Please specify a local file name without surrounding by url().
# is a separator of fragmentation. In file:///C:/dir/foo#bar.png, the resource actually accessed is file:///C:/dir/foo. Please specify url('foo%23bar.png') to access a resource called foo#bar.png.
UNC (Universal Naming Convention) in Windows, for example, \\host\My Document\foobar.png is transformed into file://host/My%20Document/foobar.png.
☞ | Please refer to Graphics for the data scheme and the jar scheme. |
---|
The table (fo:table) has the attribute, table-layout="fixed" and table-layout="auto". The former specifies the fixed layout which has the fixed column width, and the latter is a specification of the automatic layout which calculates the column width automatically. When the value is omitted, the default value is table-layout="auto". In the XSL specification, the automatic layout serves as implementation-independent. We will explain the implementation of AH Formatter V5.0 in this document.
An automatic layout takes time not a little for calculating the width of columns. Please specify table-layout="fixed", if a high-speed formatting is desired.
In AH Formatter V5.0, the processing method of the table differs between the specification of table-layout and the specification of the width to fo:table. When the width of all columns is specified, even if table-layout="auto" is specified, it is treated as table-layout="fixed". Moreover, proportional-column-width() is supposed to be available to specify only in the case of table-layout="fixed" according to the XSL specification. In AH Formatter V5.0, when a column with proportional-column-width() and a column without the width specification are intermingled, it is considered that column-width="proportional-column-width(1)" is specified to the column without the width specification. In addition, it is considered and processed that table-layout="fixed" is specified. That is, in such case, all columns will have the width specification.
table-layout | Width of fo:table | Processing Method |
---|---|---|
fixed | Yes | The width is divided equally and assigned to the column as which width is not specified. When the content exceeds the width, it will overflow. |
No | The table width becomes 100%. The width is divided equally and assigned to the column where the width is not specified. When the content exceeds the width, it will overflow. | |
auto | Yes | The content of the column are calculated and the width is assigned to the column where the width is not specified. When the table width exceeds its specified width even if the minimum width of a column is adopted, the table width expands to the exceeded width. |
No | The content of the column are calculated and the width is assigned to the column where the width is not specified. When the table width does not fill to 100% even if the maximum width of a column is adopted, it will become the table width. When the table width exceeds 100% even if the width of a column is adopted, it will become the table width. Otherwise, the width of a table becomes 100%. |
When table-layout="auto" is specified, the content of the column where the width is not specified are investigated. More desirable column width can be determined if all rows are investigated, but it takes too much time for a big table. AH Formatter V5.0 usually investigates the contents only to the column for 100 rows at the maximum and determines the width of a column. This number of rows can be changed by table-auto-layout-limit of Option Setting File.
When table-layout="fixed" is specified, since the contents of the column are not investigated, the processing speed is always high.
AH Formatter V5.0 processes the line breaking according to UAX#14: Line Breaking Properties. There are some cases that the processing differs from UAX#14.
Nonstarter Japanese characters defined in JIS X 4051:2004 can be controlled by axf:line-break.
The rule LB30 was deleted in UAX#14 Revision 2.2. This was made by considering Japanese etc., but it is a problem that the word breaks before (s) in the case like person(s). AH Formatter V5.0 permits the line breaking for full- width parentheses. The target punctations are closing parenthesis, opening parenthesis and full-width punctuation marks indicated in axf:punctuation-trim. As for the half-width parentheses in European languages, the line breaks by interpreting LB30 as follows.
(AL | NU) x OP CL x (AL | NU)
The line breaking class AI in a CJK script is processed as ID. However, U+2015 (HORIZONTAL BAR) is processed as IN since it is non-breaking character in JIS X 4051:2004.
The line breaking class of half-width kana is AL. Unless it leaves a space between words as well as the alphabet, line breaking is not done. AH Formatter V5.0 treats half-width kana as full-width kana and processes the line breaking.
Fonts in FO or CSS are specified by the font-family property. There are various cases in settings when the candidates of the font are enumerated like font-family="'Courier New', serif", or when there is no specification of font-family, AH Formatter V5.0 determines which font should be applied to a character string as follows.
The character strings in the region are divide into the character strings with the same character by the script information corresponding to the character defined by Unicode, the language specified in FO or CSS, or the script information, etc. and the script of the divided character string is determined. This method of determination is complicated because of the reason that there contains the ambiguous characters to determine if it's a full width character or not in Unicode. Or from only the Chinese character as a character string, the language is being unable to determine.
When font-selection-strategy="auto" is specified, for
the first character of the string, whether font-family specified in FO or CSS supports the range or script of Unicode is investigated in order. The supported font which is found first is adopted. When no font-family is specified, it is considered that the generic font family specified in the default font family is specified.
When font-selection-strategy="character-by-character" is specified, each character of this character string is investigated in order whether the font-family specified by FO or CSS has its glyph. Then re-divides a character string per the supported fonts. When no font-family is specified, it is considered that the generic font family as the standard font family is specified.
In XSL or CSS, the following five can be used as the generic font family.
AH Formatter V5.0 has the information of which font is actually made to correspond to these for every script. Moreover, the defaul generic font which does not belong to any script can also be defined now. These can be specified in the Font Setting page of the Option Setting dialog in GUI, and also can be specified with <script-font> in the option setting file.
When the generic font classified by the script corresponding to the script of the target character string is specified, whether it supports the character string is investigated.
When the corresponding generic font classified by the script is not specified, the default generic font is investigated.
When auto-fallback-font="true" is specified in the Option Setting File and any fonts specified in the font-family don't support the target character string, the following fallback processing will be performed.
It is an error even then the font which supports the target character string is not found.
The settings in the Option Setting dialog is reflected on the option setting file. For example, it is written like
<script-font script="Hans" serif="SimSun" sans-serif="SimHei" monospace="SimSun"/>
Since there is no specification of cursive here, cursive in the default generic font is adopted to Hans. Like immediately after the installation, when <script-font script="Hans"/> itself is not specified, it is considered that the default group is specified. The following default group is set up with the Windows version. No scripts which are not specified here are set up. Moreover, it is not set up when the font does not actually exist.
Script | serif | sans-serif | cursive | fantasy | monospace |
---|---|---|---|---|---|
Standard | Times New Roman | Arial |
Segeo Script or Comic Sans MS or Monotype Corsiva |
Impact | Courier New |
Jpan | MS Mincho or MS Gothic |
MS Gothic | MS Mincho or MS Gothic |
MS Mincho or MS Gothic |
MS Gothic or MS Mincho |
Hans | SimSun or MS Song |
SimHei or MS Hei or MS Song |
SimSun or MS Song |
SimSun or MS Song |
>SimHei or MS Hei or MS Song |
Hant | MingLiU | MingLiU | MingLiU | MingLiU | MingLiU |
Hang | Batang or BatangChe |
Gulim or BatangChe |
Batang or BatangChe |
Batang or BatangChe |
BatangChe |
Arab | Arabic Typesetting | Arabic Typesetting | Arabic Typesetting | Arabic Typesetting | Arabic Typesetting |
Hebr | FrankRuehl | FrankRuehl | FrankRuehl | FrankRuehl | FrankRuehl |
Thai | Angsana New | Angsana New | Angsana New | Angsana New | Angsana New |
Deva | Mangal | Mangal | Mangal | Mangal | Mangal |
The following default group is set up with the Macintosh version.
Script | serif | sans-serif | cursive | fantasy | monospace |
---|---|---|---|---|---|
Standard | Times or Times New Roman |
Helvetica or Arial |
Monaco or Chalkboard |
Monaco or Chalkboard |
Courier |
Jpan | HiraMinPro W3 | HiraKakuPro W3 | HiraMaruPro W3 or HiraKakuPro W3 |
HiraMaruPro W3 or HiraKakuPro W3 |
HiraKakuPro W3 |
Hans | STXihei | STSong | STXihei | STXihei | STSong |
Hant | LiHeiPro | LiSongPro | LiHeiPro | LiHeiPro | LiSongPro |
Hang | AppleMyungjo | AppleGothic | AppleMyungjo | AppleMyungjo | AppleGothic |
Arab | Geeza Pro | Geeza Pro | Geeza Pro | Geeza Pro | Geeza Pro |
Hebr | NewPeninimMT | NewPeninimMT | NewPeninimMT | NewPeninimMT | NewPeninimMT |
Thai | Thonburi | Thonburi | Thonburi | Thonburi | Thonburi |
Deva | DevanagariMT | DevanagariMT | DevanagariMT | DevanagariMT | DevanagariMT |
The following default group is set up with the other UNIX version.
Script | serif | sans-serif | cursive | fantasy | monospace |
---|---|---|---|---|---|
Standard | Times | Helvetica | Times | Times | Courier |
For example, when formatting the simple FO (or HTML etc.) without <fo:page-number-citation> and outputting PDF, since AH Formatter V5.0 outputs PDF by throwing away pages which has already been formatted, no matter how huge the document is, AH Formatter V5.0 can process without consuming more than memory for 1 page (except for the formatting from GUI). However, if the page refers to the back page by <fo:page-number-citation> we cannot know what page number the currently referenced page will be until the page is actually being formatted. For that reason, if the page containing the unsolved <fo:page-number-citation> appears, XSL Formatter V4.2 will suspend the output, storing the result on the memory in the middle of formatting. When the document has a table of contents at the start, the output will not be performed until all the page number that appears in a table of contents is solved. A limit arises in the number of formatting pages and this means that the formatting of a large-scale document is impossible because of the memory consumption in large quantities.
In order to solve this problem, AH Formatter V5.0 makes it possible to process the document with 2-pass format. With the first path, the formatting is processed only for the purpose of the solution of <fo:page-number-citation>, and all the required page number information is collected. With the second pass, the formatting starts again from the start of the page. Since all <fo:page-number-citation> is solved at this time, AH Formatter V5.0 can output the document by throwing away the already formatted pages. Although the processing time will increase, most memories are not consumed and the it becomes available to format the large-scale of document. The following shows how to perform 2 path formatting.
The following shows how to perform 2-path formatting.
CAUTION: | It's not available to process the 2-pass formatting from GUI. |
---|
CAUTION: | It's not available to process the 2-pass formatting with AH Formatter V5.0 Lite. |
---|
AH Formatter V5.0 does not make the temporary file for work except for the case of being inescapable. Followings are the cases that AH Formatter V5.0 makes the temporary file for work.
With the COM interface, PDF of a formatted result is saved to a temporary file when outputting PDF to a Web browser directly.
When outputting a file while printing, a temporary file is generated.
When a file interface is required in the XSLT transformation using external XSLT and, a temporary file is generated.
When the transformation from XML+XSL is required in the render method of a Java interface, the result FO is generated as a temporary file.
In Windows version, when embedding the image that is not embeddable in PDF, a temporary file is generated in the conversion process.
In Windows version, a temporary file is generated when processing MathML using MathPlayer.
A temporary file is generated when converting EPS to PDF using Distiller or Ghostscript.
When processing EPS using Distiller, if joboptions is not specified, default joboption will be generated as a temporary file.
A temporary file is generated when outputting to a XPS file.
In GUI of Windows version, a temporary file is suitably generated by Windows System.