Hyphenation Dictionary

XSL Formatter V3.4 conforms to the Hyphenation properties of the XSL specification.

XSL Formatter V3.4 includes only a hyphenation dictionary for English (en.xml). The user has to prepare and add dictionaries for other languages that they want to hyphenate. A good source for hyphenation dictionaries and links to other dictionaries is at the Apache website.

To further enhance multilingual hyphenation Antenna House offers XSL Formatter Hyphenation Option based on algorithm from Computer Hyphenation Ltd. This option enhances the quality of hyphenation and hyphenates more than 40 languages without the need for the user to prepare dictionaries. For more details please visit Antenna House website and the Computer Hyphenation Ltd. website.

Dictionary Name and Location

Hyphenation Dictionaries are stored in the "hyphenation" folder where XSL Formatter V3.4 is installed. The file name of Hyphenation Dictionary follows the rules shown below.

For example : de.xml, en_GB.xml

Contents of Hyphenation Dictionary

The contents of Hyphenation Dictionary are defined in the hyphenation.dtd. hyphenation.dtd is included in FOP distribution. In XSL Formatter V3.4, it is installed in the hyphenation folder where XSL Formatter V3.4 is installed. Below is a brief explanation of the DTD. Refer to hyphenation.dtd for more details.

Element Location Description
<hyphenation-info> root element
<hyphen-char> child of <hyphenation-info> This element expresses hyphenation characters in the exception dictionary data. Hyphenation character is expressed by the value attribute. Initial value is "-" (U+002D). But the hyphenation characters in the actual formatted result are given by the hyphenation-character property in the XSL specification.
<hyphen-min> child of <hyphenation-info> When hyphenation break occurs, before and after attributes give the minimum number of characters in a hyphenated word before or after the hyphenation character. Before attribute is mapped to XSL hyphenation-remain-character-count property, after is mapped to XSL hyphenation-push-character-count. XSL Formatter V3.4 uses these properties and the hyphen-min element in the dictionary is ignored.
<classes> child of <hyphenation-info> Defined as character equivalent class. Text of classes' element is white space-separated list of character groups, all characters in a group are to be treated equivalent. Actually each group consists of lowercase and uppercase characters. Following is a sample of English dictionary (en.xml).
aA bB cC dD eE fF gG hH iI jJ kK lL mM nN oO pP qQ rR sS tT uU vV wW xX yY zZ
<pattern> child of <hyphenation-info> The hyphenation patterns, space separated. A pattern consists of character and digits. Character is the beginning characters of classes groups.(normally lowercase). Digits between characters indicate the strength of hyphenation potential (hyphenation value).
<exceptions> child of <hyphenation-info> Data of hyphenation exception dictionary. Text of exceptions element consists of space-separated list of hyphenated words. A hyphen is indicated by the hyphen element, but you can use character defined in hyphen-char element. Exceptions element is used when hyphenation points determined by hyphenation-pattern dictionary are not appropriate or you want to use special hyphenation patterns of your own.
<hyphen> child of <exceptions> A full functional hyphen equivalent to TeX's \discretionary. Hyphen element has the pre, post and no attributes. The pre attribute indicates the strings inserted before the hyphenation character when a hyphenation break occurs, The post attribute indicates the strings inserted after the hyphenation character when a hyphenation break occurs, the no attribute indicates the strings appearing when a hyphenation break does not occur. Hyphen element is used when the spelling changes when a hyphenation break occurs.

How to Hyphenate

To hyphenate words, you must specify the hyphenate property and language explicitly. In addition, the dictionary for the language must exist. In the following example, the hyphenation dictionary en.xml should be set in the hyphenation folder. If the dictionary does not exist, XSL Formatter V3.4 will not hyphenate the words.

<fo:block hyphenate="true" xml:lang="en">
XML format is being adopted by corporations at an increasing rate as the preferred format for data, including order data, exchanged within an organization, as well as between corporations. While XML is appropriate for computers to exchange data, for people to see and use the data, it must be presented in a clear and understandable format. In this area, many dedicated form-printing tools have been available for years.
</fo:block>

Format example of the above text is as follows.

When the country code is also specified in the language setting as below, first the hyphenation dictionary en_GB.xml is detected, then if it's not found, the hyphenation dictionary en.xml is detected. In this case the country code is ignored.

<fo:block hyphenate="true" xml:lang="en-GB">

References

For additional information on TeX hyphenation the books listed below are recommended:


Hyphenation Option

XSL Formatter V3.4 supports English only as the standard hyphenation. For other languages, such as German and French etc., a user has to prepare a dictionary as defined in the On-line Manual provided with Formatter.

XSL Formatter Hyphenation Option makes it possible to hyphenate 40 or more languages. This option also supports hyphenation which includes spelling changes when a hyphen is placed.

Languages

XSL Formatter Hyphenation Option supports the following languages.

Code Language Characters
af Afrikaans Latin characters and Apostrophe
bg Bulgarian Cyrillic characters
ca Catalan Latin characters and Apostrophe and Decimal point (Full stop or Middle dot)
cs Czech Latin characters
cy Welsh Latin characters and Apostrophe
da Danish Latin characters and Apostrophe
de German / Swiss German Latin characters and Apostrophe
el Greek Greek characters
en English Latin characters and Apostrophe
en-US American Latin characters and Apostrophe
eo Esperanto Latin characters
es Spanish Latin characters
et Estonian Latin characters
eu Basque Latin characters
fi Finnish Latin characters
fr French / Canadian French Latin characters and Apostrophe
ga Irish (Erse or Gaelic) Latin characters and Apostrophe
hr Croatian Cyrillic characters or Latin characters
hu Hungarian Latin characters
id Indonesian Latin characters and Apostrophe and Digit 2
is Icelandic Latin characters
it Italian Latin characters and Apostrophe
la Latin Latin characters
lt Lithuanian Latin characters
lv Latvian Latin characters
ms Bahasa Malay Latin characters and Apostrophe and Digit 2
mt Maltese Latin characters and Apostrophe
nl Dutch / Flemish Latin characters and Apostrophe
no Norwegian Latin characters and Apostrophe
pl Polish Latin characters
pt Portuguese / Brazilian Latin characters
ro Romanian / Moldavian Latin characters and Apostrophe
ru Russian Cyrillic characters
sk Slovak Latin characters and Apostrophe
sl Slovenian Latin characters and Apostrophe
sr Serbian Cyrillic characters or Latin characters
sv Swedish Latin characters and Apostrophe
sw Swahili Latin characters and Apostrophe
tr Turkish Latin characters
uk Ukrainian Cyrillic characters

Example

To use Czech hyphenation the following is placed in the fo file:

<fo:block hyphenate="true" language="cs">
Všichni lidé rodí se svobodní a sobě rovní co do důstojnosti a práv. Jsou nadáni rozumem a svědomím a mají spolu jednat v duchu bratrství.
</fo:block>

When country code is specified like lang="nl-BE", country codes other than "en-US" are ignored.

Exception Dictionary

It's not necessary to prepare the dictionary with XSL Formatter Hyphenation Option. However, there may be a case that you want to treat the unexpected hyphened words as exceptions. In such case, it is possible to register the words in the exception dictionary.

The exception dictionary is stored in the hyphenation folder in the XSL Formatter V3.4 installation folder or in the folder where the AXF3_HYPDIC_PATH environment variable indicates. The name of the dictionary file conforms to the following rules, which is same as TeX dictionary.

For example: de.xml, en_GB.xml

The following shows the content of exception dictionary.

Element Location Description
<hyphenation-info> root element
<hyphen-char> child of <hyphenation-info> The element that indicates the hyphenation character alternative to <hyphen/> in the exception element. Hyphenation character is expressed by the value attribute. The initial value is "-"(U+002D).
<exceptions> child of <hyphenation-info> A data of exception dictionary. The text of the exception element is a collection of hyphened words divided by white space. The hyphen information is indicated by the hyphen element, however the character specified by the hyphen-char element can also be used.
<hyphen> child of <exceptions> A full functional hyphen equivalent to TeX's \discretionary. Hyphen element has the pre, post and no attributes. The pre attribute indicates the strings inserted before the hyphenation character when a hyphenation break occurs, The post attribute indicates the strings inserted after the hyphenation character when a hyphenation break occurs, the no attribute indicates the strings appearing when a hyphenation break does not occur. Hyphen element is used when the spelling changes when a hyphenation break occurs.
<non-eol-words> child of <hyphenation-info> Specifies non-end-of-line words by dividing with white space. The word specified here is adjusted not to placed at the end of line, however in some case it's inevitable. The non-end-of-line process is effective all the time, independent of the hypenate property in FO.

The DTD of Exception Dictionary is simple as follows:

<!ELEMENT hyphenation-info (hyphen-char?, exceptions?, non-eol-words?) >

<!ELEMENT hyphen-char EMPTY >
<!ATTLIST hyphen-char value CDATA #REQUIRED >

<!ELEMENT exceptions (#PCDATA|hyphen)* >

<!ELEMENT hyphen EMPTY >
<!ATTLIST hyphen pre  CDATA #IMPLIED
                 no   CDATA #IMPLIED
                 post CDATA #IMPLIED >

<!ELEMENT non-eol-words #PCDATA >

Suppose the followning exceation dictionary is prepared.

<hyphenation-info>
<exceptions>
ta-ble
present
ba<hyphen pre="k" no="c"/>ken
</exceptions>
</hyphenation-info>

The word table has a posobility of being hyphened only as ta-ble, the word present never be hyphened. The word backen is hyphened as bak-ken. And ta<hyphen/>ble is quite equivalent for ta-ble in this example.

Possible to specify the hyphenation by the hyphen element that change the spelling of the word.

Settings for Exception Dictionary Word Hyphenation
ab<hyphen/>def abdef ab-def
ab<hyphen no="c"/>def abcdef ab-def
ab<hyphen pre="x"/>def abdef abx-def
ab<hyphen pre="x" no="c"/>def abcdef abx-def
ab<hyphen post="z"/>def abdef ab-zdef
ab<hyphen no="c" post="z"/>def abcdef ab-zdef
ab<hyphen pre="x" post="z"/>def abdef abx-zdef
ab<hyphen pre="x" no="c" post="z"/>def abcdef abx-zdef

Restrictions

In Portuguese, where a hard hyphen exists, the hyphen is repeated at end of line and top of line. For example:

terca-feira

this word is hyphenated as follows.

terca-
-feira

However, this version of XSL Formatter Hyphenation Option does not support this hyphenation.



Copyright © 1999-2005 Antenna House, Inc. All rights reserved.
Antenna House is a trademark of Antenna House, Inc.