| Contents |
| 1. Introduction |
You should have 3MB disk space for the program and the sample files. The RAM requirements are according to your OS in most cases (but for example, if you want to create a millions words dictionary with clues, your should have 2GB RAM.)
There is a supporting program CPT Dictionary for browsing the files created by CPT Word Lists. CPT Dictionary is free for non-commercial use and you can distribute it together with your dictionaries.
The fans of Unicode details will need the book
"The Unicode Standard" and the technical reports - see
www.unicode.org.
The program supports the code points assigned in Unicode 5.1.0
(without supplemental characters)
and as a reference you could use the book or UnicodeData.txt from the
official version
.
bidirectional - languages such as Arabic and Hebrew whose general flow of text proceeds horizontally from right to left, but numbers, English, and other left-to-right language text are written from left to right. "Bidi" is a short form of the term. RTL is used for right-to-left and LTR - for left-to-right.
BMP - Basic Multilingual Plane (i.e. Unicode without supplemental characters, the subset supported properly by this program.)
bpt - bit-per-tag, a type of tag.
btc - byte to character converter.
CLASSPATH - this could denote OS environment variable, or option when running JVM, showing the path(s), where the JVM will search for class/jar/zip/... files containing compiled Java programs or other data.
CPT - crossword power tools, our collection of programs.
ctb - character to byte converter.
CTree - binary dictionary file or internal sort engine.
fixed font - this font is used to mark menu items, check boxes, dialogs, etc.
grapheme - glyph or "conceptual character" that can be presented by one or more Unicode characters.
JVM - Java virtual machine, the interpreter of Java class files, part of RTE (the OS command to start it could be 'java', 'javaw', 'jre' or 'jview').
natural sorting - the order of words in printed dictionaries.
npt - number-per-tag, a type of tag.
NSM - non-spacing mark, combining diacritical mark.
RTE - run time environment, your Java support, used to run this program.
OS - the operating system you are running, Windows is used for Microsoft Windows 95/98/ME/NT/2K/XP/Vista/7, Linux is used for popular boxes like Red Hat/Fedora, SUSE, Ubuntu, etc.
ttf - true type font.
Unicode - the standard for a coded character set that aims to be 'Universal Character Set', synchronized with ISO/IEC 10646.
\uxxxx - Hexadecimal notation for representing Unicode characters, where "\u" is the escape sequence and "xxxx" are the hex digits.
You could look at the glossary at unicode.org as well.
| 2. Overview of Data Processing |

There are operations performed only on Base words - dump, search, etc. These operations are selected from the Base Words menu.
Usually Arrays are read/written as text files called 'text word list'. When JTree is used as sorting engine, the target is text word list or picture dictionary. The CTrees are read/written as binary files called 'dictionaries' and can be dumped as text files as well. The dictionaries contain additional information about encoding, sorting type, locale, etc. They can contain tags, clues and pictures as well. In Unicode mode any character takes 2 bytes in the file, in byte mode the character can be presented by 1 byte, by 5 bits or more depending on the number letters used. The CTree dictionaries are most developed and they are the standard for all CPT programs as crossword dictionaries.
Note that the CTrees will not sort and handle properly graphemes presented by more than one Unicode character. You should compose (see Unicode Normalization) these characters or you should use JTree or Array in Locale mode as sorting engine.
With CPT Word Lists many of the operations could be done in a single run. For example, take as input Russian HTML file encoded in UTF-8 or Cp1251, extract the words, sort using Russian locale, compare the words to dictionary encoded in KOI8_R and write the difference in a text file encoded in ISO 8859-5. This is a sample of simple spell checking that hardly any other program can do for you.
Yet another example, we used CPT Word Lists to separate our Bulgarian word list in parts containing only nouns, verbs, and adjectives. Initially, we did not have any dictionary with tags to do the tagging and we did it other way. Via user defined affixes file (suffixes and tags in this case) and using 'Loose Match Mode' we created munched list (after many iterations). The munched list has been edited manually to correct the loose matching errors and this way we expanded the initial word list to all grammatical word forms. The rest was easy: in seconds we created tagged list and finally via filtering the tagged list we got the different parts. The linguists might not like this 'morphology analysis' done by suffixes only, but it works for this and other languages. The human time spent to create our affixes file is comparable to the time needed to create lexicon and rules file for PC-Kimmo or Xerox's finite state tools, but the computer time for the analysis is considerably shorter. Of course, the goals of our program and the other tools mentioned are different.
| 3. Encoding, Byte and Unicode Modes |
The processing of different encoded texts in Java is supported via pair of converters (in recent Sun Java there is also a variant as combined single object.) The input converter is byte-to-character (btc), which translates bytes from the source encoding to Unicode characters. The output converter is character-to-byte (ctb) and its task is to translate Unicode characters to the target encoding. In CPT programs most of the names of encoding converters from Sun's Java international RTE are built in. There is a mechanism via 'User Defined Encoding' to use any available converter when it is not in the built in list. The same mechanism is used to select non-Sun converters, developed by Netscape, or by other programmers. In our dialog boxes, where the built in list is shown, the encodings are ordered as follows:

The input and output files could be in any encoding, but internally, our programs work only in 'one-byte' and 'two-bytes Unicode' modes. This means that if Source and Target encodings selected by the user, are from the one-byte group and the sorting type is Byte, then the main processing is in one-byte mode. If the Source or Target encoding is from the two and more bytes group and the sorting type is Unicode or Locale, then the Source is converted to Unicode, the processing is done in Unicode, and finally, the text is converted from Unicode to the Target encoding. This is the general scheme for Arrays and JTrees. For CTrees, additionally, some space and time optimizations could be done via using 'Strict Alphabets' and/or UTF-8.
The Java strings are presented internally in Unicode with the extension of 'surrogate pairs', called UTF-16 or 'Unicode 2.0', while the pure 16-bit is called UCS-2 or 'original Unicode 1.1'. The Java characters are 16-bit 'code elements', not 'code points', and obviously one Java character can not hold a 'surrogate pair'. Single Java character can not hold letters in decomposed form as well, in this case the text should be composed - see Unicode Normalization.
There are several Unicode schemes for handling text in files (www.unicode.org, www.szybora.com,...). For reading/writing Unicode texts from/to files Sun developed converters, which do not change the 'Unicode encoding' but just the representation of characters as bytes:
| 4. File Menu |
The Exit will stop the program without saving the current settings.
Via Save Settings as Default you can save the settings you have done and on the next start of the program you will have them. The passwords and Source/Base/Target file names are never saved but the names of the auxiliary files from CTree Options dialog are saved.
Save Settings in Projects... is similar but in separate file in 'Projects' directory. There are many options you have to set for a particular task and it is a good idea to keep them in a file. After the installation, the 'Projects' directory will contain template settings for most of the groups of operations.
Via Read Settings from Projects... you can restore the saved settings.
Set User Encoding item is described below.
Via Default Directory you can set the default directory for the Open/Save file dialogs. If the default directory is not set, it will be the program installation directory.
Set Controll Background Color is used to change the look and feel of the program. System, Custom A, Custom B are predefined colors. Custom... will start the Java color chooser dialog. Gradient when checked, will force the background of most controls to be painted via gradient.
Splash on Startup when checked, will show a picture before the main window appears. You could use it to replace the initial About dialog of the demo version of the program.

In the text fields User 8-bit Converter and Converter Display Name you have to enter the program name and the display name of the converter. The display name is free text but the program name should be known (see below). Via clicking the button on the right you can check if that converter is available. The software is testing the pair of ctb and btc converters and you can receive two error messages when the converter is not found. The same applies for the user-defined 16-bit converter.
The combo box under the label Encodings and Display Names shows the built in list of converter names and below are the display names. When the user converters are not defined, they will be shown as "User8" and "User16" as converter program names and "Reserved for User Defined ..." as display names. Again, you can check any converter from the list if it is supported by your Java RTE via clicking the button on the right.
The last combo box and the supporting "Run Scan" button can be used to scan the Java CLASSPATH and to find all available ctb converters. The list will be sorted by the names and will include the built in converters (like CyrMIK) in the current CPT program as well. Via the clipboard you can copy any converter name from the combo box and paste it into User ... Converter text field. This will ensure the correct name without "blind" checking. To test if the paired btc converter is available as well, you have to click on "Check Available" button. The answers are: "Yes", "btc only", "ctb only", and "No".
After checking the user-defined converter program name and entering the display name, you can click on the OK button and the setting is done.
If you want to remove the setting, delete the converter program name
from the text field and click on OK. Note that if the removed converter
has been used as default converter for the Source/Target file encoding,
you have to change these encodings as well, otherwise, latter you will
receive error message that User8/User16 converter is not found.
CPT Converters
In addition to the Sun JRE converters, here are the converters built into this program:
Notes:
CyrBG converts the Cyrillic letters 'I' with grave and 'i' with grave (0xAA and 0xBA) to \u040D and \u045D (Unicode Cyrillic letters with grave). CyrMK converts the small Cyrillic letters 'i' with acute and 'ie' with acute (0x26 and 0x23) to \u045D and \u0450 (Unicode Cyrillic letters with grave). The characters mentioned are not part of any other Cyrillic 8-bit encoding and if the text includes any of them, you can not convert it to Cp1251 or ISO8859_5, the only targets are Unicode and UTF-8.
The converters defined/reported as 'btc only', can be used just for encoding from text to text. When you create CTree it is not recommended to use user-defined converter as a target encoding because any time you open this CTree, the same user-defined converter should be set. Since not all CPT programs have User Encoding dialog, these dictionaries might be unreadable.
The last 5 converters are actually transliterators, which are not based on any official standard. They can be used as User16 converters (just a convention, for some operations like 'encode only' they can be used as User8 as well). All of them are included in the Transliteration Bar and you can easily check their behavior. CCMH_Cyrl and CCMH_Glag are explained in detail in the file 'cu_Transliteration.txt' in 'alphabet' directory, and here is a sample picture:

The selected text from the text area is transliterated 'Glag to CCMH 7-bit' and then 'CCMH 7-bit to Cyrl', and the result is shown in the Transliteration Bar.
| 5. Source Data Menu |
Text Word List is the traditional format of a word per line.
Munched Word List is a word per line with an optional suffix mark. It is similar to the Unix ispell munched format but not the same. The affix marks should be defined in CPT Affixes file.
Tagged Word List is a word per line with optional tags, which should be defined in CPT Tags file or CPT Affixes file.
CTree Dictionary is binary file created by this program. For more details see the appendix .
CPT Affixes is text file defining affixes. For details see the appendix .
CPT Tags is text file defining tags. For details see the appendix .
Plain Text is a text (or any binary) file that will be scanned to extract words or to be transformed.
HTML could be in pure ASCII or some of the supported encodings like ISO8859_1, .., UTF-8. Note that the encoding defined in the file (charset tag) will be ignored. The encoding used will be the defined in Source File Encoding dialog. The HTML tags are not interpreted (e.g. "dir" attribute, css files, frames, etc.), the only interpretation is for the character entity references (after '&', HTML 4.01 list) and decimal or hexadecimal codes (after '#'). The operations over these files are for extracting words and "dehtml" - stripping the HTML tags.
Text Delimited is text "record per line format". Many database export utilities and CPT Word Lists produce such files. The text fields below the radio button are used to define the "delimiter character" of the fields in the record and the field number to be taken. To use TAB as delimiter character, enter the string "\t". If the field number is set to 0, this means 'take all fields' and the file is expected to be in 'text dictionary' format (see the appendix ).
Word-Clue Format is similar to Text Delimited,
but can contain a word and one or more clues, and no tags.
You should select one of the supported variants from the list
(see the appendix ).
Comments (--) means that there are lines
starting with "--", which should be ignored. For the first variant
any commented line should have "--", for all other variants
the comment is a block starting and ending with "--" line.
Note that the delimiter character (for the first variant)
and the field number are valid here as well.
Field number 1 is the word, 2 is the clue,
and 0 is 'text dictionary'.

In Unicode some "conceptual" characters or graphemes can be represented in two or more ways. The composition forms are "shrinking" the letter representation and the decomposition forms are "expanding" it. There are canonical and compatibility modes. Let's explain with an example: angstrom-sign (\u212b), A-ring (\u00c5), and A + ring (A + \u030a) are three variants of one letter glyph. The normalization of this letter is done as follows: If the input is angstrom-sign, A-ring, or A + ring: for canonical decomposition, the normalized form will be A + ring; for canonical composition, the normalized form will be A-ring.
The next example is with compatible forms. The Latin 'ffi' ligature has been used in many texts and there is single Unicode character ffi-ligature(\ufb03). In the new texts the people are using 'ffi' string and here comes the compatibility mode.
If the input is 'ffi' string:
for canonical/compatible decomposition/composition,
the normalized form will be 'ffi' string.
If the input is ffi-ligature:
for canonical decomposition/composition, the normalized
form will be ffi-ligature;
for compatible decomposition/composition the normalized
form will be 'ffi' string;
When the input text contains ligatures or characters in decomposed form
(like A + ring) it should be normalized for the proper working of word
filters and CTree style of sorting. In most cases the preferred form
is some of the compositions.
The list in Unicode Normalization combo box contains the standard forms from Unicode Technical Report #15:
The check box Plus Excluded means: "compose all characters" (according to the specifications there is so called "Exclusion List" and the characters from that list should not be composed). This option can help you to obtain proper composition of some texts, e.g. in Yiddish, despite of the formal norms.
The check box Ignore Jamo/Hangul means: "do not decompose Hangul syllables into the Jamo alphabet and vice verse". This is an option because the Unicode standard specifies that Hangul to Jamo is a canonical decomposition. If you want to keep these characters unchanged, set the check box.
If Arabic Composition is selected, via the combo box Arabic Ligature Set you can choose subset of ligatures:
During composition of ligatures the forms (isolated, initial, medial, final) are defined according to the shaping rules. Note that the source text should be in logical order.
The check box Shape Letters As Well can be used to obtain full Arabic shaping of all characters in the text. If it is set, there is no need to set the Shaping flag in Display Options dialog.
If Thai Composition is selected, via the combo box Thai Set you can choose set of the custom codes:
The Thai Composition includes preprocessing with reordering of marks - the tone marks are moved after the other marks and when the set is Single Cells, the Sara Am is decomposed. We have to note that only the Syllables set will support the proper sorting and the word breaking. Do not expect any other program outside the CPT kit to understand Thai composed text. You should use Thai Decomposition, to export it.
Notes:
The program uses tables built from UnicodeData.txt version 5.1.0 (BMP only) and the results by normalizers using other versions might differ from ours.
The normalization follows the Unicode encoding of the source text and always precedes all other conversions and filters.
For more details, see Appendix B.
The Unicode Normalization options are ignored when the source file is CPT Tags or CPT Affixes.
The group of radio buttons Default Line Direction offers the following options:
The radio button To Logical Order when set, will force conversion of the input text from visual order to logical order.
The radio button To Visual Order when set, will force conversion of the input text from logical order to visual order using one of the options:
Notes:
The CPT bidi processing claims to conform to Unicode Technical Report #9 (rev. 15). There are many similar implementations and all are compatible, partially...
For <Unicode NL> we consider one or more: '\r', '\n','\u0015','\u000b', '\u000c', and '\u2028'. The <hyphen> is defined by the radio buttons. Soft means any of: '\u00ad', '\u2010', and '\u058a'. Hard means '-'.

Usually, you have to set just once these options and via File | Save Settings As Default to save them and don't bother any more.
The Encoding combo box contains the built in display names of Java converters. Here you have to select the proper encoding of the input file. On top of the list there is special item: "Java Default (...)" - it reflects the font.properties/fontconfig.bfc file of your Java RTE. It is the default if the encoding is not set explicitly.
RTL in logical order if set, will switch on the default bidi processing. This check box is enabled for locales, which could use RTL scripts.
Country and Language combo boxes are used to select the Locale (shown under the combo boxes) of the file. In some cases the locale of the file does not matter. It is used in Search in Source dialog and if the input data type is Munched or Tagged Word List (for tag filters). In these cases the alphabet subdirectory is scanned for the affixes file (<Locale>.aff).

The combo boxes on the top define the font used for the display. Font Family will show the list of the Java RTE supported fonts. For Java 2 versions (1.2 and above) and jview these are the native font names installed in your OS, and for Java 1.1 these are only the logical font names defined in "font.properties" file.
Via Style and Size combo boxes you can select the font properties as italic, and size 12, for example.
The Encoding combo box shows the default encoding. For the older Java 1.1 versions it can be used to change the encoding of the logical font without modifying "font.properties" file. For example, the current encoding is ANSI (Cp1252) and you can change it to Thai (Cp874 or MS874). Note that this will work if the Thai locale is installed on your Windows. Since the font-encoding problem is solved in Java 2 and jview, in these cases the combo box will be disabled and you have just to select the native font, which supports the desired script (Angsana New for the sample of Thai).
Under Linux this combo box is enabled for all Java versions
because it is used to define the keyboard/clipboard converter
(when one of the flags is set).
We do not provide keyboard drivers. The keyboard converter is used
to encode the 8-bit text from your Linux keyboard driver to Unicode.
Clipboard converter will be shown only if
the Java is 1.3.1 and above. If you exchange data
with applications supporting properly
UTF8_STRING or COMPOUND_TEXT
(all Java, KDE3 applications, Mozilla, etc.)
you should switch off this flag. If switched on, the
behavior is the same as for the keyboard converter.
When View Source check box is set, the initial part of the input file (when opened) will be shown in the CPT Word Lists text area. In the Lines to View text field you can enter how many lines to view. Showing just the initial part of the file is very convenient when the source is huge file and you have to wait long time, or you could have problem with the virtual memory.
The check box CTree Header if set, when the source is CTree, will enable the dictionary data (number words, encoding, title data, word lengths, letter frequencies, etc.) to be displayed before the word list.
The check box Right Alignment if set, will force right alignment of the text area and text fields of search dialogs. This option is provided for the RTL scripts, but it is completely independent of the bidi processing.
The check box Shaping should be set in the cases when Arabic shaping is desired or the text is in 'Thai Composed' form, or when the used font is not supporting properly the combining marks.
There are two main modes of searching: strings and words.

Match String will enable the standard searching. In this mode you can use \uxxxx notation, no regular expressions, and the options:
Match Word will switch to target word searching. This mode emulates extracting the words from the source. The word filters from Target Data | Word Properties (Target Words dialog) will be used while searching words. There are some restrictions. If Whole Line as a Word is checked, it will be ignored and Letters mode will be used instead. If one of the Locale ... is checked, the locale of the source will be used. In case of Locale Dictionary File not all automatic conversions are supported (you will see "... incompatible" message). To check your word filters settings just click sequentially on "Find Next" button (there is no need to enter search pattern). If you want to find words containing specific patterns you can enter search word pattern as well. In this case the \uxxxx notation and the regular expressions as described in Base Words | Search Words dialog can be used.
Not in Base Words will turn on the simple spell checking. It will be available if some Base words are opened. On clicking the "Find Next" button the next word from the Source will be searched in the Base words. If it is not found there, "Found at ..." message will appear. Yes, this sounds strange, but this way the spell checkers are working. You can not enter search pattern in this mode.
Using Down and Up radio buttons you can change the direction of the search. Via clicking with the mouse in the text area you can fix the current position (which is not shown when there is no selected text).
| 6. Base Words Menu |
The item Open CTree Dictionary As... has three sub items:
Open Text File (target encoding, sort) is used to open text word list as base word list using the target settings - text files do not contain any information about the encoding, locale, etc. Note that Target Sort type will define the 'one-byte' or 'two-byte Unicode' mode.
Close menu item will free the memory used and will clear the additional status bar for the Base words opened.
Search Words will start the Search Base Words dialog explained below.
All Embedded Data is valid for the next three items below. When checked, all data of the dictionary will be dumped, not just the text file. If the Base dictionary is CTree with strict alphabet and tags, the alphabet and tags files will be dumped as well. If the Base is picture dictionary, all pictures will be dumped as well.
Dump in Text File will start the Save file dialog to select the file and the Base words will be written there as text file using the current CTree encoding.
Extract in Text File is similar, but it is valid for CTrees with tags/clues and if some tags selection is made via Global Tags dialog. The words having selected tags will be extracted into the file in text delimited format.
Extract Source Words in Text File is like the previous but a Source text word list should be opened and these words define what to be extracted from the Base dictionary.
'5-bit per character' CTree: this packing of suffixes is different from the packing, described in Target Action, Affixes tab, but the final format of the CTree is the same. When you have created CTree with packed suffixes and it is opened in mixed mode, the core word letters will be expanded as tree and the packed suffixes will remain packed. If opened in tree mode, all word letters will be expanded as tree and if add/delete operations are done, the dictionary will be saved without suffixes packed. If opened in packed mode, no change operations are allowed, and the file will keep the packed suffixes.
'long words' CTree: this packing of suffixes is just to reduce the file size, a new file is created only if its size is at least 10% smaller. When this CTree is opened, the suffixes are always unpacked.
Notes:
In CPT Word Lists the Base words are opened and read in the RAM. If the CTree word list contains 1M or more words the virtual memory might be easily exhausted and this is the case when the mixed and packed modes should be used. The change (tree) mode is still available, but possibly with swapping - 1MB packed file is taking many MBs of RAM in tree mode. The text word list files are opened in 'Java Array' mode and the used memory should be a 'linear' function of the file size. The linear is in quotes because the different JVM have own opinion about the heap used and the garbage collection.
Here are some figures (with 160MHz CPU used when the first version of the program was developed):
2MB-text word list with 200K words takes 300KB as CTree (5-bit per character) file. Packing the suffixes will take 60 seconds and the CTree file will be about 270KB. When this file is opened in tree mode it will take 3MB of RAM, opened in mixed mode will take 1.5MB and opened in packed mode - 300KB.
12MB text word list with 1M words takes 900KB as CTree (5-bit per character) file. Packing the suffixes will take 5 minutes and the CTree file will be about 700KB. When this file is opened in tree mode it will take 13MB of RAM, opened in mixed mode will take 4MB and opened in packed mode - 1.5MB.
For fast compare/search operations when the Base is CTree with clues - select mixed mode, when the Base is CTree without clues - select tree mode.

In the text area of the dialog window, short information about the Base words will be shown. If the letter case of the word list is upper or lower, the searching will be done in ignore case mode, otherwise, the mode will be case sensitive. If the Base words are CTree with tags, the 'display codes' of the tags will be appended to the found word. In case of CTree with clues, any tags-clue couple will be shown on 2 separate indented lines.
The search pattern you can enter is restricted to subset of regular
expression syntax using the following special symbols:
| * | matches 0 or more characters |
| ? | matches exactly one character |
| | | starts alternative pattern to search |
| \char | is the 'escape' way to match the char, where char is any Unicode character |
| [set] | matches any one character from the set |
| [^set] | matches any one character not in the set, where set can have the following forms: |
| charString | includes the characters from the string |
| char1-char2 | includes the characters from char1 to char2 in the ascending Unicode order |
Since the idea is to match words, not strings,
the meaning of '*' and '?' is the same as used in operating systems
for the file name expansion, but different from the classic regular expressions
used in egrep and friends. Here are two examples:
ase[^ex]*
will match all words having 4 or more letters, starting with 'ase'
but not including 'e' and 'x' as a forth character;
ba??a|ga????
will match all 5 letters words, starting with 'ba' and ending
with 'a', and will match all 6 letters words, starting with 'ga'.
The syntax of the search pattern is checked and you could receive error message even for well-formed expression due to restrictions in the implementation (e.g. protected CTrees).
In case you want to stop a search process producing big list, just click on the Stop button, which appears after the search starts. If you want to save the search result in a file, after entering the search pattern, instead of pressing <ENTER>, click on the left button, choose the file name and the result will be written there (if the CTree is not protected). If the dictionary contains pictures, they will be written as separate files as well.
When in the text area there are pictures shown, you can use the right mouse button to click on a picture with two possible popup menu items:
| 7. Target Data Menu |
Picture Dictionary is to create separate picture dictionary (not CTree with pictures).
Alphabet File is used to create an alphabet file based on the source words. You may need to edit this file because the current Java locale collator is used for the sorting. For details see the appendix .
The important note is that not all possible combinations of input data type and output data type are valid. The details are in Data Types and Operations below. If you have selected invalid combination or operation, you will see an error message.
The difference is that the locale setting is used in more cases. If the word filters (Target Words dialog) include Locale Letters/Alphabet File set or CTree is defined to use Strict Alphabet, the alphabet subdirectory later will be scanned for alphabet file having name <Locale>. And similarly, for Locale Dictionary File - <Locale>.dic file will be needed.
UTF-8 and CTree
If the clues contain relatively small quantity of real Unicode characters (not basic Latin, or not in Cp1251), you can select UTF-8 encoding in order to reduce the file size. The dictionary will appear externally as Unicode, but the clues will be encoded in UTF-8. If the target locale is Cyrillic (be, bg, mk, ru, sr, uk) the clues will be encoded in a custom UTF-8C encoding, which is Cp1251 plus UTF-8.
Generally, the word is sequence of 'word-characters'. In the input text the words are bounded by 'non-word-characters'. As a minimum, 'non-word-characters' include line breaks. The idea of the filters is to define what to include in 'word-characters' class and the rest of Unicode characters are added to the other class.
From Filter tab you have to select the main filter.

Whole Line as a Word means that actually no filters will be applied (just the word length) and the 'word' will be the input text line or field from record in case of input Text Delimited/Word-Clue.
If Allow One Space is set, the program will cut the input line before the second space - this usually means that the first two words will be taken. Allow Two Spaces is similar - first three words. Strip Spaces will remove the spaces before the first letter and after the last letter of the word. Stop Characters - you can enter one or more characters in this field, the word will be cut at the first position where one of these characters is met.
Letters will switch on various filters - 'word-characters' class will include all Unicode characters having letter category (Lu, Ll, Lt, Lm, and Lo). Note that the check boxes in Letters tab could expand or shrink this class.
Locale Alphabet File if set, will define the 'word-characters' class to include only the characters from the alphabet locale file. For the format of the alphabet files see the appendix .
Locale Dictionary File if set, will define the 'word-characters' class to include only the characters from the dictionary file, which should be in 'alphabet' subdirectory and should have name <Locale>.dic. In this case, the dictionary and a recursive program are used to break the input character sequences into words. The current version contains only one dictionary: "th.dic" - for the Thai language, but you can create CTree word list for any locale and use it. For the options, see Dictionary tab below. This flag is valid when the Source is Plain Text or HTML.
In Length tab you can use Minimal/Maximal Word Length text fields to enter the limits of the word length counted in characters. The range is from 1 to 1000.
In Letters tab there are many check boxes to precise
the 'word-characters' class.
Since there are a lot of characters in Unicode, which are letters,
you could use Locale Letters as excluding filter.
When it is set, the program
will reject all words having letters not in the alphabet defined by the
current locale. If the alphabet file is not found, the program will use
internal tables (see the appendix )
or will ignore this flag.
Include Marks will include all Unicode characters having mark
category (Mn, Mc, Me) in 'word-characters'.
Include Apostrophe/Hyphen/Dot will include the corresponding
ASCII character in 'word-characters' class.
Include Numbers will include all Unicode characters having
number category (Nd, Nl, No) in 'word-characters'.
No All-Uppercase if set, will exclude words having all letters
in upper case.
The Dictionary tab offers the following options:


When the options are set during dictionary file creation, they can not
be changed latter. If you need to change something, create new dictionary
using the old one as Source. As a rule, when recreating CTree,
the word filters should be switched off.
Some of the optional data and tables can be
seen only when Show CTree Header flag is set in Source | Display
Font Encoding dialog.
Alphabet Tab
When Strict Alphabet is set, the ordered list of locale characters will be included in the dictionary and will be used as word filter and as sorting order. If the number of characters used is less then 33, internally, 5-bits per character will be used. This is the case for most alphabets when only upper or lower character case is used. Some of the specific CTree operations are implemented just for this 5 bits per character format.
The radio button From File means that the alphabet
characters should be taken from locale alphabet file,
which by default is expected to be in the alphabet directory.
If you want to use other file, click on the 'Set Alphabet File' button.
If you need to revert to the default, click on the button and in
Open file dialog select 'Cancel'.
Built in means that the program will use internal tables
(3 alphabets in this version: Bulgarian-Cp1251, English-ASCII, and Russian-Cp1251).
The check box Lower Case < Upper Case means that the lower case letter should be before the upper case letter in the sort order. This is valid when both upper and lower case characters are used and the sorting type is Locale. Note that there is no similar option for Array and JTree - the standard Java collators use fixed tables.
Letter Frequency Counters means to maintain table containing the frequency of any unique letter in the words of the dictionary.
Crossword Words will switch on the 'crossword transliteration'. All non-letter symbols will be ignored. For English locale the accents will be removed and some non-English letters will be replaced. Additionally you have to select Change Letter Case and Lower in 'Target Action' dialog, 'Options' tab.
Word Length Counters means to maintain table containing number words counter for any word length in the words of the dictionary.
Data Tab
The check boxes denote to add additional data to any word
in the dictionary. When tags/clues are added, the dictionary is referred
as 'CTree with tags/clues' in this document.
The tags should be defined in CPT Tags file
(see the appendix ),
which by default is expected to be in the alphabet directory.
If you want to use other file, click on the 'Set Tags File' button.
The 'Set IPA8 File' button is used to switch on the 'IPA8' feature
and to select the file. 'IPA8' means that you have clues having 'ipa8'
tags and encoded in custom 8-bit encoding. For details see the
appendix and the 'samples' directory.
Morphology Tags means to add 'morpho' tags.
User Tags means to add 'user' tags.
Topic Tags means to add 'topic' tags.
Definitions/Clues means to add 'clue' tags and definitions.
Strict Alphabet restrictions/conversions are never applied to the text
of the definitions. If you have selected UTF-8 as target file encoding, the dictionary will
be maintained externally as Unicode, but inside the file the definitions will
be in UTF-8. If the locale is Cyrillic one, the definitions will be written
in the custom UTF-8C encoding (based on Cp1251 plus UTF-8).
Has Pictures should be checked when you want to include pictures.
Actually, the program will create additional picture dictionary file having the same name
as the CTree but with extension 'pdc'.
Inverted Index means to create index of clue's usage in the main word list
(i.e. for any clue a list of words, which use/refer this clue).
The idea is to be able to browse the file in both directions.
This inverted index is used in the other CPT programs like CPT Dictionary.
For bigger dictionaries it is expensive operation - 150K words with clues
will take 4 minutes and 40% more of the file size.
Note also, that CPT Dictionary can create own inverted index
in a separate file. If the dictionary is intended to be stored in read only
media or if you want to save the time of all users,
you could include the inverted index via this option.
When you add 'user', 'topic', or 'clue' tags, the Source data type should be Text Delimited (text dictionary format) or equivalent CTree.
Title Tab
You can use the text fields in this tab to enter optional user data, which will be written in the dictionary file. The text is not restricted by the alphabet or by the encoding of the dictionary (it is saved in Unicode).
File Tab
Pack Suffixes is valid for 'long words' and '5-bit per character' CTree types (without clues). The program will try to reduce the CTree size via packing the common suffixes. If the operation fails, the original CTree will be saved (see Pack Suffixes above).
Long Words or Phrases will force internal structures, which are more efficient for packing long words or phrases.
Packed Format means to compress the data.
Locked means 'no more changes', it will disable the dumping, the editing, and the opening as Source. The copy to the clipboard will be restricted to single line for all CPT programs.
Protect with Password (and if password entered) will force the same protection scheme as above, but the password will be required to open the file. If you have forgotten the password, the file is lost.
The third level of the dictionary protection is when both flags
are set: additionally, the copy to the clipboard
will be disabled totally.
Generally, the following image formats are supported: JP2, JPC, PGM, PGX, PNM, PPM, JPG, GIF, BMP. The program will check if Sun "JAI Image I/O Tools" or "Java Advanced Imaging API" are installed and will use them. So, depending on the JVM, you could work also with: ICO, CUR, WMF, EMF, TIF, PNG, WBMP, PCX, FPX.
As thumbnails, only JPG, JPC, and GIF are supported.
To include a picture in CTree you have to use a clue tag starting with 'pic'
and the picture file name as a clue:
Argenitina|0|0|0|pic ar_flag.gif
To include a picture in Picture Dictionary just use a line:
Argenitina|ar_flag.gif
Probably this is the place to remind that for Picture Dictionary the sorting engine is JTree,
and for a key, only one picture is maintained.
This is the dialog where you can set how the pictures will be included. The settings are global for all pictures processed for a dictionary file.
Convert Tab

Include Unsupported means that the program will not reject any input picture.
Convert MS pictures to BMP will be shown only with MS JVM. Before any other processing the ICO, CUR, WMF, EMF will be converted to BMP.
Convert Pictures to when checked, will force the conversion of the input pictures to one of the JPEG 2000 formats JP2 or JPC. JPC gives a little bit smaller file size, because there are no custom color tables included.
Rate % defines the compression as ratio in percents between the picture size in
memory and the output file size. 100% is lossless.
The important difference between JPEG and JPEG 2000 is that for good quality of JPG files
the value of 75% is OK, while for JP2 files the same quality can be achieved with a value of 20%,
i.e. much better compression.
You can enter -1 to leave the program to choose the percents for you.
Gray is to convert images to gray scale.
Resize - check it to change the size of the picture according to the values
entered in Width and Height in pixels.
Area defines that the input height and width will be changed
proportionally to meet the area defined by the given values.
Down is to scale down. The picture will be resized only if the input height and/or width
are bigger than the given values.
Up is to scale up. The picture will be resized only if the input height and/or width
are smaller than the given values.
Fixed - the output width and height will be exactly as the given values,
this is the only one non-proportional resize.
Speed over Quality means algorithm which is faster.
Thumbnails Tab

Add Thumbnails - check it, if you want every picture to be accompanied by a thumbnail in the dictionary.
Include - the program will look for a thumbnail file
in the same place as the picture file, but for a name having 't' appended.
If the picture is pic100.bmp, the thumbnail is expected to be pic100t.jpg,
or pic100t.jpc, or pic100t.gif.
Make - the program will create the thumbnail.
JPG and JPC define the image format. With MS JVM you will see GIF instead of JPG.
Max Width and Max Height define the image size.
The idea of this menu item is to check all settings in the previous dialog. You can use it to convert individual files as well.
First you have to select the picture file.
The program will process it and the following windows will be shown:
- message dialog showing the lengths in bytes of the input file (not the size in memory),
the output file and the thumbnail;
- window with the converted picture, via right mouse click you can save the file;
- window with the thumbnail, via right mouse click you can save the file.
When Create Picture Dictionary Source File from Directory... is selected you will see 'Open Directory' dialog and then 'Save' dialog.
The program will take the names of all recognised picture files from selected directory, will create source file for picture dictionary, and will save it under the name you have selected on the second dialog. For example, if you have Picture0005.jpg, Picture0028.jpg in the directory, the saved file will contain:
Picture0005|Picture0005.jpg Picture0028|Picture0028.jpg
There are two dialogs for the selections of tags for the operations like filtering, extraction or adding tags. The first (Global Tags) has access to all possible tags groups, while the second can operate on single group only. The dialogs are connected to different operations and support different selection schemes.
Use this dialog to select subsets of tags from the available tags groups for the operations 'add tags' and 'extract base words using tags selection and word length filters'. The menu item Target Data | Global Tags has three sub-items:
To clear the global tags setting select Set new from file | Other file... and then 'Cancel' in Open dialog.

Use the combo box to select one of the groups. The tags from this group will be shown in the list box below. Any tag is presented by the code in angle brackets and with the display name. For details see the appendix .
For the extraction operation you can select any subset of tags from any group. This is not true for 'add tags' operation. If the group contains npt (number per tag) assignments, you can select only one of them. If there are bpt (bit per tag) assignments, you can select all or subset of them. If there are '/subgroups/', you can select tags from one subgroup only. If the selection is not consistent, later, when you start the operation, you will see an error message.
'Select All' button can be used to select all items
in the current list.
Ignore Unselected npt/bpt is used for the extract
operation when the list contains npt type and bpt type tags.
If you need to select tags, let's say, from bpt group only,
you can check the box
and this will mean "ignore the unselected from the other group".
If not checked the filter will skip the words having unselected
tag from npt group even if it has the selected bpt tags.
Length Tab
This tab is for additional filters for the extract operation.
Only the words having the selected word length will be extracted.
Select From List will enable the list box and you can
select exactly the lengths you wish.
Just to remind that you can do multiple selection via the mouse
and Ctrl key pressed.
All means no length filters.
The rest of radio buttons from
<7 to >=33 can be used
to select a range of lengths.
If the target operation is Filter Tagged Word List, you can use this dialog to set how the tags will be used as filters. The menu item Target Data | Tags Filters has three sub-items: Tags File (morpho)... and Tags File (topic)... are tags from the 'morpho' or 'topic' group (see the appendix ); Affixes File... - tags from the 'suffixes' section (see the appendix ). When you select one of the menu sub-items, the program will take the current Source locale and will read the corresponding tags or affixes file from the alphabet directory. If the file is ok, you will see the dialog.

The tags from the file are sorted and shown in the list. Any tag is preceded by a mark (0,1,2), which defines the role of the tag. On the top of the list a special entry is included - use it to define what to do with the words, which have no tags.
You should use the keyboard to change the mark of the current selected item from the list. On any pressing the keys Space or Enter or on any double click with the mouse the mark will be changed sequentially to 0,1,2. The keys Insert and 1 will set the mark to 1, the keys Delete and 2 will set the mark to 2, and key 0 will set it to 0. If Ctrl key is pressed as well, all marks in the list will be set.
Mark 0 means that this tag will be ignored (will not be considered as a filter).
Mark 1 means that if the word has this tag, it can be included in the target.
Mark 2 means that if the word has this tag, it will not be included in the target.
Since any word can have many tags, the value of the tag's mark is used
as a weight (precedence) of the tag. If the word has at least one tag marked
as 2 it will be excluded. If the word has tags marked as 1 but no tags
marked with 2, it will be included. If the word has tags marked with 0
only, it will not be included. If the word has no tags, the filter will
be the mark of the special element on the top of the list.
The tags filters used will be from the last started dialog.

For Source-Base-Target group the input files are the Source and the Base words opened. The output files are the modified Base words (same file name) and an optional Log File, which is text word list. If Save As dialog appears, it is for the Log File (with one exception for the munched lists).
Via the radio buttons Add Source to Base Words, Delete Source from Base Words, and Compare Source to Base Words select the desired operation. The check box Replace Mode is valid when adding data to CTree with tags/clues (the word and all its data are deleted first, and then the addition(s) will be done, note that if you want to change only one data element of a word, the other data elements for this word should be supplied in the Source as well).
Via the combo box Add/Delete/Compare Log File define the contents of the Log File. The options are:
No Source Word Sorting means that the operation of reducing all Source words to list of unique words will be omitted. This way the working memory needed will be reduced but the number of add/compare/delete operations will be increased. If you have to process huge Source file, set this flag and switch off the Log File. If you perform tagging operation and you want to see the original words sequence from the Source file, set this flag as well.
For all other target operations, which do not require Base words, select Create New Target.

Change Letter Case and Upper/Lower are used to transform the characters to all upper or all lower case. This operation is common for word lists and is very essential for CTrees - it will reduce the number of used characters twice. Note that you should use this operation for CTree creation if you want to enable the single case mode even when all source words are already converted to lower/upper case. If you set Special Casing check box, the changing of the letter case will be according to the addendum of Unicode - SpecialCasing.txt. The special casing maps one lower case character to 2 or 3 upper case characters. These special lower case characters include the German es-zed, ligatures, many Greek characters and several accented Latin letters. For Lithuanian there is special case handling as well. The special Turkish mappings are handled by the standard casing and there is no need to set the flag for this locale.
Via Sort Word List Using you have to select the byte mode, type of sorting and the sort engine.
Byte will select one byte mode and binary sorting. Unicode will select two bytes mode and binary Unicode sorting. Locale will select two bytes mode and locale sorting. In this mode the Arrays and JTrees engines will use the standard Java collators to obtain the natural sorting. Since the locale sorting of CTrees is not exactly 'natural' when you have upper and lower case characters, you should use 'Strict Alphabet' to be able later in CPT Dictionary to obtain the proper natural order of the display list.
Via Array, JTree, and CTree select the corresponding sorting engine.
Reverse Word Letters can be used for sorting 'backwards' - the words are compared starting from the last character (e.g. Hebrew script stored in visual order). Actually, the words are handled internally by the engine in reversed form, but when they have to be displayed or exported to text file, they are reversed again.

Encode Data should be set when you want explicitly to change the encoding of the Source file or for operations that do not include sorting like normalization or the 'dehtml' operation.
Extract Word List is used when the target data type is text word list or CTree dictionary and we want to extract the unique words from the Source.
Frequency Counters is used when we want to count the unique words or letters from the Source. The target data type should be Text Delimited. For word frequency select Words, and for letter frequency select Letters. If you want a list sorted in descending order of the frequency counters, set Sort by Counters.
Sort by Word Length is valid only when the target data type is 'Text Word List' and the sorting engine in not CTree. If checked, the word length will be used as primary key for the sorting.
Add Tags can be set when the input file has type Word List or Text Delimited/Word-Clue (text dictionary) or CTree. This operation will add tags to all words from the Source according to the selection in Global Tags Setting dialog. If the source already contains tags, the new tags are added via simple bitwise-or operation without any consistency checks.
Filter Tagged Word List can be set when the input file has type Tagged Word List. This operation will select words from the Source according to the filters set by Tags Filters dialog.
Expand Munched Word List should be set when the input file has type munched word list. This operation will "expand" the words from the Source according to the suffixes defined in the locale affixes file. The supported Target types are Text Word List (expanding only), or Tagged Word List (expanding plus tagging).
If Output Words with Affixes Only is set, the words from the input file without affixes will be ignored (used when Expand Munched Word List is set).
There is no separate operation flag for creating munched lists. When the Source is Affixes file, the Base is CTree 5-bit format, the operation will be creating munched list and the supported Target types are:
The encoding of the target file will be the same as the CTree opened
as Base words (the target encoding will be ignored).
Loose Affixes Match will switch on the "loose match" algorithm when creating munched list. The idea is to assign a suffix to a root word even when not all word forms from the suffix definition are available in the CTree. If this flag is not set, the program will work in "strict match" mode - the suffix will be assigned when all word forms from the suffix definition are found in the tree below the root word.
Exclude Words Mode is addendum to "loose match" algorithm. The program will remove some words with the hope of better matching results.
The program will check the consistency of the selected options and this is the moment when most of the warnings and errors will be shown. If the options are accepted, the processing will start and new window with messages will appear. This new window will not be closed by the program and you have to close it after checking the messages. For most of the operations you can stop the processing via clicking on the Stop button on the Messages window. If you close the Messages window before the job has done, a notification will be shown when the operation is finished, but you will not see the possible errors or warnings.
| 8. Data Types and Operations |
Source: Plain Text, Target: Plain Text, Alphabet File.
Source: HTML, Target: Plain Text, Alphabet File - strip HTML tags (no special flag).
Source: CPT Tags, Target: Plain Text - dump codes of tags only (no special flag), no optional conversions.
Source: Text Word List, Target: Alphabet File, Text Word List or Text Delimited. The Target Text Delimited could mean two different operations. If Add Tags is set it is exactly adding tags operation. If it is not set and the Extract Word List is set, it is the special operation "Parts of Phrases". For any word (of length >= Minimal Word Length) in the phrase the program will create a record with that word and as a clue the phrase where the word is replaced by "...".
Source: Tagged Word List, Target: Tagged Word List - translate tags (no special flag), no bidi conversion. The default tags are 'morpho'. If you want to use 'topic' tags for the translation, remove 'map-morpho' and include only 'map-topic' in the Tags file.
Source: Text Delimited/Word-Clue Format (text dictionary), Target: Text Delimited - adding tags (Add Tags should be set), translate tags (no special flag, not for Word-Clue Format), convert Word-Clue Format to Text Delimited (no special flag).
Source: Text Delimited/Word-Clue Format (not text dictionary, usually field number 1 selected), Target: Alphabet File, no changing case.
Source: Text Word List, Target: Text Word List or CTree Dictionary (with or without Add Tags) or Text Delimited (frequency counters).
Source: Plain Text, Target (Extract Words set): Text Word List or CTree Dictionary or Text Delimited (frequency counters).
Source: CTree Dictionary, Target: Text Word List or CTree Dictionary or Text Delimited (letter frequency counters), the CTree can has tags or clues as well.
Source: CTree Dictionary with tags or clues, Target: CTree Dictionary with tags or clues. Add Tags is supported. This is the first case when the tags/clues from the Source CTree are used. The second case is Base with tags/clues, described below. In the other cases it is treated as Text Word List.
Source: Text Delimited/Word-Clue Format (not text dictionary), Target: Text Word List or CTree Dictionary or Text Delimited (frequency counters).
Source: Text Delimited/Word-Clue Format (text dictionary), Target: CTree Dictionary with tags/clues (with or without Add Tags). This is the standard mode for creating dictionaries. If you want to add pictures only Text Delimited could be used.
Source: Word-Clue Format - '<word><delimiter><clue>NL' variant, Target: Picture Dictionary. This is the only vriant to create separate picture dictionary.
Source: HTML, Target: Text Word List or CTree Dictionary or Text Delimited (frequency counters).
Source: Tagged Word List (creating CTree with 'morpho' tags), Target: CTree Dictionary with tags.
Source: Tagged Word List (tags filters set), Target: Text Word List or CTree Dictionary or Text Delimited (frequency counters).
Source: Munched Word List (expand word list), Target: Text Word List or Tagged Word List. The logical/visual conversion is not available and the source should be in logical order.
Base words: CTree, output: Text Word List (dump), CTree (pack suffixes).
Base words: CTree with tags/clues, output: Tagged Word List (dump) or Text Delimited (extract by filters or by Source word list).
Base words: CTree (with or w/o tags/clues) or Text Word List - search and extract words (using the dialog), output: Text Word List or Tagged Word List (using 'display codes' of tags).
Base words: Picture Dictionary, output Word-Clue Format - dump or search and extract words (using the dialog).
The delete operation when the Source has tags/clues is not available. If you need to delete words from CTree with tags/clues, use as Source Text Word List. In this case any word from the Source list and its linked clues (if not used elsewhere) will be deleted. If you need to do changes in the word's data, use 'add with replace'.
When the Base is Picture Dictionary for add operation the source could be only Word-Clue Format. For delete or extract operation the source is Text Word List.
| Appendix A: File Formats |
The locale alphabets usually reside in 'alphabet' directory. The names of the files are the same as ISO-639 language codes, optionally followed by "_" + ISO-3166 country code. For example, "el" and "el_GR" are locale names for Greek. The names could be four-letter script codes from ISO 15924:2004 as well. Note that you can override the defaults by setting explicitly the file in CTree Options dialog.
The files could be written in any of the supported encoding but the name of the converter should be given in the first line. Any line should contain upper case letter, optionally followed by lower case letter. When the lower case letter is not given, the program will add it using the Unicode tables. The order of letters in the file will define the sorting order, used by CTrees. The position of the upper case before the lower case implies the default rule that upper case letters are less than (before) the lower case. To change this rule, set the flag labeled 'Lower Case < Upper Case' in CTree Options dialog, Alphabet tab.
Generally, you can include almost any of the Unicode characters in this file (line breaks can not be part of the alphabet). Be careful, do not put unnecessary spaces, because the space is valid entry as well. Keep in mind that CTrees may not handle properly graphemes presented by more than one character (surrogates and decomposed characters) and when the number of characters used is more than 32, some of the special CTrees operations (5-bit compression and affixes) will not be available.
There are many alphabet files in the directory just waiting for you to be checked and changed. For any of the alphabets we have included short notes in the Readme.txt file.
CJK Locale Letters
Since it is difficult to prepare CJK locale alphabet files, here are the tables, used by the program:
Japanese
Radicals: from \u2E80 to \u2FD5;
Symbols, Numerals: from \u3005 to \u303B;
Hiragana, Katakana: from \u3041 to \u31FF;
Ideographs: from \u3400 to \u9FC3;
CJK Compatibility: from \uF900 to \uFAD6;
Halfwidth Katakana: from \uFF66 to \uFF9F.
Chinese
The ranges are the same but without Hiragana and
Katakana.
Korean
Jamo: from \u1100 to \u11F9;
Compatibility Jamo: from \u3131 to \u327D;
Hangul Syllables: from \uAC00 to \uD7A3;
Halfwidth Hangul: from \uFFA0 to \uFFDC.
We do not pretend that these selections are the best choice, so, any suggestions and comments are welcome.
There are internal tables for other languages as well but we recommend you to use alphabet files to be sure that you will get what you want.
Note: these internal tables are not 'built in strict alphabets', they are used just for the locale letters checking in word filters.
set c STLKPMNVC
set v AEIOU
The set variables are just one character and when used in the text, should be prefixed with dollar sign.
The suffixes section should start with the statement:
suffixes
The section can contain one or more suffix definitions. The format of any suffix definition is like this:
<sufdef +TAG1+TAG2
ED
ING +TAG3
$vBLE
>
where "<" is the start and ">" is the end mark, "sufdef" is any string of maximum 7 characters giving the suffix definition name. This name will appear in the munched list. "+TAG1+TAG2" is a list of global tags valid for all suffixes in the definition. The tags should have the plus sign as first (and only once used) character. They can be from the 'morpho' group in the CPT Tags file, but this is not mandatory. "ED" is a suffix from the definition, it will match any word ending with this string. "ING" is another suffix having private tag "+TAG3". "$vBLE" is a suffix with set variable - it will match all words ending with one of the letters from the set, followed by "BLE". The set variables are just syntax sugar (macros). You can use only one variable in the definition body. This variable will generate n separate suffix definitions, where n is the number letters in the set and the letter from the set will be appended to the name. In this case the names are: sufdefA, sufdefE, sufdefI, sufdefO, sufdefU. In the process of making munched list all suffixes from the definition should match in order to assign that definition. For example, the words from the input:
ASK
ASKABLE
ASKED
ASKING
will give (in strict match mode):
ASK
ASK/sufdefA
in the munched list. And when the munched word is expanded (with tags included), the result will be this:
ASKABLE +TAG1+TAG2
ASKED +TAG1+TAG2
ASKING +TAG1+TAG2+TAG3
There is no ignore case mode and you may need to use Change Letter Case flag to match the input and the suffix definitions.
In the body of the definition you could use other suffix definitions (like inline subroutines) with the restrictions: only one per suffix line and the used definition is already specified, the definitions used are prefixed with right slash and are the last part of the line:
<asklike +VerbForms
K/sufdefA
K +Inf
>
This definition will give "AS/asklike" in the munched list for the sample above.
You can hide some of the used definitions making them 'local definitions' or macros explicitly. The 'local definitions' are local (like set variables) for the affixes file and are not used in the munched list creation. The right slash should be the first letter in the name:
</def2 +T9
ITION
>
When referring 'local definitions', do not add additional slash.
The prefixes are not supported in this version and the prefixes section will be ignored.
The characters "<", ">", "/". "+", and "#" are reserved and should be used according to the context specified. Well, "#" is not specified yet - the comment part of the line should have it in the first position:
# end of Affixes.
The encoding used should be given in the first line. Any statement should take only one line. The '#' character is comment mark (till end of line). Below, italic font is used for non-terminals, "[...]" denotes optional item, and 'number_bits' is used to denote the number of bits available in the CTree for particular tag group.
One file can contain one or more tag groups. Any group has fixed number of bits in the CTree file. The maximum number of binary tag codes in a group is 2**number_bits. The tag group definition is as follows:
<group_id
[macros]
tag_list
>
where group_id could be one of the following reserved names:
</macro_name mtag_list >mtag_list is list of one or more tag code definitions having the form:
number_per_tag_code [display_code]
or
^bit_per_tag_code [display_code]
or
/macro_name
where number_per_tag_code is string containing one or more tags like Noun or Noun+PL. This code will take a number (one of all tag binary codes) from the group. The optional display_code is describing string like noun or noun,plural (if the group is map-..., it is not optional, but it is again number_per_tag_code, giving the translation of the tag). bit_per_tag_code is single tag taking one bit from the group. Note again that for any group there are 2**number_bits binary codes and only number_bits bits. macro_name is defined macro, which will be expanded.
tag_list is list of one or more tag code definitions having the form:
/macro_name
or
number_per_tag_code [display_code]
or
^bit_per_tag_code [display_code]
or
subtag_group
where subtag_group is sub-group definition, having the form:
<number_per_tag_code [display_code]
mtag_list
>
The maximum number of sub-groups could be 2**(number_bits/2). For example, morpho group has 8 bits and this number is 16. Inside the sub-group there are maximum 2**(number_bits/2) binary codes, number_per_tag_codes and bit_per_tag_codes are assigned locally for the sub-group. As shown in the syntax definition, the sub-groups can not be nested.
The reserved characters <>^+/# should not be used in identifiers.
The semantics restrictions are as follows:
Here are some samples with clue group (4 bits) and explanations:
Cp1252 # The tags file can contain more than one valid # 'clue' groups, but the last one will be used. # Below, 'matches' means the combination of # tags in the input tagged text that can be matched # by the tag group. <clue # all codes are 'number_per_tag' 0 # unused t1 tag1 # code binary 0001 t2 tag2 # 0010 ... t15 tag15 # 1111 > # the matches above are: +t1 or +t2 or ... +t15 # (only one tag code can be used per word) <clue # all codes are 'bit_per_tag' ^t1 tag1 # xxx1 (x means don't care value) ^t2 tag2 # xx1x ^t3 tag3 # x1xx ^t4 tag4 # 1xxx > # the matches above are: # +t1 or +t1+t2 or +t2 or ... +t1+t2+t3+t4 # (all tag combinations can be used per word) # note that 0 code is not included, because # it is taken from the first bit_per_tag <clue # mixed (number_per_tag have different semantics) 0 # unused t1 tag1 # xx01 t2 tag2 # xx10 t3 tag3 # xx11 ^t4 tag4 # x1xx ^t5 tag5 # 1xxx > # the matches above are: +t1 or +t2 or +t3 or +t1+t4 # or +t1+t5 or ... +t3+t4+t5 <clue # number_per_tag and subtags 0 # unused t1 tag1 # 0001 t2 tag2 # 0010 t3 tag3 # 0011 <st1 subtag1 # 01xx (subtag mask) st11 subtag11 # 0100 st12 subtag12 # 0101 st13 subtag13 # 0110 st14 subtag14 # 0111 > <st2 subtag2 # 10xx (subtag mask) st21 subtag21 # 10x0 st22 subtag22 # 10x1 ^st23 subtag23 # 101x > <st3 subtag3 # 11xx (subtag mask) ^st31 subtag31 # 11x1 ^st32 subtag32 # 111x > > # the matches above are: +t1 or +t2 or +t3 or +st1+st11 # or ... +st2+st21+st23 or ... +st3+st32 <clue # with macros </mac0 0 t1 t2 t3 > </mac t_1 t_2 t_3 t_4 > <st0 /mac0 > <st1 /mac > <st2 /mac > <st3 /mac > > # end of sample tags file
word1 +tag1+tag4 word2 word3+tag3 ... wordN+tag7+tag1+tag4
field11 | field12 | field13 field21 ... fieldN1|fieldN2
word | morpho-tags | user-tags | topic-tags | clue-tags clue
<word><delimiter><clue>NL
This is single line per word-clue.
The delimiter is defined by the user. Here is a sample
with ',' as delimiter:
word1, clue1 word2 word3, clue31 word3, clue32 ... wordN, clueN
<word>NL, <clue1>NL, <clue2>NL, ...2NL
The word-clue is presented by several lines. Any
line defines a separate clue. Here is a sample with
commented word-clue
(the flag Comments (--) should be set):
word1 clue1 --start of comment word2 clue21 clue22 --end of comment ... wordN clueN1 clueN2 clueN3
<word>NL, <clue1>NL..., TAB<clue2>NL..., ...2NL
In this variant one clue can be presented by several lines.
To start a new clue, use TAB (actually, the first clue can also
start with TAB, but it is not required and will be ignored).
Here is a sample:
word1 clue11 clue11 clue12 clue12 ... wordN clueN1
%h <word>NL, %dNL, <clue1>NL, <clue2>NL, ...2NL
This is the popular "%h%d" text dictionary format used in Linux, but
for a clue you can use only one line.
%h word1 %d clue11 clue12 ... %h wordN %d clueN1 clueN2
%h <word>NL, %dNL, <clue1>NL..., TAB<clue2>NL..., ...2NL
This again is the "%h%d" format but with several lines per clue and
a TAB to start a new clue:
%h word1 %d clue11 clue11 clue12 clue12 clue12 ... %h wordN %d clueN1
DSL
DSL is source text format for ABBY Lingvo dictionaries.
All control statements are ignored and only
the plain data is processed (i.e. the encoding should be set
as usual in Source File Encoding dialog.)
word1 clue11 clue12 word2 word3 clue21 clue22 wordN clueN1Note that word2 and word3 have the same clues in this format.
The 'IPA8' file has almost the same format as the Alphabet file, but defines
a custom 8-bit IPA encoding and it itself is encoded in UTF-8 or UnicodeASCII.
Any line should contain 2 characters.
The first character is the 8-bit code and the second character is
the corresponding Unicode character (usually, from the IPA block).
Since the IPA is using the base Latin characters as well,
you are not required to define these characters, but you
can overwrite them. For example, if 'k' is not redefined,
it will not be recoded, or you can define 'I' to be
'Latin letter small capital I' (\u026a).
Note that if you redefine characters outside the 7-bit ASCII range,
they have to be included with the corresponding Unicode code,
for example, if you want the Cyrillic small letter sha (\u0448)
to be the IPA letter esh (\u0283), the file should include the line:
\u0448\u0283
Restrictions:
- the clue should have only 'number_per_tag_code' starting with "ipa8";
- the 'IPA8' encoded text in the clue should be enclosed in
square brackets ('[' and ']'), and the codes of the brackets should
not be redefined;
- the IPA8 file should not redefine characters, which are not
part of the source file encoding (this is global CPT requirement
- these characters will be shown as '?').
For a complete template set of files see the 'samples' directory.
| Chinese | Dutch | en_US | Russian | Yiddish | 5000K | |
| words | 44364 | 542477 | 124220 | 942251 | 11723 | 5000000 |
| letters | 3951 | 39 | 26 | 32 | 40 | 26 |
| text | 449526 | 13251886 | 1333413 | 11068646 | 207842 | 34505844 |
| text+ | 189338 | 1816293 | 365504 | 2374712 | 41309 | 11315688 |
| sq | - | 3798365 | 700073 | 4372484 | - | 20000026 |
| sq+ | - | 727120 | 257373 | 987849 | - | 142101 |
| CTree | 154444 | 613110 | 269274 | 885092 | 31632 | 21322 |
| CTree+ | 145154 | 565828 | 240112 | 565828 | 27147 | 773 |
| ratio | 2.91 | 21.61 | 4.95 | 12.5 | 6.57 | 1618.32 |
| Appendix B: Language Support |
Here we will describe the support for the specific language features. This mostly concerns the languages, which are called "complex" because they differ in some aspects from the European ones.
The Armenian and modern European languages using Cyrillic, Greek and Latin scripts have upper and lower case letters in the alphabets. All other alphabets are considered to have caseless letters and the conversion instructions are ignored. The standard case conversion is supported via built in tables based on BMP of Unicode 5.1.0 plus Turkish dotless 'i'. The title case letters are converted to lower/upper using double lookup in the tables. The special casing, which maps one lower case character to 2 or 3 upper case characters, is supported as well (see Options Tab).
The default standard support is for the left-to-right direction. The vertical direction is not supported. Here, the special support for Hebrew/Yiddish and all Arabic languages using right-to-left direction will be described.
We have to start with two terms: 'logical order' and 'visual order'. The logical order is the order, in which characters and words are read and written. The visual order is the order, in which the characters appear on display or on paper. The RTL texts can be stored in the memory (RAM or file) in visual or in logical order. The Unicode standard defines the logical order as the default. On the other side, some Hebrew web pages use the visual order as 'de-facto standard'.
When the RTL text is stored in visual order, it is processed as LTR text. In this case you can set the check box Right Alignment in Display Options and the text will appear naturally for the script. If you want to create CTree in visual order, for the proper sorting, you have to set the check box Reverse Word Letters in Target Tab.
When the RTL text is stored in logical order, it is supported by the special bidi processing, specified in Unicode Technical Report #9. To switch on the bidi, you have to set the check box RTL in logical order (see File Encoding, Locale) and/or choose the options described in Custom RTL Conversion.
The main goal of the bidi processing in this program is to supply transparent conversion to/from visual/logical order to the other text processing modules. Our text area display control actually is not bidi enabled. If the RTL text is in logical order, it is automatically converted and stored there in visual order. This way it appears naturally, but the multi-line selection and the searching direction are internally LTR. This implies that in some cases you should check Ignore Non-spacing marks for the searching. On the other side, you can select smoothly exactly what you see on the screen without the bidi jumping. Our text field display control has experimental bidi support without jumping selection but when you edit/enter text using '\uxxxx' notation, the cursor position might not be handled properly. The entering of some forms of regular expressions might also be difficult because of the bidi processing. Some of the problems could be solved via changing the alignment: the right alignment implies default RTL line direction and the left alignment - LTR. You can use Ctrl+B keys to switch off/on the bidi display (but not the bidi processing of the final text). The communication of both display controls with the clipboard is always in logical order and in Unicode encoding.
When the program has to process Text Delimited, it will force the user delimiter character to be treated as paragraph separator by the bidi processing. The other 'deviations' from the standard bidi conversion are: the mirroring (step L4) is done before the reordering (step L2) and the reordering of combining marks (step L3) is not performed (otherwise, the process will not be reversible.) We have to note also that the 'reverse' bidi (visual to logical order) is not standardised.
If you create multilingual CTree dictionary, you should set as language of the dictionary the language of the RTL script.
The special shaping support is for the Arabic script according to the rules specified in Unicode. The shaping of displayed text (without ligatures) is controlled via the check box Shaping, described in Display Options. The shaping of processed text is controlled via 'Arabic Composition' and Shape Letters... described in Unicode Normalization. It includes ligaturization as well. The shapes of the Syriac script are not defined in Unicode and are not supported yet. There are some Arabic shapes, which are still missing in Unicode as well.
The program has a simple algorithm for the handling of decomposed characters having non-spacing marks. It is on when Shaping is checked. You should use it only when the font is not handling properly the combining marks.
All standard Unicode compositions/decompositions are supported as described in Unicode Normalization. Here we will give more information about the non-standard ones.
The 'Arabic Composition' processing uses tables
stored in "ar_lig.dic", which is CTree with tags and 'clues'
(you can dump, edit and recreate it). The standard
Unicode composition is not able to handle Arabic
ligatures and their shapes and this was the reason to create
special modules and a dictionary. The 'words' in the dictionary
are the decomposition codes of the ligatures and the 'clues'
are the codes of the ligature shapes as specified in Unicode.
The 'clue tags' define
the shape type: 'ini' for initial, 'iso' for isolated,
'med' for medial and 'fin' for final.
The 'morpho tags' define the desired subset of the ligatures.
You can find out which ligature glyphs are supported
by particular font just opening the dictionary in the
source text area (the ttf "Arial Unicode MS" contains most of used glyphs).
For example, to see the composition in work, set Source locale to
'fa' (Persian, Farsi), Unicode Normalization to 'Arabic Composition',
'Shaping' from Display Options dialog should be off
and then open 'ar_lig.dic' in the Source text area.
All lines having tag 'iso' should have the same glyphs on the left
side and on the right side (except if there is tag 'nfa' - not Farsi).
And if you set the locale to 'ar', the lines having tag
'nsa' (not standard Arabic) will not be composed, but all others,
having 'iso' tag, should be composed properly. To see the source
characters from the file, switch off the composition and reopen.
The decomposition is handled properly via the
standard Unicode compatible decomposition.
You can create text word list or CTree in Arabic composed form plus letter shaping using Unicode as target encoding and this way it will not need additional processing any time when it is browsed. To create such CTree with 'Strict Alphabet' you have to use the whole "ar_ALL" from the alphabet directory, or if you don't bother for the file size and the sorting, just set no 'Strict Alphabet'.
The 'Thai Composition/Decomposition' processing uses tables
stored in "th_lig.dic", which is CTree with tags and 'clues'.
The 'words' are 4187 sequences of Thai characters aimed
to cover most of syllables. The 'clues' are codes from the
Unicode private area. The 'morpho tags' are the Unicode character classes
as 'Lo', 'Nd', 'Po', 'Mn', and one additional: 'Lns' which stands
for 'letter non-starter' (see Dictionary tab in
Word Properties).
The 'clue tags' define the subset: 'co' for all composed or 'Syllables',
and 'sc' for 'Single cells'.
The goal of introducing this custom Thai composition
is to make easier handling the native Thai dictionary
sorting and the breaking of spaceless sentences into words.
You can process Thai texts without composition, but
the sorting of leading vowels will not be handled.
You can also replace the locale dictionary "th.dic"
with another, not in composed form, but the breaking
of words will be hundred times slower.
The changing of "th_lig.dic" is not recommended
because you will loose all composed data (including "th.dic")
and you will need the program used to create the codes.
If you really need to change it, you have to
decompose everything with the old CTree before using the new one.
To create Thai composed CTree without many efforts, specify as target encoding Unicode and no 'Strict Alphabet'. If Locale Alphabet is used together with Thai composed text, you have to create new alphabet file with the codes from "th_lig.dic". To browse Thai composed CTree without explicit decomposition, you have to set the check box Shaping, described in Display Options.
There is one more special composition - 'Yiddish Composition'.
It is obtained via the standard compatible composition with
the check box Plus Excluded set. To create word lists or CTree
in Yiddish composed form use Unicode as target encoding and
the supplied "yi" alphabet.
The decomposition is handled properly via the
standard Unicode compatible decomposition.
The CJK (Chinese, Japanese, Korean) scripts use
thousands of characters. Since there is no restriction
of the number of letters, these languages are not considered
"complex" and there is no special additional support.
If you use codes from the Unicode Surrogate Block,
they will be treated as unrecognized characters without
any support.