Generating Unicode (Devnagari) Pages

... if you are interested in knowing how I created the dEvnAgrI pages read on ...

The Hows and The Whys

Short of using specific language editors, most people like me who engage in newsgroup conversations much prefer to use the standard english keyboard. We would rather "transliterate" phonetically, and possibly metrically, our indic tongues using the roman letters than consort with a new software and learn a new keyboard map. Further, having had pages of pre-existing transliterated text, I was loath to spend the effort of writing all of them anew in a dEvnAgrI editor. The obvious solution, ofcourse, is to automatically map such text to the appropriate "code-point" in the UCS chart (for instance, the dEvnAgrI map extends from U0900 to U097F) and then leave the rest to the (default) Unicode standard and rendering rules¹. Here I must defer to Sandeep Sibal's Jtrans scheme which does almost exactly the same - but for the xdvng font developed by Arun Gupta which in turn was derived off the ligature set from Avinash Chopde's Itrans package. Thus, this is neither a very novel concept, nor entirely unobvious. My effort lay in making a parser that would:

Scan my existing transliterated (html) documents,
Locate/Identify the appropriate phonemes (which is quite possible because my transliteration scheme for urdU and marATHI is unambigous phonetically and metrically, unlike the Itrans scheme), and
Replace each phoneme with the numerical representation of the code-point of the corresponding "character". In addition, context sensitive rules can be employed at this stage to control the actual "appearance" of the "character" if one doesn't like the standard conjunct-ligatures for consonant clusters etc..

Note that the replacement of each phoneme with a character is possible only if each phoneme in a given transliteration mechanism is unambiguously mapped to a character. In case no correspoding character exists, an approximation has to be done using existing characters. A typical case is the glottal stop which is absent in dEvnAgrI but commonly used in urdU.

Transliteration Scheme and Mapping to UCS Code Points

The following transliteration scheme is driven primarily by concerns of prosody (as employed in urdU poetry) and a need for unambiguous phonetic representation. The phonetic values are derived from sa.Nsk.rt (Sanskrit), marATHI (Marathi), and urdU (Urdu).

Vowels and Associated Diacritics

A general thumbrule to follow is that the lowercase versions of the vowels denote the short vowels, while the corresponding uppercase versions denote the corresponding long vowels. This is true for all vowels except the a and A.

In the "UCS Code Points" column below, the values in the parentheses denote the values used for independent vowels (ie. those vowels that are not bound to a consonant).

Table of Vowels
Vowel	UCS Code Points	Examples and Special Considerations
a	- (U0905)	The short vowel as in par (feather). To represent the long version of this vowel repeat the character. Thus aa represents a long a. Unfortunately, there seems to be no way to represent this as a character in dEvnAgrI except as a series of independent a characters. An alternative would be to use the avagraha - but that would conflict with its use as aa glottal stop (see the consonant table below). Note: The a vowel is inherent in every nominal dEvnAgrI consonant character. Hence, no special code point exists to denote the dependent vowel.
A	U093E (U0906)	The long vowel as in kAm (work). To represent the "short" version of this vowel use the symbol that denotes vowel reduction (see below). Thus A.s denotes the short A.
i	U093F (U0907)	The short vowel as in din (day).
I	U0940 (U0908)	The long vowel as in ChIn (China).
u	U0941 (U0909)	The short vowel as in sun (hear).
U	U0942 (U090A)	The long vowel as in dUr (far away).
.r	U0943 (U090B)	The short vowel as in .rShI (sage). It is the vowel corresponding to the liquid consonant r. Used only in words from sa.Nsk.rt (Sanskrit).
.R	U0944 (U0960)	The long version of the .r vowel. Used only in words from sa.Nsk.rt (Sanskrit).
.l	U0962 (U090C)	The short vowel as in k.lp (age). It is the vowel corresponding to the liquid consonant l. Used only in words from sa.Nsk.rt (Sanskrit).
.L	U0963 (U0961)	The long version of the .l vowel. Used only in words from sa.Nsk.rt (Sanskrit).
e	U0947 U0952 == E.s (U090F U0952)	The short vowel corresponding to the long E (see below). No character in dEvnAgrI. Represented as a reduced E vowel by my UCS parser. Note: UCS provides U0946 (U090E) for this vowel, but I represent it in my UCS parser as a reduced vowel for consistency.
E	U0947 (U090F)	The long vowel as in nEk (honest).
ae	U0948 U0952 == aE.s (U0910 U0952)	The short vowel as in maehael (palace) corresponding to the long aE (see below). No character in dEvnAgrI. Represented as a reduced aE vowel by my UCS parser.
aE	U0948 (U0910)	The long vowel as in paEr (leg).
o	U094B U0952 == O.s (U0913 U0952)	The short vowel corresponding to the long O (see below). No character in dEvnAgrI. Represented as a reduced O vowel by my UCS parser. Note: UCS provides U094A (U0912) for this vowel, but I represent it in my UCS parser as a reduced vowel for consistency.
O	U094B (U0913)	The long vowel as in ChOr (thief).
ao	U094C U0952 == aO.s (U0914 U0952)	The short vowel corresponding to the long aO (see below). No character in dEvnAgrI. Represented as a reduced aO vowel by my UCS parser.
aO	U094C (U0914)	The long vowel as in kaOn (who).

'	-	Vowel Demarcator (hamzA?). Used to denote vowel separation when an independent vowel follows another vowel. e.g.: e following a is written a'e and not ae. The latter is an altogether different vowel! Note: My UCS parser internally uses the ' to mark independent vowels before converting them to their code points.

.s	U0952	Used to denote vowel grade reduction. Especially useful for prosodic purposes to indicate an originally long vowel that is reduced due to metrical concerns. An auxiliary use is within my UCS parser to internally represent the short forms of the E, aE, O, and aO vowels since the corresponding characters do not exist really in dEvnAgrI.
.h	U094D	The virAm. Used internally by my UCS parser to denote absence of the implicit a vowel. There is never a need to use this in the actual transliterated text except within the <noviram> </noviram> tags (see the specificsof the UCS parser implemenation below).

:	U0903	The visarga. Not a vowel perse. Phonologically, in sa.Nsk.rt (Sanskrit), it stands for an unvoiced h sound and follows only vowels. Also used to represent the "imperceptible" h sound in urdU words. (e.g ki:, AShna: etc.)

Corollary

Since the .s serves to reduce the vowel grade, the combination a.s, by definition, is the void consonant. In this form it is functionally similar to the .h sequence. Thus ka.s == ka.h == k. While this suggests that the notions of .s and .h can be unified, I nevertheless need the two representations since the virAm character in UCS cannot be rendered following a vowel.

Consonants, Nasals, etc..

A general thumb rule to follow is that any H following a consonant acts to aspirate the consonant. Thus, the symbol H stands for both voiced and unvoiced aspiration. In case an aspirated character is unavailable in dEvnAgrI, then the phoneme is approximated as a conjunct ligature of the corresponding unaspirated consonant and the consonant h. For e.g., the aspirated mH has to be rendered as a half m followed by a h.

In the table below, the consonants in the parentheses represent the corresponding aspirated forms (if any).

Table of Consonants etc.
Consonant	UCS Code Points	Examples and Special Considerations

k (kH)	U0915 (U0916)	The Unvoiced Velar Stop. k as in kab (when). kH as in kHEl (play).
g (gH)	U0917 (U0918)	The Voiced Velar Stop. g as in gIt (song). gH as in gHar (house).
.N^k (.N^kH)	U0919 (U0919 U094D U0939 == .N^k.hh)	The Velar Nasal. .N^k as the .N in ja.Ng (war).

Ch (ChH)	U091A (U091B)	The Unvoiced Palatal Stop. Ch as in ChIz (thing). ChH as ChHat (roof).
J (JH)	U091C (U091D)	The Voiced Palatal Stop. J as in JEb (pocket). JH as in JHIl (rivulet).
.N^j (.N^jH)	U091E (U091E U094D U0939 == .N^j.hh)	The Palatal Nasal. .N^j as the .N in pa.NjAb (Punjab).

t (tH)	U0924 (U0925)	The Unvoiced Dental Stop. t as in tErA (your). tH as in tHA (was).
d (dH)	U0926 (U0927)	The Voiced Dental Stop. d as in dEkH (see). dH as in dHIraJ (courage).
n (nH)	U0928 (U0928 U094D U0939 == n.hh)	The Dental Nasal. n as in nAm (name), n as the .N in Cha.NdA == ChandA (donation). nH as in unHE (to them).

T (TH)	U091F (U0920)	The Unvoiced Retroflex Stop. T as in TA.ng (leg). TH as in THEkA (stake).
D (DH)	U0921 (U0922)	The Voiced Retroflex Stop. D as in DamrU (drum). DH as DHag (cloud in marATHI).
N (NH)	U0923 (U0923 U094D U0939 == N.hh)	The Retroflex Nasal. N as in pANI (water in marATHI). N as the .N in pa.NDit == paNDit (priest).

p (pH)	U092A (U092B)	The Unvoiced Bilabial Stop. p as in pAnI (water). pH as in pHirsE (again).
b (bH)	U092C (U092D)	The Voiced Bilabial Stop. b as in bandH (strike). bH as in bHArat (India).
m (mH)	U092E (U092E U094D U0939 == m.hh)	The Bilabial Nasal. m as in mazhab (religion), m as the .N in ba.NbU == bambU (bamboo). mH as in tumHArA (yours).

y	U092F	The Palatal Liquid Approximant. y as in yAr (friend).

r	U0939	The Alveolar Liquid Approximant (Trill)? r as in rAJA (king).
??	TBD	The Retroflex Liquid Approximant (Trill)?

l (lH)	U0932 (U0932 U094D U0939 == l.hh)	The Lateral Liquid Approximant? l as in lambA (long). lH as in dulHan (bride).
L (LH)	U0933 (U0933 U094D U0939 == L.hh)	The Retroflex Liquid Approximant? L as in pALNa (crib in marATHI).

f	U095E == U092A U093C == pH.d	The Unvoiced LabioDental Fricative. f as in farEb (betrayal).
v	U0935	The Voiced LabioDental Fricative. v as in virAsat (inheritance). Note: The v was originally a w however contemporary usage employs v throughout all indic tongues.

w	U0935 U093C == v.d	The Voiced Bilabial Fricative? w as in KhwudI (self). Note: The w never occurs in most Indian tongues - and has been replaced by v all over. Even in words borrowed from Arabic, it has been replaced by the v. Nevertheless, it is useful in certain fArsI words (as above) wherein it is assimilated into the preceeding consonant.

s	U0938	The Unvoiced Alveolar Fricative. s as in sab (everything).
z	U095B == U091C U093C == J.d Note: The contemporary choice of using this symbol for the z is rather unfortunate. The z is really a voiced s - and is unrelated to the J phonologically. As a result, there is no symbol to use for the j sound (see below).	The Voiced Alveolar Fricative. z as in mazhab (religion).
Sh	U0936	The Unvoiced PalatoAlveolar Fricative. Sh as in ShEr (tiger).
Zh	U091D U093C == JH.d Note: As for the z, similar comments apply here.	The Voiced PalatoAlveolar Fricative. Zh as in ZhAlA (sleet/ice). It is equivalent to the sound of s in the english word measure.
Xh	U0937	The Unvoiced Retroflex Fricative. Xh as in bHUXhan (Bhushan - a proper noun). Used only in words derived from sa.Nsk.rt (Sanskrit).
??	TBD	The Voiced Retroflex Fricative.

h	U0939	The (Un)Voiced Pharyngeal? Fricative. h as in hAtHI (elephant). Note: Most indic tongues do not seem to distinuguish between the voiced and unvoiced versions of the h. Classically, this was probably a voiced consonant. However, in modern usage, it seems to be exclusively unvoiced.

q	U0958 == U0915 U093C == k.d	The Unvoiced Uvular Stop. q as in qismat (fate/destiny).
?	U095A == U0917 U093C == g.d	The Voiced Uvular Stop.
Kh	U0959 == U0916 U093C == kH.d	The Unvoiced Uvular Fricative. Kh as in KhwudI (self).
Gh	U0918 U093C == gH.d	The Voiced Uvular Fricative. Gh as in GhaEr (stranger).

c (cH)	U091A U093C == Ch.d (U091A U093C U094D U0939 == Ch.d.hh)	The Unvoiced Palatal Fricative. c as in cAk (wheel in marATHI). Used only in marATHI extensively.
j (jH)	TBD (TBD)	The Voiced Palatal Fricative. j as in jAg (awake! in marATHI). jH as in mAjHE (mine in marATHI). Used only in marATHI extensively. However, modern usage tends to the z sound probably due to the influence of English.

R (RH)	U095C == U0921 U093C == D.d (U095C U094D U0939 == D.d.hh)	The Voiced Retroflex Flap? R as in baRA (big). RH as in dARHI (beard).

.n	U0901	The anunAsik or pure/assimilated nasal. The assimilated nasal is represented by a Chandra-bindU (crescent-dot) above the vowel marks (if present). It never contibutes to the metrical weight of the syllable.
.N	U0902	The anusvar or the unassimilated nasal. The unassimilated nasal - usually represented in orthography by a bindU (dot), is technically equivalent to an implicit nasal consonant - the exact nature of which is determined by the nature of the following consonant. Thus, sa.Ngat == s+ a + .N^k + g + a + t sa.NChAr == s + a + .N^j + Ch + a + r sa.NtUr == s + a + n + t + U + r sa.Npat == s + a + m + p + a + t and so on. Obviously, it contributes to the metrical weight of the syllable. In cases, where an equivalent character exists in dEvnAgrI (as in the examples above) my UCS parser substitutes it in place of the anusvar. However, in other cases, it is left as a bindU.

`	U093D	The Glottal Stop. ` as in aSh`Ar (couplets). There is no character in dEvnAgrI to represent this. Technically, I guess, I could have used the void consonant and rendered it by a.s (but that would still be an issue if the ` is followed by vowels). In the current parser implementation I decided to go with the character for the avagraha since there is no possiblity of confusion - and eliminates ambiguity when the ` is followed by vowels.

.d	U093C	The "dot" diacritic (nuktA) used below consonants.

0	U0966	The Number 0.
1	U0967	The Number 1.
2	U0968	The Number 2.
3	U0969	The Number 3.
4	U096A	The Number 4.
5	U096B	The Number 5.
6	U096C	The Number 6.
7	U096D	The Number 7.
8	U096E	The Number 8.
9	U096F	The Number 9.

.	U0970	The Abbreviation symbol.

\|	U0964	The dEvnAgrI phrase separator.
\|\|	U0965	The dEvnAgrI sentence terminator.

The one deviation in my UCS parser implementation is that it uses the j symbol for the J. As a result, certain marATHI words cannot be represented for now ...

Still to do: Represent Dental Fricatives (the english th sounds).

My UCS Parser

My Lex Parser Source Code dev.yy (GPL). This is primarily written to scan HTML pages. It skips over all letters except those that are enclosed by <dn> </dn> tags and converts them into dEvnAgrI. It discards the <dn> </dn> tags. Letters that don't scan are echoed as-is to the output.
To compile this into a working executable do:

$ lex dev.yy
$ gcc -DYY_MAIN lex.yy.c
$ mv a.out 2dev
Finally, cat the transliterated text to stdin and pipe it to the executable. The UCS representation is spit to stdout. An equivalent alternative is:

$ 2dev < myText.txt > myText.ucs

Parser Implemenation Notes

Future Work?

My next immediate plan is to remap my current parser to use the fArsI script. Unfortunately, my transliteration scheme, which had a phonetic basis, does not readily allow itself to such translation simply because the fArsI script does not have a one-to-one mapping from a phoneme to character ...

Another goal would be to provide a Web-based interface to my 2dev tool. This will take some effort and time because I'll have to indulge in writing an entire tokenizer in javascript. Boy is that fun!!