... if you are interested in knowing how I created the
dEvnAgrI pages read on ...
The Hows and The Whys
Short of using specific language editors, most people like me who
engage in newsgroup conversations much prefer to use the standard
english keyboard. We would rather "transliterate" phonetically, and
possibly metrically, our indic tongues using the roman letters than
consort with a new software and learn a new keyboard map. Further,
having had pages of pre-existing transliterated text, I was loath to
spend the effort of writing all of them anew in a
dEvnAgrI editor. The obvious
solution, ofcourse, is to automatically map such text to the appropriate
"code-point" in the
UCS chart (for
instance, the
dEvnAgrI map
extends from U0900 to U097F) and then leave the rest to the (default)
Unicode standard and rendering rules
1.
Here I must defer to Sandeep Sibal's
Jtrans scheme which
does almost exactly the same - but for the xdvng font developed by Arun
Gupta which in turn was derived off the ligature set from
Avinash Chopde's Itrans package.
Thus, this is
neither a
very novel concept,
nor
entirely unobvious. My effort lay in making a parser that would:
- Scan my existing transliterated (html)
documents,
- Locate/Identify the appropriate phonemes (which
is quite possible because my transliteration scheme for urdU
and marATHI is unambigous
phonetically and metrically, unlike the Itrans scheme), and
- Replace each phoneme with the numerical
representation of the code-point of the corresponding "character". In
addition, context sensitive rules can be employed at this stage to
control the actual "appearance" of the "character" if one doesn't like
the standard conjunct-ligatures for consonant clusters etc..
Note that the replacement of each phoneme with a character is possible
only if each phoneme in a given transliteration mechanism is
unambiguously mapped to a character. In case no correspoding character
exists, an approximation has to be done using existing characters. A
typical case is the
glottal stop
which is absent in
dEvnAgrI but
commonly used in
urdU.
Transliteration
Scheme and Mapping to UCS Code Points
The following transliteration scheme is driven primarily by concerns of
prosody (as employed in
urdU poetry)
and a need for unambiguous phonetic representation. The phonetic values
are derived from
sa.Nsk.rt (Sanskrit),
marATHI (Marathi), and
urdU (Urdu).
Vowels and Associated Diacritics
A general thumbrule to follow is that the lowercase versions of the
vowels denote the short vowels, while the corresponding uppercase
versions denote the corresponding long vowels. This is true for all
vowels except the
a and
A.
In the "UCS Code Points" column below, the values in the parentheses
denote the values used for
independent
vowels (ie. those vowels that are not bound to a consonant).
Table of Vowels
Vowel
|
UCS
Code Points
|
Examples
and Special Considerations
|
a
|
-
(U0905)
|
The
short vowel as in par (feather).
To represent the long version of this vowel repeat the character. Thus aa represents a long a. Unfortunately, there seems to be
no way to represent this as a character in dEvnAgrI except as a series of
independent a characters. An
alternative would be to use the avagraha
- but that would conflict with its use as aa glottal stop (see the consonant table below).
Note: The a vowel is inherent
in every nominal dEvnAgrI
consonant character. Hence, no special code point exists to denote the
dependent vowel.
|
A
|
U093E
(U0906)
|
The
long vowel as in kAm (work).
To represent the "short" version of this vowel use the symbol that
denotes vowel reduction (see below). Thus A.s denotes the short A.
|
i
|
U093F
(U0907)
|
The
short vowel as in din (day).
|
I
|
U0940
(U0908)
|
The
long vowel as in ChIn (China).
|
u
|
U0941
(U0909)
|
The
short vowel as in sun (hear).
|
U
|
U0942
(U090A)
|
The
long vowel as in dUr (far away).
|
.r
|
U0943
(U090B)
|
The
short vowel as in .rShI (sage).
It is the vowel corresponding to the liquid consonant r. Used only in words from sa.Nsk.rt (Sanskrit).
|
.R
|
U0944
(U0960)
|
The
long version of the .r vowel.
Used only in words from sa.Nsk.rt
(Sanskrit).
|
.l
|
U0962
(U090C)
|
The
short vowel as in k.lp (age).
It is the vowel corresponding to the liquid consonant l. Used only in words from sa.Nsk.rt (Sanskrit).
|
.L
|
U0963
(U0961)
|
The
long version of the .l vowel.
Used only in words from sa.Nsk.rt
(Sanskrit).
|
e
|
U0947
U0952 == E.s
(U090F U0952)
|
The
short vowel corresponding to the long E
(see below). No character in dEvnAgrI.
Represented as a reduced E
vowel by my UCS parser.
Note: UCS provides U0946 (U090E) for this vowel, but I represent it in
my UCS parser as a reduced vowel for consistency.
|
E
|
U0947
(U090F)
|
The
long vowel as in nEk (honest).
|
ae
|
U0948
U0952 == aE.s
(U0910 U0952)
|
The
short vowel as in maehael
(palace) corresponding to the long aE
(see below). No character in dEvnAgrI.
Represented as a reduced aE
vowel by my UCS parser.
|
aE
|
U0948
(U0910)
|
The
long vowel as in paEr (leg).
|
o
|
U094B
U0952 == O.s
(U0913 U0952)
|
The
short vowel corresponding to the long O
(see below). No character in dEvnAgrI.
Represented as a reduced O
vowel by my UCS parser.
Note: UCS provides U094A (U0912) for this vowel, but I represent it in
my UCS parser as a reduced vowel for consistency. |
O
|
U094B
(U0913)
|
The
long vowel as in ChOr (thief).
|
ao
|
U094C
U0952 == aO.s
(U0914 U0952)
|
The
short vowel corresponding to the long aO
(see below). No character in dEvnAgrI.
Represented as a reduced aO
vowel by my UCS parser. |
aO
|
U094C
(U0914)
|
The
long vowel as in kaOn (who).
|
|
|
|
'
|
-
|
Vowel
Demarcator (hamzA?). Used to
denote vowel separation when an independent vowel follows another vowel.
e.g.: e following a is written a'e and not ae. The latter is an altogether
different vowel!
Note: My UCS parser internally uses the ' to mark independent vowels before
converting them to their code points.
|
|
|
|
.s
|
U0952
|
Used
to denote vowel grade reduction. Especially useful for prosodic
purposes to indicate an originally long vowel that is reduced due to
metrical concerns.
An auxiliary use is within my UCS parser to internally represent the
short forms of the E, aE, O,
and aO vowels since the
corresponding characters do not exist really in dEvnAgrI.
|
.h
|
U094D
|
The virAm.
Used internally by my UCS parser to denote absence of the implicit a vowel. There is never a need to use
this in the actual transliterated text except within the <noviram>
</noviram> tags (see the specificsof
the UCS parser implemenation below).
|
|
|
|
:
|
U0903
|
The visarga. Not a vowel perse.
Phonologically, in sa.Nsk.rt
(Sanskrit), it stands for an unvoiced h
sound and follows only vowels.
Also used to represent the "imperceptible" h sound in urdU words. (e.g ki:, AShna: etc.) |
Corollary
Since the .s serves to reduce the vowel grade,
the combination a.s, by
definition, is the void
consonant. In this form it is functionally similar to the .h sequence. Thus ka.s == ka.h == k. While this suggests that the
notions of .s and .h can be unified, I nevertheless
need the two representations since the virAm
character in UCS cannot be rendered following a vowel.
Consonants, Nasals, etc..
A general thumb rule to follow is that any
H following a consonant acts to
aspirate the consonant. Thus, the symbol
H stands for both voiced and unvoiced
aspiration. In case an aspirated character is unavailable in
dEvnAgrI, then the phoneme is
approximated as a conjunct ligature of the corresponding unaspirated
consonant and the consonant
h. For e.g., the aspirated
mH has to be rendered as a half
m followed by a
h.
In the table below, the consonants in the parentheses represent the
corresponding aspirated forms (if any).
Table of Consonants etc.
Consonant
|
UCS
Code Points
|
Examples
and Special Considerations
|
|
|
|
k
(kH)
|
U0915
(U0916)
|
The
Unvoiced Velar Stop.
k as in kab (when).
kH as in kHEl (play).
|
g
(gH)
|
U0917
(U0918)
|
The
Voiced Velar Stop.
g as in gIt (song).
gH as in gHar (house).
|
.N^k
(.N^kH)
|
U0919
(U0919 U094D U0939 == .N^k.hh)
|
The
Velar Nasal.
.N^k as the .N in ja.Ng (war).
|
|
|
|
Ch
(ChH)
|
U091A
(U091B)
|
The
Unvoiced Palatal Stop.
Ch as in ChIz (thing).
ChH as ChHat (roof).
|
J
(JH)
|
U091C
(U091D)
|
The
Voiced Palatal Stop.
J as in JEb (pocket).
JH as in JHIl (rivulet).
|
.N^j
(.N^jH)
|
U091E
(U091E U094D U0939 == .N^j.hh)
|
The
Palatal Nasal.
.N^j as the .N in pa.NjAb (Punjab).
|
|
|
|
t
(tH)
|
U0924
(U0925)
|
The
Unvoiced Dental Stop.
t as in tErA (your).
tH as in tHA (was).
|
d
(dH)
|
U0926
(U0927)
|
The
Voiced Dental Stop.
d as in dEkH (see).
dH as in dHIraJ (courage).
|
n
(nH)
|
U0928
(U0928 U094D U0939 == n.hh)
|
The
Dental Nasal.
n as in nAm (name), n as the .N in Cha.NdA == ChandA (donation).
nH as in unHE (to them).
|
|
|
|
T
(TH)
|
U091F
(U0920)
|
The
Unvoiced Retroflex Stop.
T as in TA.ng (leg).
TH as in THEkA (stake).
|
D
(DH)
|
U0921
(U0922)
|
The
Voiced Retroflex Stop.
D as in DamrU (drum).
DH as DHag (cloud in marATHI).
|
N
(NH)
|
U0923
(U0923 U094D U0939 == N.hh)
|
The
Retroflex Nasal.
N as in pANI (water in marATHI).
N as the .N in pa.NDit == paNDit (priest).
|
|
|
|
p
(pH)
|
U092A
(U092B)
|
The
Unvoiced Bilabial Stop.
p as in pAnI (water).
pH as in pHirsE (again).
|
b
(bH)
|
U092C
(U092D)
|
The
Voiced Bilabial Stop.
b as in bandH (strike).
bH as in bHArat (India).
|
m
(mH)
|
U092E
(U092E U094D U0939 == m.hh)
|
The
Bilabial Nasal.
m as in mazhab (religion), m as the .N in ba.NbU == bambU (bamboo).
mH as in tumHArA (yours).
|
|
|
|
y
|
U092F
|
The
Palatal Liquid Approximant.
y as in yAr (friend).
|
|
|
|
r
|
U0939
|
The
Alveolar Liquid Approximant (Trill)?
r as in rAJA (king).
|
??
|
TBD
|
The
Retroflex Liquid Approximant (Trill)?
|
|
|
|
l
(lH)
|
U0932
(U0932 U094D U0939 == l.hh)
|
The
Lateral Liquid Approximant?
l as in lambA (long).
lH as in dulHan (bride).
|
L
(LH)
|
U0933
(U0933 U094D U0939 == L.hh)
|
The
Retroflex Liquid Approximant?
L as in pALNa (crib in marATHI).
|
|
|
|
f
|
U095E
== U092A U093C == pH.d
|
The
Unvoiced LabioDental Fricative.
f as in farEb (betrayal).
|
v
|
U0935
|
The
Voiced LabioDental Fricative.
v as in virAsat (inheritance).
Note: The v was originally a w however contemporary usage employs v throughout all indic tongues.
|
|
|
|
w
|
U0935
U093C == v.d
|
The
Voiced Bilabial Fricative?
w as in KhwudI (self).
Note: The w never occurs in
most Indian tongues - and has been replaced by v all over. Even in words borrowed
from Arabic, it has been replaced by the v. Nevertheless, it is useful in
certain fArsI words (as above)
wherein it is assimilated into the preceeding consonant.
|
|
|
|
s
|
U0938
|
The
Unvoiced Alveolar Fricative.
s as in sab (everything).
|
z
|
U095B
== U091C U093C == J.d
Note: The contemporary choice of using this symbol for the z is rather unfortunate. The z is really a voiced s - and is unrelated to the J phonologically. As a result, there
is no symbol to use for the j
sound (see below).
|
The
Voiced Alveolar Fricative.
z as in mazhab (religion).
|
Sh
|
U0936
|
The
Unvoiced PalatoAlveolar Fricative.
Sh as in ShEr (tiger).
|
Zh
|
U091D
U093C == JH.d
Note: As for the z, similar
comments apply here.
|
The
Voiced PalatoAlveolar Fricative.
Zh as in ZhAlA (sleet/ice). It is equivalent to the sound of s in the
english word measure.
|
Xh
|
U0937
|
The
Unvoiced Retroflex Fricative.
Xh as in bHUXhan (Bhushan - a proper noun).
Used only in words derived from sa.Nsk.rt
(Sanskrit).
|
??
|
TBD
|
The
Voiced Retroflex Fricative.
|
|
|
|
h
|
U0939
|
The
(Un)Voiced Pharyngeal? Fricative.
h as in hAtHI (elephant).
Note: Most indic tongues do not seem to distinuguish between the voiced
and unvoiced versions of the h.
Classically, this was probably a voiced consonant. However, in modern
usage, it seems to be exclusively unvoiced.
|
|
|
|
q
|
U0958
== U0915 U093C == k.d
|
The
Unvoiced Uvular Stop.
q as in qismat (fate/destiny).
|
?
|
U095A
== U0917 U093C == g.d
|
The
Voiced Uvular Stop.
|
Kh
|
U0959
== U0916 U093C == kH.d
|
The
Unvoiced Uvular Fricative.
Kh as in KhwudI (self).
|
Gh
|
U0918
U093C == gH.d
|
The
Voiced Uvular Fricative.
Gh as in GhaEr (stranger).
|
|
|
|
c
(cH)
|
U091A
U093C == Ch.d
(U091A U093C U094D U0939 == Ch.d.hh)
|
The
Unvoiced Palatal Fricative.
c as in cAk (wheel in marATHI).
Used only in marATHI
extensively.
|
j
(jH)
|
TBD
(TBD)
|
The
Voiced Palatal Fricative.
j as in jAg (awake! in marATHI).
jH as in mAjHE (mine in marATHI).
Used only in marATHI
extensively. However, modern usage tends to the z sound probably due to the influence
of English.
|
|
|
|
R
(RH)
|
U095C
== U0921 U093C == D.d
(U095C U094D U0939 == D.d.hh)
|
The
Voiced Retroflex Flap?
R as in baRA (big).
RH as in dARHI (beard).
|
|
|
|
.n
|
U0901
|
The anunAsik or pure/assimilated nasal.
The assimilated nasal is represented by a Chandra-bindU (crescent-dot) above
the vowel marks (if present). It never contibutes to the metrical weight
of the syllable.
|
.N
|
U0902
|
The anusvar or the unassimilated nasal.
The unassimilated nasal - usually represented in orthography by a bindU (dot), is technically
equivalent to an implicit nasal consonant - the exact nature of which is
determined by the nature of the following consonant. Thus,
sa.Ngat == s+ a + .N^k + g + a + t
sa.NChAr == s + a + .N^j + Ch + a + r
sa.NtUr == s + a + n + t + U + r
sa.Npat == s + a + m + p + a + t
and so on.
Obviously, it contributes to the metrical weight of the syllable.
In cases, where an equivalent character exists in dEvnAgrI (as in the examples above)
my UCS parser substitutes it in place of the anusvar. However, in other
cases, it is left as a bindU.
|
|
|
|
`
|
U093D
|
The
Glottal Stop.
` as in aSh`Ar (couplets).
There is no character in dEvnAgrI
to represent this. Technically, I guess, I could have used the void consonant and rendered it by a.s (but that would still be an
issue if the ` is followed by vowels). In the current parser
implementation I decided to go with the character for the avagraha since there is no
possiblity of confusion - and eliminates ambiguity when the ` is
followed by vowels.
|
|
|
|
.d
|
U093C
|
The
"dot" diacritic (nuktA) used
below consonants.
|
|
|
|
0
|
U0966
|
The
Number 0.
|
1
|
U0967
|
The
Number 1.
|
2
|
U0968
|
The
Number 2.
|
3
|
U0969
|
The
Number 3.
|
4
|
U096A
|
The
Number 4.
|
5
|
U096B
|
The
Number 5.
|
6
|
U096C
|
The
Number 6.
|
7
|
U096D
|
The
Number 7.
|
8
|
U096E
|
The
Number 8.
|
9
|
U096F
|
The
Number 9.
|
|
|
|
.
|
U0970
|
The
Abbreviation symbol.
|
|
|
|
|
|
U0964
|
The dEvnAgrI phrase separator.
|
||
|
U0965
|
The dEvnAgrI sentence terminator.
|
The one deviation in my UCS parser implementation is that it uses the
j symbol for the
J. As a result, certain marATHI
words cannot be represented for now ...
Still to do: Represent Dental Fricatives (the english th sounds).
My UCS Parser
- My Lex Parser Source Code dev.yy
(GPL). This is primarily written to scan HTML pages. It skips over all
letters except those that are enclosed by <dn> </dn> tags
and converts them into dEvnAgrI.
It discards the <dn> </dn> tags. Letters that don't scan
are echoed as-is to the output.
To compile this into a working executable do:
$ lex dev.yy
$ gcc -DYY_MAIN lex.yy.c
$ mv a.out 2dev
- Finally, cat the transliterated text to stdin
and pipe it to the executable. The UCS representation is spit to stdout.
An equivalent alternative is:
$ 2dev <
myText.txt > myText.ucs
Parser
Implemenation Notes
<UNDER
CONSTRUCTION>
Future Work?
My next immediate plan is to remap my current parser to use the
fArsI script. Unfortunately, my
transliteration scheme, which had a phonetic basis, does not readily
allow itself to such translation simply because the
fArsI script does not have a
one-to-one mapping from a phoneme to character ...
Another goal would be to provide a Web-based interface to my 2dev tool.
This will take some effort and time because I'll have to indulge in
writing an entire tokenizer in javascript. Boy is that fun!!