A Standard For Tamil Computing

The following is a DRAFT of a proposal for comments by the internet tamil community. Early in 1998 (January ?), this draft will be finalised and the proposal presented to the Tamilnadu Computer Standardisation Committee (TNC) for possible adoption as a Standard.

A Proposal for
A Tamil Standard Code For Information Interchange
(TSCII)

Tamil

Tamil is one of the two classical languages of India. It is the only language in that country which has continued to exist for over two thousand years. It is spoken today by approximately 65 million people living mainly in southern India, Sri Lanka, Singapore, Malaysia, Africa, Fiji, the West Indies, Mauritius and Reunion Islands, United Kingdom, United States and Canada. Tamil is the pre-eminent member of the Dravidian Language family and has one of the longest unbroken literary traditions of any living language in the world. [1]

Information Processing in Tamil

Dravidian Languages such as Tamil use non-roman letters as alphabets. Hence typing of text materials in computers of these Indic languages requires use of either specific font-faces and/or word-processing software. In spite of this limitation, word-processing of tamil text materials on computers has been taking place well over a decade. Many different fonts and packages have been developed. With the availability of free tamil fonts in the internet during the last two years, there has been a phenomenal growth in the number of web sites dealing with matters of interest to Tamils at large. There are already a number of tamil language newspapers and many popular, literary magazines available "On-line" in tamil script. There are web sites devoted to collection of electronic texts of tamil literary classics, language learning etc. in tamil script. A comprehensive listing of over 350 web sites of interest to Tamils is also available on the internet [2]. Currently nearly all of the tamil computing is at the word-processing level. We do not yet have dedicated softwares for other applications such as those for databases and multimedia

In the absence of any organised effort to co-ordinate and promote tamil information processing at the national and international level, many different fonts and desktop publishing softwares have been developed at different parts of the world. There was hardly any standard protocols observed in the development of these key tools for tamil information processing. This in turn has led to the present (rather unfortunate) situation that, one needs to download and install several tamil fonts or packages to be able to access most of the materials of interest to tamils available on the internet. An International Conference devoted to Tamil Information Processing, called TamilNet'97 (the first of its kind) was organised early this year at Singapore by the Internet Resources and Development Unit (IRDU) of the National University of Singapore to discuss the situation and propose possible standards. Some of the key papers presented at this conference [3] (including one -a broad overview of features of different fonts and DTP packages currently in use [4] ) are available on the internet.

Recent Efforts Towards Standardisation

Recently there have been three series of efforts, all directed towards standardisation of the tamil information processing on the internet. Firstly there have been a couple of national and international conferences on this topic, including the TamilNet'97 mentioned earlier. Secondly, the Tamil Nadu Government recently set up an expert committee /task force ("The Tamilnadu Computer Standardisation Committee") to examine the situation and make recommendations. This committee has made its first recommendations for tamil keyboard layouts [5].

Tamil Language has been fortunate to have two major email discussion lists on internet: one operated by Asia-Pacific Internet Company (APIC) called tamilnet [6] (tamil@tamil.net) and one operated by IRDU unit of the National University of Singapore called TamilWeb (tamilnet@tamilnews.org.sg) [7]. For over a year, tamil lovers from different parts of the world have been discussing in these email lists, the urgent need to have an international standard for tamil computing. Participants for these discussions come from different walks of life (software developers, academics at the universities and ordinary/simple end-users). Recently the mailing lists were merged into a single one. The proposal for a new international standard for tamil information interchange discussed in these web pages is the outcome of these deliberations (exchange of several hundred emails amongst several hundred participants over several months!!!).

Current Standards for Tamil

Before we elaborate on the proposed standard, it is pertinent to review the current standard if any. In early eighties, the Dept. of Electronics of the Govt. of India set up an expert committee to set up standards for information processing of indic languages. The Indian Standard Code for Information Interchange (ISCII) first launched in 1984 is the outcome of this exercise. The Indian Standard Code ISCII is a 8-bit umbrella standard, defined in such a way that all indian languages can be treated using one single character encoding scheme. ISCII is a bilingual character encoding (not glyphs!) scheme. Roman characters and punctuation marks as defined in the standard lower-ASCII take up the first half the character set (first 128 slots). Characters for indic languages are allocated to the upper slots (128-255). The Indian Standard ISCII-84 was subsequently revised in 1991 (ISCII-91). It is currently undergoing revision, possibly leading to ISCII-97. Along with the character encoding scheme (ISCII), the Govt. of India also defined a keyboard layout for input called INSCRIPT. The research and development wing of the DOE, Govt of India (called Center for Development of Advanced Computing, CDAC based in Pune, India) has developed software packages based on these indian standards. Multilingual and Multimedia products are based on Graphics and Intelligence-based Script Technology (GIST)(Email: gist@cdac.ernet.in). Commercial DTP packages based on ISCII are also available .

UNICODE[8] is a rapidly evolving international standard for multi-lingual word-processing. Unicode is a more ambitious 16-bit character encoding scheme with defining of over 65000 slots for 50+ world languages. Along with other indic languages, Tamil has been assigned specific slots U+0B80 -> U+0BFF (which, in decimal, is 2944 -> 3071; 128 locations) in this multi-lingual standard [9]. For obvious reasons, the choice of characters in UNICODE for indic languages is based on the indian standard code ISCII. Microsoft has already implemented Unicode in its Windows 95/NT OS and even distributes a unicode font free for multi-lingual word-processing[10]. These fonts do NOT yet include any glyphs for the indic language segments. Apple has released recently a multi-lingual package for indian market based on ISCII [11] but this package does NOT include, yet, the glyphs corresponding to Tamil.

If ISCII and UNICODE standards already exist for information interchange of indic languages (including tamil), a natural question is why propose another standard for tamil. Listed below are some key arguments advanced in this context:

i) Both ISCII and UNICODE emphasise character encoding and leave the screen rendering of these characters to software developers. Dravidian languages are notorious for their complex glyph structures. Practically all of the current implementations of the ISCII and Unicode standards invoke modern font-handling techniques (such as glyph substitution) that are available only on state-of-the-art computers running under the latest version of the OS (both on two of the most widely used PCs -Windows and Macintosh). Consequently these DTP packages are very expensive. A layman / simple tamil user is precluded from doing any simple word-processing of tamil texts on earlier generation computers. The necessity to go for advanced font handling techniques such as glyph substitution further puts us to a disadvantage as we will have to wait for applications (DTP, Word Processing etc.) to be developed from scratch for Tamil and we may not enjoy the luxury of using off-the-shelf applications that were developed for English *as-is* in Tamil.

ii) Using Devanagiri script as the reference language, ISCII defines a certain encoding scheme for all indic languages including the dravidian languages such as tamil, telugu and malayalam. Many of the scholars of th dravidian languages are highly critical of this approach. The phonology and the script usage of dravidian languages are very different. There are many characters in Tamil and Malayalam for which there are no equivalent devanagiri ones. Compromises are made by allocating extra slots to introduce these additional characters. By treating all indian scripts under one scheme, ISCII philosophy does not take advantage of the fact that Tamil *can* be encoded in a simple form that seamlessly integrates with existing computing platforms without requiring specialised rendering technologies.

iii) ISCII and UNICODE are NOT the only avenues open for tamil information interchange. It is worth pointing out that these are evolving standards. But before their emergence, for several decades, information processing and exchange in major languages of the world has been going on and these are via usage of simple, self-standing 7- and 8-bit fonts. As mentioned earlier, even in Tamil, in the last decade, several 7- and 8-bit tamil fonts have been successfully used for the major area of tamil computing, viz. word-processing. The only problem with these tamil fonts is that no standard encoding scheme has been used. So exchange of tamil text files is not simple. One needs to use converters to go from one scheme to other. Web (read World-Wide-Web) based information exchange is fast growing as the rapid, cost-effective means of data exchange across the world. A standard encoding scheme for these tamil fonts can simplify the exchange enormously. European languages, for example, have been fortunate to have several character-encoding standards defined and universally implemented.

There are several advantages to develop a tamil standard for information interchange that is based on simple, self-standing fonts:

i) Once installed in the system, they can be used practically on all applications directly without any extra software/hardware intervention; ii) The development of fonts corresponding to one encoding scheme but for use in different computer platforms ( particularly Windows, Macintosh and Unix) is rather straight-forward. The task is so routine and simple that, growing number of fonts are being made available FREE on the internet even by the amateurs. iii) World-wide, FREE Distribution of a self-standing tamil font can lead to vary rapid standardisation of information interchange, as has been the case with most of the european, Russian and Japanese languages. Up till recently (when free tamil fonts appeared on the internet), tamil word-processing required purchase of a tamil font for at least US$50 (much higher for DTP packages). No language can flourish in the emerging computer era if the basic fonts required for routine tasks are either come as part of the computer system software or freely available on the internet.

Proposed Standard Code for Tamil

The proposed tamil standard is a 8-bit bilingual scheme with the standard roman characters and punctuation marks in the first 128 slots, as in lower ASCII chart). The tamil glyphs along with a handful of grantha characters and special characters are placed in the upper-ASCII part (slot positions 128-255). Table 1 (see annex-1) presents a complete listing of various glyph choices and with their code assignments. The same information is also presented in the form of a compact .gif file Figure 1 . Motivation for the specific glyph choices and the slot allocations are elaborated in the next section ( design goals for the proposed standard).
Glyph choices for Slot positions 0-127 /rows 0-7:

Roman characters and punctuation marks - glyph choices identical to those in standard lower ASCII code / 8859-1 (Latin-1) schemes

Glyph choices for Slot positions 128-255 /rows 8-15:

i) entire set of vowels (uyirs) (18) and consonants (meis) (14)

ii) entire set of akaram-eRRiya meis (18), ikara varisai (18) and ukara varisai (18) and uukara varisai (18)

iii) consonant-modifiers for aakara, ikara, iikara, ekara and Ekara varisai (5); consonant-modifiers for the ukara and uukara varisai for the grantha characters (2)

iv) grantha characters ja, sha, sa, ha, ksha ( 5 vowel form and also the corresponding akara varisai (5))

v) special characters: copyright sign (#169), registered sign (#174) and bullet (#183) at their respective ANSI code positions shown within parenthesis. (Most of the punctuation marks required for presentation of newspapers and magazines on WWW are available in the standard lower ASCII set. Two missing ones are the copyright and registered mark signs. We included them here at their respective ANSI code positions to avoid the need to invoke additional font face tags in the HTML files.)

Design Goals of the Proposed Standard

1. Establish a consistent INTERNATIONAL TAMIL CHARACTER ENCODING STANDARD that in turn lead to a SELF-STANDING TAMIL FONT USABLE on ALL COMPUTER PLATFORMS, particularly on EARLIER MODELS/OPERATING SYSTEMS (cover at least those that appeared within the last decade).

A tamil font defined very much like the roman font such as Times or Helvetica, once installed in the system, can be used on all software packages supported by the respective OS without the need for additional software/hardware intervention. It is likely that over 90% of tamil computing is in the form of simple word-processing of plain text. The encoding standard must be such as to be readily implemented in most of the widely used computer platforms (UNIX, Windows and Mac). The input of tamil materials will be in all these three platforms. On the internet, the information exchange may involve any of the three OS (sender could use a Windows PC, the recipient a Mac and the intermediate mail server can be Unix-based).

Fortunately in the last three years, procedures have been developed for production of fonts with identical encoding scheme that work under these different platforms. Information exchange via email and WWW has also been perfected that, no serious problems are anticipated in rapid implementation of the proposed scheme on all three OS. Tamilnadu Govt. is willing to undertake the task of producing one such tamil font and distribute it free on internet. Free distribution of a handful of such fonts will not deprive the software market. There will always be a need for specially designed fonts for professional usage (in publishing houses), very much the same way the font market still exists for roman fonts (Adobe and others continue to make millions marketing roman fonts!)

2. The encoding is at the 8-BIT BILINGUAL LEVEL, USING A UNIQUE SET OF GLYPHS and the usual lower ASCII set (roman letters with standard punctuation marks) occupying the first 128 slots.
Why a 8-bit bilingual scheme?
i) Almost all of the European languages (representing several hundred million population!) currently employ such 8-bit bilingual scheme, commonly known as ISO 8859-X schemes. Such 8-bit schemes are proven standards widely implemented by all major computer platforms. So, in terms of identification and implementation, the scheme is rather straightforward even for non-tamil speakers.
ii) A 8-bit scheme with lower ASCII part in the first 128 slot can facilitate enormously the smooth flow of information across the internet in all of the commonly used protocols (SMTP, FTP, HTTP, NNTP, POP, IMAP,..) All non-tamil speaking personnel entrusted with communication flow (postmasters, system administrators,..) can easily follow the content, its originator, destination etc.
ii) Tamilnadu as a constituent state of India works under a bilingual scenario with both English and Tamil as the languages for official communications. With a single font it will be possible to correspond in either or both of the languages. ISC-II standard of the Govt. of India is also defined in a similar way.
What does it mean by a unique set of glyphs?
Tamil has far too many alphabets to be accommodated as a single glyph in the 128 slots left. So, depending on the complexity of the character (and its rendering) the scheme may use one, two or three bytes to define a single alphabet. But the choices of glyphs are such that, each of the 250+ tamil alphabets (uyir, mei and uyirmeis) are represented by one and only one way.
In the past, Tamil language used alternative glyphs for some of the tamil alphabets (e.g forward kombu/kokki to write lai/Nai/nai, Raa, Naa and Naa, referred to as ORNL). A unique definition scheme implies that there is no place for these old style characters in the encoding scheme.
What about character encoding as in ISCII and UNICODE?
If the glyph encoding scheme is UNAMBIGUOUS in defining the resulting character set, then it does not really matter if one choose to encode glyphs or characters. Defining a unique set of glyphs leading to a unique definition of all of the 250+ tamil characters makes the glyph encoding scheme unambiguous. Defining glyphs also defines the rendering part of the characters. The fact that we already have successful functioning of several tamil fonts in the market is a clear proof on the validity or implementation part of the approach. As mentioned under (1), the glyph encoding scheme allow design of self-standing simple fonts. Defining characters alone and leaving the rendering part to the software (as in UNICODE and ISCII) require dedicated expensive softwares. Most of the rendering methods currently employed in these schemes require modern font handling techniques that are available only on state-of-the-art computers. Unicode fonts and Apple Multi-lingual package (with Devanagiri) can be used only on the latest generation computers with Power PC chips and current OS software !!

3. The encoding scheme should be UNIVERSAL IN SCOPE. That is, the standard must be include all characters that are likely to be used in everyday Tamil text interchange.
For centuries tamil language has grown with several grantha characters added on. The usage of these grantha characters along with pure tamil ones is so deep-rooted in the day-to-day usage of tamil by the common man. Hence the inclusion of these grantha characters becomes essential under the above criterion. Both ISC-II and UNICODE recognise this situation and have provided specific slots for a number of grantha characters.
Unlike many of the tamil fonts and software packages that leave out rarely used tamil alphabets (such as ngu, ngU, nyu, nyuu), the present scheme ensures their presence. This has been done so that multimedia and softwares for teaching tamil can display all of the tamil alphabets without exception.

4. The encoding standard must be UNICODE AND ISCII COMPATIBLE.
Why unicode compatibility?
The glyph choices are to be such that, a one-to-one correspondence table between the alphabet/character definitions under the present scheme and UNICODE / ISCII can be established. A draft of one such mapping table is presented herewith in the annex section..
There are major advantages by ensuring this requirement:
i) The end-user can have a choice in the storage format to be either the present 8-bit scheme or the unicode/iscii format. So files can be exchanged readily between users of these different standards without loss of integrity of the file;
ii) Secondly the present glyph encoding scheme can happily co-exist with the more sophisticated Unicode/ISCII schemes and even can make way for smooth transition to unicode at a future date. Indian language Packages for Unicode and ISCII are very expensive and have started appearing in the market only very recently. It is still largely under-explored domain for fool-proof implementation.
What does unicode compatibility means in terms of glyph choices?
Both Unicode and ISCII scheme include a number of tamil numerals. So the present scheme need to include these tamil numerals. Else there cannot be a one-to-one correspondence between these forthcoming standards.

5. The standard may INCLUDE A PRIVATE USE AREA, which may be used to assign codes to characters not included.
The encoding scheme should leave at least 4-5 slots free for special use by software developers. None of the standard softwares written specifically for tamil will use the characters that are placed in these slots in "search" or "sort" type routines. However, use of this *special* characters in archives and other digital libraries is not encouraged so as to prevent mis-interpretation of their 'values' or 'meanings'.
What are possible uses for such private use area?
Several possible usages can be envisaged for these private slots:
a) replacement of straight quotes by the corresponding curly quotes, as is the current default case in most of the Microsoft, Claris softwares for word-processing, graphics,...( vacant slots #145-148 strongly recommended for this purpose)
b) diacritical markers for writing transliterated/romanized form of tamil
c) old style tamil characters such as lai/Nai/nai or Raa/Naa/naa.
d) "escape slots" through which software developers can bring many special characters - such as those required for recording/processing of old classical tamil texts still in palm-leaf manuscript level.

6. The standard must be USABLE AND CO-EXIST WITH other existing software until Unicode compliant software becomes available.
One-to-one correspondence table in the character definition as per the proposed standard with the popular tamil fonts/DTP packages will ensure smooth transition and recovery of all the archived tamil text materials produced till this date. There exist already conversion softwares that allow inter conversion of tamil text files prepared using different font encoding schemes. One such conversion software based on the proposed tamil standard can be made available to promote rapid and smooth transition to the new standard.

7. The tamil standard MUST BE IN PUBLIC DOMAIN.
The character encoding standard will have no restrictions on its use. It can be used freely for both commercial and private purposes.
Practically all of the tamil fonts and softwares that are currently in use world-wide are the recent work of individual authors and hence are subject to copyright protection to the authors. The copyright protection to authors is very clear with DTP packages. But when it comes to fonts, the scope (what can be subject to copyright and what is not) is very hazy and protection vary from country to country. So it is desirable to develop a true international standard that will be labelled clearly to be in public domain.

The proposed "encoding" is in public domain - i.e. no one needs to seek permission or state credit to implement a font based on the encoding. But the "implemented font" may or may not be copyrighted by the developer - this is entirely the developers descretion.

8. The tamil encoding standard should be such as to allow rapid IMPLEMENTATION of many of the routine tasks required in large DATABASES (such as search or sort).
It is very likely that with the widespread growth of a true international standard for tamil, large databases (library catalogues, electronic telephone directory, land/property registry, inventory of materials in departmental stores etc. etc.) are built based on tamil script. Routine usage of these databases often require search or sort routines. The encoding scheme should be such as to allow development of softwares for these without unnecessary demand for huge computer memory or processing capacity.

9. The encoding standard should be such as to meet SPECIAL REQUIREMENTS of various types of applications.
What about the special needs of publishing houses that require a high quality output. Can a glyph encoding guarantee this?
As mentioned earlier, with 128 available slots it is not possible to keep all of the 250+ tamil characters as single unique glyphs. If we use the frequency of occurrence of various tamil alphabets in a typical tamil text as a guide, with the proposed glyph choices, nearly 90% of the text will consist of tamil characters appearing as full glyphs without invoking of kerning or other font handling techniques!
Kerning is a routine font handling technique now available in most of the common computer platforms/OS. The present scheme envisages using kerning techniques to generate only two series of uyirmeis (ikara and iikara varisai). As a right-end modifier, the ikara and iikara varisai uyirmeis can be rendered fairly precise on all platforms. So it is likely that over 98% of the tamil characters can be rendered easily on screen and in print without loss of quality. Techniques such as pair-wise kerning can handle even the residuals adequately.
Also, sophisticated software packages for professional publishing houses (that use high end computers ) can always invoke font/glyph substitution and bring in a single glyph for the required character in question rather than using a composite. Provisions have been made with escape slots for this purpose.
What about the display in Point-of-Sale (POS) terminals?
Excessive invoking of kerning can pose problems for character display in POS terminals. As stated above, the kerning is invoked in only two series (ikara, iikara varisai). Even in these case the kokki falls apart in primitive POS systems, the tamil text should be still readable. Since the screen is constantly re-written, there should not be any problem to display all characters. Even in the character only terminals certain characters cannot be rendered legibly (Na, ha, ksha, sa may not fit in the usual 8 x 12 cell).

10. The output of the tamil standard (tamil text) should be independent of the input mode.
There are several popular methods of input for tamil and these are considered under different keyboard layouts: classical tamil typewriter, romanized and phonetic or transliterated. Several Keyboard editors that allow input according to these different methods have already been developed and these can be readily adapted to include the proposed encoding scheme as the reference chart for the font in question.

11. As with the Unicode standard, "the proposed standard does not encode idiosyncratic, personal, novel, rarely exchanged, or private-use characters, nor does it encode logos or graphics. Artificial entities, whose sole function is to serve transiently in the input of text, are excluded. Graphologies unrelated to text, such as musical and dance notations, are outside the scope of the .. standard.
One possibility would be to agree on a supplementary ding-bat type font for exclusive usage amongst the tamil community - one that contains symbols such as OM, religious symbols, arrows, greek symbols etc. If the all tamil web pages use these two (one official tamil font and a second de-facto standard dingbat style font ), we can easily add some color and liveliness to the world of tamil computing.

Notes on Implementation of 8-bit encoding scheme

What are the procedures to be followed in design of fonts so that there is no loss of integrity of files when they are transported across different computer platforms?
Amongst the character sets defined for world languages, the most extensively used standards are those corresponding to ISO 8859-X schemes, X=1,2,3,4, 5 (these are also known as Latin-1, Latin-2, Latin-3, Latin-4 and Latin-5 schemes). Due to fears of bit-stripping, 8859-X schemes do not use the rows 8 and 9. Till early nineties, bit-stripping was a common problem. (When bit-stripping occurs a character at slot #163 gets replaced by the one at #131 (=163 - 32). Now most of the communication protocols (particularly the email (SMTP) and the Web (HTML on Netscape or Internet Explorer) ) fully implement 8-bit level transfers.
However, it is essential that the proper precaution is taken in the definition of the encoding of the character set. There are many 8-bit encoding standards implemented today on computers (OS). Mention can be made of the following: i) ANSI standard having characters in all of the 16 rows; ii) Macintosh (MacRoman) encoding, having characters in all of the 16 rows; iii) Windows encoding ...... and iv) Adobe Standard Encoding. It is strongly recommended that the character names chosen during the packing correspond to full ANSI standard (or that of MacRoman ??).
For smooth handling of text files created using the proposed encoding scheme, it is desirable to register the proposed standard code-set TSCII as an international ISO standard. The commonly used web browsers such as Netscape/Internet Explorer recognize and handle many types of character sets and they can be persuaded to include the TSCII code set as well. This would also avoid the necessity to choose "personal encoding" to read the tamil text files on these browsers.

Continued in Part II- inclusive of Annexes (carrying the technical details on the proposed standard and guidelines in implementation.

Click here to go to the Webpage carrying the Annexes.
This file was last revised on 2 Dec. 1997.
Please feel free to Sign Guestbook or View Guestbook
Return to the Homepage of Tamil Electronic Library
Please send your comments to Dr. K. Kalyanasundaram
You are visitor number

since July 1, 2000

This page hosted by

A Proposal for A Tamil Standard Code For Information Interchange (TSCII)