中文電腦大綱

中文電腦大綱 (本頁作者保留版權, 如需全篇(中文或英文)引用請聯絡作者)

文字

表意: 字形與字義有關聯。如: 漢字。
表音: 使用數十個字母(音素)拼出各種音節。如: 羅馬字母。
其他: 表意與表音混合等。如: 日文。

中文(漢字)

漢字
繁體字: 有對應簡化字的漢字。台灣、香港、海外華人地區使用。
簡化字: 有對應繁體字的漢字。中國大陸、星加坡、海外華人地區使用。
傳承字: 傳統繼承的漢字, 即繁簡同形、沒有被簡化的漢字。
異體字: 字音字義相同, 有兩種或以上寫法的漢字。
國際漢字: 其他國家使用的漢字。如: 日語漢字、韓語漢字。

現代漢字字表
中國大陸: 常用字表, 通用字表, 簡化字表, 異體字表等。
台灣: 常用字表, 次常用字表, 罕用字表, 異體字表等。
日本: 當用漢字表等。

漢字特徵
形: 筆畫, 筆形(橫、豎、撇、點、捺、折), 劃數; 筆順(筆序); 部首, 部件; 漢字結構。
音: 今音 (普通話, 粵語等方言讀音); 古音; 文白二讀(讀書音及口語音);
多音字; 俗讀, 舊讀, 統讀。
義: 漢字的意義及其用法場合等; 褒貶等感情色彩。
用: 一般用字、專科用字、方言用字、人名地名用字;
被動字彙(認識的字)、主動字彙(會用的字);
常用字、非常用字等。

字量字頻
常用字(三四千)、次常用字、罕用字、古籍用字等。
一般字典收七千多到萬餘。現時估計漢字總量超過十萬。

字序
傳統無字序觀念。
傳統字書以部首及筆劃定字的出現先後次序。如: 說文解字(漢)、字彙(明)、康熙字典(清)。
傳統韻書用音序(先韻後聲)。如: 廣韻(宋)、集韻(宋)。
現代字典兩者皆有, 也有多種檢字法的字典。

電腦

二進制: 由0及1組成的數字, 別於十進制的9個數字。
十六進制: 使用9個數字及ABCDEF的數字系統。
人機介面: 操作系統。如Windows XP。
程序: 系統程序如Norton Antivirus, 應用程序如Office XP。
資料存儲: 檔案。如: 純文字格式(txt)、Word文件(doc)等。
運行: 輸入 > 處理 > 輸出/存儲
電子書: 文字、圖畫、聲音、錄像

編碼及文字編碼

可以被編碼的對象: 文字、圖畫、聲音、錄像等。
文字會轉成二進制, 交由電腦處理, 一般程序員都會把二進制轉成十六進制, 以便於閱讀。

字集(ref:中文:漢字): 字表, 按一定原則收錄的文字集合。如字典。

英文: 52個大小楷羅馬(拉丁)字母、阿拉伯數字、標點符號等。
中文: 繁體字集、簡化字集等。

編碼字集

有數字及文字兩欄的字表,
一個文字只對應一個數字,
一個數字只代表一個文字。
此數字稱為內碼。
字集大小固定。
碼位: 一個數字及文字的編碼空間。
碼區: 由連續的碼位組成。
編碼字集: 由不同碼區組成。

文字以內碼存儲在電腦檔案中,
也以內碼作信息交換用。

英文(羅馬字母)
ASCII。如A用41(十六進制)代表。有128個碼位。

繁體字: Big5。
如 "一" 用A440(十六進制)代表。
主要是台灣及香港使用,
有13053字及標點等符號,
共二萬多碼位。

其他: CNS11643、CCCII。

簡化字:

GB
如 "一" 用D2BB(十六進制)代表。
主要是中國大陸使用。
有6763字及標點等符號。

GBK
有二萬多字。

GB18030。
有二萬七千多字。

繁簡異日韓字:

Unicode 2.0(多國文字編碼, 漢字區段名為CJK)
二萬多字。

GBK, GB18030

日語漢字:

JIS, 六千多字。還有其他編碼。

編碼標準/區段

國際標準: ISO10646, 相當於Unicode。
碼區: 系統區(常用字區、次常用字區等), 外字區(用戶造字區)。
外字編碼: 新增字的處理。字量、碼位編排(賦予內碼值)、輸入碼(形音劃數部首等)、排序等。
編碼外字集: 香港字集(附於Big5或Unicode上)、Big5P。

編碼與字形、字體(ref:輸出:字體)

一種編碼多種字體: Big5: 細明體、標楷體 (外字未必有多種字體)

漢字輸入及輸入碼

鍵盤(ref:漢字特徵)
形: 倉頡、速成(簡易)、輕鬆、快速倉頡、大易、縱橫、快碼、九方、五筆字型等。
音: 注音、忘形、拼音、雙拼、廣東、粵拼(標準粵語拼音)、粵拼詞匯等。
混合形音: 觀音、自然等。
翻譯: 英翻中。
碼表: 內碼、電報碼等。

手寫
聯機: 即時認字。
離機: 寫完才認。

掃描
文字辨識。

語音
普通話、粵語、其他方言。

其他
翻譯。

原理/設計
字數: 有多少字要編輸入碼。如Big5的13053字可以只編常用字部份五千多字。
字根: 有限個數, 與鍵盤的對應表。如: 倉頡使用25個字根, 字根日的對應鍵位是A。
輸入碼碼長: 一個字最多使用多少個字根。如: 倉頡碼長為5。
編碼規則: 不同字根如何組合起來代表一個字。例: 倉頡用日月(AB)代表明字。
內碼/編碼字集
輸入碼表: 若干字及其輸入碼, 有關輸入法的信息或設定。
輸入程序。
選項設定: 鍵盤快速鍵、字數控制、內碼設定、增減輸入法等。

漢字輸出

字形
標準字形(繁體字或簡化字), 舊字形。

字體
印刷字體, 字形的表現形式。

宋體(又叫明體)
Windows本身的
細明體/新細明體不合標準字形。
標楷體是台灣標準字形。
SimSun是中國大陸標準字形。

其他字體: 黑體、圓體等。

字型
顯示字體的技術。

點陣、向量、TrueType、OpenType、ClearType等。

選項設定: 字體增減、外字設定、繁簡設定等。

ref: 編碼與字形、字體

中文環境(I18N, L10N, G11N)
I18N: internationalization
L10N: localization
G11N: globalization

漢字處理: 字碼存儲、內碼轉換、文本搜尋等。

軟件漢化: 應用程序介面中文化、支援漢字處理等。

其他軟件硬件: Linux, Mac, PDA, 手提電話、其他電子產品等。

應用軟件

輸入法套裝
文本編輯(排版)器
繁簡轉換軟件
外掛中文系統
電子字典
漢語學習軟件
其他

有關科目或疇範

應用語言學
語料庫語言學
漢語自動分詞
機器翻譯
自然語言理解及信息撮要
全文檢索及文字搜索引擎
古籍整理及電子化
電子書

Outline of Chinese Computing

read the original Chinese version

(rights reserved, any full quotation (Chinese or English) please contact author
(translated from Chinese, refer to Chinese if any difference exists)
(* the terms are translated by author and may not be standard, need to be confirmed)
(Please email to me if any mistake exists)

Writing/Script

Ideographic: the glyph relates its meaning. eg. Chinese characters.
Phonetic: letters(phonemes) to form syllables. eg. Roman alphabet.
Others: mix of ideographic and phonetic. eg. Japanese writing.

Chinese Characters

Characters
traditional: those characters having simplified characters.
             used in Taiwan, Hong Kong, overseas Chinese.

simplified: those characters having traditional characters.
             used in China mainland, Singapore, overseas Chinese.

inherit(*): those characters having same glyph of simplified.
             or traditional, not simplified.

variant: those characters having same pronounciation
             and meaning but different glyphs.

international(*): Chinese characters used in other country/region.
             eg. Japanese Chinese(Kanji), Korean Chinese(Hanja).

modern Chinese character lists
Chinese mainland: frequently used, common used, simplied, variants.(*)
Taiwan: frequently used, less frequently used, rare used, variants.(*)
Japan: common used, etc. (*)

characteristics of characters
glyph:

stroke:

stroke count;
stroke feature
horizontal, vertical, left-falling, point, right-falling, turning(*);
stroke order;

radical, component;
structure.

pronunciation:

modern (Putonghua, Cantonese, etc).
ancient.
reading and spoken.
multi-pronunciation characters.

meaning

meaning, usage, situation; appraise, etc.

usage

common, subject, dialect, human name, geographic name, etc.
passive vocabulary, active vocabualry;
frequently used, not frequently used; etc.

quantity and frequency
common(3000 to 4000), less common, rare, only in ancient books.
common dictionaries contains 7000 to over 10,000 characters.
Now quantity is estimated over 100,000.

character order
not exist in tradition.
And in tradition,
character dictionaries
        classified by radical and stroke count.
        eg. Shuowenjiezi(說文解字), Kangxi(康熙字典).

rhyme dictionaries
        classified by rhyme(vowels韻母 then consonants聲母).
        eg. Guangwen(廣韻), Jiwen(集韻).

modern dictionaries
        two classification also exist.

Computer

binary: numbers only consist of 0 and 1.
hexadecimal: number systems using digits 0 to 9, ABCDEF.
different from decimal numbers which consists of 9 digits.
interface between human and machine: operating system. eg. Windows XP.
programs: of system (Norton Anitvirus) or of application (Office XP).
data/info saving: files. eg. text format(txt), Word document(doc), etc.
run/execution: input > processing > output/saving.
ebooks: text, picture, sound/voice, video.

Encoding and Script Encoding

objects can be encoded: text, picture, sound, video, etc.
text is changed to binary for computer processing.
the binary turns to hexadecimal for programmers reading.

character set(ref: Chinese characters)
word list. collection of certain quantities of characters.
eg. dictionaries.

English: 52 capital and small letters of Roman letters,
Arabic numbers, punctuation marks, etc.

Chinese: traditional, simplified, etc.

coded character set
word list contains number field and character field.
one character maps to only one number
and one number represents only one character.
the number is called internal code.

size of character set is fixed.
code point: occupation space of one number and one character.
block: continuous code point.
character set: consist of different blocks.

characters (text) are saved as internal code in files.
internal codes are also used in information exchange.

English(Roman alphabet): ASCII.
eg. hexadecimal 41 represents "A".
total 128 code point.

Traditional Chinese: Big5

eg. hexadecimal A440 represents "一"
used mainly in Taiwan, Hong Kong.
13053 characters, puctuation marks and other symbols.
total over 20,000 code points.

other: CNS11643, CCCII.

Simplified Chinese:

GB
eg. hexadecimal D2BB represents "一".
used mainly in China mainland.
6763 characters, punctuation marks, other symbols.

GBK, over 20,000 characters.

GB18030, over 27,000 characters.

TSVJK (traditional, simplified, variant, Japanese Kanji, Korean Hanja)(*):

Unicode 2.0
mulilingual encoding.
block name of TSVJK is called CJK.
over 20,000 characters.

GBK
GB18030

Japanese

JIS. over 6000 characters. etc.

Encoding standard/Block
international: ISO10646, equivalent to Unicode.

block: system(frequent, less frequent, etc), UDA (user defined area).

UDC encoding: UDC (user defined character) control and process.
       quantity, code assign,
       input code (glyph,pronunciation,stroke,radical,etc)
       sorting.

UDC set: HKSCS (in Big5 or Unicode), Big5P.

Encoding and glyph, font (ref: Output: font)
One encoding to many fonts: Big5: ming(明體), kai(楷體)
       (UDC may not have many fonts)

Character input and input code

keyboard (ref: characteristics of characters)
glyph: changjei(倉頡), quick(快速/速成), Qingsong(輕鬆), quick changjei(快倉), Dayi(大易),
       Zhongheng(縱橫), QCode(快碼), Q9(九方), Wubizixing(五筆字型), etc.(*)

pronunciation: Zhuyin(注音), Wangxing(忘形), Pinyin(拼音), Shuangpin(雙拼), Cantonese,
       Yuepin(粵拼standard Cantonese pinyin), Yuepingcihui(粵拼詞匯), etc.(*)

mix of glyph and pronunciation: Guanyin(觀音), Ziran(自然), etc.(*)

translation: English to Chinese.

table: internal code table, telegraph, etc.

handwriting
online: recognize immediately.
offline: recognize after writing.

scan
character recognizing.

voice
Putonghua(普通話), Yueyu(粵語), etc.

others
translation.

principle/design
quantity: characters needs input code.
        eg. Big5 have 13053 characters,
        but only over 5000 frequently used characters have input code.

radical(of input code): quantity, mapping to keyboard.
        eg. Changjei use 25 radical and 日 maps to A.

regulations of encoding:
        how different radicals compose and represent one character.
        Changjei 日月(AB) represents 明。

input code length: the maximum number of radicals that represent characters.
        eg. Changjei is 5.

internal code / coded character set

input code table: contains some characters and their input code,
        or other info and setting about input method.

input method program

options and setting: quick key, character number control,
        internal code control, add or delete input method,etc.

Output

glyph: standard glyph(traditional or simplified), old glyph.

font: for printing. implementation of glyph.
      ming(細明體) of Windows is not standard.
      kai(標楷體) is Taiwan standard.
      SimSun(簡宋) is China mainland standard.

type: technology of fonts.
      Bitmap, vector, TrueType, OpenType, ClearType, etc.

Options and setting: add or delete fonts, UDC setting,
      traditional-simplified setting, etc.

Ref: Encoding and glyph, font

Chinese environment (I18N, L10N, G11N)

I18N: internationalization
L10N: localization
G11N: globalization

character processing:
internal code save, internal code transcode, text search, etc.

software Chinese localization(*):
application software having a Chinese user interface,
support Chinese character processing, etc.

other software and hardware:
Linux, Mac, PDA, mobile phone, other electronic product.

Application Software

input method software package
text editor / processor / compose type
simplified-traditional conversion
hang-on Chinese system
electronic dictionaries
Chinese learning software
others

Related Topics or Fields

applied linguistics
corpus linguistics
Chinese phrase parsing
machine translation
naturual language processing(NLP) and information extraction(IE)
full text searching and search engine
ancient classic books digitization
ebooks

2002-10-22 初稿
2002-11-04 二稿修改中
2002-11-22 定稿