GB2125197A

GB2125197A - Encoding chinese characters

Info

Publication number: GB2125197A
Application number: GB08221271A
Authority: GB
Inventors: Pingyi Zhi
Original assignee: Wang Shu Mei; Wong Kam Fu
Current assignee: Wang Shu Mei; Wong Kam Fu
Priority date: 1982-07-22
Filing date: 1982-07-22
Publication date: 1984-02-29
Also published as: GB2125197B

Abstract

A method of encoding a Chinese character comprises dissembling the character into four constituent radicals each of which represents a pronounciable sound, as many radicals as possible, subject to the formation of a total of four radicals, being combined into the first radical and, again subject to the formation of four radicals, a succeeding radical is incorporated into a preceding radical, and in the event that there are less than four resulting constituent parts, after the constituent parts, the final stroke of the character as normally written is used as one of the constituent parts and in the event that there are still not enough total constituent parts to complete the four required, the whole character is repeated as a final constituent part, and thereafter representing the four constituent parts by the initial four Roman alphabet letters of the transliterations of each constituent part if the part is a character or its link character which the part closely resembles, the code for the Chinese character having therefore the resulting four Roman letter code. <IMAGE>

Description

SPECIFICATION Improvements in the encoding of Chinese characters This invention relates to the encoding of Chinese and other idiomatic character into a form in which they can readily be entered into a computer, typewriter or the like via a simple keyboard.

Chinese characters are not made up in the same way as English words from a relatively small number of Roman letters and since there are a very large number of characters not all of which are admittedly in wide-spread use, it is virtually impossible to design a simple keyboard where a single key is provided for each character. It is therefore desirable to try to work out some method of encoding each individual character into a unique code which can be entered through a keyboard of reasonable size and preferably a keyboard which is of conventional size and construction for use with Roman letters and numbers.

For ease of usage any such encoding method must have relatively simple and straightforward rules which are easy to learn and understand and which are preferably based on common knowledge and everyday usage by Chinese people. In addition, the rules should be relatively simple to remember and should provide an encoding method by means of which characters can be entered in a relatively easy and straightforward fashion through a simple keyboard such as a western languarge keyboard.

The invention has therefore been made with these points in mind and aims to provide a method of encoding Chinese characters which meets these requirements.

According to the invention, there is provided a method of encoding Chinese characters into a form in which they can be entered into a computer or the like in which the character is dissembled into four constituent radicals each of which represents a pronounciable sound, or many radicals are possible, subject to the formation of a total of four radicals, are combined into the first radical and, again subject to the formation of four radicals, a succeeding radical is incorporated into a preceding radical, and in the event that there are less than four resulting constituent parts, after the constituent parts, the final stroke of the character as normally written is used as one of the constituent parts and in the event that there are still not enough total constituent parts to complete the four required, the whole character is repeated as a final constituent part, and thereafter the four constituent parts are represented by the initial four Roman alphabet letters of the transliterations of each constituent part if the part is a character or its link character whilst the part clearly resembles, the code for the Chinese character having therefore the resulting four Roman letter code.

Such as system is relatively straightforward to learn, understand and operate. In particular, the final code will be in the form of four Roman letters and so it is possible to use a western style keyboard for the entry of this code corresponding to the Chinese character into a computer, typewriter or the like and provided the computer, typewriter or the like has been suitably programmed to recognise the codes of each particular Chinese character the appropriate character will be input or printed.

In addition, the chance of ambiguity as between two different Chinese characters having the same code is largely eliminated. Thus, for each of the four Roman letters, there are of course 26 possibilities and practice has shown that the chance of two different Chinese characters ending up with the same alphabet code is extremely low indeed. In addition, a very large number of codes are possible since each of the four Roman letters can be made up of 26 different possibilities and so, in theory, it is possible to encode the order of 450,000 different characters which is well in excess of the number of Chinese characters in use.

Amongst the advantages of the invention are that the rule for encoding characters are simple, clear and easy to learn and in the breaking up of character into radicals based upon the common knowledge of people who understand Chinese writing. Therefore regular reference to code books is not required. Further codes can be entered whilst touch typing and the keyboard can be programmed to give immediate warning of a mis-typing for a code which does not correspond to a code recognised by the keyboard.

For convenience, the transliteration used for the constituent parts is Pinyin spelling which is the modern form and so the first letter of the Pinyin spelling of each part is preferably used to obtain the code fed into the keyboard.

In the event that any of the constituent parts is not itself a pronounciable character, the popular custom of using a link character which resembles the part is followed, Thus, for exmaple, the link character for

If this cannot be done, the "external configuration" of the character is examined to find a real Chinese character which resembles the part. For example, the three parts

and

have a single link character which is

This is the general principle from which all further example are derived.

Thus, the following rules can be adopted: 1. The relative character for an element as a stroke is the name of the stroke. For example, the relative character of

2. The selection is based on customary designations, such as the radical

The others are on the analogy of these.

3. The elements in similar forms are grouped into one class and the relative character of the one most in use is used as that of the whole class. For example,

4. For a few elements for which we are actually unable to select relative characters, we use "0" as their symbol, such as the element

However, "0" can only be used in case that the element does not allow further disassembly. "0" cannot be used for the element which can be further disassembled.

To give some examples, the character

and so the four parts chosen for the code, since these two characters cannot be further split would be:

Therefore the character

can be coded as follows:

this could result in encoding the four parts required.

Some further examples are shown in Figure 1 of the accompanying drawings.

The general principles which emerge from Table 1 are: 1. Parts of four strokes or less generally speaking are not dismantled further. For example, see the character

then it must be taken apart.

2. Complicated characters must still be reduced to four parts. In order to limit the ambiguities of dismantling, as much as possible is included in the first part of the character and the rest is dismantled.

For example, in the character

is placed entirely into one part whereas the second one is dismantled into several.

Hence, whenever a part can, by the addition of the succeeding stroke, be transformed into yet another character, that transformation must be adopted, until such time as either four strokes or no new additional parts can be formed by adding the incremental stroke.

The method of the invention is easy to learn and remember and the steps of dismantling characters derives its ideas from popular modes of oral communication which are well known to the masses. This also lessens the burden of study. As to even the codes, though they are taken from the reading pronounciation, they need only be accurate in their first letter.

By limiting the codes to four letters it is easier to transform the codes into machine codes on entry into the computer and for the codes to be processed internally. This regularity and shortness contributes to raising the rapidity of data entry.

The selection of link characters does not depend on the radical-phonetic distinction, but rather proceeds directly from the form of the characters themselves. This raises the speed of distinguishing the sounds and it expands the breadth of a single constituent part's characteristics.

In order to increase the speed of typing a test, reducing the number of key strokes is necessary.

Therefore a "High Speed Code" can be used for the most commonly used Chinese characters and character-compounds. The basic concept of this special code is one Chinese character one key stroke.

This is, however, limited to 26 Roman letters. By way of example, the frequently used characters shown in Figure 2 of the accompanying drawings can be represented by the single letters shown in Figure 2. Of course for these characters the keyboard can still be programmed to recognise the four letter code as well.

As can be seen from Figure 2, the relative frequencies of these characters is given and they are marked roughly to the relative frequencies of the Roman letters on a keyboard which is itself arranged so that the most frequently used keys are the easiest and quickest to operate.

To further simply and increase the coding, Chinese numbers can simply be given the number of the corresponding Arabic numbers as shown in Figure 3 to create 10 more codes.

The method of the invention can also be expanded to include character expression codes. Such a character-compound can be thought of a Chinese character in the broad sense, with the characters composing the compound treated as the constituent parts of these characters broadly conceived. In this way, all the above principles of code construction can be applied to longer expressions. Examples are shown in Figure 4.

There is a very large number of Chinese character-compounds in the Chinese modern language, and, of course, for all of them codes can be created in this way. Normally however only the most commonly used character-compounds will be treated this way, e.g. about 140, but it is possible to set up say an additionally 200 such characters for use in specialist applications.

Other high speed codes for character-compounds can be created by combining Arabia numeral codes with the Roman alphabet code as shown in Figure 5 for times and dates. These correspond with the Chinese STC, the Standard Chinese Telegraph Code.

With these "High Speed Codes" we can achieve a fourfold increase of the speed of typing compared with the Normal Code. If we can flexibly combine Chinese character codes with high speed Chinese character and character-compound codes, we can obtain in practice a typing speed of 100 to 120 characters per minute (that is 2.5 key strokes on the average per one Chinese character).

When it becomes possible for Chinese characters to be entered directly as machine codes into a computer, then the possibility arises of Chinese becoming a working language of computers in much the same way as English is presently used as a working language. These codes are manipulated and processed within the machine, but at output time are restored to their original form and printed, e.g.

with an ink jet printer. For this to be done in Chinese, it will require a character generator for Chinese which has stored an image of the Chinese character, as well as a display device which can display characters (soft copy) or which is capable of printing characters (hard copy). To combine this type of equipment with the keyboard that can handle on sight encoding is to create an intelligent terminal for Chinese characters. As the typist enters his code he will see displayed the Chinese character on the screen. If necessary, he can have contents transmitted to a main computer connected to the terminal for more complicated processing from which a hard copy may be produced. Or he can use the terminal as a word processor and typewriter.

Examples of the uses of the invention includes: 1. Automated typesetting and editing. Today, advanced countries have all introduced computerized phototypesetting in order to eliminate the hand labour of traditional typesetting art and to reduce the work load of the workers. For Chinese, the difficult point has always been data entry. Our method is solution.

An intelligent Chinese terminal can also greatly reduce the workload of editing of newspapers and magazines. Editorial workers can sit at the terminal, make changes through the keyboard, do proofreading, and even set pages of type using the main computer. They can even exchange texts with other places through computer networks. A printer can be asked at any point to type a clear copy.

2. Mechanized translation, preparation of news indexes. Mechanical translation using the computer, whether from Chinese into foreign languages or vice versa will depend on input and output of Chinese characters. English can be translated mechanically into Chinese. At present Chinese into English requires the use of Pinyin spelling and symbols for the tones. Human beings have to be called upon to write out the characters one by one. If the invention is used to give an internal code then it need only be linked to the addresses of the shapes of Chinese characters in the machine.

Systems of indexing first compress their data into compact form, whether it be manuscripts, archives or other material. After establishing coding categories, the character-compound codes can be used through the computer terminal directly to read material or to ask direct questions using Chinese with the answers to be displayed on a screen.

3. A Chinese language computer, a truly Chinese computer, would be able to have Chinese as its working language with the capability of using Chinese in its systems programs. For example, you could use Chinese to write a mathematical programming language.

4. Management of enterprises and projects with a nation-wide network of computers, production statistics and other important economic indexes can be reported back to the leading offices. This too would require a Chinese language terminal.

5. Applications in the social science. The field of social science now everywhere is using computers with particular success in linguistic research. For example, someone has done very thorough and penetrating research on Shakespeare's works, their style and contents, using the computer, and has produced some proof in the solution of the mystery of who Shakespeare was. Thus computers of today's advanced science are not completely out of touch with literary masterpieces of hundreds of years ago. China's enormous literary tradition, the corpus of commentaries on ancient books, investigations of historical reality, dictionaries and editions, the verifications of authors and so on, all these are areas in which the computer can be of help.

Claims (Filed on 22 July 83) 1. A method of encoding Chinese characters into a form in which they can be entered into a computer or the like in which the character is dissembled into four constituent radicals each of which represents a pronounciable sound, as many radicals as possible, subject to the formation of a total of four radicals, are combined into the first radical and, again subject to the formation of four radicals, a succeeding radical is incorporated into a preceding radical, and in the event that there are less than four resulting constituent parts, after the constituent parts, the final stroke of the character as normally written is used as one of the constituent parts and in the event that there are still not enough total constituent parts to complete the four required, the whole character is repeated as a final constituent part, and thereafter the four constituent parts are represented by the intial four Roman alphabet letters

**WARNING** end of DESC field may overlap start of CLMS **.

Claims

**WARNING** start of CLMS field may overlap end of DESC **. commonly used character-compounds will be treated this way, e.g. about 140, but it is possible to set up say an additionally 200 such characters for use in specialist applications. Other high speed codes for character-compounds can be created by combining Arabia numeral codes with the Roman alphabet code as shown in Figure 5 for times and dates. These correspond with the Chinese STC, the Standard Chinese Telegraph Code. With these "High Speed Codes" we can achieve a fourfold increase of the speed of typing compared with the Normal Code. If we can flexibly combine Chinese character codes with high speed Chinese character and character-compound codes, we can obtain in practice a typing speed of 100 to 120 characters per minute (that is 2.5 key strokes on the average per one Chinese character). When it becomes possible for Chinese characters to be entered directly as machine codes into a computer, then the possibility arises of Chinese becoming a working language of computers in much the same way as English is presently used as a working language. These codes are manipulated and processed within the machine, but at output time are restored to their original form and printed, e.g. with an ink jet printer. For this to be done in Chinese, it will require a character generator for Chinese which has stored an image of the Chinese character, as well as a display device which can display characters (soft copy) or which is capable of printing characters (hard copy). To combine this type of equipment with the keyboard that can handle on sight encoding is to create an intelligent terminal for Chinese characters. As the typist enters his code he will see displayed the Chinese character on the screen. If necessary, he can have contents transmitted to a main computer connected to the terminal for more complicated processing from which a hard copy may be produced. Or he can use the terminal as a word processor and typewriter. Examples of the uses of the invention includes: 1. Automated typesetting and editing. Today, advanced countries have all introduced computerized phototypesetting in order to eliminate the hand labour of traditional typesetting art and to reduce the work load of the workers. For Chinese, the difficult point has always been data entry. Our method is solution. An intelligent Chinese terminal can also greatly reduce the workload of editing of newspapers and magazines. Editorial workers can sit at the terminal, make changes through the keyboard, do proofreading, and even set pages of type using the main computer. They can even exchange texts with other places through computer networks. A printer can be asked at any point to type a clear copy. 2. Mechanized translation, preparation of news indexes. Mechanical translation using the computer, whether from Chinese into foreign languages or vice versa will depend on input and output of Chinese characters. English can be translated mechanically into Chinese. At present Chinese into English requires the use of Pinyin spelling and symbols for the tones. Human beings have to be called upon to write out the characters one by one. If the invention is used to give an internal code then it need only be linked to the addresses of the shapes of Chinese characters in the machine. Systems of indexing first compress their data into compact form, whether it be manuscripts, archives or other material. After establishing coding categories, the character-compound codes can be used through the computer terminal directly to read material or to ask direct questions using Chinese with the answers to be displayed on a screen. 3. A Chinese language computer, a truly Chinese computer, would be able to have Chinese as its working language with the capability of using Chinese in its systems programs. For example, you could use Chinese to write a mathematical programming language. 4. Management of enterprises and projects with a nation-wide network of computers, production statistics and other important economic indexes can be reported back to the leading offices. This too would require a Chinese language terminal. 5. Applications in the social science. The field of social science now everywhere is using computers with particular success in linguistic research. For example, someone has done very thorough and penetrating research on Shakespeare's works, their style and contents, using the computer, and has produced some proof in the solution of the mystery of who Shakespeare was. Thus computers of today's advanced science are not completely out of touch with literary masterpieces of hundreds of years ago. China's enormous literary tradition, the corpus of commentaries on ancient books, investigations of historical reality, dictionaries and editions, the verifications of authors and so on, all these are areas in which the computer can be of help. Claims (Filed on 22 July 83)

1. A method of encoding Chinese characters into a form in which they can be entered into a computer or the like in which the character is dissembled into four constituent radicals each of which represents a pronounciable sound, as many radicals as possible, subject to the formation of a total of four radicals, are combined into the first radical and, again subject to the formation of four radicals, a succeeding radical is incorporated into a preceding radical, and in the event that there are less than four resulting constituent parts, after the constituent parts, the final stroke of the character as normally written is used as one of the constituent parts and in the event that there are still not enough total constituent parts to complete the four required, the whole character is repeated as a final constituent part, and thereafter the four constituent parts are represented by the intial four Roman alphabet letters

of the transliterations of each constituent part if the part is a character or its link character which the part closely resembles, the code for the Chinese character having therefore the resulting four Roman letter code.

2. A method as claimed in Claim 1 in which the transliteration used for the constituent parts is Pinyin spelling.

3. A method as claimed in Claim 1 or Claim 2 in which the "external configuration" of the character is examined to find a real Chinese character which resembles the part in the event that a link character resembling the part cannot be found.

4. A method as claimed in any preceding claim in which, in order to increase the speed of typing a text, a special code of one letter for one Chinese character is used for frequently used characters.

5. A method as claimed in Claim 4 in which the frequently used characters shown in Figure 2 of the accompanying drawings are represented by the single letters shown in Figure 2.

6. A method as claimed in any preceding claim in which to simplify and increase the coding, Chinese numbers are given the number of the corresponding Arabic numbers as shown in Figure 3 of the accompanying drawings.

7. A method as claimed in any preceding claim which is applied to character expression codes which can be thought of a Chinese character, with the characters composing the compound treated as the constituent parts of these characters.

8. A computer which has been programmed to accept the entry of a Chinese character coded by a method as claimed in any preceding claims, in which the codes are manipulated and processed within the machine, but at output time are restored to their original form and printed.

9. A typewriter which has been programmed to accept the entry of a Chinese character coded by a method as claimed in any preceding claim in which the character is input on the keyboard in the coded form and output as the character.