CN100390711C

CN100390711C - Computer processing and keyboard inputting method for Chinese word

Info

Publication number: CN100390711C
Application number: CNB2005101354752A
Authority: CN
Inventors: 贾惠波; 焦慧; 刘迁; 熊剑平; 马骋
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2005-12-31
Filing date: 2005-12-31
Publication date: 2008-05-28
Anticipated expiration: 2025-12-31
Also published as: CN1790238A

Abstract

The present invention belongs to the technical field of Chinese information processing, which is characterized in that a computer processing and keyboard inputting method for Chinese words orderly comprises the following steps that vocabularies in 'contemporary Chinese common word list for information processing' are categorized according to parts of speech to form dictionary words; the dictionary words are codes by four bytes to form dictionary codes; the dictionary codes correspond to inner machine codes which denote Chinese characters to form a dictionary code list; phonetic codes are inputted through keyboard operation in a phonetic mode by using the words as basic units; the dictionary codes are formed according to a comparison lift of the phonetic codes and the dictionary codes; the dictionary codes correspond to the inner machine codes according to the dictionary code list for completing input and forming a form file of the dictionary codes by using the words as the basic units. The present invention prevents participles from causing difficulty to the Chinese information processing, and realizes the automatic categorization of the parts of speech through first bytes in the dictionary codes. Under the conditions that any inner structure and arrangement of the existing computer are not changed, file representation by using the words as the basic units is realized.

Description

The method of a kind of Computer Processing of Chinese word and keyboard input

Technical field

The invention belongs to the Chinese information processing field.Specially refer to expression and the storage mode of Chinese in computing machine.

Background technology

Along with developing rapidly of computer technology and artificial intelligence theory, people begin the method research natural language with form, thus a branch of the artificial intelligence that has been born---natural language processing (NLP).Nearly all natural language processing system is important information carrier and basic operation unit with speech all.And in Chinese linguistics, the definite definition of " speech " does not still have final conclusion, and at this, we claim in the Chinese significant, and the least unit that can independently use is a speech.Indicating as separating between speech and the speech with the space in writing form of western languages such as English, and Chinese is continuous Chinese character string in writing form, Chinese character file also is that unit encodes with the word in computing machine, and internal code is continuous code word, does not have obvious distinctive mark between speech and the speech.Like this, the top priority of understanding Chinese is the sequence that continuous Chinese character string is divided into speech, promptly so-called participle.

The application of Chinese in computing machine for convenience, the technology of computer realization Chinese word segmentation is promptly used in people's automatic word segmentation that begins one's study.Along with going deep into of Chinese information processing research, the importance of automatic word segmentation is outstanding further, it is the basis of all Chinese information processing, as Chinese analysis and understanding, Chinese-foreign language mechanical translation, Chinese literature automatic indexing or in full information retrieval, Chinese Character Recognition, Chinese speech identification and synthetic, Chinese simplified and traditional body conversion and Chinese manuscript all need at first to have divided speech in the multinomial application such as check and correction automatically automatically.

About automatic word segmentation, though have the research history in 10 years at home and abroad, the effort that is put to is very big, also a lot (the Words partition systems of having set up of the achievement that obtains, China's Mainland, Taiwan, Hong Kong and Singapore add up to more than 20), but the utility system that there is no real maturation up to now emerges, and becomes one of bottleneck of serious restriction Chinese information processing development.

Summary of the invention

Participle results from the Chinese inner code method for expressing based on the word platform, is that Chinese is peculiar, and it has caused the bottleneck of Chinese information processing.The objective of the invention is fundamentally to solve the automatic word segmentation problem that fetters the Chinese information processing development for a long time.Propose a kind of computing machine method of Chinese character coding, thereby form a kind of new Chinese document format based on the speech platform.It is characterized in that document based on the speech platform, and be based on the word platform unlike Chinese document commonly used at present.Because in Chinese linguistics, the definite definition of " speech " does not still have final conclusion, our said speech here, be meant in the Chinese statement significant, the least unit that can independently use, promptly usually said speech, phrase, phrase and Chinese idiom etc.The present invention according to " information processing with Modern Chinese everyday words vocabulary " (to call " everyday words vocabulary " in the following text, this table meets " the information processing Modern Chinese standard of word segmentation " GB13715 fully, be the vocabulary that generally uses at present) each speech is wherein encoded, article uses like this coded format based on speech just can make speech become minimal information carrier in the computing machine Chinese language processing, need not carry out Chinese word segmentation again, Chinese language computer is handled be in identical starting point level with western language, and this system has been arranged, western language can be used in Chinese language processing for the achievement in research of Language Processing.

The present invention proposes a kind of computing machine method of Chinese character coding based on the speech platform, comprise new Chinese document coding form, the Hanzi keyboard input method of dictionary code table that the new coding of each speech is corresponding with internal code in " everyday words vocabulary " (i.e. database) and the new document format of generation.It is characterized in that said Chinese document coding form comprises the ASCII control code that meets international standard, the western language of standard and the ascii character-set of symbol, and the four byte code of Chinese entry and Chinese punctuation mark; Said dictionary code table is according to parts of speech classification, presses the phonetic alphabet series arrangement again in each class speech, gives each speech with four byte codes, and the corresponding internal code of forming the Chinese character of this speech, the database of a dictionary code table of formation; The Hanzi keyboard input method of said new document format, be based on present phonetic code imput method, make the keypad code that produces by word sound scheme in phonetic-dictionary sign indicating number table of comparisons, search corresponding dictionary sign indicating number through the decoding back, again by the dictionary code table, be dictionary sign indicating number and internal code corresponding relation, find corresponding internal code to show and import.

The invention is characterized in: a kind of Chinese word computing machine that described method is is basic input unit in the phonetic mode, with the speech, handle based on phonetic sign indicating number-dictionary sign indicating number, dictionary sign indicating number-internal code order contrast ground successively and the method for keyboard input, described speech is meant and comprises the least unit that speech, phrase, phrase and Chinese idiom independently use the interior significant user of confession in the Chinese statement that described method contains following steps successively:

Step 1: the Chinese vocabulary in " everyday words vocabulary " is divided into noun, verb (comprising verb phrase), adjective, adverbial word, pronoun, number, measure word, onomatopoeia, interjection, preposition, conjunction, auxiliary word and modal particle by their the most frequently used parts of speech, also has Chinese idiom, form dictionary word, each dictionary word all is made of 1 to 7 Chinese character;

Step 2: each dictionary word in the step 1 is encoded in the following manner, form the dictionary sign indicating number.Each dictionary sign indicating number all is made of 4 bytes, by the concrete form of hexadecimal representation is:

The Gao Siwei of first byte of [AxH xxH xxH xxH] (H is hexadecimal expression symbol, and following dictionary sign indicating number is all used hexadecimal representation) must be AH, and binary representation is 1010; The scope of low four x of first byte is that 1H is to FH, the part of speech of representing this speech, 1 to the 9 little speech of representing noun, adjective, adverbial word, pronoun, number, measure word, Chinese idiom, preposition and conjunction successively and comprising onomatopoeia, interjection, auxiliary word and modal particle wherein, A represents punctuation mark, B to F represents verb and the verb phrase in " everyday words vocabulary ", and each class is respectively by series arrangement in the table.The Gao Siwei of second byte is for keeping the position, the Chinese character number that low four these speech of bit representation of second byte are comprised; The 3rd byte and nybble are formed a sequence code, scope be 1H to FFFFH, promptly 65535, be used for vocabulary is arranged and numbered by the phonetic alphabet order, coding can hold 14 * 65536 entries at least in this way;

Step 3: form phonetic sign indicating number-dictionary sign indicating number table of comparisons input computing machine according to the result of step 1 and the result of step 2;

Step 4: set up the dictionary sign indicating number of the speech in " everyday words vocabulary " and the table of comparisons between the internal code of this speech in computing machine, be called the dictionary code table: comprise 12 kinds of tables distinguishing by part of speech in the described dictionary code table, it is respectively non-dictionary vocabulary, the termini generales table, describe vocabulary, verb list, the adverbial word table, the pronoun table, the number table, measure word table, preposition/conjunction table, onomatopoeia/interjection/auxiliary word/tone vocabulary, Chinese idiom table and punctuation mark table, the dictionary sign indicating number and the pairing internal code of this dictionary sign indicating number of each speech of record constitute the dictionary code table in the table, wherein said non-dictionary word is meant and comprises name, place name, trade name is in some interior proper nouns, and described non-dictionary word and punctuation mark adopt nybble method coding equally;

Step 5: the dictionary code table that step 4 is obtained is input in the described computing machine and goes, and forms the file of dictionary sign indicating number form and internal code form during input in Chinese respectively;

Step 6: in the phonetic mode, with the speech is that base unit is to described computing machine input Chinese word.

When described nybble method was encoded to punctuation mark, wherein first byte perseverance was AaH, and the second byte perseverance is got 00H; The 3rd byte and nybble are the internal code of this punctuation mark.

When described nybble method was encoded to non-dictionary word, wherein the first byte perseverance was got AOH; High four reservations of second byte, the Chinese character number that low four these the non-dictionary words of bit representation of second byte are comprised; The 3rd byte and nybble are represented the serial number in the non-dictionary vocabulary of this non-dictionary word in the dictionary code table.

Function of the present invention and characteristics:

(1) realized with the speech being the document format of minimal information carrier, thereby avoided participle fully, made Chinese language processing and western language treating stations on the same height to the obstacle that Chinese information processing brings;

(2) realized the automatic classification of part of speech, clearly expressed the part of speech of speech in the dictionary sign indicating number, need not mark with other method again;

(3) do not change any inner structure of existing computing machine and setting, still use the Hanzi internal code international standard, only set up a cover system on its basis, structure is based on the document format of speech platform.

Description of drawings

Fig. 1 is the general structure synoptic diagram.

Fig. 2 is the processing procedure synoptic diagram.

Embodiment cryptoprinciple of the present invention is: the entire document file is made up of a series of code words, the ASCII character that control code is adopted international standards is represented, the expression western language character that the western language character is adopted international standards and the ASCII character of symbol are represented, overlap coded system and the Chinese vocabulary of " everyday words vocabulary " lining has been set up one.

The coding method that the present invention proposes is that the Chinese vocabulary in " everyday words vocabulary " is classified by its most frequently used part of speech, is divided into: noun, verb, adjective, adverbial word, pronoun, number, measure word, onomatopoeia, interjection, preposition, conjunction, auxiliary word and modal particle.In addition, also have a large amount of Chinese idioms in the Chinese, they are classified as a class speech.We claim that having divided the speech of class by above method is dictionary word.Vocabulary in " everyday words vocabulary " is minimum to comprise 1 Chinese character, comprise 7 Chinese characters at most, so each dictionary word of the present invention all is to be made of 1 to 7 Chinese character, encode for each dictionary word, be called the dictionary sign indicating number, all dictionary sign indicating numbers all are to be made of 4 bytes, by the concrete form of hexadecimal representation are:

[AxH xxH xxH xxH] (H is hexadecimal expression symbol, and together, following dictionary sign indicating number is all used hexadecimal representation down)

Wherein the Gao Siwei of first byte must be AH (binary form is shown 1010), the scope of low four x of first byte be 1H to FH, be used for representing the part of speech of this speech, big class part of speech such as noun, adjective independently is one group, group parts of speech such as auxiliary word, interjection, modal particle synthesize one group, for verb and verb phrase, situation more complicated such as tense, number, B represents the individual character notional verb, C represents the multiword notional verb, D represents verb phrase, and E and F keep, to treat extended function in the future.Concrete corresponding as follows:

1	2	3	4	5	6	7	8	9	A	B-F
1	2	3	4	5	6	7	8	9	A	B-F	Noun	Describe	Adverbial word	Pronoun	Number	Measure word	Chinese idiom	Preposition,	Little speech	Punctuate	Verb

Speech

Conjunction

(auxiliary word, modal particle, onomatopoeia, interjection)

Symbol

And verb phrase

The Gao Siwei of second byte is for keeping the position, low four of second byte is used for representing the number of words (1-7) that this speech comprises: remaining the 3rd byte and nybble are formed a sequence code, scope is 1 to FFFFH (promptly 65535), is used for vocabulary is arranged by pinyin order.The entry number that can hold at least of encoding in this way is 14 * 65535=917490 bar, totally 39016 of one-level everyday words in " everyday words vocabulary ", secondary everyday words and monosyllabic words, the subordinate list that also comprises some proper nouns in addition is so the space of dictionary sign indicating number is enough.

As: " younger brother Ah " this speech is arranged in " everyday words vocabulary ", and the dictionary sign indicating number that we compile for it is:

[A1020002]

Wherein to represent this speech be termini generales to A1, and on behalf of this speech, 02 comprise two Chinese characters, 0002 this speech of expression serial number in the noun list in the dictionary code table.

The present invention claims that a class speech is non-dictionary word, i.e. some proper nouns are as name, place name, trade name etc.We also adopt the method for four byte code to the coding of this class speech,

[AxH?xxH?xxH?xxH]

Just the x perseverance gets zero among first byte AxH, i.e. first byte of non-dictionary word A0H always, and remainder is identical with above coding method.As: " Arab " this proper noun, dictionary sign indicating number are [A0030001], and wherein to represent this speech be non-dictionary word to A0, and on behalf of this speech, 03 comprise three Chinese characters, 0001 this speech of expression serial number in the non-dictionary vocabulary in the dictionary code table.

The present invention adopts four byte code equally for symbols (or SBC case) such as Chinese character punctuates:

[AaH xxH xxH xxH] wherein first byte perseverance is AaH, and second byte perseverance is 00H, the internal code that two bytes in back are these punctuation marks.The coding of punctuate commonly used is as follows:

，	Aa00a3ac
，	Aa00a3ac	.	Aa00a1a3
：	Aa00a3ba	.	Aa00a1a3
：	Aa00a3ba	；	Aa00a3bb
？	Aa00a3bf	；	Aa00a3bb
？	Aa00a3bf	、	Aa00a1a2
‘	Aa00a1ae	、	Aa00a1a2
‘	Aa00a1ae	，	Aa00a1af
《	Aa00a1b6	，	Aa00a1af

》	Aa00a1b7
》	Aa00a1b7	“	Aa00a1b0
”	Aa00a1b1	“	Aa00a1b0
”	Aa00a1b1	(	Aa00a3a8
)	Aa00a3a9	(	Aa00a3a8
)	Aa00a3a9	！	Aa00a3a1
...	Aa00a1ad	！	Aa00a3a1
...	Aa00a1ad	—	Aa00a1aa
-	Aa00a3ad	—	Aa00a1aa

For the coded system that the present invention is proposed can realize on computers, the internal code of expression Chinese character in dictionary sign indicating number of the present invention and the present computing machine need be connected, the present invention sets up the corresponding relation of dictionary sign indicating number and internal code by a database, is called the dictionary code table.Comprising 12 kinds of tables distinguishing by part of speech in the dictionary code table, is respectively non-dictionary vocabulary, the termini generales table, describe vocabulary, verb list, adverbial word table, pronoun table, the number table, the measure word table, preposition/conjunction table, onomatopoeia/interjection/auxiliary word/tone vocabulary, Chinese idiom table and punctuation mark table, each dictionary sign indicating number of record and the pairing internal code string of this dictionary sign indicating number in the table.The structure of database illustrates following (ISN is internalstatement number, internal code):

The non-dictionary word of table 0

Numbering	dictionarynumber	ISN
Numbering	dictionarynumber	ISN	1	A0030001	b0a2c0adb2ae
2	A0040002	b0c2c1d6c6a5bfcb	1	A0030001	b0a2c0adb2ae
2	A0040002	b0c2c1d6c6a5bfcb	......	......	......

Table 1 termini generales

Numbering	dictionarynumber	ISN
Numbering	dictionarynumber	ISN	1	A1020001	b0a2b0d6
2	A1020002	b0a2b5dc	1	A1020001	b0a2b0d6
2	A1020002	b0a2b5dc	3	A1020003	b0a2b8e7
4	A1020004	b0a2c2e8	3	A1020003	b0a2b8e7
4	A1020004	b0a2c2e8	5	A1020005	b0a2c3c3
6	A1020006	b0a2c6c5	5	A1020005	b0a2c3c3
6	A1020006	b0a2c6c5	......	......	......

The Chinese-character keyboard input method that the present invention proposes, be to be main input medium with phonetic, with the speech is the basic unit that imports, set up the table of comparisons of a phonetic sign indicating number and dictionary sign indicating number,, in the table of comparisons, search the dictionary sign indicating number that is mated with phonetic according to the phonetic of input speech, find corresponding internal code to show by the dictionary code table again and import, final form two files, one is the file of the dictionary sign indicating number form of input content, and another is the file of common internal code form.Show or still show when printing or print by internal code corresponding in the dictionary code table.The table of comparisons of phonetic sign indicating number and dictionary sign indicating number is mapped the phonetic of each speech and its dictionary sign indicating number according to the order of " everyday words vocabulary ".Though phonetically similar word extensively exists in the Modern Chinese, the homonym much less.So in general language environment, adopt the speech input can reach man-to-man mapping substantially.But also can produce repeated code under the situation about having, the corresponding several dictionary sign indicating numbers of such spelling sound sign indicating number possibility, system finds all speech that meet this phonetic sign indicating number to allow the importer select, thus the speech that affirmation will be imported.Phonetic sign indicating number and dictionary sign indicating number table of comparisons structure are as follows:

Numbering	pinyin	dictionarynumber
Numbering	pinyin	dictionarynumber	1	a	A9010001
2	aha	A9020002	1	a	A9010001
2	aha	A9020002	3	aba	A1020001
4	adi	A1020002	3	aba	A1020001
4	adi	A1020002	......	......	......

In the time of for example will importing " spring " this speech, key in eight letters of phonetic keypad code " chuntian " with keyboard, translator is just searched the dictionary sign indicating number corresponding with " chuntian " in the table of comparisons, find the pairing internal code of this speech by the dictionary sign indicating number again, be presented at then and allow the importer confirm on the screen, after confirming the dictionary sign indicating number and the internal code of this speech are all preserved, exist respectively in two files.After like this each speech all having been imported, just can form one piece of complete article, and the internal format of document is with the dictionary representation.On the platform of this dictionary sign indicating number, just can carry out some information processings like this, such as text classification, automatic abstract etc. directly read per four bytes during processing, and a speech can directly be handled exactly, thereby has walked around this difficult problem of participle.

Claims

The Computer Processing of a Chinese word and keyboard the input method, it is characterized in that, a kind of Chinese word computing machine that described method is is basic input unit in the phonetic mode, with the speech, handle based on phonetic sign indicating number-dictionary sign indicating number, dictionary sign indicating number-internal code order contrast ground successively and the method for keyboard input, described speech is meant and comprises the least unit that speech, phrase, phrase and Chinese idiom independently use the interior significant user of confession in the Chinese statement that described method contains following steps successively:

Step 1: the Chinese vocabulary in " information processing with Modern Chinese everyday words vocabulary " is divided into noun, verb, verb phrase, adjective, adverbial word, pronoun, number, measure word, onomatopoeia, interjection, preposition, conjunction, auxiliary word and modal particle by their the most frequently used parts of speech, also has Chinese idiom, form dictionary word, each dictionary word all is made of 1 to 7 Chinese character;

Step 2: each dictionary word in the step 1 is encoded in the following manner, form the dictionary sign indicating number, each dictionary sign indicating number all is made of 4 bytes, by the concrete form of hexadecimal representation is:

[AxH xxH xxH xxH], H is hexadecimal expression symbol, and following dictionary sign indicating number is all used hexadecimal representation, and the Gao Siwei of first byte must be AH, and binary representation is 1010; The scope of low four x of first byte is that 1H is to FH, the part of speech of representing this speech, 1 to the 9 little speech of representing noun, adjective, adverbial word, pronoun, number, measure word, Chinese idiom, preposition and conjunction successively and comprising onomatopoeia, interjection, auxiliary word and modal particle wherein, A represents punctuation mark, B to F represents verb and the verb phrase in " information processing Modern Chinese everyday words vocabulary ", each class is respectively by series arrangement in the table, the Gao Siwei of second byte is for keeping the position, the Chinese character number that low four these speech of bit representation of second byte are comprised; The 3rd byte and nybble are formed a sequence code, scope be 1H to FFFFH, promptly 65535, be used for vocabulary is arranged and numbered by the phonetic alphabet order, coding can hold 14 * 65536 entries at least in this way;

Step 3: form phonetic sign indicating number-dictionary sign indicating number table of comparisons according to the result of the result of step 1 and step 2 and deposit computing machine in;

Step 4: set up the dictionary sign indicating number of the speech in " information processing Modern Chinese everyday words vocabulary " and the table of comparisons between the internal code of this speech in computing machine, be called the dictionary code table: comprise 12 kinds of tables distinguishing by part of speech in the described dictionary code table, it is respectively non-dictionary vocabulary, the termini generales table, describe vocabulary, verb list, the adverbial word table, the pronoun table, the number table, measure word table, preposition/conjunction table, onomatopoeia/interjection/auxiliary word/tone vocabulary, Chinese idiom table and punctuation mark table, the dictionary sign indicating number and the pairing internal code of this dictionary sign indicating number of each speech of record constitute the dictionary code table in the table, wherein said non-dictionary word is meant and comprises name, place name, trade name is in some interior proper nouns, and described non-dictionary word and punctuation mark adopt nybble method coding equally;

Step 5: the dictionary code table that step 4 is obtained stores in the described computing machine and goes, and forms the file of dictionary sign indicating number form and internal code form during input in Chinese respectively;

Step 6: in the phonetic mode, with the speech is that base unit is to described computing machine input Chinese word;

When described nybble method was encoded to punctuation mark, wherein first byte perseverance was AaH, and the second byte perseverance is got 00H; The 3rd byte and nybble are the internal code of this punctuation mark;

When described nybble method was encoded to non-dictionary word, wherein the first byte perseverance was got A0H; High four reservations of second byte, the Chinese character number that low four these the non-dictionary words of bit representation of second byte are comprised; The 3rd byte and nybble are represented the serial number in the non-dictionary vocabulary of this non-dictionary word in the dictionary code table.