CN1115887A

CN1115887A - Sentence input method in Chinese character input system of computer

Info

Publication number: CN1115887A
Application number: CN 94108208
Authority: CN
Inventors: 王希曾
Original assignee: Individual
Current assignee: Individual
Priority date: 1994-07-23
Filing date: 1994-07-23
Publication date: 1996-01-31

Abstract

The said Chinese character input system utilizes form word as main body for word separation and selection in sentence and sets up corresponding code system with duplication code selection so as to realize sentence input. The sentence input method inputs Chinese characters by using sentence as unit with each character being represented by one or several codes, and it has also, character and word input, phonetic input and writing recognizing input.

Description

The statement input method that Chinese character computer input system is general

A kind of computer Chinese-character statement input method, particularly a kind of computer Chinese-character statement input method that defines function based on various coding techniquess and function word.Relate to technical fields such as keyboard input, sound input, literal identification, mechanical translation and information retrieval.

Up to now, the computer Chiense character code keyboard and input method is existing hundreds of, is divided into several classes such as font code, sound sign indicating number, phonetic-stroke code and serial code substantially.Through ten years development, encode Chinese characters for computer enters its input speed of speech input phase (being the so-called second pronoun input phase), learnability and accuracy from the word input phase and all obviously improves.But wherein the coding of word is still oversize too complicated, and the input of speech often " is not taken in " because of dictionary has speech, is forced to return the word input.Thereby must keep word code and dictionary content firmly in mind in the operation, pay attention to the words boundary.And average stroke is generally higher, has limited input speed.The more important thing is that people's brain burden and labour intensity are also very heavy, remain " catering to computer with human brain ", rather than allow computer " understanding " help human brain.

Existing sound sound Chinese character statement input system is a third generation statement input method.It mainly adopts complete Chinese phonetic alphabet mode read statement.Computing machine is understood natural language by grammer, semantic analysis and is provided input results.Its learnability, input speed and intelligent level improve a lot.But it only is applicable to Chinese phonetic alphabet input (compatible double spelling), and this then is a big obstacle for the user who is unfamiliar with the Chinese phonetic alphabet.Secondly, well-known, phonetic transcriptions of Chinese characters is made up of 2-6 letter, and average length is 3.01 letters, adds four tones of standard Chinese pronunciation mark, and the space bar of word link writing, can extrapolate its average stroke and surpass 4.5.Even take the speech input to reduce keystroke simultaneously, its average stroke is yet more than 3.5, and still genus is long partially, lacks competitiveness.This is the shortcoming that the statement input phase should not exist.The 3rd, still can not be applied to various coding methods such as existing font code, phonetic-stroke code and serial code, thereby be difficult to satisfy pressing for of raising China's infotech integral level.The 4th, the technical requirement of grammatical analysis or semantic analysis is very high, and element task is very vast and numerous, especially the operational analysis of computing machine spended time too.And the analysis of sentence pattern non-general personnel can be competent at, and must expend the linguist's of large quantities of knowledge brilliances energy.The 5th, the Chinese character input as the term suggests be exactly to allow computing machine that the Chinese character of people institute desire input is reflected strictly according to the facts, does not need to analyze its grammatical and semantic (mechanical translation is from working as another matter) really.So the logic theory of statement input should be easy, just can reduce cost, and improves analysis speed.

The objective of the invention is to propose a kind of being easy to learn and use, average stroke is few, small investment, and easy easy realization can be common to the computer Chinese-character statement input method that has various Chinese character input systems or coding method now.

The present invention is achieved in that at first, has studied the characteristics of various coding methods.Find as long as appropriate change code fetch mode and code fetch number (font code for example transforms original " first, 2,3, end " code fetch into " first, 2, end ") most of method only got 3 yards to each Chinese character just to make its repetition rate of coding reduce to 30%, in addition lower.At the statement input phase, the differentiation work of repeated code is handled by Computerized intelligent.So so not only simplify code fetch greatly, obviously reduce labor intensity, also created condition for the statement input.The second, the dynamic frequency of research Chinese character function word.The utilization available research achievements, finding has 186 to belong to the most frequently used word in 253 single syllable function words commonly used, and its frequency (referring to by the frequency ordering) is in 1000.For example wherein ",, get,,, cross, be, or not to,, with, with, in, to, " etc. the frequency of tens words all drop in 100, so these words usually are selected as one-level brevity code word.In addition, the above multisyllable function word of double-tone joint has 650 approximately, and major part also is the very high common wordss of dynamic frequency.About 1000 altogether of single syllable and multisyllable function words, the ratio that they occur in general text can reach 20-30%, often just has a function word every 3 to 4 words and occurs.So, utilizing the boundary mark of function word as the automatic word segmentation of statement, is the simplest and the most direct effective.The 3rd, from the part of speech of the angle research Chinese character of information processing.Known several function word has different restrictions to the part of speech that joins before and after it, and this mainly is the restriction to notional word.For example: the adverbial word back can be verb, adjective, but can not be noun, pronoun; The preposition back can be noun, pronoun; Conjunction does not then have significant limitation, and the function word that has can be private, and what have can not; What have often occupies beginning of the sentence, and what have often occupies end of the sentence; Or the like.Through discovering, this close and distant or walk quickly and keep away between several function words and the notional word, amiable and overcritical is a kind of simple clear and definite right and wrong whether logical relation, the characteristics that exactly can adapt to computing machine, define rapidly and judge the entry that is connected with function word, speech is selected in the realization intellectuality, rejects the repeated code entry simple, convenient and rapidly.The 4th, the research of parts of speech classification.Function word is the sign that defines of statement participle, is core, is to analyze main body relatively.Part of speech set up the needs that will define function according to function word, this is the principle basis of parts of speech classification.To connecing before it and the adaptive requirement of follow-up word and the type of service of speech, the present invention has done differentiation, merging with tens speech like sound compositions such as traditional notional word, function word, morpheme, Chinese idioms according to function word.For example, merging such as noun, pronoun, measure word, time word are become famous for class, and verb is divided into general verb and psychological verb 2 classes, and adverbial word is divided into general adverbial word and degree adverb 2 classes, in addition, also have the beginning of the sentence class, end of the sentence class or the like.The 5th, the determining of labeling method.The mark of part of speech must be connected mutually with the label sets of existing computational linguistics, just helps the standardization of infotech.So the name of above-mentioned establishment should be labeled as N for class, number should be labeled as M, or the like.The 6th, research is compatible, mainly is the compatibility of statement input and the input of speech word.(" input of speech word " here is meant that multi-character words coding mixes the method for input with 1 word coding method, i.e. the second pronoun input method, down with) determined to decompose the coding of four words (reaching speech and phrases more than four words).For example, in the dictionary existing " economic benefit ", and do not have " economy " and " benefit ", then should replenish this 22 words, be called " decomposition " here.Then should not decompose as for Chinese idiom that should not decompose or specific term, as " unexpectedly ", " Mountain Everest " etc.The 7th, studied the prerequisite condition of dictionary that adapts with the statement input.That is: should comprise whole function words; The above word of 4 words should resolve into 2 words or 3 word coding methods, and with former 4 words or the coding of the entry more than 4 words and deposit; The entry number should be advisable about 4.5 ten thousand; Preferably use the standard dictionary.The 8th, the supplementary means of research differentiation repeated code.Computing machine also has the repeated code words unavoidably and waits to elect through after searching a series of processing such as dictionary and function word define in the statement.For this situation, the present invention has founded repeated code group word order labeling method.The definition of repeated code group word order is: the selecting sequence of each speech in same group of repeated code speech.Available arabic numeral mark.Also available other mark.In addition, the Chinese character in the counterweight code character will enroll code book with its group speech as much as possible, also can reduce the repeated code chance.Point out in passing, the branch of this nothing of Chinese character words, in the present invention, word promptly is a speech, single Chinese character generally is word, but is a words in fact.The 9th, the logic theory of Computer Processing in the research statement input process.Determined with the foundation of non-Chinese symbol (comprising punctuation mark, foreign language symbol, figure, arabic numeral etc.) as punctuate; With the boundary mark of function word as participle; At first choosing the speech of no repeated code, is benchmark with the function word to connecing before it or identification selection made in follow-up repeated code words again; If still having repeated code, then available word order is according to selecting suitable speech.

According to above-mentioned research and design, the present invention proposes the general statement input method of following Chinese character computer input system: utilize the type of service of Chinese function word and to the restriction of notional word, with function word as the statement participle and select the main body of speech, and set up corresponding code book, cooperate repeated code group word order system, support and finish the statement input process.Its input method can adopt various types of coding method, on the corresponding Chinese character input system, is unit with the statement, with each Chinese character in the statement with one to several codes (comprising space bar and phrase marker key) successively, isometric, import computing machine continuously.Can dual-purpose speech word input mode.It also can be phonetic entry or literal identification input.

Embodiment of the present invention are as follows:

The first, when various coding methods application are of the present invention, can not revise its coding method, also can do suitable modification.Modification is in order to reduce stroke, for example original each Chinese character is got 4 yards and changes into and get 3 yards.Should note controlling the static repetition rate of coding in the modification, generally should not surpass 35%.Surpass as if having, then also tackle the indivedual adjustment of code fetch mode and code element do, for example " head, 2,3, end " changed into " head, 2, end " in the hope of reducing the repetition rate of coding; The code element of individual code is transferred.

The second, replenish the dictionary clauses and subclauses.The necessary whole function word clauses and subclauses of polishing, about 1000 altogether; Polishing is by the entry word that not merely constitutes with morpheme, for example as far as possible: grape, embarrassment or the like; Non-speech character code must be replenished dictionary, for example: Fu, Jie or the like; The entry that (comprises 4 words) more than 4 words must be decomposed coding, become 2 words or 3 words, decompose the back,, then needn't decompose as " baffled ", " it is wonderful " if become not merely with the word string of morpheme or non-word; Also speech must be organized as much as possible in the individual character of repeated code group and enroll code book.

The 3rd, the establishment code book.Original code book can continue to use, and calls when being made for the input of speech word.Should compile the parallel use of a cover code book and former code book in addition.The code book of compiling is based on 2 words, 3 words and 4 words in addition.To 6763 Chinese character code books of GB, also answer appropriate reconstruction.

The 4th, the mark of part of speech.(following mark belongs to a kind of embodiment)

The name for class, mark N.Comprise noun, pronoun, time word, place speech, the noun of locality, measure word.

Number, mark M.

General adjective, mark A.

Qualifying adjective, mark AA.

General verb, mark V.

The psychological activity verb, mark VV.

General adverbial word, mark D.

Degree adverb, mark DD.

General auxiliary word, mark U.

The tense auxiliary word (,), mark UU.

Preposition, mark P.

Conjunction, mark C.

The language class, mark O comprises modal particle, onomatopoeia, interjection.

Not merely use morpheme, mark G.

Non-speech character, mark X.

In addition, also have private class, beginning of the sentence class, end of the sentence class or the like.

The 5th, the mark of repeated code group word order.The foundation of mark be in same group frequency ratio and the rank comparison.(rank refers to the one-level or the Chinese characters of level 2 that 6763 Chinese characters of GB are concentrated) secondly rarely used word, not merely all should compare respectively with morpheme, non-speech character.Tag align sort word order successively then.Usually if the speech number of most of repeated code group is 2-3 when taking 3 yards inputs, its ordering is easier.

The 6th, the part adaptation rule of function word for example: connect before the POS-tagging and connect not adaptive follow-up adaptive follow-up not adaptive general adverbial word D general verb V name for class N before adaptive

General adjective A degree adverb DD psychology verb VV name is for class N

The general auxiliary word U of qualifying adjective AA name for class N tense auxiliary word UU general verb V name for class N

Psychology verb VV preposition P name for class N conjunction C number M name for class N

The 7th, when dual-purpose speech word was imported, the phrase code end should be composed with suitable mark key symbol, so that computer Recognition.When the code length of word did not wait, the code of vacancy should be marked with suitable mark key (as space bar).

Below we will be in conjunction with several application example of the present invention, the invention will be further described, the wherein coding method of code requirement sign indicating number (font code) and corresponding code book, each word is with 3 yards inputs.And attachedly be made for comparison with the speech character input method.

Example 1, (statement input) JYVRM U RXX A BFR VNJAE YA WKJ J BQEZZAWAL

No repeated code is chosen: we will make a shining cog which never rusts

Stroke 39, average stroke 2.78 (containing space bar, down together) (input of speech word) JYRM U RXXA BFRVNJAE YAWK J BZWA

Stroke 32, average stroke 2.29

Example 2, he is with the wine in the tubule sorption bottle (statement input) RNNCW A XL ZQXMCCOSJVXQVUVBT J EUA (selecting speech) name, number is not adaptive moving, and pair is adaptive not merely not merely uses morpheme with morpheme

Choose " propping up " to choose " wearing " to choose " bottle " to choose " wine "

Stroke 34, average stroke 2.83.

Can find out that from above application examples the present invention has the following advantages:

1, be easy to learn and use, easy and simple to handle. Very low to operating personnel's requirement, and input process is than existing Word word coding and input method is all light, the real intelligent processing method of realizing " teaching computer rule ".

2, compatible good. Both can support the word word input of former input method, but also input by sentence. Both can fit Answer the shape code input, it is defeated also can to adapt to the many kinds of Chinese characters such as phonetic-stroke code, tone code and Speech input, literal identification Enter mode.

3, quick.Theoretical average stroke is not more than 3, actual average stroke (when compatible speech word is imported) about about 2.5.Because be successively, isometricly import whole section statement continuously, the operator needn't often note presenting bank, also selects speech, selects the repeated code word or carries out " association " without association.The code fetch process is also just simplified greatly and is light, so can improve input speed.

4, novelty.Relevant up to now statement input or computing machine are to the understanding method of natural language, all be based on grammatical analysis, semantic analysis, contextual analysis relatively, by the traditional grammer such as meaning of a word key words sorting or the analysis and understanding of language meaning, allow computing machine grasp countless rule and example sentence.The present invention is from traditional parts of speech classification, reclassifies according to the computer Recognition characteristics.

5, creativeness.The present invention proposes a cover brand-new effectively simple parts of speech classification system and repeated code group word order system, and classification is few, all kinds of characteristics are clear and definite, adapt to the computing characteristics that statement input computer-chronograph carries out participle and selects speech fully.

6, the realization of software is very simple, does not resemble the corpus that need employ hundreds thousand of even millions of speech other intelligent softwares, consumes a large amount of manpowers, and is for many years time-consuming.The present invention catches this key of function word, cooperates code book transformation and repeated code group word order mark, and some POS-taggings, and changing complexity is simply, separate vast and numerous in a sword, for extremely beneficial condition has been created in the realization of software.Thereby, various Chinese character input systems are shifted to the statement level input phase provide fast, approach cheaply upgrades.

In sum, the present invention is easy to learn and use and easily promotes; Fast, efficiently, compatible good; Its software is realized simple, and cost is low, helps the upgrading of existing various Chinese character input systems.It is the very strong computer Chinese-character statement input method of a kind of versatility.

Claims

1, the general statement input method of a kind of Chinese character computer input system, it is characterized in that utilizing the function that defines of coding techniques and function word, it is the type of service of function word and to the restriction of notional word, with function word as the statement participle and select the main body of speech, set up corresponding code book simultaneously, cooperate repeated code group word order system, support and finish the statement input process.

2, the general statement input method of Chinese character computer input system according to claim 1, it is characterized in that: various function words, notional word connect or follow-up fitting relation before reaching and not merely reaching according to its type of service with morpheme, non-speech character etc., break up or make up, the divide into several classes, and on code book, give mark with suitable code name, make the statement participle when the computer Recognition and select the usefulness of speech.

3, the general statement input method of Chinese character computer input system according to claim 1, it is characterized in that: repeated code group word order system is the word order mark of in code book each words of each group repeated code words being made respectively, this mark be used for the distinguish word word frequency and select speech.

4, the general statement input method of Chinese character computer input system according to claim 1, it is characterized in that: the statement input process is to utilize various Chinese character input systems, various coding methods, coding method can constant or suitable modification, Chinese character coding keyboard when input is unit with the statement, each Chinese character with 1 to several codes successively, isometric, import computing machine continuously, can dual-purpose speech word input mode, to reduce stroke.

5, the general statement input method of Chinese character computer input system according to claim 1 is characterized in that: the mode of statement input can be phonetic entry or Chinese Character Recognition input.