CN1445640A

CN1445640A - Method for inputting Chinese holophrase into computers by using partial stroke

Info

Publication number: CN1445640A
Application number: CN 02104443
Authority: CN
Inventors: 郑方; 莫树联
Original assignee: Individual
Current assignee: Individual
Priority date: 2002-03-18
Filing date: 2002-03-18
Publication date: 2003-10-01
Anticipated expiration: 2022-03-18
Also published as: CN1187677C

Abstract

A method for inputting a complete Chinese sentence to computer by lens strokes features that 23 or 40 characteristic strokes are used as basic code cells and assigned them to the keys on keyboard, a Chinese character is coded by 4 or less characteristic strokes, and a Chinese language model and the context information are used to automatically translate a complete sentence. Its advantages are easy mastering, and high speed and correctness.

Description

Method for inputting Chinese holophrase into computers by using partial stroke

Technical field

The invention belongs to computing machine (comprising desk-top computer, notebook computer, palm PC, personal digital assistant etc.) Chinese character input method technical field, particularly be used for wireless telecommunications system (as mobile phone) Chinese character input method.

Background technology

Wireless, palm communication apparatus such as mobile phone, PDA(Personal Digital Assistant) etc. are most popular now section news products, and sales figure is passed length every year at double.When using these communication products, the input of Chinese character is essential.And computing machine now (comprise desk-top computer, notebook computer, palm PC, personal digital assistant etc., and wireless telecommunications system etc.) environment mostly is based on English, and input in Chinese becomes the problem of a complexity and difficulty relatively.The input in Chinese of current these products all is to use hand-written and keypad; The hand-written input of writing that is confined to whole word, very time-consuming and slow; Keyboard input current more popular be back of the body radical and phonetic, the radical input can such as Cangjie and five-stroke input method etc., but want skilled these input methods need spend suitable time exercise than comparatively fast; The defect of phonetic input is to need word selection, because phonetically similar word is too many; And the ratio that keyboard shows on these communication apparatus is quite little, is unfavorable for using.

In the input method of Chinese character based on stroke and keyboard, five-stroke input method is popular.The order of strokes observed in calligraphy custom that yet its maximum problem is the method for divining by means of characters when not meeting people and writing, and the skill of divining by means of characters during input needs professional training for a long time to grasp.

In the input method of Chinese character based on handwriting pad, people need write out all strokes of whole Chinese character, just can be input to a Chinese character in computing machine or the equipment.When the Chinese character stroke that will import was a lot, when maybe note was forbidden writing of Chinese character when writing, it was very common makeing mistakes.

In existing input method,, need to use the Chinese language model method in order to solve the coincident code problem of a sound multiword based on phonetic.

Chinese language model (CLM) utilizes the collocation information between adjacent speech in the context, need be the phonetic in continuous no space, stroke, or represent the numeral of letter or stroke, when converting Chinese character string (being sentence) to, can calculate sentence, thereby be implemented to the automatic conversion of Chinese character, need not the user and manually select with maximum probability, avoided the coincident code problem of the corresponding identical phonetic of many Chinese characters (or stroke string, or numeric string).

The most frequently used CLM is the language model that is called Tri-gram (tlv triple), it provided collocation probability P between any three Chinese word a, b and the c (c|a, b).When the Chinese language text of magnanimity being arranged, by simple method of counting, can count the collocation number of times between any three speech, thereby estimate its collocation probability.This can be used for pinyin string, stroke string or numeric string are being picked out best candidate according to the principle of maximum likelihood to the Chinese character transfer process from numerous candidates.When being mapped to suitable Chinese character string, maximum-likelihood criterion means maximum probability when the feature representation string (as pinyin string, numeric string or stroke string) from Chinese character.In the Tri-gram language model, the probability of occurrence of Chinese character sequence is with following formulate:

P (w_{1}, w_{2}, \cdot \cdot \cdot, w_{N}) = P (w_{1}) \cdot Π_{n = 2}^{N} P (w_{n} | w_{1}, \cdot \cdot \cdot, w_{n - 1}) - - - - (1)

\approx P (w_{1}) \cdot P (w_{2} | w_{1}) \cdot Π_{n = 3}^{N} P (w_{n} | w_{n - 2}, w_{n - 1})

Tlv triple (w wherein _N-2, w _N-1, w _n) probability that occurs, just P (w _n| w _N-2, w _N-1), study comes from magnanimity Chinese language text (being called training text).

Existing Tri-gram language model has such several steps: (i) to the probability estimate of carrying out of the tlv triple in training text that do not occurred; (ii) reduce the model storage size; (iii) decode or search for, promptly utilize formula (1) from the candidate of a large amount of repeated codes, to select correct sentence quickly and accurately.

(i) to the probability estimate of carrying out of the tlv triple that do not occurred

Commonly used speech is nearly 30,000 or more in the Chinese.The number of the tlv triple that any three speech are formed just reaches 30,0003 scales, this wherein some tlv triple be impossible occur, some seldom occurs.Therefore, much no matter the language material of training usefulness has, (a, b c) can not occur in language material always have some tlv triple.If the probability to these tlv triple is not done special processing, (c|a is 0 b), is 0 thereby cause the probability of sentence will to cause the estimated probability P of these tlv triple.But these tlv triple that find in corpus not exclusively are 0 probability also, but say that the probability of their appearance is smaller relatively.Therefore these tlv triple should be given less relatively reasonable probability, and different tlv triple should be given different probability as the case may be.Traditional solution 0 probability method is to go to estimate according to two tuple probability P (c|b) of low order, Here it is backoff algorithm.Backoff algorithm can recurrence, if promptly P (c|b) also is 0 probability, then further return back to P (c).In order to guarantee that the probability summation is 1, must from the probability of the tlv triple of those nonzero probabilities, discount go out to serve probable value, in tlv triple occurring, do not redistribute.The shortcoming of this tradition backing method is that it has only considered toward a direction to do rollback to low order, and this makes not accurate enough to these probability estimate that the unit do not occur.

(ii) reduce the model scale

As said in (i), the Tri-gram language model is very huge in storage, even because most tlv triple does not all have to occur, depositing of the tlv triple of those appearance also needs very huge space.Generally, a vocabulary size is used for storage for the Chinese language model of 50K needs 300M to the space of 1G byte.Have at PC etc. on the equipment of a large amount of storages and can not reduce the model scale; To use on tens megabyte even the littler equipment be unpractiaca but memory space has only.This has the reason of two aspects, obviously is because storage space then is because the huge search procedure that causes of storage is consuming time very big on the other hand on the one hand.

(iii) decode or search for

The purpose of search is that the feature representation string (as pinyin string, numeric string or stroke string) with Chinese character is mapped on the Chinese character sequence and according to maximum-likelihood criterion and finds the best sequence of coupling as last result.Because (1) stroke sequence, pinyin sequence or a Serial No. shared in a plurality of Chinese characters; (2) there is not clear and definite speech border in the Chinese sentence between the speech; (3) sentences can produce and much meet coupling " sentence " because the difference of cut-off and can be divided into different word sequence (have only optimum) therefore is being mapped to single word, and then is being mapped in the process of sentence from the feature representation string.In this case, we can not list all possible Chinese character sequences and compare probability.Therefore, just seem very important of searching algorithm efficiently and exactly.Existing coding/decoding method is traditional dynamic programming algorithm, its deficiency be not at concrete application characteristics, it only uses the single level structure, therefore transplants not odd jobs, decoding efficiency is low, decoding effect is not ideal enough.

Summary of the invention

The objective of the invention is for overcoming the weak point of prior art, a kind of Chinese holophrase into computers by using partial stroke input method and Chinese language model method thereof are proposed, each Chinese character only need be imported or hand-written local feature stroke in proper order by the feature stroke encoding of Chinese character, and the user needn't choose from those Chinese characters candidate of repeated code one by one in input process, can change out whole sentence automatically; Have input method and be easy to grasp, slewing rate is fast, and the very high characteristics of conversion accuracy can be used in various computing machines and the mobile communication equipment.

The present invention proposes the Chinese sentence using partial stroke input method of a kind of desk-top computer, hand-hold electronic equipments or mobile communication equipment etc., is applied to keyboard input devices, it is characterized in that, may further comprise the steps:

1) adopt 23 basic code elements that the feature stroke is encode Chinese characters for computer, said code element comprises: ,-, Shu ,/, , mouthful, Qian, Contraband, ,/,,---,-Shu, Shu Shu ,/,＜, *, //, ,

2) above-mentioned 23 kinds of code symbols are mapped on the correspondent button position of keyboard of said equipment;

3) Chinese character is divided into about, up and down or outer interior two parts, each part is got two feature strokes at most and is encoded, each Chinese character has 4 code symbols at most; If Chinese character can not split into two parts, then directly get maximum 4 code symbols in order;

4) at different input equipments, set the number that each Chinese character uses code symbols, can be 1,2,3 or 4 a kind of;

5) carry out input to Chinese character by the mode of whole sentence, the feature stroke of Hanzi features order of strokes input setting number press in each Chinese character, utilizes Chinese language model related information based on context that whole sentence is changed out.When using big keyboard, be to improve performance, also can comprise 17 additional feature strokes :-, Shu, ,-,-/,-, Shu, Shu-, Shu/, Shu ,/-,/Shu ,/, , -, Shu, /.Thereby totally 40 feature strokes.

The present invention also proposes the Chinese sentence using partial stroke input method of a kind of desk-top computer, hand-hold electronic equipments or mobile communication equipment etc., is applied to the handwriting pad input equipment, it is characterized in that, may further comprise the steps:

1) adopt 23 basic code elements that the feature stroke is encode Chinese characters for computer, said code element comprises:

2) Chinese character is divided into about, up and down or outer interior two parts, each part is got two feature strokes at most and is encoded, each Chinese character has 4 code symbols at most; If Chinese character can not split into two parts, then directly get maximum 4 code symbols in order;

3) at different input equipments, set the number that each Chinese character uses code symbols, can be 1,2,3 or 4 a kind of;

4) carry out input to Chinese character by the mode of whole sentence, the feature stroke of Hanzi features order of strokes input setting number press in each Chinese character, utilizes Chinese language model related information based on context that whole sentence is changed out.For improving performance, also can comprise 17 additional feature strokes:

Totally 40 feature strokes.

Chinese language model in the said method may further comprise the steps:

1) training Chinese language model;

2) do not carry out probability estimate to the n tuple occurring;

3) compact model storage space, the steps include: for the 1st step: check that all tlv triple (tri-gram), two tuples (bi-gram) and a tuple (uni-gram) (are referred to as n-gram in the language model, the n tuple) the occurrence number in training text, n-gram more those occurrence numbers, that model performance is played an important role remains, and the occurrence number of other n-gram is forced to be changed to 0; The 2nd step: because of the many more n-gram of occurrence number in training text, the number of its n-gram is few more, therefore for the number of times of the fewer n-gram of occurrence number, needs bigger precision to go to preserve, and for the more n-gram of occurrence number, the point-device number of times of then unnecessary preservation.The present invention adopts logarithm bending curve to compress its occurrence number to uni-gram, thereby removes memory model with lower bit width, and the information of model is not lost substantially; The 3rd step: to the bi-gram that remains in the model, its occurrence number is non-zero certainly, does not write down its concrete occurrence number, but the bi-gram with same history (being same preceding continuous speech) is sorted from high to low according to occurrence number.All bi-gram are counted the average probability of the n-gram that comes the m position, set up code table, so that use during search; The 4th step: reduce the expense of index, set up three grades of index.Speech number with two bytes totally 16 bits represent, speech number is divided into three parts.For example: the highest 10 constitute the one-level index, and middle 4 constitute secondary index, and last two constitute three grades of index.In this way, effectively the number of one-level index has been dropped to hundreds of from several ten thousand, thereby reduced memory space;

4) the feature stroke sequence of being imported is searched for obtained Chinese character string.

The step of in the said Chinese language model n-tuple that does not occur being carried out probability estimate can be: the tlv triple (a that connects together and occur as three speech a, b, c, b, c) training probability is 0, when promptly in corpus, not occurring, adopt two-way low order degeneration algorithm to estimate, promptly simultaneously with reference to two tuple (a, b) and (b, c) (estimate for a, b by probability c) to tlv triple for training probability; Recurrence during this process, if promptly two tuples (x, training probability y) is 0 o'clock, utilize two-way degeneration algorithm, the training probability of word x and speech y is estimated simultaneously.

Carry out the method for storage space compression in the said Chinese language model, can may further comprise the steps:

1) checks all n-gram in the language model, desirable 3 (tri-gram of n wherein, tlv triple), 2 (bi-gram, two tuples) and 1 (uni-gram, one tuple), occurrence number in training text, n-gram more those occurrence numbers, that model performance is played an important role remains, and the occurrence number of other n-gram is forced to be changed to 0;

2) because of the many more n-gram of occurrence number in training text, the number of its n-gram is few more, therefore for the number of times of the fewer n-gram of occurrence number, needs bigger precision to go to preserve, and for the more n-gram of occurrence number, the point-device number of times of then unnecessary preservation.The present invention adopts logarithm bending curve to compress its occurrence number to uni-gram, thereby removes memory model with lower bit width, and the information of model is not lost substantially;

3) bi-gram to remaining in the model, its occurrence number is non-zero certainly, does not write down its concrete occurrence number, but the bi-gram with same history (being same preceding continuous speech) is sorted from high to low according to occurrence number.All bi-gram are counted the average probability of the n-gram that comes the m position, set up code table, so that use during search;

4) expense of reduction index is set up three grades of index.Speech number with two bytes totally 16 bits represent, be divided into three parts, constitute three grades of index; As: the highest 10 constitute the one-level index, and middle 4 constitute secondary index.In this way, effectively the number of one-level index has been dropped to hundreds of from several ten thousand, thereby reduced memory space.Searching algorithm in the described Chinese language model can may further comprise the steps:

1) search beginning, the search path candidate empties;

2) obtain the coding that a feature stroke is imported (from handwriting pad, keyboard or soft keyboard);

3) the stroke promotion stroke-Chinese character tree with input carries out the state redirect;

4) judging whether to obtain the feature stroke encoding (according to application, can set 1,2,3 or 4) of the setting number of a Chinese character, otherwise changeed for the 2nd step, is then to continue for the 5th step;

5) obtain all candidates of this individual character, promote lexical tree (all speech are called lexical tree by the tree that the word tissue forms) search condition redirect and advance;

6) judging whether to arrive the speech border, otherwise changeed for the 2nd step, is to continue for the 7th step;

7) obtain all speech candidates, to existing path add different candidate word by formula (1) carry out path marking respectively;

8) carried out sorting from high to low by probability score in all paths;

9) judging whether end of input, otherwise changeed for the 2nd step, is to continue for the 10th step;

10) obtain the whole sentence of best result candidate;

11) once whole sentence search finishes.

The present invention has following feature:

1) the stroke encoding scheme science of Chinese character, succinct, each Chinese character at most only need 4 " feature strokes " just can express, and have certain repeated code certainly.Here " feature stroke " is that the present invention defines, and it is different from traditional stroke.

2) Chinese language model has solved 0 probability estimate well, has reduced the model size, has improved search speed and precision, simultaneously the structure odd jobs.

3) when utilizing feature stroke input Chinese character, can be undertaken by the mode of whole sentence, each Chinese character only need by the feature stroke encoding of Chinese character import in proper order or the feature stroke of hand-written setting number (as situation according to application-specific, can be designed as each Chinese character and only use 1 of front, 2,3 or 4 feature strokes, and not necessarily use whole feature strokes), and the user needn't choose from those Chinese characters candidate of repeated code one by one in input process, and Chinese language model can be changed out whole sentence automatically.

4) these local feature strokes are chosen in order, meet people's order of strokes observed in calligraphy custom, and method is easy to grasp.

5) small scale of Chinese language model, whole data be less than the 1M byte, thereby make this technology to be applied to carry out the Chinese character input on most of little handheld devices.

6) the conversion accuracy is very high, and is first-selected up to more than 97%, thereby the user seldom need word for word select in numerous candidates.

7) slewing rate is fast, can change more than 300 Chinese character in the off line test per second.

8) multi-level Data Structure Design can combine the realization input method of Chinese character to Chinese language model and phonetic, numeral or other features easily.

Good result of the present invention:

According to the Chinese character standard of country's promulgation, the secondary character library has more than 6,700 Chinese character.If use four feature strokes to represent a Chinese character, then average per 1.2 Chinese characters use same feature stroke sequence.If each Chinese character only uses preceding two feature strokes to represent, then average 12 Chinese characters use same feature stroke sequence.Use like this and often need choose needed Chinese character from the Chinese character candidate list based on the commonsense method of stroke, the present invention then can help people to break away from this loaded down with trivial details process of choosing.When the user need import a Chinese word or sentence, he only need be with the local feature stroke of each Chinese character (as 1,2,3, or 4) input successively; In the process of input, utilize the advantage of Chinese language model, utilize Chinese character context mutual information, system picks out only output according to the feature stroke sequence contrast linguistry of having imported automatically; After all strokes were totally lost, the optimal candidate of whole word or sentence just provided automatically.That summarizes says, the present invention can obtain the correct candidate of whole word/sentence by the local feature stroke sequence of input Chinese character, and the model of this system is very little and accuracy is very high.

Description of drawings

Fig. 1 is the level frame diagram of explanation the present invention in various application.

Fig. 2 is the searching algorithm process flow diagram of stroke input.

Fig. 3 is the application example of stroke input method.

Embodiment

The content of the method for inputting Chinese holophrase into computers by using partial stroke that the present invention proposes and principle reach embodiment in conjunction with the accompanying drawings and are described in detail as follows:

(1) definition of feature stroke

Define 40 feature strokes altogether, see Table 1, they are:

(1) one stroke: have 5, they are " horizontal (-) ", " perpendicular (Shu) ", " casting aside (/) ", " point () " and " folding () ".As beginning and of can not link to each other, medium as " in vain ", " dashing forward ", " boat " with next record; Or last of being left, be " casting aside the horizontal, vertical left-falling stroke of proposing, roll over " as " I " word, last " point " is just be one stroke.

" left-falling stroke " and " carrying " is same stroke, so long as unidirectional then classify as (/).

" receive " and " point " for same stroke, so long as unidirectionally then classify as ().

" folding " comprised the folding of all directions, and waiting all as " sentence ", " five ", " fast ", " bow (bottom) " is not all the classifying as of straight line ().

" erect " and comprise " the perpendicular (亅 that colludes) " and " erect and carry ( ) ", as long as main body is perpendicular all be classified as perpendicular (Shu).As: containing perpendicular routine word has " OK " (left side) etc.; The routine word that contains " collude on a perpendicular left side " has " hand ", " I (left side) ", " row (the right) ", " what (the right) " etc.; And the routine word that contains " the perpendicular right side is carried " has " very ", " people " etc.Annotate: the right is not " erecting " but " colluding " in " I ", because main body is not perpendicular.

(2) combination stroke: totally 23, they are " " "-" " Shu " " " "-" "---" "-Shu " "-/" "- " " Shu " " Shu-" " Shu Shu " " Shu/" " Shu " "/-" "/Shu " " // " "/ " " " all single Chinese characters of " -" " Shu " " /" " " are all according to stroke order lined up stroke, per two strokes be one group form a feature stroke (unless the first stroke of a Chinese character can't make up with next record, or the tail pen does not have other strokes and can make up).Be " casting aside horizontal ", " perpendicular carrying ", " folding is cast aside ", " point " as " I " word; " speech " word is " point is horizontal ", " horizontal ", " mouth ".

(3) shape stroke: totally 12, these strokes are based on " shape " of topology, and they are:

(a) " mouth ": " state ", " in ", all contained complete square words such as " four ", " field " all are taken as " mouth "; And any irregular or incomplete all be not included in, as " ear ", " order ", " being total to ", " mother " etc.

(b) "

": for example " moon ", " treasured ", " hat ", " just (left side) ", " rain " etc.

(c) "

": for example " corpse ", " family ", " huge (inside) ", " bow (first stroke of a Chinese character) " etc.Annotate: the receipts pen of " bow " be " folding " rather than " ".

(d) " Contraband ": for example " district ", " Europe ", " huge (outside) " etc.

(e) " Qian ": for example " mountain ", " village ", " twenty " etc.

(f) " * ": " father ", " literary composition ", " from " etc. for obviously being all the classifying as of " * " type " * ".

(g) "/": follow the example of into by the limit, the word of stroke or shape, as " head ", " fire ", " rice ", " adopting " (" adopting " is " apostrophe＜", " point cast aside/" then get other stroke again) etc.

(h) "〉": follow the example of into by the limit, the word of stroke or shape, as " ice ", " water ", " cold " etc.

(i) "/": follow the example of into by the limit, the word of stroke or shape, as " people ", " going into ", " fire ", " little ", " sky " etc.

(j) "＜": follow the example of into by the limit, the word of stroke or shape, as " water ", " asking ", " holding " etc.

(k) "

": follow the example of into by the limit, the word of stroke or shape, as " red ", " profound " etc.

(l) "

": follow the example of into by the limit, the word of stroke or shape, as " mistake ", " court of a feudal ruler " etc.

The shape stroke is also referred to as preferential stroke, if because such shape is arranged, will preferentially be combined.As following the example of of " little " word, become " perpendicular, point is cast aside " so follow the example of by original " perpendicular point, left-falling stroke " because the existence of preferential stroke is arranged.

Table 1 feature stroke of the present invention, coding, hand-written stroke, and the routine word that contains this feature stroke

Little at keyboard, as mobile phone, or do not have keyboard, as PDA, application in, then only with wherein 23 feature strokes, promptly block letter is: ,-, Shu ,/, , mouthful,

Qian, Contraband, ,/,,---,-Shu, Shu Shu ,/,＜, *, //, , Handwritten form is:

At this moment the accuracy rate of input method slightly reduces.Big at keyboard, as PC, maybe can be with in the application of handwriting pad, input method is used 40 all feature strokes, removes 23 above-mentioned feature strokes, also comprises following 17 feature strokes, and block letter is:

-, Shu, ,-,-/,-, Shu, Shu-, Shu/,

Shu ,/-,/Shu ,/, , -, Shu, /; Handwritten form is:

The feature stroke of above-mentioned block letter is mapped on the existing keyboard if desired, can arbitrarily set as required, when for example using the existing standard keyboard as input equipment, uses 26 letters cases and capital and small letter shift key.And the feature stroke of handwritten form then needs directly to use handwriting pad to get final product.And the feature stroke of handwritten form then can be write as requested and got final product.

(2) the local code method of Chinese character

(1) if Chinese character is an integral body, about can't splitting into, up and down or outer inner structure, order of strokes is got the local code sequence of preceding 4 feature strokes as this Chinese character so; If not enough 4, have what just with what as the local code sequence.Be " casting aside horizontal ", " perpendicular carrying ", " folding is cast aside ", " point " as " I " word; " speech " word is " point is horizontal ", " horizontal ", " mouth ".Press table 1 then, can obtain the feature stroke encoding sequence of Chinese character.

(2) if about Chinese character can split into, up and down or outer inner structure, split into two parts so, according to stroke order get first and the local code sequence of preceding two feature strokes second portion respectively as Chinese character.If first has only a feature stroke, its excess-three feature stroke takes out in turn from second portion so.If not enough 4 of the number of feature stroke, have what just with what as the local code sequence.

Be exemplified below:

A) left and right sides structure: from left to right,, get earlier, just need get for not enough two yards by a left side with ways of writing is identical at ordinary times

The part on the right.As: rich :-Shu;-,

---, Shu afterwards ,-Shu, just need not get again.Enough :/, mouthful; / ,, afterwards/, just need not get again.

B) up-down structure:, get top part earlier from top to bottom, as: hang: mouthful;

Shu bears: /; / ,

C) outer inner structure: from outside to inside, get the part of outside earlier, as: state: mouthful;---, Shu-, the moon:

---

(3) training step of Chinese language model

The 1st step: select the suitably vocabulary of size, and vocabulary is done suitable processing according to the stroke encoding of single Chinese character;

The 2nd step: according to vocabulary the magnanimity text data is carried out intelligent cutting, form the speech sequence;

The 3rd step: the speech sequence is carried out statistical study, the tlv triple that obtains to be occurred (a, b, c) and occurrence number;

The 4th step: model is carried out smoothing processing, is that 0 n-gram carries out probability estimate to probability promptly.

(4) the two-way degeneration method of estimation of 0 probability in the Chinese language model

Consider the deficiency of traditional solution 0 probabilistic method, (c|a b) is at 0 o'clock, and the present invention not only considers P (c|b), and considers P (b|a) as P; Equally, when two tuple probability P (x|y) were 0, we also not only considered P (x), and considered P (y).Thereby utilize two-way degeneration algorithm more accurately 0 probability to be reappraised.

(5) reduce the algorithm of model scale in the Chinese language model

May further comprise the steps:

The 1st step: the occurrence number in training text of checking all tlv triple (tri-gram), two tuples (bi-gram) and single speech (uni-gram) (being referred to as n-gram) in the language model, n-gram more those occurrence numbers, that model performance is played an important role remains, and the occurrence number of other n-gram is forced to be changed to 0;

The 2nd step: because of the many more n-gram of occurrence number in training text, the number of its n-gram is few more, therefore for the number of times of the fewer n-gram of occurrence number, needs bigger precision to go to preserve, and for the more n-gram of occurrence number, the point-device number of times of then unnecessary preservation.The present invention adopts logarithm bending curve to compress its occurrence number to uni-gram, thereby removes memory model with lower bit width, and the information of model is not lost substantially;

The 3rd step: to the bi-gram that remains in the model, its occurrence number is non-zero certainly, does not write down its concrete occurrence number, but the bi-gram with same history (being same preceding continuous speech) is sorted from high to low according to occurrence number.All b-gram are counted the average probability of the n-gram that comes the m position, set up code table, so that use during search;

The 4th step: reduce the expense of index, set up three grades of index.Speech is number with two byte representations, and speech number is divided into three parts.For example: the highest 10 constitute the one-level index, and middle 4 constitute secondary index, and last two constitute three grades of index.In this way, effectively the number of one-level index has been dropped to hundreds of from several ten thousand, thereby reduced memory space.

(6) the efficiently and accurately searching algorithm of Chinese language model

The present invention proposes the tree structure of a multilayer and the search problem that synchronous grid search algorithm of character solves Chinese language model.This structure is divided into three layers, sees Fig. 1: top layer is the word sequence layer, and this layer is retrained by Chinese language model; The second layer is the word layer, retrained by lexical tree; The bottom is the Chinese character layer, is subjected to the constraint of " Chinese character-Hanzi features (phonetic, stroke or numeral) " tree.The back is two-layer can be thought independently, also can be used as an integral body and treat.By using this structure, the search from the stroke sequence to the sentence is that Chinese character is synchronous, and the probability of sentence is along with the appearance of word adds up by word.This searching algorithm can reach the search speed of 300 word/seconds; When each Chinese character uses two stroke encodings, can reach the accuracy more than 97%, when each Chinese character uses four codings, then can reach the accuracy more than 99%.

As can be seen from Figure 1, the advantage of this multilayered structure is to have good expandability.By from letter, numeral or stroke to the mapping phonetic or the Chinese character, this input system can be transplanted in the various systems by letter, numeral or stroke input very easily.

Searching algorithm as shown in Figure 2, step comprises:

The 1st step: the search beginning, the search path candidate empties;

The 2nd step: the coding that obtains a feature stroke input (from handwriting pad, keyboard or soft keyboard);

The 3rd step: the stroke promotion stroke-Chinese character tree with input carries out the state redirect;

The 4th step: judge whether to obtain the setting number of a Chinese character the feature stroke encoding (according to application,

Can set 1,2,3 or 4), otherwise changeed for the 2nd step, be then to continue the

5 steps;

The 5th step: obtain all candidates of this individual character, (all speech claim by the tree that the word tissue forms to promote lexical tree

Being lexical tree) the search condition redirect advances;

The 6th step: judging whether to arrive the speech border, otherwise changeed for the 2nd step, is to continue for the 7th step;

The 7th step: obtain all speech candidates, different candidate word by formula (1) is added in existing path

Carry out path marking respectively;

The 8th step: carried out sorting from high to low by probability score in all paths;

The 9th step: judging whether end of input, otherwise changeed for the 2nd step, is to continue for the 10th step;

The 10th step: obtain the whole sentence of best result candidate;

The 11st step: once whole sentence search finishes.

Applicating example of the present invention:

Having only two feature strokes with each Chinese character is example.

At first, the sequence step according to Chinese language model obtains language model.

When input, such as importing " the individual master worker in Shanghai overcomes difficulties " this sentence.According to coding of the present invention, these Chinese character characteristic of correspondence stroke sequences see Table 2.

The feature stroke sequence and the repeat code Chinese character of each word of table 2 " the individual master worker in Shanghai overcomes difficulties "

Thus, each Chinese character all will be imported four strokes, and goes choosing from the Chinese character of repeated code.As when input when " ", need four strokes of input "/, mouthful ,/, ", and from candidate's " chimney " reach " " select.And if only import preceding two strokes "/, mouthful ", the candidate can be more, as " the white white soul of clear and bright emperor highland spring ... "Look at situation below, see Fig. 3 with input method of the present invention.Among the figure, the first behavior word candidate in each frame, middle row is a speech number, the logarithm probability that a beneath behavior adds up.Every row is from left to right pressed the descending sort of logarithm probability.

(i) input earlier " on " preceding two strokes " Shu-,-", obtain the Chinese character of two strokes of repeated codes before four

" upward, Ji, mentally disturbed, Bei ".

(ii) need not select, continue preceding two strokes of input " sea " " ,/", though with " sea "

Preceding two strokes of identical words have a lot, but consider " Shu-,-" and " ,/"

Collocation relation, can obtain several possible speech " Shanghai, rise, upstream, show, on

Flow, trace back, on send, oil ", and to two possible individual characters that should two strokes " river,

Husky ".

(iii) continue input " " preceding two strokes "/, mouthful " because " sea " do not become speech, it is given

Go out some to individual character that should two strokes, provide last word again for " river " this word during for " mouth "

Possible word " river mouth ".

(iv) so go down, along with the maximum discoveries of input stroke, algorithm selected the sentence that needs " on

The individual master worker in sea overcomes difficulties ", see the Chinese character that the thick black surround among Fig. 3 marks, this sentence

Has maximum probability.

Claims

1, the Chinese sentence using partial stroke input method of a kind of desk-top computer, hand-hold electronic equipments or mobile communication equipment etc., be applied to keyboard input devices, it is characterized in that, may further comprise the steps: 1) adopt 23 basic code elements that the feature stroke is encode Chinese characters for computer, said code element comprises:,-, Shu ,/, , mouthful

Qian, Contraband, ,/,,---,-Shu, Shu Shu ,/,＜, *, //, ,

5) carry out input to Chinese character by the mode of whole sentence, the feature stroke of Hanzi features order of strokes input setting number press in each Chinese character, utilizes Chinese language model related information based on context that whole sentence is changed out.

2, Chinese sentence using partial stroke input method as claimed in claim 1 is characterized in that, also comprises 17 additional feature strokes totally 40 feature strokes :-, Shu, ,-,-/,-, Shu, Shu-, Shu/, Shu ,/-, / Shu ,/, , -, Shu, /.

3, the Chinese sentence using partial stroke input method of a kind of desk-top computer, hand-hold electronic equipments or mobile communication equipment etc. is applied to the handwriting pad input equipment, it is characterized in that, may further comprise the steps:

2) Chinese character is divided into about, up and down or outer interior two parts, each part is got two feature strokes at most and is encoded, each Chinese character has 4 code symbols at most; If Chinese character can not split into two parts, then directly get the most the more 4 code symbols in order;

4) carry out input to Chinese character by the mode of whole sentence, the feature stroke of Hanzi features order of strokes input setting number press in each Chinese character, utilizes Chinese language model related information based on context that whole sentence is changed out.

4, Chinese sentence using partial stroke input method as claimed in claim 3 is characterized in that, also comprises additional feature stroke:

5, as claim 1,2,3 or 4 described Chinese sentence using partial stroke input methods, it is characterized in that said Chinese language model may further comprise the steps:

1) training Chinese language model;

2) adopt two-way degeneration algorithm for estimating not carry out probability estimate to the n tuple occurring;

3) compact model storage space, the steps include: for the 1st step: check all n tuples in the language model, wherein n gets 3,2 and 1, occurrence number in training text, n tuple more those occurrence numbers, that model performance is played an important role remains, and the occurrence number of other n tuples is forced to be changed to 0; The 2nd step: adopt logarithm bending curve to compress its occurrence number to a tuple, remove memory model with lower bit width; The 3rd step: to two tuples that remain in the model, its occurrence number is non-zero certainly, do not write down its concrete occurrence number, but two tuples with same preceding continuous speech are sorted from high to low according to occurrence number, two all tuples are counted the average probability that comes the m position, set up code table, so that when search used; The 4th step: reduce the expense of index, speech number with two bytes totally 16 bits represent, be divided into three parts and constitute three grades of index;

6, Chinese sentence using partial stroke input method as claimed in claim 5, it is characterized in that, saidly do not carry out the step of probability estimate and be: the tlv triple (a, the b that connect together and occur as three speech a, b, c the n tuple occurring, c) training probability is 0, when promptly in corpus, not occurring, adopt two-way low order degeneration algorithm to estimate, promptly simultaneously with reference to two tuple (a, b) and (b, c) (estimate for a, b by probability c) to tlv triple for training probability; Recurrence during this process, if promptly two tuples (x, training probability y) is 0 o'clock, utilize two-way degeneration algorithm, the training probability of word x and speech y is estimated simultaneously.

7, Chinese sentence using partial stroke input method as claimed in claim 5 is characterized in that said searching algorithm may further comprise the steps:

1) search beginning, the search path candidate empties;

2) obtain the coding that a feature stroke is imported from handwriting pad, keyboard or soft keyboard;

4) judging whether to obtain the feature stroke encoding of the setting number of a Chinese character, otherwise changeed for the 2nd step, is then to continue for the 5th step;

5) obtain all candidates of this individual character, promote the redirect of lexical tree search condition and advance;

7) obtain all speech candidates, different candidate word is added in existing path carry out path marking respectively by following formula

P (w_{1}, w_{2}, \cdot \cdot \cdot, w_{N}) = P (w_{1}) \cdot P (w_{2} | w_{1}) \cdot Π_{n = 3}^{N} P (w_{u} | w_{u - 2}, w_{u - 1});

8) carried out sorting from high to low by probability score in all paths;

10) obtain the whole sentence of best result candidate;

11) once whole sentence search finishes.