CN1445640A - Method for inputting Chinese holophrase into computers by using partial stroke - Google Patents

Method for inputting Chinese holophrase into computers by using partial stroke Download PDF

Info

Publication number
CN1445640A
CN1445640A CN 02104443 CN02104443A CN1445640A CN 1445640 A CN1445640 A CN 1445640A CN 02104443 CN02104443 CN 02104443 CN 02104443 A CN02104443 A CN 02104443A CN 1445640 A CN1445640 A CN 1445640A
Authority
CN
China
Prior art keywords
chinese
stroke
chinese character
shu
input
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN 02104443
Other languages
Chinese (zh)
Other versions
CN1187677C (en
Inventor
郑方
莫树联
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CNB021044430A priority Critical patent/CN1187677C/en
Publication of CN1445640A publication Critical patent/CN1445640A/en
Application granted granted Critical
Publication of CN1187677C publication Critical patent/CN1187677C/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Landscapes

  • Document Processing Apparatus (AREA)

Abstract

A method for inputting a complete Chinese sentence to computer by lens strokes features that 23 or 40 characteristic strokes are used as basic code cells and assigned them to the keys on keyboard, a Chinese character is coded by 4 or less characteristic strokes, and a Chinese language model and the context information are used to automatically translate a complete sentence. Its advantages are easy mastering, and high speed and correctness.

Description

Method for inputting Chinese holophrase into computers by using partial stroke
Technical field
The invention belongs to computing machine (comprising desk-top computer, notebook computer, palm PC, personal digital assistant etc.) Chinese character input method technical field, particularly be used for wireless telecommunications system (as mobile phone) Chinese character input method.
Background technology
Wireless, palm communication apparatus such as mobile phone, PDA(Personal Digital Assistant) etc. are most popular now section news products, and sales figure is passed length every year at double.When using these communication products, the input of Chinese character is essential.And computing machine now (comprise desk-top computer, notebook computer, palm PC, personal digital assistant etc., and wireless telecommunications system etc.) environment mostly is based on English, and input in Chinese becomes the problem of a complexity and difficulty relatively.The input in Chinese of current these products all is to use hand-written and keypad; The hand-written input of writing that is confined to whole word, very time-consuming and slow; Keyboard input current more popular be back of the body radical and phonetic, the radical input can such as Cangjie and five-stroke input method etc., but want skilled these input methods need spend suitable time exercise than comparatively fast; The defect of phonetic input is to need word selection, because phonetically similar word is too many; And the ratio that keyboard shows on these communication apparatus is quite little, is unfavorable for using.
In the input method of Chinese character based on stroke and keyboard, five-stroke input method is popular.The order of strokes observed in calligraphy custom that yet its maximum problem is the method for divining by means of characters when not meeting people and writing, and the skill of divining by means of characters during input needs professional training for a long time to grasp.
In the input method of Chinese character based on handwriting pad, people need write out all strokes of whole Chinese character, just can be input to a Chinese character in computing machine or the equipment.When the Chinese character stroke that will import was a lot, when maybe note was forbidden writing of Chinese character when writing, it was very common makeing mistakes.
In existing input method,, need to use the Chinese language model method in order to solve the coincident code problem of a sound multiword based on phonetic.
Chinese language model (CLM) utilizes the collocation information between adjacent speech in the context, need be the phonetic in continuous no space, stroke, or represent the numeral of letter or stroke, when converting Chinese character string (being sentence) to, can calculate sentence, thereby be implemented to the automatic conversion of Chinese character, need not the user and manually select with maximum probability, avoided the coincident code problem of the corresponding identical phonetic of many Chinese characters (or stroke string, or numeric string).
The most frequently used CLM is the language model that is called Tri-gram (tlv triple), it provided collocation probability P between any three Chinese word a, b and the c (c|a, b).When the Chinese language text of magnanimity being arranged, by simple method of counting, can count the collocation number of times between any three speech, thereby estimate its collocation probability.This can be used for pinyin string, stroke string or numeric string are being picked out best candidate according to the principle of maximum likelihood to the Chinese character transfer process from numerous candidates.When being mapped to suitable Chinese character string, maximum-likelihood criterion means maximum probability when the feature representation string (as pinyin string, numeric string or stroke string) from Chinese character.In the Tri-gram language model, the probability of occurrence of Chinese character sequence is with following formulate: P ( w 1 , w 2 , · · · , w N ) = P ( w 1 ) · Π n = 2 N P ( w n | w 1 , · · · , w n - 1 ) - - - - ( 1 ) ≈ P ( w 1 ) · P ( w 2 | w 1 ) · Π n = 3 N P ( w n | w n - 2 , w n - 1 ) Tlv triple (w wherein N-2, w N-1, w n) probability that occurs, just P (w n| w N-2, w N-1), study comes from magnanimity Chinese language text (being called training text).
Existing Tri-gram language model has such several steps: (i) to the probability estimate of carrying out of the tlv triple in training text that do not occurred; (ii) reduce the model storage size; (iii) decode or search for, promptly utilize formula (1) from the candidate of a large amount of repeated codes, to select correct sentence quickly and accurately.
(i) to the probability estimate of carrying out of the tlv triple that do not occurred
Commonly used speech is nearly 30,000 or more in the Chinese.The number of the tlv triple that any three speech are formed just reaches 30,0003 scales, this wherein some tlv triple be impossible occur, some seldom occurs.Therefore, much no matter the language material of training usefulness has, (a, b c) can not occur in language material always have some tlv triple.If the probability to these tlv triple is not done special processing, (c|a is 0 b), is 0 thereby cause the probability of sentence will to cause the estimated probability P of these tlv triple.But these tlv triple that find in corpus not exclusively are 0 probability also, but say that the probability of their appearance is smaller relatively.Therefore these tlv triple should be given less relatively reasonable probability, and different tlv triple should be given different probability as the case may be.Traditional solution 0 probability method is to go to estimate according to two tuple probability P (c|b) of low order, Here it is backoff algorithm.Backoff algorithm can recurrence, if promptly P (c|b) also is 0 probability, then further return back to P (c).In order to guarantee that the probability summation is 1, must from the probability of the tlv triple of those nonzero probabilities, discount go out to serve probable value, in tlv triple occurring, do not redistribute.The shortcoming of this tradition backing method is that it has only considered toward a direction to do rollback to low order, and this makes not accurate enough to these probability estimate that the unit do not occur.
(ii) reduce the model scale
As said in (i), the Tri-gram language model is very huge in storage, even because most tlv triple does not all have to occur, depositing of the tlv triple of those appearance also needs very huge space.Generally, a vocabulary size is used for storage for the Chinese language model of 50K needs 300M to the space of 1G byte.Have at PC etc. on the equipment of a large amount of storages and can not reduce the model scale; To use on tens megabyte even the littler equipment be unpractiaca but memory space has only.This has the reason of two aspects, obviously is because storage space then is because the huge search procedure that causes of storage is consuming time very big on the other hand on the one hand.
(iii) decode or search for
The purpose of search is that the feature representation string (as pinyin string, numeric string or stroke string) with Chinese character is mapped on the Chinese character sequence and according to maximum-likelihood criterion and finds the best sequence of coupling as last result.Because (1) stroke sequence, pinyin sequence or a Serial No. shared in a plurality of Chinese characters; (2) there is not clear and definite speech border in the Chinese sentence between the speech; (3) sentences can produce and much meet coupling " sentence " because the difference of cut-off and can be divided into different word sequence (have only optimum) therefore is being mapped to single word, and then is being mapped in the process of sentence from the feature representation string.In this case, we can not list all possible Chinese character sequences and compare probability.Therefore, just seem very important of searching algorithm efficiently and exactly.Existing coding/decoding method is traditional dynamic programming algorithm, its deficiency be not at concrete application characteristics, it only uses the single level structure, therefore transplants not odd jobs, decoding efficiency is low, decoding effect is not ideal enough.
Summary of the invention
The objective of the invention is for overcoming the weak point of prior art, a kind of Chinese holophrase into computers by using partial stroke input method and Chinese language model method thereof are proposed, each Chinese character only need be imported or hand-written local feature stroke in proper order by the feature stroke encoding of Chinese character, and the user needn't choose from those Chinese characters candidate of repeated code one by one in input process, can change out whole sentence automatically; Have input method and be easy to grasp, slewing rate is fast, and the very high characteristics of conversion accuracy can be used in various computing machines and the mobile communication equipment.
The present invention proposes the Chinese sentence using partial stroke input method of a kind of desk-top computer, hand-hold electronic equipments or mobile communication equipment etc., is applied to keyboard input devices, it is characterized in that, may further comprise the steps:
1) adopt 23 basic code elements that the feature stroke is encode Chinese characters for computer, said code element comprises: ,-, Shu ,/, , mouthful, Qian, Contraband, ,/,,---,-Shu, Shu Shu ,/,<, *, //, ,
Figure A0210444300073
2) above-mentioned 23 kinds of code symbols are mapped on the correspondent button position of keyboard of said equipment;
3) Chinese character is divided into about, up and down or outer interior two parts, each part is got two feature strokes at most and is encoded, each Chinese character has 4 code symbols at most; If Chinese character can not split into two parts, then directly get maximum 4 code symbols in order;
4) at different input equipments, set the number that each Chinese character uses code symbols, can be 1,2,3 or 4 a kind of;
5) carry out input to Chinese character by the mode of whole sentence, the feature stroke of Hanzi features order of strokes input setting number press in each Chinese character, utilizes Chinese language model related information based on context that whole sentence is changed out.When using big keyboard, be to improve performance, also can comprise 17 additional feature strokes :-, Shu, ,-,-/,-, Shu, Shu-, Shu/, Shu ,/-,/Shu ,/, , -, Shu, /.Thereby totally 40 feature strokes.
The present invention also proposes the Chinese sentence using partial stroke input method of a kind of desk-top computer, hand-hold electronic equipments or mobile communication equipment etc., is applied to the handwriting pad input equipment, it is characterized in that, may further comprise the steps:
1) adopt 23 basic code elements that the feature stroke is encode Chinese characters for computer, said code element comprises:
Figure A0210444300078
Figure A0210444300079
Figure A02104443000710
Figure A02104443000713
Figure A02104443000714
Figure A02104443000718
Figure A02104443000719
Figure A02104443000722
Figure A02104443000724
Figure A02104443000725
Figure A02104443000726
2) Chinese character is divided into about, up and down or outer interior two parts, each part is got two feature strokes at most and is encoded, each Chinese character has 4 code symbols at most; If Chinese character can not split into two parts, then directly get maximum 4 code symbols in order;
3) at different input equipments, set the number that each Chinese character uses code symbols, can be 1,2,3 or 4 a kind of;
4) carry out input to Chinese character by the mode of whole sentence, the feature stroke of Hanzi features order of strokes input setting number press in each Chinese character, utilizes Chinese language model related information based on context that whole sentence is changed out.For improving performance, also can comprise 17 additional feature strokes:
Figure A0210444300081
Figure A0210444300082
Figure A0210444300083
Figure A0210444300086
Figure A0210444300087
Figure A02104443000811
Figure A02104443000814
Figure A02104443000815
Totally 40 feature strokes.
Chinese language model in the said method may further comprise the steps:
1) training Chinese language model;
2) do not carry out probability estimate to the n tuple occurring;
3) compact model storage space, the steps include: for the 1st step: check that all tlv triple (tri-gram), two tuples (bi-gram) and a tuple (uni-gram) (are referred to as n-gram in the language model, the n tuple) the occurrence number in training text, n-gram more those occurrence numbers, that model performance is played an important role remains, and the occurrence number of other n-gram is forced to be changed to 0; The 2nd step: because of the many more n-gram of occurrence number in training text, the number of its n-gram is few more, therefore for the number of times of the fewer n-gram of occurrence number, needs bigger precision to go to preserve, and for the more n-gram of occurrence number, the point-device number of times of then unnecessary preservation.The present invention adopts logarithm bending curve to compress its occurrence number to uni-gram, thereby removes memory model with lower bit width, and the information of model is not lost substantially; The 3rd step: to the bi-gram that remains in the model, its occurrence number is non-zero certainly, does not write down its concrete occurrence number, but the bi-gram with same history (being same preceding continuous speech) is sorted from high to low according to occurrence number.All bi-gram are counted the average probability of the n-gram that comes the m position, set up code table, so that use during search; The 4th step: reduce the expense of index, set up three grades of index.Speech number with two bytes totally 16 bits represent, speech number is divided into three parts.For example: the highest 10 constitute the one-level index, and middle 4 constitute secondary index, and last two constitute three grades of index.In this way, effectively the number of one-level index has been dropped to hundreds of from several ten thousand, thereby reduced memory space;
4) the feature stroke sequence of being imported is searched for obtained Chinese character string.
The step of in the said Chinese language model n-tuple that does not occur being carried out probability estimate can be: the tlv triple (a that connects together and occur as three speech a, b, c, b, c) training probability is 0, when promptly in corpus, not occurring, adopt two-way low order degeneration algorithm to estimate, promptly simultaneously with reference to two tuple (a, b) and (b, c) (estimate for a, b by probability c) to tlv triple for training probability; Recurrence during this process, if promptly two tuples (x, training probability y) is 0 o'clock, utilize two-way degeneration algorithm, the training probability of word x and speech y is estimated simultaneously.
Carry out the method for storage space compression in the said Chinese language model, can may further comprise the steps:
1) checks all n-gram in the language model, desirable 3 (tri-gram of n wherein, tlv triple), 2 (bi-gram, two tuples) and 1 (uni-gram, one tuple), occurrence number in training text, n-gram more those occurrence numbers, that model performance is played an important role remains, and the occurrence number of other n-gram is forced to be changed to 0;
2) because of the many more n-gram of occurrence number in training text, the number of its n-gram is few more, therefore for the number of times of the fewer n-gram of occurrence number, needs bigger precision to go to preserve, and for the more n-gram of occurrence number, the point-device number of times of then unnecessary preservation.The present invention adopts logarithm bending curve to compress its occurrence number to uni-gram, thereby removes memory model with lower bit width, and the information of model is not lost substantially;
3) bi-gram to remaining in the model, its occurrence number is non-zero certainly, does not write down its concrete occurrence number, but the bi-gram with same history (being same preceding continuous speech) is sorted from high to low according to occurrence number.All bi-gram are counted the average probability of the n-gram that comes the m position, set up code table, so that use during search;
4) expense of reduction index is set up three grades of index.Speech number with two bytes totally 16 bits represent, be divided into three parts, constitute three grades of index; As: the highest 10 constitute the one-level index, and middle 4 constitute secondary index.In this way, effectively the number of one-level index has been dropped to hundreds of from several ten thousand, thereby reduced memory space.Searching algorithm in the described Chinese language model can may further comprise the steps:
1) search beginning, the search path candidate empties;
2) obtain the coding that a feature stroke is imported (from handwriting pad, keyboard or soft keyboard);
3) the stroke promotion stroke-Chinese character tree with input carries out the state redirect;
4) judging whether to obtain the feature stroke encoding (according to application, can set 1,2,3 or 4) of the setting number of a Chinese character, otherwise changeed for the 2nd step, is then to continue for the 5th step;
5) obtain all candidates of this individual character, promote lexical tree (all speech are called lexical tree by the tree that the word tissue forms) search condition redirect and advance;
6) judging whether to arrive the speech border, otherwise changeed for the 2nd step, is to continue for the 7th step;
7) obtain all speech candidates, to existing path add different candidate word by formula (1) carry out path marking respectively;
8) carried out sorting from high to low by probability score in all paths;
9) judging whether end of input, otherwise changeed for the 2nd step, is to continue for the 10th step;
10) obtain the whole sentence of best result candidate;
11) once whole sentence search finishes.
The present invention has following feature:
1) the stroke encoding scheme science of Chinese character, succinct, each Chinese character at most only need 4 " feature strokes " just can express, and have certain repeated code certainly.Here " feature stroke " is that the present invention defines, and it is different from traditional stroke.
2) Chinese language model has solved 0 probability estimate well, has reduced the model size, has improved search speed and precision, simultaneously the structure odd jobs.
3) when utilizing feature stroke input Chinese character, can be undertaken by the mode of whole sentence, each Chinese character only need by the feature stroke encoding of Chinese character import in proper order or the feature stroke of hand-written setting number (as situation according to application-specific, can be designed as each Chinese character and only use 1 of front, 2,3 or 4 feature strokes, and not necessarily use whole feature strokes), and the user needn't choose from those Chinese characters candidate of repeated code one by one in input process, and Chinese language model can be changed out whole sentence automatically.
4) these local feature strokes are chosen in order, meet people's order of strokes observed in calligraphy custom, and method is easy to grasp.
5) small scale of Chinese language model, whole data be less than the 1M byte, thereby make this technology to be applied to carry out the Chinese character input on most of little handheld devices.
6) the conversion accuracy is very high, and is first-selected up to more than 97%, thereby the user seldom need word for word select in numerous candidates.
7) slewing rate is fast, can change more than 300 Chinese character in the off line test per second.
8) multi-level Data Structure Design can combine the realization input method of Chinese character to Chinese language model and phonetic, numeral or other features easily.
Good result of the present invention:
According to the Chinese character standard of country's promulgation, the secondary character library has more than 6,700 Chinese character.If use four feature strokes to represent a Chinese character, then average per 1.2 Chinese characters use same feature stroke sequence.If each Chinese character only uses preceding two feature strokes to represent, then average 12 Chinese characters use same feature stroke sequence.Use like this and often need choose needed Chinese character from the Chinese character candidate list based on the commonsense method of stroke, the present invention then can help people to break away from this loaded down with trivial details process of choosing.When the user need import a Chinese word or sentence, he only need be with the local feature stroke of each Chinese character (as 1,2,3, or 4) input successively; In the process of input, utilize the advantage of Chinese language model, utilize Chinese character context mutual information, system picks out only output according to the feature stroke sequence contrast linguistry of having imported automatically; After all strokes were totally lost, the optimal candidate of whole word or sentence just provided automatically.That summarizes says, the present invention can obtain the correct candidate of whole word/sentence by the local feature stroke sequence of input Chinese character, and the model of this system is very little and accuracy is very high.
Description of drawings
Fig. 1 is the level frame diagram of explanation the present invention in various application.
Fig. 2 is the searching algorithm process flow diagram of stroke input.
Fig. 3 is the application example of stroke input method.
Embodiment
The content of the method for inputting Chinese holophrase into computers by using partial stroke that the present invention proposes and principle reach embodiment in conjunction with the accompanying drawings and are described in detail as follows:
(1) definition of feature stroke
Define 40 feature strokes altogether, see Table 1, they are:
(1) one stroke: have 5, they are " horizontal (-) ", " perpendicular (Shu) ", " casting aside (/) ", " point () " and " folding () ".As beginning and of can not link to each other, medium as " in vain ", " dashing forward ", " boat " with next record; Or last of being left, be " casting aside the horizontal, vertical left-falling stroke of proposing, roll over " as " I " word, last " point " is just be one stroke.
" left-falling stroke " and " carrying " is same stroke, so long as unidirectional then classify as (/).
" receive " and " point " for same stroke, so long as unidirectionally then classify as ().
" folding " comprised the folding of all directions, and waiting all as " sentence ", " five ", " fast ", " bow (bottom) " is not all the classifying as of straight line ().
" erect " and comprise " the perpendicular (亅 that colludes) " and " erect and carry ( ) ", as long as main body is perpendicular all be classified as perpendicular (Shu).As: containing perpendicular routine word has " OK " (left side) etc.; The routine word that contains " collude on a perpendicular left side " has " hand ", " I (left side) ", " row (the right) ", " what (the right) " etc.; And the routine word that contains " the perpendicular right side is carried " has " very ", " people " etc.Annotate: the right is not " erecting " but " colluding " in " I ", because main body is not perpendicular.
(2) combination stroke: totally 23, they are " " "-" " Shu " " " "-" "---" "-Shu " "-/" "- " " Shu " " Shu-" " Shu Shu " " Shu/" " Shu " "/-" "/Shu " " // " "/ " " " all single Chinese characters of " -" " Shu " " /" " " are all according to stroke order lined up stroke, per two strokes be one group form a feature stroke (unless the first stroke of a Chinese character can't make up with next record, or the tail pen does not have other strokes and can make up).Be " casting aside horizontal ", " perpendicular carrying ", " folding is cast aside ", " point " as " I " word; " speech " word is " point is horizontal ", " horizontal ", " mouth ".
(3) shape stroke: totally 12, these strokes are based on " shape " of topology, and they are:
(a) " mouth ": " state ", " in ", all contained complete square words such as " four ", " field " all are taken as " mouth "; And any irregular or incomplete all be not included in, as " ear ", " order ", " being total to ", " mother " etc.
(b) "
Figure A0210444300112
": for example " moon ", " treasured ", " hat ", " just (left side) ", " rain " etc.
(c) "
Figure A0210444300113
": for example " corpse ", " family ", " huge (inside) ", " bow (first stroke of a Chinese character) " etc.Annotate: the receipts pen of " bow " be " folding " rather than " ".
(d) " Contraband ": for example " district ", " Europe ", " huge (outside) " etc.
(e) " Qian ": for example " mountain ", " village ", " twenty " etc.
(f) " * ": " father ", " literary composition ", " from " etc. for obviously being all the classifying as of " * " type " * ".
(g) "/": follow the example of into by the limit, the word of stroke or shape, as " head ", " fire ", " rice ", " adopting " (" adopting " is " apostrophe<", " point cast aside/" then get other stroke again) etc.
(h) "〉": follow the example of into by the limit, the word of stroke or shape, as " ice ", " water ", " cold " etc.
(i) "/": follow the example of into by the limit, the word of stroke or shape, as " people ", " going into ", " fire ", " little ", " sky " etc.
(j) "<": follow the example of into by the limit, the word of stroke or shape, as " water ", " asking ", " holding " etc.
(k) "
Figure A0210444300115
": follow the example of into by the limit, the word of stroke or shape, as " red ", " profound " etc.
(l) "
Figure A0210444300116
": follow the example of into by the limit, the word of stroke or shape, as " mistake ", " court of a feudal ruler " etc.
The shape stroke is also referred to as preferential stroke, if because such shape is arranged, will preferentially be combined.As following the example of of " little " word, become " perpendicular, point is cast aside " so follow the example of by original " perpendicular point, left-falling stroke " because the existence of preferential stroke is arranged.
Table 1 feature stroke of the present invention, coding, hand-written stroke, and the routine word that contains this feature stroke
Little at keyboard, as mobile phone, or do not have keyboard, as PDA, application in, then only with wherein 23 feature strokes, promptly block letter is: ,-, Shu ,/, , mouthful,
Figure A0210444300123
Qian, Contraband, ,/,,---,-Shu, Shu Shu ,/,<, *, //, , Handwritten form is:
Figure A0210444300128
Figure A0210444300129
Figure A02104443001210
Figure A02104443001211
Figure A02104443001212
Figure A02104443001213
Figure A0210444300132
Figure A0210444300133
Figure A0210444300138
Figure A0210444300139
Figure A02104443001313
At this moment the accuracy rate of input method slightly reduces.Big at keyboard, as PC, maybe can be with in the application of handwriting pad, input method is used 40 all feature strokes, removes 23 above-mentioned feature strokes, also comprises following 17 feature strokes, and block letter is:
-, Shu, ,-,-/,-, Shu, Shu-, Shu/,
Shu ,/-,/Shu ,/, , -, Shu, /; Handwritten form is:
Figure A02104443001314
Figure A02104443001315
Figure A02104443001316
Figure A02104443001318
Figure A02104443001319
Figure A02104443001320
Figure A02104443001322
Figure A02104443001324
Figure A02104443001325
Figure A02104443001326
Figure A02104443001330
The feature stroke of above-mentioned block letter is mapped on the existing keyboard if desired, can arbitrarily set as required, when for example using the existing standard keyboard as input equipment, uses 26 letters cases and capital and small letter shift key.And the feature stroke of handwritten form then needs directly to use handwriting pad to get final product.And the feature stroke of handwritten form then can be write as requested and got final product.
(2) the local code method of Chinese character
(1) if Chinese character is an integral body, about can't splitting into, up and down or outer inner structure, order of strokes is got the local code sequence of preceding 4 feature strokes as this Chinese character so; If not enough 4, have what just with what as the local code sequence.Be " casting aside horizontal ", " perpendicular carrying ", " folding is cast aside ", " point " as " I " word; " speech " word is " point is horizontal ", " horizontal ", " mouth ".Press table 1 then, can obtain the feature stroke encoding sequence of Chinese character.
(2) if about Chinese character can split into, up and down or outer inner structure, split into two parts so, according to stroke order get first and the local code sequence of preceding two feature strokes second portion respectively as Chinese character.If first has only a feature stroke, its excess-three feature stroke takes out in turn from second portion so.If not enough 4 of the number of feature stroke, have what just with what as the local code sequence.
Be exemplified below:
A) left and right sides structure: from left to right,, get earlier, just need get for not enough two yards by a left side with ways of writing is identical at ordinary times
The part on the right.As: rich :-Shu;-,
Figure A02104443001331
---, Shu afterwards ,-Shu, just need not get again.Enough :/, mouthful; / ,, afterwards/, just need not get again.
B) up-down structure:, get top part earlier from top to bottom, as: hang: mouthful;
Figure A02104443001332
Shu bears: /; / ,
C) outer inner structure: from outside to inside, get the part of outside earlier, as: state: mouthful;---, Shu-, the moon:
Figure A02104443001333
---
(3) training step of Chinese language model
The 1st step: select the suitably vocabulary of size, and vocabulary is done suitable processing according to the stroke encoding of single Chinese character;
The 2nd step: according to vocabulary the magnanimity text data is carried out intelligent cutting, form the speech sequence;
The 3rd step: the speech sequence is carried out statistical study, the tlv triple that obtains to be occurred (a, b, c) and occurrence number;
The 4th step: model is carried out smoothing processing, is that 0 n-gram carries out probability estimate to probability promptly.
(4) the two-way degeneration method of estimation of 0 probability in the Chinese language model
Consider the deficiency of traditional solution 0 probabilistic method, (c|a b) is at 0 o'clock, and the present invention not only considers P (c|b), and considers P (b|a) as P; Equally, when two tuple probability P (x|y) were 0, we also not only considered P (x), and considered P (y).Thereby utilize two-way degeneration algorithm more accurately 0 probability to be reappraised.
(5) reduce the algorithm of model scale in the Chinese language model
May further comprise the steps:
The 1st step: the occurrence number in training text of checking all tlv triple (tri-gram), two tuples (bi-gram) and single speech (uni-gram) (being referred to as n-gram) in the language model, n-gram more those occurrence numbers, that model performance is played an important role remains, and the occurrence number of other n-gram is forced to be changed to 0;
The 2nd step: because of the many more n-gram of occurrence number in training text, the number of its n-gram is few more, therefore for the number of times of the fewer n-gram of occurrence number, needs bigger precision to go to preserve, and for the more n-gram of occurrence number, the point-device number of times of then unnecessary preservation.The present invention adopts logarithm bending curve to compress its occurrence number to uni-gram, thereby removes memory model with lower bit width, and the information of model is not lost substantially;
The 3rd step: to the bi-gram that remains in the model, its occurrence number is non-zero certainly, does not write down its concrete occurrence number, but the bi-gram with same history (being same preceding continuous speech) is sorted from high to low according to occurrence number.All b-gram are counted the average probability of the n-gram that comes the m position, set up code table, so that use during search;
The 4th step: reduce the expense of index, set up three grades of index.Speech is number with two byte representations, and speech number is divided into three parts.For example: the highest 10 constitute the one-level index, and middle 4 constitute secondary index, and last two constitute three grades of index.In this way, effectively the number of one-level index has been dropped to hundreds of from several ten thousand, thereby reduced memory space.
(6) the efficiently and accurately searching algorithm of Chinese language model
The present invention proposes the tree structure of a multilayer and the search problem that synchronous grid search algorithm of character solves Chinese language model.This structure is divided into three layers, sees Fig. 1: top layer is the word sequence layer, and this layer is retrained by Chinese language model; The second layer is the word layer, retrained by lexical tree; The bottom is the Chinese character layer, is subjected to the constraint of " Chinese character-Hanzi features (phonetic, stroke or numeral) " tree.The back is two-layer can be thought independently, also can be used as an integral body and treat.By using this structure, the search from the stroke sequence to the sentence is that Chinese character is synchronous, and the probability of sentence is along with the appearance of word adds up by word.This searching algorithm can reach the search speed of 300 word/seconds; When each Chinese character uses two stroke encodings, can reach the accuracy more than 97%, when each Chinese character uses four codings, then can reach the accuracy more than 99%.
As can be seen from Figure 1, the advantage of this multilayered structure is to have good expandability.By from letter, numeral or stroke to the mapping phonetic or the Chinese character, this input system can be transplanted in the various systems by letter, numeral or stroke input very easily.
Searching algorithm as shown in Figure 2, step comprises:
The 1st step: the search beginning, the search path candidate empties;
The 2nd step: the coding that obtains a feature stroke input (from handwriting pad, keyboard or soft keyboard);
The 3rd step: the stroke promotion stroke-Chinese character tree with input carries out the state redirect;
The 4th step: judge whether to obtain the setting number of a Chinese character the feature stroke encoding (according to application,
Can set 1,2,3 or 4), otherwise changeed for the 2nd step, be then to continue the
5 steps;
The 5th step: obtain all candidates of this individual character, (all speech claim by the tree that the word tissue forms to promote lexical tree
Being lexical tree) the search condition redirect advances;
The 6th step: judging whether to arrive the speech border, otherwise changeed for the 2nd step, is to continue for the 7th step;
The 7th step: obtain all speech candidates, different candidate word by formula (1) is added in existing path
Carry out path marking respectively;
The 8th step: carried out sorting from high to low by probability score in all paths;
The 9th step: judging whether end of input, otherwise changeed for the 2nd step, is to continue for the 10th step;
The 10th step: obtain the whole sentence of best result candidate;
The 11st step: once whole sentence search finishes.
Applicating example of the present invention:
Having only two feature strokes with each Chinese character is example.
At first, the sequence step according to Chinese language model obtains language model.
When input, such as importing " the individual master worker in Shanghai overcomes difficulties " this sentence.According to coding of the present invention, these Chinese character characteristic of correspondence stroke sequences see Table 2.
The feature stroke sequence and the repeat code Chinese character of each word of table 2 " the individual master worker in Shanghai overcomes difficulties "
Figure A0210444300151
Thus, each Chinese character all will be imported four strokes, and goes choosing from the Chinese character of repeated code.As when input when " ", need four strokes of input "/, mouthful ,/, ", and from candidate's " chimney " reach " " select.And if only import preceding two strokes "/, mouthful ", the candidate can be more, as " the white white soul of clear and bright emperor highland spring ... "Look at situation below, see Fig. 3 with input method of the present invention.Among the figure, the first behavior word candidate in each frame, middle row is a speech number, the logarithm probability that a beneath behavior adds up.Every row is from left to right pressed the descending sort of logarithm probability.
(i) input earlier " on " preceding two strokes " Shu-,-", obtain the Chinese character of two strokes of repeated codes before four
" upward, Ji, mentally disturbed, Bei ".
(ii) need not select, continue preceding two strokes of input " sea " " ,/", though with " sea "
Preceding two strokes of identical words have a lot, but consider " Shu-,-" and " ,/"
Collocation relation, can obtain several possible speech " Shanghai, rise, upstream, show, on
Flow, trace back, on send, oil ", and to two possible individual characters that should two strokes " river,
Husky ".
(iii) continue input " " preceding two strokes "/, mouthful " because " sea " do not become speech, it is given
Go out some to individual character that should two strokes, provide last word again for " river " this word during for " mouth "
Possible word " river mouth ".
(iv) so go down, along with the maximum discoveries of input stroke, algorithm selected the sentence that needs " on
The individual master worker in sea overcomes difficulties ", see the Chinese character that the thick black surround among Fig. 3 marks, this sentence
Has maximum probability.

Claims (7)

1, the Chinese sentence using partial stroke input method of a kind of desk-top computer, hand-hold electronic equipments or mobile communication equipment etc., be applied to keyboard input devices, it is characterized in that, may further comprise the steps: 1) adopt 23 basic code elements that the feature stroke is encode Chinese characters for computer, said code element comprises:,-, Shu ,/, , mouthful
Figure A0210444300021
Qian, Contraband, ,/,,---,-Shu, Shu Shu ,/,<, *, //, ,
Figure A0210444300023
2) above-mentioned 23 kinds of code symbols are mapped on the correspondent button position of keyboard of said equipment;
3) Chinese character is divided into about, up and down or outer interior two parts, each part is got two feature strokes at most and is encoded, each Chinese character has 4 code symbols at most; If Chinese character can not split into two parts, then directly get maximum 4 code symbols in order;
4) at different input equipments, set the number that each Chinese character uses code symbols, can be 1,2,3 or 4 a kind of;
5) carry out input to Chinese character by the mode of whole sentence, the feature stroke of Hanzi features order of strokes input setting number press in each Chinese character, utilizes Chinese language model related information based on context that whole sentence is changed out.
2, Chinese sentence using partial stroke input method as claimed in claim 1 is characterized in that, also comprises 17 additional feature strokes totally 40 feature strokes :-, Shu, ,-,-/,-, Shu, Shu-, Shu/, Shu ,/-, / Shu ,/, , -, Shu, /.
3, the Chinese sentence using partial stroke input method of a kind of desk-top computer, hand-hold electronic equipments or mobile communication equipment etc. is applied to the handwriting pad input equipment, it is characterized in that, may further comprise the steps:
1) adopt 23 basic code elements that the feature stroke is encode Chinese characters for computer, said code element comprises:
Figure A0210444300027
Figure A0210444300028
Figure A02104443000211
Figure A02104443000212
Figure A02104443000213
Figure A02104443000215
Figure A02104443000216
Figure A02104443000217
Figure A02104443000218
Figure A02104443000220
Figure A02104443000221
Figure A02104443000224
Figure A02104443000225
2) Chinese character is divided into about, up and down or outer interior two parts, each part is got two feature strokes at most and is encoded, each Chinese character has 4 code symbols at most; If Chinese character can not split into two parts, then directly get the most the more 4 code symbols in order;
3) at different input equipments, set the number that each Chinese character uses code symbols, can be 1,2,3 or 4 a kind of;
4) carry out input to Chinese character by the mode of whole sentence, the feature stroke of Hanzi features order of strokes input setting number press in each Chinese character, utilizes Chinese language model related information based on context that whole sentence is changed out.
4, Chinese sentence using partial stroke input method as claimed in claim 3 is characterized in that, also comprises additional feature stroke:
Figure A0210444300031
Figure A0210444300032
Figure A0210444300033
Figure A0210444300034
Figure A0210444300035
Figure A0210444300039
Figure A02104443000310
Figure A02104443000311
Figure A02104443000312
Figure A02104443000313
Figure A02104443000315
Figure A02104443000316
Figure A02104443000317
5, as claim 1,2,3 or 4 described Chinese sentence using partial stroke input methods, it is characterized in that said Chinese language model may further comprise the steps:
1) training Chinese language model;
2) adopt two-way degeneration algorithm for estimating not carry out probability estimate to the n tuple occurring;
3) compact model storage space, the steps include: for the 1st step: check all n tuples in the language model, wherein n gets 3,2 and 1, occurrence number in training text, n tuple more those occurrence numbers, that model performance is played an important role remains, and the occurrence number of other n tuples is forced to be changed to 0; The 2nd step: adopt logarithm bending curve to compress its occurrence number to a tuple, remove memory model with lower bit width; The 3rd step: to two tuples that remain in the model, its occurrence number is non-zero certainly, do not write down its concrete occurrence number, but two tuples with same preceding continuous speech are sorted from high to low according to occurrence number, two all tuples are counted the average probability that comes the m position, set up code table, so that when search used; The 4th step: reduce the expense of index, speech number with two bytes totally 16 bits represent, be divided into three parts and constitute three grades of index;
4) the feature stroke sequence of being imported is searched for obtained Chinese character string.
6, Chinese sentence using partial stroke input method as claimed in claim 5, it is characterized in that, saidly do not carry out the step of probability estimate and be: the tlv triple (a, the b that connect together and occur as three speech a, b, c the n tuple occurring, c) training probability is 0, when promptly in corpus, not occurring, adopt two-way low order degeneration algorithm to estimate, promptly simultaneously with reference to two tuple (a, b) and (b, c) (estimate for a, b by probability c) to tlv triple for training probability; Recurrence during this process, if promptly two tuples (x, training probability y) is 0 o'clock, utilize two-way degeneration algorithm, the training probability of word x and speech y is estimated simultaneously.
7, Chinese sentence using partial stroke input method as claimed in claim 5 is characterized in that said searching algorithm may further comprise the steps:
1) search beginning, the search path candidate empties;
2) obtain the coding that a feature stroke is imported from handwriting pad, keyboard or soft keyboard;
3) the stroke promotion stroke-Chinese character tree with input carries out the state redirect;
4) judging whether to obtain the feature stroke encoding of the setting number of a Chinese character, otherwise changeed for the 2nd step, is then to continue for the 5th step;
5) obtain all candidates of this individual character, promote the redirect of lexical tree search condition and advance;
6) judging whether to arrive the speech border, otherwise changeed for the 2nd step, is to continue for the 7th step;
7) obtain all speech candidates, different candidate word is added in existing path carry out path marking respectively by following formula P ( w 1 , w 2 , · · · , w N ) = P ( w 1 ) · P ( w 2 | w 1 ) · Π n = 3 N P ( w u | w u - 2 , w u - 1 ) ;
8) carried out sorting from high to low by probability score in all paths;
9) judging whether end of input, otherwise changeed for the 2nd step, is to continue for the 10th step;
10) obtain the whole sentence of best result candidate;
11) once whole sentence search finishes.
CNB021044430A 2002-03-18 2002-03-18 Method for inputting Chinese holophrase into computers by using partial stroke Expired - Fee Related CN1187677C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB021044430A CN1187677C (en) 2002-03-18 2002-03-18 Method for inputting Chinese holophrase into computers by using partial stroke

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB021044430A CN1187677C (en) 2002-03-18 2002-03-18 Method for inputting Chinese holophrase into computers by using partial stroke

Publications (2)

Publication Number Publication Date
CN1445640A true CN1445640A (en) 2003-10-01
CN1187677C CN1187677C (en) 2005-02-02

Family

ID=27810882

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB021044430A Expired - Fee Related CN1187677C (en) 2002-03-18 2002-03-18 Method for inputting Chinese holophrase into computers by using partial stroke

Country Status (1)

Country Link
CN (1) CN1187677C (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104503597A (en) * 2014-12-19 2015-04-08 北京奇虎科技有限公司 Stroke input method, stroke input device and stroke input system
CN110110292A (en) * 2018-01-29 2019-08-09 北京搜狗科技发展有限公司 A kind of data processing method, device and the device for data processing
TWI685761B (en) * 2017-01-22 2020-02-21 香港商阿里巴巴集團服務有限公司 Word vector processing method and device

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104503597A (en) * 2014-12-19 2015-04-08 北京奇虎科技有限公司 Stroke input method, stroke input device and stroke input system
CN104503597B (en) * 2014-12-19 2017-12-12 北京奇虎科技有限公司 stroke input method, device and system
TWI685761B (en) * 2017-01-22 2020-02-21 香港商阿里巴巴集團服務有限公司 Word vector processing method and device
US10878199B2 (en) 2017-01-22 2020-12-29 Advanced New Technologies Co., Ltd. Word vector processing for foreign languages
CN110110292A (en) * 2018-01-29 2019-08-09 北京搜狗科技发展有限公司 A kind of data processing method, device and the device for data processing
CN110110292B (en) * 2018-01-29 2023-11-14 北京搜狗科技发展有限公司 Data processing method and device for data processing

Also Published As

Publication number Publication date
CN1187677C (en) 2005-02-02

Similar Documents

Publication Publication Date Title
CN1159661C (en) System for Chinese tokenization and named entity recognition
CN1207664C (en) Error correcting method for voice identification result and voice identification system
CN1156741C (en) Chinese handwriting identifying method and device
CN1648828A (en) System and method for disambiguating phonetic input
CN1023916C (en) Chinese keyboard entry technique with both simplified and original complex form of Chinese character root and its keyboard
CN1815467A (en) Dictionary learning method, and devcie for using same, input method and user terminal device for using same
CN1232226A (en) Sentence processing apparatus and method thereof
CN1607491A (en) System and method for Chinese input using a joystick
CN1667699A (en) Generating large units of graphonemes with mutual information criterion for letter to sound conversion
CN101038508A (en) GB phoneticize input method
CN1256650C (en) Chinese whole sentence input method
CN1737739A (en) Tibetan input method based on English keyboard
CN1187677C (en) Method for inputting Chinese holophrase into computers by using partial stroke
CN1106619C (en) Chinese input transition processing device and Chinese input transition processing method
CN1203389C (en) Initial four-stroke Chinese sentence input method for computer
CN113590765B (en) Multi-mode information fusion broadcast television news keyword and abstract combined extraction method
CN101046706A (en) Universal input method for different person computer and mobile phone
CN1731389A (en) Braille-Chinese contrapositive editing/typesetting system and editing/typesetting method
CN1257445C (en) Chinese-character 'Pronunciation-meaning code' input method
CN1679023A (en) Method and system of creating and using chinese language data and user-corrected data
CN1131770A (en) Retrieval method for Chinese character
CN1102768C (en) Chinese character phono configurational code input method for electronic computer
CN1838044A (en) Chinese spelling, tone and stroke combined input method
CN1034245C (en) Burmese characters four-code intelligent coding method and keyboard thereof
CN1052200A (en) Pronunciation-form-meaning words encode series with compatibility and keyboard

Legal Events

Date Code Title Description
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C19 Lapse of patent right due to non-payment of the annual fee
CF01 Termination of patent right due to non-payment of annual fee