CN1378130A - Initial four-stroke Chinese sentence input method for computer - Google Patents

Initial four-stroke Chinese sentence input method for computer Download PDF

Info

Publication number
CN1378130A
CN1378130A CN 02117934 CN02117934A CN1378130A CN 1378130 A CN1378130 A CN 1378130A CN 02117934 CN02117934 CN 02117934 CN 02117934 A CN02117934 A CN 02117934A CN 1378130 A CN1378130 A CN 1378130A
Authority
CN
China
Prior art keywords
stroke
chinese
chinese character
probability
input
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN 02117934
Other languages
Chinese (zh)
Other versions
CN1203389C (en
Inventor
郑方
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN 02117934 priority Critical patent/CN1203389C/en
Publication of CN1378130A publication Critical patent/CN1378130A/en
Application granted granted Critical
Publication of CN1203389C publication Critical patent/CN1203389C/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Landscapes

  • Document Processing Apparatus (AREA)

Abstract

The present invention relates to Chinese character input technology in computer and the partial-stroke Chinese character input method for computer includes: taking the five standard strokes as basic code elements for coding Chinese characters, mapping the code elements to corresponding keys in keyboard or writing manually, coding each Chinese character into a four-stroke sequence, inputting Chinese characters in Chinese sentence mode with at most four strokes for each Chinese character, and correlating context by utilizing Chinese language model to convert and input the whole sentence without selecting one character by one character. The input method has fast conversion speed and high correct conversion rate and may be used in computer and mobile communication equipment.

Description

Initial four-stroke Chinese sentence input method for computer
Technical field
The invention belongs to computing machine (comprising desk-top computer, notebook computer, palm PC, personal digital assistant etc.) Chinese character input method technical field, particularly be used for wireless telecommunications system (as mobile phone) Chinese character input method.
Background technology
Wireless, palm communication apparatus such as mobile phone, PDA(Personal Digital Assistant) etc. are most popular now section news products, and sales figure is passed length every year at double.When using these communication products, the input of Chinese character is essential.And computing machine now (comprise desk-top computer, notebook computer, palm PC, personal digital assistant etc., and wireless telecommunications system etc.) environment mostly is based on English, and input in Chinese becomes the problem of a complexity and difficulty relatively.The input in Chinese of current these products all is to use hand-written and keypad; The hand-written input of writing that is confined to whole word, very time-consuming and slow; Keyboard input current more popular be back of the body radical and phonetic, the radical input can such as Cangjie and five-stroke input method etc., but want skilled these input methods need spend suitable time exercise than comparatively fast; The defect of phonetic input is to need word selection, because phonetically similar word is too many; And the ratio that keyboard shows on these communication apparatus is quite little, is unfavorable for using.
In the input method of Chinese character based on stroke and keyboard, five-stroke input method is popular.The order of strokes observed in calligraphy custom that yet its maximum problem is the method for divining by means of characters when not meeting people and writing, and the skill of divining by means of characters during input needs professional training for a long time to grasp.
In the input method of Chinese character based on handwriting pad, people need write out all strokes of whole Chinese character, just can be input to a Chinese character in computing machine or the equipment.When the Chinese character stroke that will import was a lot, when maybe note was forbidden writing of Chinese character when writing, it was very common makeing mistakes.
In existing input method,, need to use the Chinese language model method in order to solve the coincident code problem of a sound multiword based on phonetic.
Chinese language model (CLM) utilizes the collocation information between adjacent speech in the context, need be the phonetic in continuous no space, stroke, or represent the numeral of letter or stroke, when converting Chinese character string (being sentence) to, can calculate sentence, thereby be implemented to the automatic conversion of Chinese character, need not the user and manually select with maximum probability, avoided the coincident code problem of the corresponding identical phonetic of many Chinese characters (or stroke string, or numeric string).
The most frequently used CLM is the language model that is called Tri-gram (tlv triple), it provided collocation probability P between any three Chinese word a, b and the c (c|a, b).When the Chinese language text of magnanimity being arranged, by simple method of counting, can count the collocation number of times between any three speech, thereby estimate its collocation probability.This can be used for pinyin string, stroke string or numeric string are being picked out best candidate according to the principle of maximum likelihood to the Chinese character transfer process from numerous candidates.When being mapped to suitable Chinese character string, maximum-likelihood criterion means maximum probability when the feature representation string (as pinyin string, numeric string or stroke string) from Chinese character.In the Tri-gram language model, the probability of occurrence of Chinese character sequence is with following formulate: P ( w 1 , w 2 , . . . , w N ) = P ( w 1 ) · Π n = 2 N P ( w n | w 1 , . . . , w n - 1 ) ≈ P ( w 1 ) · P ( w 2 | w 1 ) · Π n = 3 N P ( w n | w n - 2 , w n - 1 ) - - - - ( 1 ) Tlv triple (w wherein N-2, w N-1, w n) probability that occurs, just P (w n| w N-2, w N-1), study comes from magnanimity Chinese language text (being called training text).
Existing Tri-gram language model has such several steps: (i) to the probability estimate of carrying out of the tlv triple in training text that do not occurred; (ii) reduce the model storage size; (iii) decode or search for, promptly utilize formula (1) from the candidate of a large amount of repeated codes, to select correct sentence quickly and accurately.
(i) to the probability estimate of carrying out of the tlv triple that do not occurred
Commonly used speech is nearly 30,000 or more in the Chinese.The number of the tlv triple that any three speech are formed just reaches 30,000 3Scale, this wherein some tlv triple be impossible occur, some seldom occurs.Therefore, much no matter the language material of training usefulness has, (a, b c) can not occur in language material always have some tlv triple.If the probability to these tlv triple is not done special processing, (c|a is 0 b), is 0 thereby cause the probability of sentence will to cause the estimated probability P of these tlv triple.But these tlv triple that find in corpus not exclusively are 0 probability also, but say that the probability of their appearance is smaller relatively.Therefore these tlv triple should be given less relatively reasonable probability, and different tlv triple should be given different probability as the case may be.Traditional solution 0 probability method is to go to estimate according to two tuple probability P (c|b) of low order, Here it is backoff algorithm.Backoff algorithm can recurrence, if promptly P (c|b) also is 0 probability, then further return back to P (c).In order to guarantee that the probability summation is 1, must from the probability of the tlv triple of those nonzero probabilities, discount go out to serve probable value, in tlv triple occurring, do not redistribute.The shortcoming of this tradition backing method is that it has only considered toward a direction to do rollback to low order, and this makes not accurate enough to these probability estimate that the unit do not occur.
(ii) compact model storage space
As said in (i), the Tri-gram language model is very huge in storage, even because most tlv triple does not all have to occur, depositing of the tlv triple of those appearance also needs very huge space.Generally, a vocabulary size is used for storage for the Chinese language model of 50K needs 300M to the space of 1G byte.Has on the equipment of a large amount of storages not compact model storage space at PC etc.; To use on tens megabyte even the littler equipment be unpractiaca but memory space has only.This has the reason of two aspects, obviously is because storage space then is because the huge search procedure that causes of storage is consuming time very big on the other hand on the one hand.
(iii) decode or search for
The purpose of search is that the feature representation string (as pinyin string, numeric string or stroke string) with Chinese character is mapped on the Chinese character sequence and according to maximum-likelihood criterion and finds the best sequence of coupling as last result.Because (1) stroke sequence, pinyin sequence or a Serial No. shared in a plurality of Chinese characters; (2) there is not clear and definite speech border in the Chinese sentence between the speech; (3) sentences can produce and much meet coupling " sentence " because the difference of cut-off and can be divided into different word sequence (have only optimum) therefore is being mapped to single word, and then is being mapped in the process of sentence from the feature representation string.In this case, we can not list all possible Chinese character sequences and compare probability.Therefore, just seem very important of searching algorithm efficiently and exactly.Existing coding/decoding method is traditional dynamic programming algorithm, its deficiency be not at concrete application characteristics, it only uses the single level structure, therefore transplants not odd jobs, decoding efficiency is low, decoding effect is not ideal enough.
Summary of the invention
The objective of the invention is for overcoming the weak point of prior art, a kind of initial four-stroke Chinese sentence input method for computer and Chinese language model method thereof are proposed, each Chinese character only need be imported or hand-written using partial stroke in proper order by the strokes of Chinese characters encoding of standard, and the user needn't choose from those Chinese characters candidate of repeated code one by one in input process, and system can change out whole sentence automatically; Have input method and be easy to grasp, slewing rate is fast, and the very high characteristics of conversion accuracy can be used in the equipment such as various computing machines and hand-hold electronic equipments or mobile communication.
The present invention proposes a kind of initial four-stroke Chinese sentence input method for computer, it is characterized in that, may further comprise the steps:
1) adopting the stroke of 5 standards is the encode Chinese characters for computer code element, and said code element comprises: ,-, Shu ,/, ;
2) above-mentioned 5 kinds of code symbols are mapped on the correspondent button position of keyboard of input equipment;
3) Chinese character is got the coded sequence of preceding 4 strokes as this Chinese character by the writing stroke order; If not enough 4, have what just with what as coded sequence;
4) carry out input to Chinese character by the mode of whole sentence by said keyboard, each Chinese character press the Chinese-character writing order of strokes and is imported said 4 standard strokes at the most, utilizes Chinese language model related information based on context that whole sentence is changed out.
The present invention also proposes initial four standards of a kind of Chinese holophrase into computers/hand-written stroke input method, it is characterized in that, may further comprise the steps:
1) adopt 5 writing pencils to divide the basic code element of encode Chinese characters for computer into, said code element comprises: Or
Figure A0211793400063
Figure A0211793400064
Or
Figure A0211793400066
Figure A0211793400067
2) Chinese character is got the coded sequence of preceding 4 hand-written strokes as this Chinese character by the writing stroke order, if not enough 4, have what just with what as coded sequence;
3) carry out input to Chinese character by the mode of whole sentence by the handwriting pad input equipment, each Chinese character press the Chinese-character writing order of strokes and is imported said 4 hand-written strokes at the most, utilizes Chinese language model related information based on context that whole sentence is changed out.
Chinese language model in the said method can adopt existing technical scheme, also can adopt following method, and concrete steps comprise:
1) training Chinese language model;
2) adopt two-way degeneration algorithm for estimating not carry out probability estimate to the n tuple occurring;
3) compact model storage space, the steps include: for the 1st step: check that all tlv triple (tri-gram), two tuples (bi-gram) and a tuple (uni-gram) (are referred to as n-gram in the language model, the n tuple) the occurrence number in training text, n-gram more those occurrence numbers, that model performance is played an important role remains, and the occurrence number of other n-gram is forced to be changed to 0; The 2nd step: because of the many more n-gram of occurrence number in training text, the number of its n-gram is few more, therefore for the number of times of the fewer n-gram of occurrence number, needs bigger precision to go to preserve, and for the more n-gram of occurrence number, the point-device number of times of then unnecessary preservation.The present invention adopts logarithm bending curve to compress its occurrence number to uni-gram, thereby removes memory model with lower bit width, and the information of model is not lost substantially; The 3rd step: to the bi-gram that remains in the model, its occurrence number is non-zero certainly, does not write down its concrete occurrence number, but the bi-gram with same history (being same preceding continuous speech) is sorted from high to low according to occurrence number.All bi-gram are counted the average probability of the n-gram that comes the m position, set up code table, so that use during search; The 4th step: reduce the expense of index, set up three grades of index.Speech number with two bytes totally 16 bits represent, speech number is divided into three parts.For example: the highest 10 constitute the one-level index, and middle 4 constitute secondary index, and last two constitute three grades of index.In this way, effectively the number of one-level index has been dropped to hundreds of from several ten thousand, thereby reduced memory space;
4) the stroke sequence search of being imported is obtained Chinese character string.
In the said method the 1st), 2), 4) step all can use conventional methods.
Wherein, the 2nd) in Bu the Chinese language model n tuple that does not occur is carried out the method for probability estimate; Also can adopt following steps: the tlv triple (a that connects together and occur as three speech a, b, c, b, c) training probability is 0, when promptly in corpus, not occurring, adopt two-way low order degeneration algorithm to estimate, promptly simultaneously with reference to two tuple (a, b) and (b, c) (estimate for a, b by probability c) to tlv triple for training probability; Recurrence during this process, if promptly two tuples (x, training probability y) is 0 o'clock, utilize two-way degeneration algorithm, the training probability of word x and speech y is estimated simultaneously.
Said method the 4th) searching algorithm in the step also can may further comprise the steps:
1) search beginning, the search path candidate empties;
2) obtain the coding that a stroke is imported (from handwriting pad, keyboard or soft keyboard);
3) stroke promotion " stroke-Chinese character " tree with input carries out the state redirect;
4) judging whether to obtain 4 stroke encodings (or Chinese character runs into the symbol that the input of expression Chinese character stroke is finished when not having 4 strokes) of a Chinese character, otherwise changeed for the 2nd step, is then to continue for the 5th step;
5) obtain all candidates of this individual character, promote lexical tree (all speech are called lexical tree by the tree that the word tissue forms) search condition redirect and advance;
6) judging whether to arrive the speech border, otherwise changeed for the 2nd step, is to continue for the 7th step;
7) obtain all speech candidates, to existing path add different candidate word by formula (1) carry out path marking respectively;
8) carried out sorting from high to low by probability score in all paths;
9) judging whether end of input, otherwise changeed for the 2nd step, is to continue for the 10th step;
10) obtain the whole sentence of best result candidate;
11) once whole sentence search finishes.
The present invention has following feature:
1) the stroke encoding scheme science of Chinese character, succinct, each Chinese character at most only need 4 " strokes " just can express, and have certain repeated code certainly.Here " stroke " is exactly traditional stroke, is the stroke that meets national standard.
2) Chinese language model has solved 0 probability estimate well, has reduced the model size, has improved search speed and precision, simultaneously the structure odd jobs.
3) when utilizing stroke input Chinese character, can be undertaken by the mode of whole sentence, each Chinese character only need be imported or hand-written 4 strokes in proper order by the strokes of Chinese characters encoding of standard, and the user needn't choose from those Chinese characters candidate of repeated code one by one in input process, and Chinese language model can be changed out whole sentence automatically.
4) these using partial strokes are chosen by the order of strokes observed in calligraphy of standard fully, meet people's order of strokes observed in calligraphy custom, and method is easy to grasp.
5) small scale of Chinese language model, whole data account for about the 1M byte, thereby make this technology to be applied to carry out on most of little handheld devices the Chinese character input.
6) the conversion accuracy is very high, and is first-selected up to more than 94%, thereby the user seldom need word for word select in numerous candidates.
7) slewing rate is fast, can change more than 300 Chinese character in the off line test per second.
8) multi-level Data Structure Design can combine the realization input method of Chinese character to Chinese language model and phonetic, numeral or other features easily.
Good result of the present invention:
According to the Chinese character standard of country's promulgation, the secondary character library has more than 6,700 Chinese character.If use four strokes to remove to represent a Chinese character, then average per 1.2 Chinese characters use same stroke sequence.If each Chinese character only uses preceding two strokes to represent, then average 12 Chinese characters use same stroke sequence.Use like this and often need choose needed Chinese character from the Chinese character candidate list based on the commonsense method of stroke, the present invention then can help people to break away from this loaded down with trivial details process of choosing.When the user need import a Chinese word or sentence, he only need import the several strokes of preceding office portion of each Chinese character successively; In the process of input, utilize the advantage of Chinese language model, utilize Chinese character context mutual information, system picks out only output according to the stroke sequence contrast linguistry of having imported automatically; After all strokes were totally lost, the optimal candidate of whole word or sentence just provided automatically.That summarizes says, the present invention can obtain the correct candidate of whole word/sentence by the using partial stroke sequence of input Chinese character, and the model of this system is very little and accuracy is very high.
Description of drawings
Fig. 1 is the level frame diagram of explanation the present invention in various application.
Fig. 2 is the searching algorithm process flow diagram of stroke input.
Fig. 3 is the application example of stroke input method.
Embodiment
The content of the method for inputting Chinese holophrase into computers by using partial stroke that the present invention proposes and principle reach embodiment in conjunction with the accompanying drawings and are described in detail as follows:
(1) stroke explanation
Use 5 strokes that meet national standard altogether, they are:
(1)“\”。All from upper left to the lower right to single stroke.Comprise " point (Dian) " and " press down (\) ".As: second stroke and ' Chuo ' the 3rd stroke of ' Yin '.
(2)“—”。The single stroke that all are horizontal.As: see ' grass ' first stroke.
(3) " Shu ".All main bodys are single strokes of " erecting ".Comprise " perpendicular (Shu) ", " (' Dao ' or ' Rolling ' second stroke, ' I ' the 3rd stroke) colluded on a perpendicular left side " and " erect the right side and carry (left part of ' people ', ' Jin ' the 5th stroke, ' ratio ' first stroke) ".And for example: ' Lv ' second stroke and the 3rd stroke.
(4)“/”。All from the lower-left to upper right or upper right to the lower left to single stroke.Comprise " casting aside (Pie) " and " carrying (second stroke or ' Rui ' the 3rd stroke of ' Bing ') ".And for example: ' Xiangxi ' first stroke, first stroke and ' Http ' second stroke of ' Mi '.
(5)“”。" folding ".Every other is not the single stroke of the non-rectilinear direction of above-mentioned four strokes.For example: ' five ' the 3rd stroke, ' Cannibals ' second stroke, ' Fan ' Second stroke, second stroke, the water of ' Bao ' ' second stroke, ' forever ' the 3rd stroke, ' Jie ' first stroke, ' Fu ' first stroke, ' Yan ' second stroke, ' flying ' first stroke, ' Chuo ' second stroke, ' Yin ' first stroke; ' bow ' the 3rd stroke, ' beggar ' the 4th stroke; Three strokes of ' Si ' first stroke, ' the one ' first stroke and second stroke, ' Si ' first stroke and second stroke, ' Chuan '; ' I ' the 5th stroke; ' Quan ' second stroke; Deng.
Annotate: when some word becomes radical, it is received pen and may change, as ' the receipts pen that native Wang Yu already ends car army in the bird crow " is proposed " (variation that the receipts pen of ' car ' and ' army ' more has the order of strokes observed in calligraphy) etc. by " horizontal stroke " changes, and and for example the receipts pen in ' north ' " erects (erecting the right side carries) " by " folding " change.And for example, ' moon ' first stroke is ' Pie ' sometimes, is ' Shu ' (seeing in ' having ') sometimes.
(2) the local code method of Chinese character
Chinese character is got the coded sequence of preceding 4 strokes as this Chinese character by order of strokes; If not enough 4, have what just with what as coded sequence.
(3) the training embodiment step of Chinese language model
The 1st step: select the suitably vocabulary of size, and vocabulary is done suitable processing according to the stroke encoding of single Chinese character;
The 2nd step: according to vocabulary the magnanimity text data is carried out intelligent cutting, form the speech sequence;
The 3rd step: the speech sequence is carried out statistical study, the tlv triple that obtains to be occurred (a, b, c) and occurrence number;
The 4th step: model is carried out smoothing processing, is that 0 n-gram carries out probability estimate to probability promptly.
(4) the two-way degeneration method of estimation embodiment of 0 probability in the Chinese language model
Consider the deficiency of traditional solution 0 probabilistic method, (c|a b) is at 0 o'clock, and the present invention not only considers P (c|b), and considers P (b|a) as P; Equally, when two tuple probability P (x|y) are 0, also not only consider P (x), and consider P (y).Thereby utilize two-way degeneration algorithm more accurately 0 probability to be reappraised.
(5) compact model storage space embodiment in the Chinese language model
May further comprise the steps:
The 1st step: the occurrence number in training text of checking all tlv triple (tri-gram), two tuples (bi-gram) and single speech (uni-gram) (being referred to as n-gram) in the language model, n-gram more those occurrence numbers, that model performance is played an important role remains, and the occurrence number of other n-gram is forced to be changed to 0;
The 2nd step: because of the many more n-gram of occurrence number in training text, the number of its n-gram is few more, therefore for the number of times of the fewer n-gram of occurrence number, needs bigger precision to go to preserve, and for the more n-gram of occurrence number, the point-device number of times of then unnecessary preservation.The present invention adopts logarithm bending curve to compress its occurrence number to uni-gram, thereby removes memory model with lower bit width, and the information of model is not lost substantially;
The 3rd step: to the bi-gram that remains in the model, its occurrence number is non-zero certainly, does not write down its concrete occurrence number, but the bi-gram with same history (being same preceding continuous speech) is sorted from high to low according to occurrence number.All b-gram are counted the average probability of the n-gram that comes the m position, set up code table, so that use during search;
The 4th step: reduce the expense of index, set up three grades of index.Speech is number with two byte representations, and speech number is divided into three parts.For example: the highest 10 constitute the one-level index, and middle 4 constitute secondary index, and last two constitute three grades of index.In this way, effectively the number of one-level index has been dropped to hundreds of from several ten thousand, thereby reduced memory space.
(6) embodiment of the efficiently and accurately searching algorithm of Chinese language model
The Chinese language model that present embodiment proposes is the tree structure of a multilayer and the search problem that synchronous grid search algorithm of character solves Chinese language model.
This structure is divided into three layers, and as shown in Figure 1, among the figure, top layer is the word sequence layer, and this layer is retrained by Chinese language model, and this layer is an input pattern; The second layer is the word layer, retrained by lexical tree, and this layer is a decoding schema; The bottom is the Chinese character layer, is subjected to the constraint of " Chinese character-Hanzi features (phonetic, stroke or numeral) " tree, and this layer is an output mode.The back is two-layer can be thought independently, also can be used as an integral body and treat.By using this structure, the search from the stroke sequence to the sentence is that Chinese character is synchronous, and the probability of sentence is along with the appearance of word adds up by word.This searching algorithm can reach the search speed of 300 word/seconds; When each Chinese character uses four stroke encodings, can reach the accuracy more than 94%, when each Chinese character uses eight codings, then can reach the accuracy more than 99%.
As can be seen from Figure 1, the advantage of this multilayered structure is to have good expandability.By from letter, numeral or stroke to the mapping phonetic or the Chinese character, this input system can be transplanted in the various systems by letter, numeral or stroke input very easily.
This searching algorithm as shown in Figure 2, step comprises:
The 1st step: the search beginning, the search path candidate empties;
The 2nd step: the coding that obtains a stroke input (from handwriting pad, keyboard or soft keyboard);
The 3rd step: stroke promotion " stroke-Chinese character " tree with input carries out the state redirect;
The 4th step: judging whether to obtain 4 stroke encodings of a Chinese character, otherwise changeed for the 2nd step, is then to continue for the 5th step;
The 5th step: obtain all candidates of this individual character, promote lexical tree (all speech are called lexical tree by the tree that the word tissue forms) search condition redirect and advance;
The 6th step: judging whether to arrive the speech border, otherwise changeed for the 2nd step, is to continue for the 7th step;
The 7th step: obtain all speech candidates, to existing path add different candidate word by formula (1) carry out path marking respectively;
The 8th step: carried out sorting from high to low by probability score in all paths;
The 9th step: judging whether end of input, otherwise changeed for the 2nd step, is to continue for the 10th step;
The 10th step: obtain the whole sentence of best result candidate;
The 11st step: once whole sentence search finishes.
An application example of the present invention is described in detail as follows:
At first, the sequence step according to Chinese language model obtains language model.
When input, such as importing " the individual master worker in Shanghai overcomes difficulties " this sentence.The stroke sequence and the repeat code Chinese character of this each word of Chinese character are as shown in table 1.
Table 1
Chinese character Stroke sequence Complete repeat code Chinese character Complete repeat code Chinese character number
On Shu ,-,- ??1
The sea ??\,\,/,/ Do not have, send, live ??78
{。##.##1}, /, Shu, ,- , inferior, soap ??12
The worker -, Shu ,- Soil, scholar ??3
The people ??/,\ Eight, go into ??3
The teacher Shu ,/,-, Shu ??1
Fu /, Shu ,-, Shu The pari, scholar, two ??4
Gram -, Shu, Shu, Former, fearful, Gu ??4
Clothes ??/,,—,— Intestines, get rid of, tire ??7
Be stranded Shu, ,-, Shu Group, boundary, ??19
Difficult , ,/, Shu ??1
Thus, each Chinese character all will be imported four strokes (being less than blank stroke of four strokes of benefits gets final product), and goes choosing from the Chinese character of repeated code.As when input when " ", need four strokes of input "/, Shu, ,-", and from 12 candidates ", inferior, soap ... " in select.
(i) input earlier " on " three strokes, obtain a candidate.
(ii) continue four strokes of input " sea ", though have much with preceding four strokes of identical words in " sea ", but considering collocation relation, can obtain several possible speech " Shanghai, go up group ", and to two possible individual characters that should two strokes " sea, do not have, send ... " etc.
(iii) continue input " " four strokes because " sea " do not become speech, it provides some to individual character that should four strokes.
(iv) so go down, along with importing the maximum of stroke, algorithm of the present invention has been selected the sentence " the individual master worker in Shanghai overcomes difficulties " that needs, and sees the Chinese character that the thick black surround among Fig. 3 marks, and this sentence has maximum probability.
The present invention provides candidate's sentence tabulation simultaneously, and the stroke sequence of all sentences is all identical in the table; The present invention also provides the candidate tabulation to each word, and the stroke sequence of all words is identical in the table.
Sometimes, actual result is possible different with the expectation input, then can choose the sentence of wishing typing in the tabulation of candidate's sentence.If also do not expect the sentence of input in candidate's sentence, then select immediate sentence, find first not right word, from the candidate tabulation, select the word of expectation input; If the candidate number is too many, then continue the stroke of this word back of input, because the number of words of repeated code is along with the increase meeting of input number of strokes is fewer and feweri, thereby can determine the word of expectation input very soon.Other not right words are so analogized and can be selected one by one.

Claims (5)

1, a kind of initial four-stroke Chinese sentence input method for computer is characterized in that, may further comprise the steps:
1) adopting the stroke of 5 standards is the basic code element of encode Chinese characters for computer, and said code element comprises: ,-, Shu ,/, ;
2) above-mentioned 5 kinds of code symbols are mapped on the correspondent button position of keyboard of input equipment;
3) Chinese character is got the coded sequence of preceding 4 strokes as this Chinese character by order of writing strokes, not enough 4 have what just with what as coded sequence;
4) carry out input to Chinese character by the mode of whole sentence by said keyboard, each Chinese character press the Chinese-character writing order of strokes and is imported said 4 standard strokes at the most, utilizes Chinese language model related information based on context that whole sentence is changed out.
2, a kind of initial four-stroke Chinese sentence input method for computer is characterized in that, may further comprise the steps:
1) adopt 5 writing pencils to divide the basic code element of encode Chinese characters for computer into, said code element comprises:
Figure A0211793400021
Or
Figure A0211793400022
Figure A0211793400023
Figure A0211793400024
Or
Figure A0211793400027
2) Chinese character is got the coded sequence of preceding 4 hand-written strokes as this Chinese character by the writing stroke order, have what just with what as coded sequence;
3) carry out input to Chinese character by the mode of whole sentence, each Chinese character press the Chinese character stroke order and is imported said 4 hand-written strokes at the most, utilizes Chinese language model related information based on context that whole sentence is changed out.
3, Chinese sentence using partial stroke input method as claimed in claim 1 or 2 is characterized in that said Chinese language model may further comprise the steps:
1) training Chinese language model;
2) adopt two-way degeneration algorithm for estimating not carry out probability estimate to the n tuple occurring;
3) compact model storage space, the steps include: for the 1st step: check all n tuples in the language model, wherein n gets 3,2 and 1, occurrence number in training text, n tuple more those occurrence numbers, that model performance is played an important role remains, and the occurrence number of other n tuples is forced to be changed to 0; The 2nd step: adopt logarithm bending curve to compress its occurrence number to a tuple, remove memory model with lower bit width; The 3rd step: to two tuples that remain in the model, its occurrence number is non-zero certainly, do not write down its concrete occurrence number, but two tuples with same preceding continuous speech are sorted from high to low according to occurrence number, two all tuples are counted the average probability that comes the m position, set up code table, so that when search used; The 4th step: reduce the expense of index, speech number with two bytes totally 16 bits represent, be divided into three parts and constitute three grades of index;
4) the stroke sequence search of being imported is obtained Chinese character string.
4, Chinese sentence using partial stroke input method as claimed in claim 3, it is characterized in that, saidly do not carry out the step of probability estimate and be: the tlv triple (a, the b that connect together and occur as three speech a, b, c the n tuple occurring, c) training probability is 0, when promptly in corpus, not occurring, adopt two-way low order degeneration algorithm to estimate, promptly simultaneously with reference to two tuple (a, b) and (b, c) (estimate for a, b by probability c) to tlv triple for training probability; Recurrence during this process, if promptly two tuples (x, training probability y) is 0 o'clock, utilize two-way degeneration algorithm, the training probability of word x and speech y is estimated simultaneously.
5, Chinese sentence using partial stroke input method as claimed in claim 3 is characterized in that said searching algorithm may further comprise the steps:
1) search beginning, the search path candidate empties;
2) obtain the coding that a stroke is imported from handwriting pad, keyboard or soft keyboard;
3) the stroke promotion stroke-Chinese character tree with input carries out the state redirect;
4) judging whether to obtain 4 stroke encodings of a Chinese character, otherwise changeed for the 2nd step, is then to continue for the 5th step;
5) obtain all candidates of this individual character, promote the redirect of lexical tree search condition and advance;
6) judging whether to arrive the speech border, otherwise changeed for the 2nd step, is to continue for the 7th step;
7) obtain all speech candidates, different candidate word is added in existing path carry out path marking respectively by following formula P ( w 1 , w 2 , . . . , w N ) = P ( w 1 ) · P ( w 2 | w 1 ) · Π n = 3 N P ( w n | w n - 2 , w n - 1 ) ;
8) carried out sorting from high to low by probability score in all paths;
9) judging whether end of input, otherwise changeed for the 2nd step, is to continue for the 10th step;
10) obtain the whole sentence of best result candidate;
11) once whole sentence search finishes.
CN 02117934 2002-05-24 2002-05-24 Initial four-stroke Chinese sentence input method for computer Expired - Fee Related CN1203389C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 02117934 CN1203389C (en) 2002-05-24 2002-05-24 Initial four-stroke Chinese sentence input method for computer

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 02117934 CN1203389C (en) 2002-05-24 2002-05-24 Initial four-stroke Chinese sentence input method for computer

Publications (2)

Publication Number Publication Date
CN1378130A true CN1378130A (en) 2002-11-06
CN1203389C CN1203389C (en) 2005-05-25

Family

ID=4744579

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 02117934 Expired - Fee Related CN1203389C (en) 2002-05-24 2002-05-24 Initial four-stroke Chinese sentence input method for computer

Country Status (1)

Country Link
CN (1) CN1203389C (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105573520A (en) * 2015-12-15 2016-05-11 上海嵩恒网络科技有限公司 Method and system for consecutive-typing input of long sentences through Wubi
CN110096693A (en) * 2018-01-29 2019-08-06 北京搜狗科技发展有限公司 A kind of data processing method, device and the device for data processing
CN110110292A (en) * 2018-01-29 2019-08-09 北京搜狗科技发展有限公司 A kind of data processing method, device and the device for data processing

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108345580B (en) * 2017-01-22 2020-05-15 创新先进技术有限公司 Word vector processing method and device

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105573520A (en) * 2015-12-15 2016-05-11 上海嵩恒网络科技有限公司 Method and system for consecutive-typing input of long sentences through Wubi
CN105573520B (en) * 2015-12-15 2018-03-30 上海嵩恒网络科技有限公司 The long sentence of a kind of five even beats input method and its system
CN110096693A (en) * 2018-01-29 2019-08-06 北京搜狗科技发展有限公司 A kind of data processing method, device and the device for data processing
CN110110292A (en) * 2018-01-29 2019-08-09 北京搜狗科技发展有限公司 A kind of data processing method, device and the device for data processing
CN110110292B (en) * 2018-01-29 2023-11-14 北京搜狗科技发展有限公司 Data processing method and device for data processing
CN110096693B (en) * 2018-01-29 2024-05-28 北京搜狗科技发展有限公司 Data processing method and device for data processing

Also Published As

Publication number Publication date
CN1203389C (en) 2005-05-25

Similar Documents

Publication Publication Date Title
CN1159661C (en) System for Chinese tokenization and named entity recognition
CN1207664C (en) Error correcting method for voice identification result and voice identification system
CN1815467A (en) Dictionary learning method, and devcie for using same, input method and user terminal device for using same
CN1667699A (en) Generating large units of graphonemes with mutual information criterion for letter to sound conversion
CN1256650C (en) Chinese whole sentence input method
CN1203389C (en) Initial four-stroke Chinese sentence input method for computer
CN1737739A (en) Tibetan input method based on English keyboard
CN1271550C (en) Sentence boundary identification method in spoken language dialogue
CN1136496C (en) Simplified spelling-touching screen mouse chinese character input method
CN1187677C (en) Method for inputting Chinese holophrase into computers by using partial stroke
CN1034245C (en) Burmese characters four-code intelligent coding method and keyboard thereof
CN1435749A (en) Chinese character stroke and phonetic code input method and keyboard thereof
CN1292329C (en) Pictographic code keyboard and multiple letter input method
CN1147780C (en) Three-stroke digital code Chinese character input method and keyboard
CN1118085A (en) Chinese character input system capable of inputing by digital keyboard and its keyboard
CN1234061C (en) General Chinese character input method suitable for letter keyboard and digital keyboard in computer and its keyboard
CN1201220C (en) Efficient key code input method in computer
CN1598743A (en) Input method for inputing Chinese according to standard stroke and its keyboard
CN1272693C (en) Artificial phonetic digital input method
CN1419179A (en) Chinese characters input method according to stroke sequence and keyboard thereof
CN1195257C (en) Chinese-character structure code input method
CN1162766C (en) Chinese-character 'pronunciation-shape code' input method and its keyboard profile
CN1115619C (en) Chinese character phonetic and configuration assembling input method for computer
CN1504863A (en) Concise Korean characters input method for numerals
CN1554994A (en) Hand phone chinese character input method and its keyboard relating to digital symbol and pictographs

Legal Events

Date Code Title Description
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C06 Publication
PB01 Publication
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20050525

PC01 Cancellation of the registration of the contract for pledge of patent right

Date of cancellation: 20130307

Granted publication date: 20050525

Pledgee: Zhongguancun Beijing technology financing Company limited by guarantee

Pledgor: Zheng Fang

Registration number: 200501226

PLDC Enforcement, change and cancellation of contracts on pledge of patent right or utility model
PM01 Change of the registration of the contract for pledge of patent right

Change date: 20130307

Registration number: 200501226

Pledgee after: Zhongguancun Beijing technology financing Company limited by guarantee

Pledgee before: Zhongguancun Beijing science and technology Company limited by guarantee