CN102103416B - Chinese character input method and device - Google Patents

Chinese character input method and device Download PDF

Info

Publication number
CN102103416B
CN102103416B CN200910261064A CN200910261064A CN102103416B CN 102103416 B CN102103416 B CN 102103416B CN 200910261064 A CN200910261064 A CN 200910261064A CN 200910261064 A CN200910261064 A CN 200910261064A CN 102103416 B CN102103416 B CN 102103416B
Authority
CN
China
Prior art keywords
entry
candidate
weight
probability
pinyin string
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN200910261064A
Other languages
Chinese (zh)
Other versions
CN102103416A (en
Inventor
蔡衡
董恭谨
李洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sina Technology China Co Ltd
Original Assignee
Sina Technology China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sina Technology China Co Ltd filed Critical Sina Technology China Co Ltd
Priority to CN200910261064A priority Critical patent/CN102103416B/en
Publication of CN102103416A publication Critical patent/CN102103416A/en
Application granted granted Critical
Publication of CN102103416B publication Critical patent/CN102103416B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Document Processing Apparatus (AREA)

Abstract

The embodiment of the invention provides a Chinese character input method and a Chinese character input device for solving the problem that the Chinese character input speed is low in the prior art. The method comprises the following steps of: acquiring a Pinyin string; splitting the Pinyin string according to a dictionary to obtain the Pinyin sub-strings of the Pinyin string; acquiring candidate entries corresponding to the Pinyin sub-strings according to the dictionary, and the appearance probabilities corresponding to the candidate entries, the appearance probabilities of the candidate entries under the condition that the other entries appear, and the characteristics of the candidate entries; calculating the weight of each candidate entry according to the above candidate entry from left to right; and determining the input result according to the weight of the candidate entry. In the embodiment of the invention, as the characteristic of the word is considered and a certain constraint relation of the characteristics exists, the correctness of inputting the Chinese character corresponding to the Pinyin string is improved, and the input speed is quickened.

Description

A kind of Chinese character input method and device
Technical field
The present invention relates to a kind of Chinese character entering technique, relate in particular to a kind of Chinese character input method and device.
Background technology
In typewriting; We need use the input method system record to want the information of expressing; And a very big part all is long sentence in these information, and with the sentence that the phonetic one whole is imported and obtained wanting, this just needs to use a critical function-smart group sentence in the input method system.We know same pinyin string can corresponding a plurality of speech, entry or sentence; Input method system possible provides this phonetic expressed information to the user for bigger, and the probability that mainly occurs according to entry of input method system finds probability of occurrence maximum entry, phrase or sentence as alternate item at present.
The alternative word that provides of input method system generally is to select probability of occurrence in the daily life higher entry, entry and English word according to descending sort in the input Chinese character.When importing long sentence, just use intelligent matching algorithm to be combined into the maximum sentence of co-occurrence probabilities as alternate item.For example after input Pinyin string xian ' cheng; The frequency (perhaps probability) that occurs according to each corresponding entry of this pinyin string sorts; " county town " can come " ready-made " and " thread " front; And " taking advantage of earlier " such entry is because occurrence number is less, and the dictionary that is not recorded input method system is chosen.
When input a word, input method system can carry out cutting according to the phonetic of input, and the frequency that occurs according to speech again finds the maximum sentence of co-occurrence probabilities.For example shown in Figure 1.
As shown in Figure 1; Pinyin string " bushoufanshiqinrao " for input; Through pinyin string being carried out the corresponding later result of phonetic substring cutting of Chinese word character be " bu ' shou ' fan ' shi ' qin ' rao "; These phonetic substrings can corresponding following individual character: " do not receive meal be parent around " perhaps " portion receives downer Qin Rao " or the like, be combined into speech according to individual character again, and each speech is identified by a spiral.As shown in Figure 1; The pairing phonetic substring of speech that is spliced into by word has: " bushou ", " fanshi ", " qinrao "; Wherein, phonetic substring " bushou " can corresponding " not receiving ", speech, phonetic substring " qinrao " correspondence speech such as " invasions " such as speech, phonetic substring " fanshi " correspondence " every ", " every " such as " not receiving ".Present method is the probability P (A that occurs according to former and later two speech A, B i| A I-1), the probability P (A that current entry occurs i), in conjunction with recessive Markov model, try to achieve the maximum probability that whole sentence occurs.General formula is that
Figure GSB00000821671800021
is according to above formula; Can calculate Weight (S1), Weight (S2) ... the probable value that waits, select to have the output of the whole sentence S of maximum probability P (S) as the smart group sentence.
Though present technology can be good at satisfying the demand of smart group sentence to a certain extent, still have certain problem.Present method has just been considered the frequency of speech appearance and the co-occurrence probabilities of two speech, does not consider other relations such as attribute of entry.Because the quantity of entry is huge, doublet quantity can become quadratic relationship, and current input method system is stored in for the relation with these magnanimity in the middle of the limited space, can only remove some unessential relations.This has influenced the accuracy rate of smart group sentence to a certain extent.And the conditional probability between the single use entry and the frequency of appearance not can solve all problems.As shown in Figure 1, what input method system was very natural will " not invaded and not harassed by everything ", translate into " not receiving every invasion ".Therefore, in input process, the user has to change input results, thereby causes the slow problem of input speed.
Summary of the invention
Embodiments of the invention provide a kind of Chinese character input method and device, can solve the slow problem of prior art Chinese character input speed.
Embodiments of the invention provide a kind of Chinese character input method, comprising: obtain pinyin string; According to dictionary said pinyin string is carried out cutting obtaining the phonetic substring of pinyin string, said dictionary comprises the conditional probability between the probability of occurrence, part of speech, part of speech of this entry under the probability of occurrence, other entry Conditions of entry, phonetic that entry is corresponding, entry; From dictionary, obtain the candidate entry corresponding with the phonetic substring, and the probability of occurrence, the part of speech of this candidate's entry of this candidate's entry under the corresponding probability of occurrence of this candidate's entry, other entry Conditions; Calculate the weight of each candidate's entry from left to right according to the probability of occurrence of this candidate's entry under the probability of occurrence of candidate's entry, other entry Conditions, the part of speech of said candidate's entry; From the weight of all corresponding candidate's entries of the last phonetic substring that comprises pinyin string; Find out the maximum candidate's entry of weight; And confirm each candidate's entry that said pinyin string is corresponding according to the maximum candidate's entry of this weight; As input results, the computing formula of said weight is following with the combination of these candidate's entries:
Weight(A i)=max(Weight(A i-1)+(a×log(P(A i|A i-1))+b×log(P(A i))+c×log(P(Prop(A i)|Prop(A i-1)))))
Wherein, i=1 is to M, and M is that pinyin string is the number of the pairing phonetic substring of single Chinese character by cutting; A iRepresent the entry of i position, Weight (A i) expression entry A iWeight, a, b, c are constants; P (A i| A I-1) be meant at entry A I-1Condition under A iThe probability that occurs; P (A i) be entry A iThe probability that occurs, Prop (A) is the part of speech of entry A; P (Prop (A i) | Prop (A I-1)) be at A I-1Part of speech Prop (A I-1) A under the condition that occurs iPart of speech Prop (A i) probability that occurs;
Said candidate's entry according to this weight maximum confirms that the step of each candidate's entry that said pinyin string is corresponding comprises: remove the phonetic substring of the maximum candidate's entry of this weight from pinyin string; With this phonetic substring as current pinyin string; From the weight of all corresponding candidate's entries of the last phonetic substring that comprises current pinyin string; Find out the maximum candidate's entry of weight; Comprise till the phonetic substring that begins most that up to current pinyin string resulting each candidate's entry is each corresponding candidate's entry of pinyin string.
The embodiment of the invention also provides a kind of Chinese input unit, specifically comprises: dictionary, and it comprises the conditional probability between the probability of occurrence, part of speech, part of speech of this entry under the probability of occurrence, other entry Conditions of the corresponding phonetic of entry, entry, entry; First acquiring unit is used to obtain pinyin string; The cutting unit is used for according to dictionary said pinyin string being carried out cutting to obtain the phonetic substring of pinyin string; Second acquisition unit; Be used for obtaining the candidate entry corresponding with the phonetic substring from dictionary, and the probability of occurrence, the part of speech of this candidate's entry, the conditional probability between the part of speech of this candidate's entry under the corresponding probability of occurrence of this candidate's entry, other entry Conditions; Computing unit is used for the probability of occurrence of this candidate's entry under the probability of occurrence, other entry Conditions according to candidate's entry, the part of speech of said candidate's entry is calculated the weight of each candidate's entry from left to right, and the computing formula of this weight is following:
Weight(A i)=max(Weight(A i-1)+(a×log(P(A i|A i-1))+b×log(P(A i))+c×log(P(Prop(A i)|Prop(A i-1)))))
Wherein, i=1 is to M, and M is that pinyin string is the number of the pairing phonetic substring of single Chinese character by cutting; A iRepresent the entry of i position, Weight (A i) expression entry A iWeight, a, b, c are constants; P (A i| A I-1) be meant at entry A I-1Condition under the probability that occurs of Ai; P (A i) be entry A iThe probability that occurs, Prop (A) is the part of speech of entry A; P (Prop (A i) | Prop (A I-1)) be at A I-1Part of speech Prop (A I-1) A under the condition that occurs iPart of speech Prop (A i) probability that occurs; Confirm the unit; Be used for finding out the maximum candidate's entry of weight, remove the phonetic substring of the maximum candidate's entry of this weight from pinyin string from the weight of all corresponding candidate's entries of the last phonetic substring that comprises pinyin string; With this phonetic substring as current pinyin string; From the weight of all corresponding candidate's entries of the last phonetic substring that comprises current pinyin string, find out the maximum candidate's entry of weight, comprise till the phonetic substring that begins most up to current pinyin string; Resulting each candidate's entry is each corresponding candidate's entry of pinyin string, with the combination of these candidate's entries as input results.
Embodiments of the invention owing to considered the part of speech of speech, because part of speech has certain restriction relation, through this restriction relation, can improve the correctness of the corresponding Chinese character of input Pinyin string, and then improve input speed.
Description of drawings
Fig. 1 shows the Chinese character segmenting method of prior art;
Fig. 2 shows the Chinese character input method of the embodiment of the invention;
Fig. 3 shows the Chinese character segmenting method in the embodiment of the invention;
Fig. 4 shows the Chinese input unit of the embodiment of the invention.
Embodiment
Understand and realization the present invention for the ease of persons skilled in the art, combine accompanying drawing to describe embodiments of the invention at present.
Embodiment one
As shown in Figure 2, present embodiment provides a kind of Chinese character input method, comprises the steps:
Step 21, obtain pinyin string.
Step 22, according to dictionary said pinyin string is carried out cutting obtaining the phonetic substring of pinyin string, said dictionary comprises the conditional probability between the probability of occurrence, part of speech, part of speech of this entry under the probability of occurrence, other entry Conditions of entry, phonetic that entry is corresponding, entry etc.Said phonetic substring can be a phonetic substring of forming single Chinese character, perhaps can be the phonetic substring of forming speech.
Step 23, from dictionary, obtain candidate entry or the candidate corresponding with said phonetic substring, and probability of occurrence, the part of speech of this candidate's entry or candidate under the corresponding probability of occurrence of this candidate's entry or candidate, other entry Conditions.For describing conveniently, candidate or candidate's entry are referred to as candidate's entry, speech and entry are identical concepts.
Step 24, calculate the weight of each candidate's entry from left to right according to the probability of occurrence of this candidate's entry under the probability of occurrence of candidate's entry, other entry Conditions, the part of speech of this candidate's entry; The computing formula of weight is following:
Weight(A i)=max(Weight(A i-1)+(a×log(P(A i|A i-1))+b×log(P(A i))+c×log(P(Prop(A i)|Prop(A i-1))))
Wherein, i=1 is to M, and M is that pinyin string is the total number of the pairing phonetic substring of single Chinese character by cutting; A iRepresent the entry of i position, Weight (A i) expression entry A iWeight, a, b, c are constants; P (A i| A I-1) be meant at entry A I-1Condition under A iThe probability that occurs; P (A i) be entry A iThe probability that occurs, Prop (A) is the part of speech of speech A; P (Prop (A i) | Prop (A I-1)) be at A I-1Part of speech Prop (A I-1) A under the condition that occurs iPart of speech Prop (A i) probability that occurs.
Because above-mentioned formula has been used part of speech, the accuracy of the sentence of increase group greatly.
Preferably, for the weight of calculated candidate entry, line up before and after the order of phonetic substring that these candidate's entries are corresponding according to input.Because pinyin string can be cut into each phonetic substring of various combination, promptly to pinyin string multiple slit mode can be arranged, each slit mode makes pinyin string be made up of the phonetic substring of various combination.Like this, the phonetic substring of these various combinations after the cutting can be formed the matrix of two dimension, promptly can regard the matrix of a N * M as.Wherein, N is the number of the maximum possible phonetic substring that in pinyin string, begins from any Chinese character position; M be pinyin string by the maximum number of the phonetic substring of cutting, this number is that pinyin string is the total number of the pairing phonetic substring of single Chinese character by cutting.Each unit of matrix is called as node, the position of the first Chinese character of classifying the corresponding speech of phonetic substring as of node.Like this, all nodes of first row of matrix all exist, and some node of other row possibly not exist.Each is not that empty node is represented a phonetic substring, and this phonetic substring can corresponding one or more speech according to dictionary.Next phonetic substring that each phonetic substring is adjacent links to each other, and will above-mentioned matrix be become a figure.
Step 25, from the weight of all corresponding candidate's entries of the last phonetic substring that comprises pinyin string, find out the maximum candidate's entry of weight, confirm each candidate's entry that said pinyin string is corresponding according to the maximum candidate's entry of this weight.
Confirm that according to the maximum candidate's entry of this weight each corresponding candidate's entry concrete steps of said pinyin string are following: remove the phonetic substring of the maximum candidate's entry of this weight from pinyin string; With this phonetic substring as current pinyin string; From the weight of all corresponding candidate's entries of the last phonetic substring that comprises current pinyin string; Find out the maximum candidate's entry of weight; Comprise till the phonetic substring that begins most that up to current pinyin string resulting each candidate's entry is each corresponding candidate's entry of pinyin string.With the combination of these candidate's entries as input results.
Further describe the present invention with a concrete example below.Be that example is explained the present invention promptly with pinyin string " shifengongli ".The content of supposing dictionary comprises table 1, table 2 and table 3.Wherein, table 1 comprises probability of occurrence and the part of speech of the corresponding phonetic of entry, entry, entry etc.; Table 2 comprises the probability of occurrence of this entry under other entry Conditions; Table 3 comprises the conditional probability between the part of speech.
Table 1
Entry Phonetic Probability of occurrence Part of speech
Be shi 15 Judge verb
Thing shi 13 Noun
Divide fen 12 Noun
Part fen 10 Measure word
The worker gong 11 Noun
Attack gong 10 Verb
Altogether gong 10 Adverbial word
Lee li 12 Name
Power li 10 Noun
Very shifen 31 Adverbial word
Time-division shifen 30 Noun
Kilometer gongli 28 Noun
Material gain gongli 26 Adjective
Table 2
The probability condition of speech Probable value
P (very | be) 12
P (very | very) 5
P (be | very) 3
P (material gain | be) 2
Table 3
The probability condition of part of speech Probable value
P (adjective | adverbial word) 5
P (verb | adverbial word) 5
P (noun | name) 2
P (adjective | noun) 3
P (noun | noun) 2
P (verb | noun) 1
P (noun | verb) ?2
P (verb | verb) ?2
P (adjective | verb) ?4
P (noun | adjective) ?5
At first pinyin string " shfengongli " is cut into the corresponding phonetic substring " shi " of single Chinese character, " fen ", " gong ", " li " and corresponding phonetic substring " shifen ", " gongli " of speech.Like this, each phonetic substring just can form a node, from dictionary, finds out the corresponding candidate's entry of this phonetic substring according to the phonetic substring again, that is, each node can corresponding one or more candidate's entries, and is as follows:
Shi can corresponding following candidate's entry: be, thing ...
Fen can corresponding following candidate's entry: divide, part ...
Gong can corresponding following candidate's entry: the worker, attack, be total to ...
Li can corresponding following candidate's entry: Lee, power ...
Shfen can corresponding following candidate's entry: very, the time-division ...
Gongli can corresponding following candidate's entry: kilometer, material gain ...
Ordinal relation according to pinyin string connects into reticulate texture with node, and is as shown in Figure 3.Can clearly react the context of each node through Fig. 3.Then, can according to the formula in the step 24, and table 1, table 2, table 3 can to calculate the weight of each speech in each node following:
Weight (being)=15
Weight (thing)=13
Weight (very)=31
Weight (time-division)=30
Weight (branch)=max (weight (being)+P (branch)+P (noun | verb), weight (thing)+P (branch)+P (noun | noun))=29
Weight (part)=max (weight (being)+P (part), weight (thing)+P (part))=25
Weight (worker)=max (weight (branch)+P (worker)+P (noun | noun), weight (part)+P (worker), weight (very)+P (worker)+P (noun | adjective), weight (time-division)+P (worker)+P (noun | noun))=47
Weight (attacking)=max (weight (branch)+P (attacking)+P (verb | noun), weight (part)+P (attacking), weight (very)+P (attacking), weight (time-division)+P (attacking)+P (verb | noun))=41
Weight (being total to)=max (weight (branch)+P (being total to), weight (part)+P (being total to), weight (very)+P (being total to), weight (time-division)+P (being total to))=41
Weight (kilometer)=max (weight (very)+P (kilometer), weight (branch)+P (kilometer)+P (noun | noun), weight (time-division)+P (kilometer)+P (noun | noun), weight (part)+P (kilometer))=61
Weight material gain=max (weight (very)+P (material gain)+P (adverbial word | adjective), weight (branch)+P (material gain), weight (time-division)+P (material gain), weight (part)+P (material gain)+P (adjective | noun))=62
Weight (Lee)=max (weight (worker)+P (Lee), weight (being total to)+P (Lee), weight (attacking)+P (Lee))=59
Weight (power)=max (weight (worker)+P (power)+P (noun | noun), weight (being total to)+P (power), weight (attacking)+P (Lee))=59
Symbol max (a, b, c) expression a, b, that element of maximum among the c.
After the weight of having calculated each speech on each node, in the node " gongli " that comprises the end string, " li ", seek the speech of weight limit.The set that comprises all speech of the node of end mark is { kilometer, material gain, Lee, power }.In above-mentioned set, entry " material gain " weight is maximum.According to the unique relationships between the entry definite in the process of calculating the entry weight, can know that candidate's entry that above-mentioned middle for example group can be organized sentence is: the ten minutes material gain.
Here the part of each entry back bracket is represented the weight of this entry, and the weight of entry has been passed through particular processing here, be 1 greater than zero integer.
For the sentence of confirming that pinyin string is corresponding, need be from the last phonetic substring that comprises pinyin string in the weight of corresponding all candidate's entries (that is, " li ", " gongli "); Find out the maximum candidate's entry of weight; I.e. " material gain ", and remove the phonetic substring of this candidate's entry from pinyin string, with this phonetic substring as current pinyin string; Find out candidate's entry of weight limit as stated above; Till the phonetic substring that begins most that comprises current pinyin string, thereby confirm each candidate's entry that said pinyin string is corresponding, with the combination of these candidate's entries as input results.
In order to raise the efficiency, present embodiment can use viterbi algorithm that problem is reduced to the polynomial expression complexity from exponential complexity when seeking the whole sentence of weight limit, has improved the efficient of program greatly.
In the weight process of each speech that calculates each node; If remove part of speech information; Because the word frequency of " kilometer " is much larger than the word frequency of " material gain ", therefore, the probability P (S) of " very kilometer " this sentence will be than higher; According to the conventional method, will obtain sentence " very kilometer " by pinyin string " shifengongli ".But according to the embodiment of the invention, owing to considered the part of speech of speech, with noun, so P (Prop (kilometer) | Prop (very)) just equals 0, the value of P (Prop (material gain) | Prop (very)) is greater than 0 on the contrary like the adverbial word back.Therefore the probability of " very material gain " will at last, will obtain sentence " very material gain " as candidate's entry by pinyin string " shifengongli " greater than the probability of " very kilometer ".Thereby improved the correctness of input, and then improved input speed.
Embodiment two
As shown in Figure 4; Present embodiment provides a kind of Chinese input unit; Specifically comprise: dictionary, it comprises the conditional probability between the probability of occurrence, part of speech, part of speech of this candidate's entry under the probability of occurrence, other entry Conditions of the corresponding phonetic of entry, entry, entry; First acquiring unit is used to obtain pinyin string; The cutting unit is used for according to dictionary said pinyin string being carried out cutting to obtain the phonetic substring of pinyin string; Second acquisition unit is used for obtaining the candidate entry corresponding with the phonetic substring from dictionary, and the probability of occurrence, the part of speech of this candidate's entry of this candidate's entry under the corresponding probability of occurrence of this candidate's entry, other entry Conditions; Computing unit is used for the probability of occurrence of this candidate's entry under the probability of occurrence, other entry Conditions according to candidate's entry, the part of speech of said candidate's entry is calculated the weight of each candidate's entry from left to right; Confirm the unit; Be used for from the weight of all corresponding candidate's entries of the last phonetic substring that comprises pinyin string; Find out the maximum candidate's entry of weight; And confirm each candidate's entry that said pinyin string is corresponding according to the maximum candidate's entry of this weight, with the combination of these candidate's entries as input results.
Said computing unit uses following formula:
Weight(A i)=max(Weight(A i-1)+(a×log(P(A i|A i-1))+b×log(P(A i))+c×log(P(Prop(A i)|Prop(A i-1)))))
Wherein, i=1 is to M, and M is that pinyin string is the number of the pairing phonetic substring of single Chinese character by cutting; A iRepresent the entry of i position, Weight (A i) expression entry A iWeight, a, b, c are constants; P (A i| A I-1) be meant at entry A I-1Condition under A iThe probability that occurs; P (A i) be entry A iThe probability that occurs, Prop (A) is the part of speech of speech A; P (Prop (A i) | Prop (A I-1)) be at A I-1Part of speech Prop (A I-1) A under the condition that occurs iPart of speech Prop (A i) probability that occurs.
The principle of work of each unit of present embodiment can be referring to the description of embodiment one.
Though described the present invention through embodiment, those of ordinary skills know, under the situation that does not break away from spirit of the present invention and essence, just can make the present invention that many distortion and variation are arranged, and scope of the present invention is limited appended claim.

Claims (2)

1. a Chinese character input method is characterized in that, comprising:
Obtain pinyin string;
According to dictionary said pinyin string is carried out cutting obtaining the phonetic substring of pinyin string, said dictionary comprises the conditional probability between the probability of occurrence, part of speech, part of speech of this entry under the probability of occurrence, other entry Conditions of entry, phonetic that entry is corresponding, entry;
From dictionary, obtain the candidate entry corresponding with the phonetic substring, and the probability of occurrence, the part of speech of this candidate's entry of this candidate's entry under the corresponding probability of occurrence of this candidate's entry, other entry Conditions;
Calculate the weight of each candidate's entry from left to right according to the probability of occurrence of this candidate's entry under the probability of occurrence of candidate's entry, other entry Conditions, the part of speech of said candidate's entry, the computing formula of this weight is following:
Weight(A i)=max(Weight(A i-1)+(a×log(P(A i|A i-1))+b×log(P(A i))+c×log(P(Prop(A i)|Prop(A i-1)))))
Wherein, i=1 is to M, and M is that pinyin string is the number of the pairing phonetic substring of single Chinese character by cutting; A iRepresent the entry of i position, Weight (A i) expression entry A iWeight, a, b, c are constants; P (A i| A I-1) be meant at entry A I-1Condition under A iThe probability that occurs; P (A i) be entry A iThe probability that occurs, Prop (A) is the part of speech of entry A; P (Prop (A i) | Prop (A I-1)) be at A I-1Part of speech Prop (A I-1) A under the condition that occurs iPart of speech Prop (A i) probability that occurs;
From the weight of all corresponding candidate's entries of the last phonetic substring that comprises pinyin string; Find out the maximum candidate's entry of weight; Remove the phonetic substring of the maximum candidate's entry of this weight from pinyin string, with this phonetic substring as current pinyin string, from the weight of all corresponding candidate's entries of the last phonetic substring that comprises current pinyin string; Find out the maximum candidate's entry of weight; Comprise till the phonetic substring that begins most that up to current pinyin string resulting each candidate's entry is each corresponding candidate's entry of pinyin string, with the combination of these candidate's entries as input results.
2. a Chinese input unit is characterized in that, specifically comprises:
Dictionary, it comprises the conditional probability between the probability of occurrence, part of speech, part of speech of this entry under the probability of occurrence, other entry Conditions of the corresponding phonetic of entry, entry, entry;
First acquiring unit is used to obtain pinyin string;
The cutting unit is used for according to dictionary said pinyin string being carried out cutting to obtain the phonetic substring of pinyin string;
Second acquisition unit; Be used for obtaining the candidate entry corresponding with the phonetic substring from dictionary, and the probability of occurrence, the part of speech of this candidate's entry, the conditional probability between the part of speech of this candidate's entry under the corresponding probability of occurrence of this candidate's entry, other entry Conditions;
Computing unit is used for the probability of occurrence of this candidate's entry under the probability of occurrence, other entry Conditions according to candidate's entry, the part of speech of said candidate's entry is calculated the weight of each candidate's entry from left to right, and the computing formula of this weight is following:
Weight(A i)=max(Weight(A i-1)+(a×log(P(A i|A i-1))+b×log(P(A i))+c×log(P(Prop(A i)|Prop(A i-1)))))
Wherein, i=1 is to M, and M is that pinyin string is the number of the pairing phonetic substring of single Chinese character by cutting; A iRepresent the entry of i position, Weight (A i) expression entry A iWeight, a, b, c are constants; P (A i| A I-1) be meant at entry A I-1Condition under A iThe probability that occurs; P (A i) be entry A iThe probability that occurs, Prop (A) is the part of speech of entry A; P (Prop (A i) | Prop (A I-1)) be at A I-1Part of speech Prop (A I-1) A under the condition that occurs iPart of speech Prop (A i) probability that occurs;
Confirm the unit; Be used for finding out the maximum candidate's entry of weight, remove the phonetic substring of the maximum candidate's entry of this weight from pinyin string from the weight of all corresponding candidate's entries of the last phonetic substring that comprises pinyin string; With this phonetic substring as current pinyin string; From the weight of all corresponding candidate's entries of the last phonetic substring that comprises current pinyin string, find out the maximum candidate's entry of weight, comprise till the phonetic substring that begins most up to current pinyin string; Resulting each candidate's entry is each corresponding candidate's entry of pinyin string, with the combination of these candidate's entries as input results.
CN200910261064A 2009-12-17 2009-12-17 Chinese character input method and device Active CN102103416B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200910261064A CN102103416B (en) 2009-12-17 2009-12-17 Chinese character input method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200910261064A CN102103416B (en) 2009-12-17 2009-12-17 Chinese character input method and device

Publications (2)

Publication Number Publication Date
CN102103416A CN102103416A (en) 2011-06-22
CN102103416B true CN102103416B (en) 2012-10-10

Family

ID=44156251

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200910261064A Active CN102103416B (en) 2009-12-17 2009-12-17 Chinese character input method and device

Country Status (1)

Country Link
CN (1) CN102103416B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102929864B (en) * 2011-08-05 2016-08-17 北京百度网讯科技有限公司 A kind of tone-character conversion method and device
CN102955770B (en) * 2011-08-17 2017-07-11 深圳市世纪光速信息技术有限公司 A kind of phonetic automatic identifying method and system
CN103870001B (en) * 2012-12-11 2018-07-10 百度国际科技(深圳)有限公司 A kind of method and electronic device for generating candidates of input method
CN104731364A (en) * 2015-03-30 2015-06-24 天脉聚源(北京)教育科技有限公司 Input method and input method system
CN105069028B (en) * 2015-07-16 2018-05-25 广东小天才科技有限公司 A kind of Chinese character method for pushing and Chinese character pusher based on phonetic
CN105653061B (en) * 2015-12-29 2020-03-31 北京京东尚科信息技术有限公司 Entry retrieval and wrong word detection method and system for pinyin input method
CN105718070A (en) * 2016-01-16 2016-06-29 上海高欣计算机系统有限公司 Pinyin long sentence continuous type-in input method and Pinyin long sentence continuous type-in input system
CN110245331A (en) * 2018-03-09 2019-09-17 中兴通讯股份有限公司 A kind of sentence conversion method, device, server and computer storage medium
CN109085932B (en) * 2018-08-17 2023-07-25 科大讯飞股份有限公司 Candidate entry adjustment method, device, equipment and readable storage medium
CN112987941B (en) * 2019-12-17 2024-02-13 北京搜狗科技发展有限公司 Method and device for generating candidate words

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1165336A (en) * 1996-05-13 1997-11-19 上海欧姆龙计算机有限公司 Chinese character phonetic input system
CN1322984A (en) * 2000-05-10 2001-11-21 微软公司 Chinese characters inputting method and its apparatus

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1165336A (en) * 1996-05-13 1997-11-19 上海欧姆龙计算机有限公司 Chinese character phonetic input system
CN1322984A (en) * 2000-05-10 2001-11-21 微软公司 Chinese characters inputting method and its apparatus

Also Published As

Publication number Publication date
CN102103416A (en) 2011-06-22

Similar Documents

Publication Publication Date Title
CN102103416B (en) Chinese character input method and device
Sharma et al. Stemming algorithms: a comparative study and their analysis
JP4672418B2 (en) Efficient capitalization by user modeling
US20060206306A1 (en) Text mining apparatus and associated methods
Siivola et al. On Growing and Pruning Kneser–Ney Smoothed $ N $-Gram Models
CN108763196A (en) A kind of keyword extraction method based on PMI
CN101071420A (en) Method and system for cutting index participle
CN102622338A (en) Computer-assisted computing method of semantic distance between short texts
CN109657053B (en) Multi-text abstract generation method, device, server and storage medium
JP2005251206A (en) Word collection method and system for use in word segmentation
CN109522547B (en) Chinese synonym iteration extraction method based on pattern learning
CN102662952A (en) Chinese text parallel data mining method based on hierarchy
US20160283588A1 (en) Generation apparatus and method
WO2023071118A1 (en) Method and system for calculating text similarity, device, and storage medium
CN111460170B (en) Word recognition method, device, terminal equipment and storage medium
CN104298746A (en) Domain literature keyword extracting method based on phrase network diagram sorting
CN106569989A (en) De-weighting method and apparatus for short text
CN101650729A (en) Dynamic construction method for Web service component library and service search method thereof
CN109902290A (en) A kind of term extraction method, system and equipment based on text information
EP3811276A1 (en) Token matching in large document corpora
Gupta Hybrid algorithm for multilingual summarization of Hindi and Punjabi documents
CN110674635A (en) Method and device for text paragraph division
Makazhanov et al. Spelling correction for kazakh
Dianati et al. Words stemming based on structural and semantic similarity
CN101576877A (en) Fast word segmentation realization method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20230419

Address after: Room 501-502, 5/F, Sina Headquarters Scientific Research Building, Block N-1 and N-2, Zhongguancun Software Park, Dongbei Wangxi Road, Haidian District, Beijing, 100193

Patentee after: Sina Technology (China) Co.,Ltd.

Address before: 100080, International Building, No. 58 West Fourth Ring Road, Haidian District, Beijing, 1510 floor

Patentee before: Sina.com Technology (China) Co.,Ltd.