CN102103416B

CN102103416B - Chinese character input method and device

Info

Publication number: CN102103416B
Application number: CN200910261064A
Authority: CN
Inventors: 蔡衡; 董恭谨; 李洋
Original assignee: Sina Technology China Co Ltd
Current assignee: Sina Technology China Co Ltd
Priority date: 2009-12-17
Filing date: 2009-12-17
Publication date: 2012-10-10
Anticipated expiration: 2029-12-17
Also published as: CN102103416A

Abstract

The embodiment of the invention provides a Chinese character input method and a Chinese character input device for solving the problem that the Chinese character input speed is low in the prior art. The method comprises the following steps of: acquiring a Pinyin string; splitting the Pinyin string according to a dictionary to obtain the Pinyin sub-strings of the Pinyin string; acquiring candidate entries corresponding to the Pinyin sub-strings according to the dictionary, and the appearance probabilities corresponding to the candidate entries, the appearance probabilities of the candidate entries under the condition that the other entries appear, and the characteristics of the candidate entries; calculating the weight of each candidate entry according to the above candidate entry from left to right; and determining the input result according to the weight of the candidate entry. In the embodiment of the invention, as the characteristic of the word is considered and a certain constraint relation of the characteristics exists, the correctness of inputting the Chinese character corresponding to the Pinyin string is improved, and the input speed is quickened.

Description

A kind of Chinese character input method and device

Technical field

The present invention relates to a kind of Chinese character entering technique, relate in particular to a kind of Chinese character input method and device.

Background technology

In typewriting; We need use the input method system record to want the information of expressing; And a very big part all is long sentence in these information, and with the sentence that the phonetic one whole is imported and obtained wanting, this just needs to use a critical function-smart group sentence in the input method system.We know same pinyin string can corresponding a plurality of speech, entry or sentence; Input method system possible provides this phonetic expressed information to the user for bigger, and the probability that mainly occurs according to entry of input method system finds probability of occurrence maximum entry, phrase or sentence as alternate item at present.

The alternative word that provides of input method system generally is to select probability of occurrence in the daily life higher entry, entry and English word according to descending sort in the input Chinese character.When importing long sentence, just use intelligent matching algorithm to be combined into the maximum sentence of co-occurrence probabilities as alternate item.For example after input Pinyin string xian ' cheng; The frequency (perhaps probability) that occurs according to each corresponding entry of this pinyin string sorts; " county town " can come " ready-made " and " thread " front; And " taking advantage of earlier " such entry is because occurrence number is less, and the dictionary that is not recorded input method system is chosen.

When input a word, input method system can carry out cutting according to the phonetic of input, and the frequency that occurs according to speech again finds the maximum sentence of co-occurrence probabilities.For example shown in Figure 1.

As shown in Figure 1; Pinyin string " bushoufanshiqinrao " for input; Through pinyin string being carried out the corresponding later result of phonetic substring cutting of Chinese word character be " bu ' shou ' fan ' shi ' qin ' rao "; These phonetic substrings can corresponding following individual character: " do not receive meal be parent around " perhaps " portion receives downer Qin Rao " or the like, be combined into speech according to individual character again, and each speech is identified by a spiral.As shown in Figure 1; The pairing phonetic substring of speech that is spliced into by word has: " bushou ", " fanshi ", " qinrao "; Wherein, phonetic substring " bushou " can corresponding " not receiving ", speech, phonetic substring " qinrao " correspondence speech such as " invasions " such as speech, phonetic substring " fanshi " correspondence " every ", " every " such as " not receiving ".Present method is the probability P (A that occurs according to former and later two speech A, B _i| A _I-1), the probability P (A that current entry occurs _i), in conjunction with recessive Markov model, try to achieve the maximum probability that whole sentence occurs.General formula is that

is according to above formula; Can calculate Weight (S1), Weight (S2) ... the probable value that waits, select to have the output of the whole sentence S of maximum probability P (S) as the smart group sentence.

Though present technology can be good at satisfying the demand of smart group sentence to a certain extent, still have certain problem.Present method has just been considered the frequency of speech appearance and the co-occurrence probabilities of two speech, does not consider other relations such as attribute of entry.Because the quantity of entry is huge, doublet quantity can become quadratic relationship, and current input method system is stored in for the relation with these magnanimity in the middle of the limited space, can only remove some unessential relations.This has influenced the accuracy rate of smart group sentence to a certain extent.And the conditional probability between the single use entry and the frequency of appearance not can solve all problems.As shown in Figure 1, what input method system was very natural will " not invaded and not harassed by everything ", translate into " not receiving every invasion ".Therefore, in input process, the user has to change input results, thereby causes the slow problem of input speed.

Summary of the invention

Embodiments of the invention provide a kind of Chinese character input method and device, can solve the slow problem of prior art Chinese character input speed.

Embodiments of the invention provide a kind of Chinese character input method, comprising: obtain pinyin string; According to dictionary said pinyin string is carried out cutting obtaining the phonetic substring of pinyin string, said dictionary comprises the conditional probability between the probability of occurrence, part of speech, part of speech of this entry under the probability of occurrence, other entry Conditions of entry, phonetic that entry is corresponding, entry; From dictionary, obtain the candidate entry corresponding with the phonetic substring, and the probability of occurrence, the part of speech of this candidate's entry of this candidate's entry under the corresponding probability of occurrence of this candidate's entry, other entry Conditions; Calculate the weight of each candidate's entry from left to right according to the probability of occurrence of this candidate's entry under the probability of occurrence of candidate's entry, other entry Conditions, the part of speech of said candidate's entry; From the weight of all corresponding candidate's entries of the last phonetic substring that comprises pinyin string; Find out the maximum candidate's entry of weight; And confirm each candidate's entry that said pinyin string is corresponding according to the maximum candidate's entry of this weight; As input results, the computing formula of said weight is following with the combination of these candidate's entries:

Weight(A _i)＝max(Weight(A _i-1)+(a×log(P(A _i|A _i-1))+b×log(P(A _i))+c×log(P(Prop(A _i)|Prop(A _i-1)))))

Wherein, i=1 is to M, and M is that pinyin string is the number of the pairing phonetic substring of single Chinese character by cutting; A _iRepresent the entry of i position, Weight (A _i) expression entry A _iWeight, a, b, c are constants; P (A _i| A _I-1) be meant at entry A _I-1Condition under A _iThe probability that occurs; P (A _i) be entry A _iThe probability that occurs, Prop (A) is the part of speech of entry A; P (Prop (A _i) | Prop (A _I-1)) be at A _I-1Part of speech Prop (A _I-1) A under the condition that occurs _iPart of speech Prop (A _i) probability that occurs;

Said candidate's entry according to this weight maximum confirms that the step of each candidate's entry that said pinyin string is corresponding comprises: remove the phonetic substring of the maximum candidate's entry of this weight from pinyin string; With this phonetic substring as current pinyin string; From the weight of all corresponding candidate's entries of the last phonetic substring that comprises current pinyin string; Find out the maximum candidate's entry of weight; Comprise till the phonetic substring that begins most that up to current pinyin string resulting each candidate's entry is each corresponding candidate's entry of pinyin string.

The embodiment of the invention also provides a kind of Chinese input unit, specifically comprises: dictionary, and it comprises the conditional probability between the probability of occurrence, part of speech, part of speech of this entry under the probability of occurrence, other entry Conditions of the corresponding phonetic of entry, entry, entry; First acquiring unit is used to obtain pinyin string; The cutting unit is used for according to dictionary said pinyin string being carried out cutting to obtain the phonetic substring of pinyin string; Second acquisition unit; Be used for obtaining the candidate entry corresponding with the phonetic substring from dictionary, and the probability of occurrence, the part of speech of this candidate's entry, the conditional probability between the part of speech of this candidate's entry under the corresponding probability of occurrence of this candidate's entry, other entry Conditions; Computing unit is used for the probability of occurrence of this candidate's entry under the probability of occurrence, other entry Conditions according to candidate's entry, the part of speech of said candidate's entry is calculated the weight of each candidate's entry from left to right, and the computing formula of this weight is following:

Wherein, i=1 is to M, and M is that pinyin string is the number of the pairing phonetic substring of single Chinese character by cutting; A _iRepresent the entry of i position, Weight (A _i) expression entry A _iWeight, a, b, c are constants; P (A _i| A _I-1) be meant at entry A _I-1Condition under the probability that occurs of Ai; P (A _i) be entry A _iThe probability that occurs, Prop (A) is the part of speech of entry A; P (Prop (A _i) | Prop (A _I-1)) be at A _I-1Part of speech Prop (A _I-1) A under the condition that occurs _iPart of speech Prop (A _i) probability that occurs; Confirm the unit; Be used for finding out the maximum candidate's entry of weight, remove the phonetic substring of the maximum candidate's entry of this weight from pinyin string from the weight of all corresponding candidate's entries of the last phonetic substring that comprises pinyin string; With this phonetic substring as current pinyin string; From the weight of all corresponding candidate's entries of the last phonetic substring that comprises current pinyin string, find out the maximum candidate's entry of weight, comprise till the phonetic substring that begins most up to current pinyin string; Resulting each candidate's entry is each corresponding candidate's entry of pinyin string, with the combination of these candidate's entries as input results.

Embodiments of the invention owing to considered the part of speech of speech, because part of speech has certain restriction relation, through this restriction relation, can improve the correctness of the corresponding Chinese character of input Pinyin string, and then improve input speed.

Description of drawings

Fig. 1 shows the Chinese character segmenting method of prior art;

Fig. 2 shows the Chinese character input method of the embodiment of the invention;

Fig. 3 shows the Chinese character segmenting method in the embodiment of the invention;

Fig. 4 shows the Chinese input unit of the embodiment of the invention.

Embodiment

Understand and realization the present invention for the ease of persons skilled in the art, combine accompanying drawing to describe embodiments of the invention at present.

Embodiment one

As shown in Figure 2, present embodiment provides a kind of Chinese character input method, comprises the steps:

Step 21, obtain pinyin string.

Step 22, according to dictionary said pinyin string is carried out cutting obtaining the phonetic substring of pinyin string, said dictionary comprises the conditional probability between the probability of occurrence, part of speech, part of speech of this entry under the probability of occurrence, other entry Conditions of entry, phonetic that entry is corresponding, entry etc.Said phonetic substring can be a phonetic substring of forming single Chinese character, perhaps can be the phonetic substring of forming speech.

Step 23, from dictionary, obtain candidate entry or the candidate corresponding with said phonetic substring, and probability of occurrence, the part of speech of this candidate's entry or candidate under the corresponding probability of occurrence of this candidate's entry or candidate, other entry Conditions.For describing conveniently, candidate or candidate's entry are referred to as candidate's entry, speech and entry are identical concepts.

Step 24, calculate the weight of each candidate's entry from left to right according to the probability of occurrence of this candidate's entry under the probability of occurrence of candidate's entry, other entry Conditions, the part of speech of this candidate's entry; The computing formula of weight is following:

Weight(A _i)＝max(Weight(A _i-1)+(a×log(P(A _i|A _i-1))+b×log(P(A _i))+c×log(P(Prop(A _i)|Prop(A _i-1))))

Wherein, i=1 is to M, and M is that pinyin string is the total number of the pairing phonetic substring of single Chinese character by cutting; A _iRepresent the entry of i position, Weight (A _i) expression entry A _iWeight, a, b, c are constants; P (A _i| A _I-1) be meant at entry A _I-1Condition under A _iThe probability that occurs; P (A _i) be entry A _iThe probability that occurs, Prop (A) is the part of speech of speech A; P (Prop (A _i) | Prop (A _I-1)) be at A _I-1Part of speech Prop (A _I-1) A under the condition that occurs _iPart of speech Prop (A _i) probability that occurs.

Because above-mentioned formula has been used part of speech, the accuracy of the sentence of increase group greatly.

Preferably, for the weight of calculated candidate entry, line up before and after the order of phonetic substring that these candidate's entries are corresponding according to input.Because pinyin string can be cut into each phonetic substring of various combination, promptly to pinyin string multiple slit mode can be arranged, each slit mode makes pinyin string be made up of the phonetic substring of various combination.Like this, the phonetic substring of these various combinations after the cutting can be formed the matrix of two dimension, promptly can regard the matrix of a N * M as.Wherein, N is the number of the maximum possible phonetic substring that in pinyin string, begins from any Chinese character position; M be pinyin string by the maximum number of the phonetic substring of cutting, this number is that pinyin string is the total number of the pairing phonetic substring of single Chinese character by cutting.Each unit of matrix is called as node, the position of the first Chinese character of classifying the corresponding speech of phonetic substring as of node.Like this, all nodes of first row of matrix all exist, and some node of other row possibly not exist.Each is not that empty node is represented a phonetic substring, and this phonetic substring can corresponding one or more speech according to dictionary.Next phonetic substring that each phonetic substring is adjacent links to each other, and will above-mentioned matrix be become a figure.

Step 25, from the weight of all corresponding candidate's entries of the last phonetic substring that comprises pinyin string, find out the maximum candidate's entry of weight, confirm each candidate's entry that said pinyin string is corresponding according to the maximum candidate's entry of this weight.

Confirm that according to the maximum candidate's entry of this weight each corresponding candidate's entry concrete steps of said pinyin string are following: remove the phonetic substring of the maximum candidate's entry of this weight from pinyin string; With this phonetic substring as current pinyin string; From the weight of all corresponding candidate's entries of the last phonetic substring that comprises current pinyin string; Find out the maximum candidate's entry of weight; Comprise till the phonetic substring that begins most that up to current pinyin string resulting each candidate's entry is each corresponding candidate's entry of pinyin string.With the combination of these candidate's entries as input results.

Further describe the present invention with a concrete example below.Be that example is explained the present invention promptly with pinyin string " shifengongli ".The content of supposing dictionary comprises table 1, table 2 and table 3.Wherein, table 1 comprises probability of occurrence and the part of speech of the corresponding phonetic of entry, entry, entry etc.; Table 2 comprises the probability of occurrence of this entry under other entry Conditions; Table 3 comprises the conditional probability between the part of speech.

Table 1

Entry	Phonetic	Probability of occurrence	Part of speech
				Be	shi	15	Judge verb
Thing	shi	13	Noun
				Divide	fen	12	Noun
Part	fen	10	Measure word
				The worker	gong	11	Noun
Attack	gong	10	Verb
				Altogether	gong	10	Adverbial word

Lee	li	12	Name
				Power	li	10	Noun
Very	shifen	31	Adverbial word
				Time-division	shifen	30	Noun
Kilometer	gongli	28	Noun
				Material gain	gongli	26	Adjective

Table 2

The probability condition of speech	Probable value
		P (very \| be)	12
P (very \| very)	5
		P (be \| very)	3
P (material gain \| be)	2

Table 3

The probability condition of part of speech	Probable value
		P (adjective \| adverbial word)	5
P (verb \| adverbial word)	5
		P (noun \| name)	2
P (adjective \| noun)	3
		P (noun \| noun)	2
P (verb \| noun)	1

P (noun \| verb)	?2
		P (verb \| verb)	?2
P (adjective \| verb)	?4
		P (noun \| adjective)	?5

At first pinyin string " shfengongli " is cut into the corresponding phonetic substring " shi " of single Chinese character, " fen ", " gong ", " li " and corresponding phonetic substring " shifen ", " gongli " of speech.Like this, each phonetic substring just can form a node, from dictionary, finds out the corresponding candidate's entry of this phonetic substring according to the phonetic substring again, that is, each node can corresponding one or more candidate's entries, and is as follows:

Shi can corresponding following candidate's entry: be, thing ...

Fen can corresponding following candidate's entry: divide, part ...

Gong can corresponding following candidate's entry: the worker, attack, be total to ...

Li can corresponding following candidate's entry: Lee, power ...

Shfen can corresponding following candidate's entry: very, the time-division ...

Gongli can corresponding following candidate's entry: kilometer, material gain ...

Ordinal relation according to pinyin string connects into reticulate texture with node, and is as shown in Figure 3.Can clearly react the context of each node through Fig. 3.Then, can according to the formula in the step 24, and table 1, table 2, table 3 can to calculate the weight of each speech in each node following:

Weight (being)=15

Weight (thing)=13

Weight (very)=31

Weight (time-division)=30

Weight (branch)=max (weight (being)+P (branch)+P (noun | verb), weight (thing)+P (branch)+P (noun | noun))=29

Weight (part)=max (weight (being)+P (part), weight (thing)+P (part))=25

Weight (worker)=max (weight (branch)+P (worker)+P (noun | noun), weight (part)+P (worker), weight (very)+P (worker)+P (noun | adjective), weight (time-division)+P (worker)+P (noun | noun))=47

Weight (attacking)=max (weight (branch)+P (attacking)+P (verb | noun), weight (part)+P (attacking), weight (very)+P (attacking), weight (time-division)+P (attacking)+P (verb | noun))=41

Weight (being total to)=max (weight (branch)+P (being total to), weight (part)+P (being total to), weight (very)+P (being total to), weight (time-division)+P (being total to))=41

Weight (kilometer)=max (weight (very)+P (kilometer), weight (branch)+P (kilometer)+P (noun | noun), weight (time-division)+P (kilometer)+P (noun | noun), weight (part)+P (kilometer))=61

Weight material gain=max (weight (very)+P (material gain)+P (adverbial word | adjective), weight (branch)+P (material gain), weight (time-division)+P (material gain), weight (part)+P (material gain)+P (adjective | noun))=62

Weight (Lee)=max (weight (worker)+P (Lee), weight (being total to)+P (Lee), weight (attacking)+P (Lee))=59

Weight (power)=max (weight (worker)+P (power)+P (noun | noun), weight (being total to)+P (power), weight (attacking)+P (Lee))=59

Symbol max (a, b, c) expression a, b, that element of maximum among the c.

After the weight of having calculated each speech on each node, in the node " gongli " that comprises the end string, " li ", seek the speech of weight limit.The set that comprises all speech of the node of end mark is { kilometer, material gain, Lee, power }.In above-mentioned set, entry " material gain " weight is maximum.According to the unique relationships between the entry definite in the process of calculating the entry weight, can know that candidate's entry that above-mentioned middle for example group can be organized sentence is: the ten minutes material gain.

Here the part of each entry back bracket is represented the weight of this entry, and the weight of entry has been passed through particular processing here, be 1 greater than zero integer.

For the sentence of confirming that pinyin string is corresponding, need be from the last phonetic substring that comprises pinyin string in the weight of corresponding all candidate's entries (that is, " li ", " gongli "); Find out the maximum candidate's entry of weight; I.e. " material gain ", and remove the phonetic substring of this candidate's entry from pinyin string, with this phonetic substring as current pinyin string; Find out candidate's entry of weight limit as stated above; Till the phonetic substring that begins most that comprises current pinyin string, thereby confirm each candidate's entry that said pinyin string is corresponding, with the combination of these candidate's entries as input results.

In order to raise the efficiency, present embodiment can use viterbi algorithm that problem is reduced to the polynomial expression complexity from exponential complexity when seeking the whole sentence of weight limit, has improved the efficient of program greatly.

In the weight process of each speech that calculates each node; If remove part of speech information; Because the word frequency of " kilometer " is much larger than the word frequency of " material gain ", therefore, the probability P (S) of " very kilometer " this sentence will be than higher; According to the conventional method, will obtain sentence " very kilometer " by pinyin string " shifengongli ".But according to the embodiment of the invention, owing to considered the part of speech of speech, with noun, so P (Prop (kilometer) | Prop (very)) just equals 0, the value of P (Prop (material gain) | Prop (very)) is greater than 0 on the contrary like the adverbial word back.Therefore the probability of " very material gain " will at last, will obtain sentence " very material gain " as candidate's entry by pinyin string " shifengongli " greater than the probability of " very kilometer ".Thereby improved the correctness of input, and then improved input speed.

Embodiment two

As shown in Figure 4; Present embodiment provides a kind of Chinese input unit; Specifically comprise: dictionary, it comprises the conditional probability between the probability of occurrence, part of speech, part of speech of this candidate's entry under the probability of occurrence, other entry Conditions of the corresponding phonetic of entry, entry, entry; First acquiring unit is used to obtain pinyin string; The cutting unit is used for according to dictionary said pinyin string being carried out cutting to obtain the phonetic substring of pinyin string; Second acquisition unit is used for obtaining the candidate entry corresponding with the phonetic substring from dictionary, and the probability of occurrence, the part of speech of this candidate's entry of this candidate's entry under the corresponding probability of occurrence of this candidate's entry, other entry Conditions; Computing unit is used for the probability of occurrence of this candidate's entry under the probability of occurrence, other entry Conditions according to candidate's entry, the part of speech of said candidate's entry is calculated the weight of each candidate's entry from left to right; Confirm the unit; Be used for from the weight of all corresponding candidate's entries of the last phonetic substring that comprises pinyin string; Find out the maximum candidate's entry of weight; And confirm each candidate's entry that said pinyin string is corresponding according to the maximum candidate's entry of this weight, with the combination of these candidate's entries as input results.

Said computing unit uses following formula:

Wherein, i=1 is to M, and M is that pinyin string is the number of the pairing phonetic substring of single Chinese character by cutting; A _iRepresent the entry of i position, Weight (A _i) expression entry A _iWeight, a, b, c are constants; P (A _i| A _I-1) be meant at entry A _I-1Condition under A _iThe probability that occurs; P (A _i) be entry A _iThe probability that occurs, Prop (A) is the part of speech of speech A; P (Prop (A _i) | Prop (A _I-1)) be at A _I-1Part of speech Prop (A _I-1) A under the condition that occurs _iPart of speech Prop (A _i) probability that occurs.

The principle of work of each unit of present embodiment can be referring to the description of embodiment one.

Though described the present invention through embodiment, those of ordinary skills know, under the situation that does not break away from spirit of the present invention and essence, just can make the present invention that many distortion and variation are arranged, and scope of the present invention is limited appended claim.

Claims

1. a Chinese character input method is characterized in that, comprising:

Obtain pinyin string;

According to dictionary said pinyin string is carried out cutting obtaining the phonetic substring of pinyin string, said dictionary comprises the conditional probability between the probability of occurrence, part of speech, part of speech of this entry under the probability of occurrence, other entry Conditions of entry, phonetic that entry is corresponding, entry;

From dictionary, obtain the candidate entry corresponding with the phonetic substring, and the probability of occurrence, the part of speech of this candidate's entry of this candidate's entry under the corresponding probability of occurrence of this candidate's entry, other entry Conditions;

Calculate the weight of each candidate's entry from left to right according to the probability of occurrence of this candidate's entry under the probability of occurrence of candidate's entry, other entry Conditions, the part of speech of said candidate's entry, the computing formula of this weight is following:

From the weight of all corresponding candidate's entries of the last phonetic substring that comprises pinyin string; Find out the maximum candidate's entry of weight; Remove the phonetic substring of the maximum candidate's entry of this weight from pinyin string, with this phonetic substring as current pinyin string, from the weight of all corresponding candidate's entries of the last phonetic substring that comprises current pinyin string; Find out the maximum candidate's entry of weight; Comprise till the phonetic substring that begins most that up to current pinyin string resulting each candidate's entry is each corresponding candidate's entry of pinyin string, with the combination of these candidate's entries as input results.

2. a Chinese input unit is characterized in that, specifically comprises:

Dictionary, it comprises the conditional probability between the probability of occurrence, part of speech, part of speech of this entry under the probability of occurrence, other entry Conditions of the corresponding phonetic of entry, entry, entry;

First acquiring unit is used to obtain pinyin string;

The cutting unit is used for according to dictionary said pinyin string being carried out cutting to obtain the phonetic substring of pinyin string;

Second acquisition unit; Be used for obtaining the candidate entry corresponding with the phonetic substring from dictionary, and the probability of occurrence, the part of speech of this candidate's entry, the conditional probability between the part of speech of this candidate's entry under the corresponding probability of occurrence of this candidate's entry, other entry Conditions;

Computing unit is used for the probability of occurrence of this candidate's entry under the probability of occurrence, other entry Conditions according to candidate's entry, the part of speech of said candidate's entry is calculated the weight of each candidate's entry from left to right, and the computing formula of this weight is following:

Confirm the unit; Be used for finding out the maximum candidate's entry of weight, remove the phonetic substring of the maximum candidate's entry of this weight from pinyin string from the weight of all corresponding candidate's entries of the last phonetic substring that comprises pinyin string; With this phonetic substring as current pinyin string; From the weight of all corresponding candidate's entries of the last phonetic substring that comprises current pinyin string, find out the maximum candidate's entry of weight, comprise till the phonetic substring that begins most up to current pinyin string; Resulting each candidate's entry is each corresponding candidate's entry of pinyin string, with the combination of these candidate's entries as input results.