Background technology
At present, the function that most input in Chinese softwares all have whole sentence to generate, such as, the user thinks input " People's Republic of China (PRC) ", so, the user only need be in input method software input Pinyin string " zhonghuarenmingongheguo " continuously, can obtain correct whole sentence and generate the result, see also Fig. 1.See also Fig. 2, the Chinese complete sentence generating method process flow diagram for prior art provides comprises:
Step 201: pinyin string is carried out syllabification;
Step 202: according to the syllabification result, in the phonetic dictionary, search all candidate word that occur in the pinyin string, and make up the candidate word digraph, and corresponding one or several candidate word of each bar arc of this digraph, and each bar arc all has the word frequency of the candidate word of word frequency maximum;
Wherein, writing down the mapping relations of phonetic to candidate word in the phonetic dictionary, described word frequency is meant the number of times that candidate word occurs.
Step 203:, obtain the probability of every arc according to the word frequency that described digraph carries;
Wherein, the probability that obtains every arc specifically comprises: the word frequency of carrying with every arc of described digraph obtains the probability of every arc respectively divided by the word frequency summation of all speech in the phonetic dictionary.
Step 204: a paths (candidate word assembled scheme) that utilizes shortest path first (as dijkstra's algorithm, Viterbi algorithm etc.) to obtain the probability maximum generates the result as whole sentence;
Step 205: described whole sentence is generated the result be presented at first of candidate word window, and the candidate word of the initial arc correspondence of digraph is presented in the candidate word window successively according to word frequency order from high to low.
With the Viterbi algorithm is example, briefly describes the specific implementation process of step 204.
Start node from described digraph, calculate the accumulated probability (product of probability) of each node, the accumulated probability of start node is initialized as 1, choose an accumulated probability and the corresponding forward direction node sequence number of record maximum in the accumulated probability of each node, up to the accumulated probability and the forward direction node sequence number thereof of last node that obtains described digraph as this node; Then, from last node of described digraph, recall forward according to the forward direction node sequence number of record, date back to start node always, obtain a paths of probability maximum, the candidate word sequence combination of every arc correspondence in this path is obtained whole sentence generate the result.Wherein, the computing formula of accumulated probability is: the probability of the accumulated probability * forward direction arc of the accumulated probability of current node=its forward direction node.
Below illustrate the implementation procedure of existing Chinese complete sentence generating method.
For example, user's input Pinyin string " womendoushipingfanren ", result after the syllabification is " wo ' men ' dou ' shi ' ping ' fan ' ren ", according to this syllabification result, in the phonetic dictionary, search all candidate word that occur in this pinyin string, and the candidate word digraph of structure shown in Fig. 3 (a), every arc of this digraph is all corresponding one or more candidate word (candidate word is from top to bottom according to word frequency series arrangement from high to low), and each bar arc all carries the word frequency (not marking among the figure) of the candidate word (promptly coming uppermost candidate word among the figure) of word frequency maximum; Adopt the Viterbi algorithm to obtain whole sentence generation result and be " we are the ordinary peoples ", this whole sentence generation result is presented at first of candidate word window, shown in Fig. 3 (b), show successively that according to word frequency order from high to low the candidate word " we " " I " of the initial arc correspondence of this digraph " is held " from second beginning of candidate word window.
But, generally user and one group of very long pinyin string of uncomfortable continuous input, but custom is a unit input Pinyin string with the speech, such as, the user thinks input " this bedroom is very big ", if the user imports at twice, and input " zhejian " for the first time, the word that generates sees also Fig. 4 (a), the user selects " this ", continues input " woshihenda ", and the word of generation sees also Fig. 4 (b), the candidate word that makes number one is " I am very big ", this whole sentence generates the requirement that the result does not meet the user, and the user needs to select 2 earlier, obtains Fig. 4 (c) result displayed, the user selects 1 more then, obtains correct whole sentence and generates result's " bedroom is very big ".
The defective of prior art is: because prior art is only considered the candidate word that word frequency is the highest when whole sentence generates, this makes the user when importing whole sentence several times, precision is not high as a result in the whole sentence generation of first demonstration of candidate word window, the selection operation that the user need carry out repeatedly just can obtain correct whole sentence generation result, influences user's input speed.
Summary of the invention
The technical matters that the embodiment of the invention will solve provides a kind of Chinese complete sentence generating method and device, can access whole accurately sentence and generate the result.
For solving the problems of the technologies described above, the embodiment of the invention provides a kind of Chinese complete sentence generating method, comprising:
Obtain the candidate word that last time generated;
Obtain the candidate word that occurs in the pinyin string, make up the candidate word digraph;
From the candidate word of the initial arc correspondence of described digraph, select the candidate word of the conditional probability maximum of the described candidate word correspondence that last time generated;
Based on the candidate word of described conditional probability maximum, the whole sentence that obtains described pinyin string generates the result.
Preferably, the described candidate word of selecting the conditional probability maximum of the described candidate word correspondence that last time generated specifically comprises:
With the candidate word of the initial arc correspondence of described digraph respectively with the described candidate word combination that last time generated;
Inquire about the word frequency of described candidate word combination respectively, the word frequency of the candidate word of described initial arc correspondence, and the word frequency of the described candidate word that last time generated;
According to the word frequency of described candidate word combination, the word frequency of the candidate word of described initial arc correspondence, and the word frequency of the described candidate word that last time generated are calculated the conditional probability of the candidate word of described initial arc correspondence, the candidate word of alternative condition probability maximum respectively.
Preferably, the conditional probability of the candidate word of the described initial arc correspondence of described calculating is specially:
According to the word frequency of described candidate word combination, and the word frequency of the described candidate word that last time generated, the co-occurrence probabilities of described candidate word combination calculated;
According to the word frequency of the candidate word of described initial arc correspondence, calculate the independent probability of described candidate word;
With described co-occurrence probabilities and described independent probability addition, obtain the conditional probability of the candidate word of described initial arc correspondence.
Preferably, the co-occurrence probabilities of the described candidate word combination of described calculating are specially:
Word frequency with described candidate word combination multiply by first parameter again divided by the described last time word frequency of the candidate word of generation, obtains the co-occurrence probabilities of described candidate word combination;
The independent probability of the described candidate word of described calculating is specially:
Multiply by second parameter with the word frequency of the candidate word of described initial arc correspondence again divided by the word frequency summation of all speech in the phonetic dictionary, obtain the independent probability of described candidate word;
Wherein, described first parameter and second parameter are greater than 0 less than 1 positive number, and described first parameter and second parameter and less than 1.
Preferably, the described whole sentence that obtains described pinyin string correspondence based on selected candidate word generates the result and is specially:
The conditional probability of candidate word of obtaining described conditional probability maximum is as the probability of the initial arc of described candidate word digraph;
Calculate in the described candidate word digraph probability of other arcs except that initial arc;
Adopt shortest path first, obtain the whole sentence generation result of a paths of probability maximum as described pinyin string.
Preferably, said method further comprises:
Described whole sentence is generated the result be presented at first of candidate word window.
Preferably, said method further comprises:
The candidate word that last time generates is kept in the buffer zone;
After obtaining whole sentence generation result, the candidate word of preserving in the described buffer zone is replaced with described whole sentence generate the result.
The embodiment of the invention also provides the whole sentence of a kind of Chinese generating apparatus, comprising:
The digraph construction unit is used for obtaining the candidate word that pinyin string occurs, and makes up the candidate word digraph;
Last time the candidate word acquiring unit was used to obtain the candidate word that last time generated;
The candidate word selected cell is used for from the candidate word of the initial arc correspondence of described digraph, selects the candidate word of the conditional probability maximum of the described candidate word correspondence that last time generated;
Whole sentence generation unit is used for the candidate word based on described conditional probability maximum, obtains whole sentence and generates the result.
Preferably, described candidate word selected cell specifically comprises: candidate word assembled unit, word frequency inquiry unit, selected cell;
Described candidate word assembled unit, be used for the candidate word of the initial arc correspondence of described digraph respectively with the candidate word combination that last time generated;
Described word frequency inquiry unit is used for inquiring about respectively the word frequency that described candidate word makes up, the word frequency of the candidate word of described initial arc correspondence, and the word frequency of the described candidate word that last time generated;
Described selected cell, be used for word frequency, the word frequency of the candidate word of described initial arc correspondence, and the word frequency of the described candidate word that last time generated according to described candidate word combination, calculate the conditional probability of the candidate word of described initial arc correspondence respectively, the candidate word of alternative condition probability maximum.
Preferably, described selected cell specifically comprises: the co-occurrence probabilities computing unit, and independent probability calculation unit, the conditional probability computing unit selects the speech unit;
Described co-occurrence probabilities computing unit is used for, and the word frequency that makes up with described candidate word multiply by first parameter again divided by the described last time word frequency of the candidate word of generation, obtains the co-occurrence probabilities of described candidate word combination;
Described independent probability calculation unit is used for, and with the word frequency of the candidate word of the described initial arc correspondence word frequency summation divided by all speech in the phonetic dictionary, multiply by second parameter again, obtains the independent probability of described candidate word;
Wherein, described first parameter and second parameter are greater than 0 less than 1 positive number, and described first parameter and second parameter and less than 1;
Described conditional probability computing unit is used for described co-occurrence probabilities and described independent probability addition are obtained the conditional probability of the candidate word of described initial arc correspondence;
The described speech unit that selects is used for the candidate word of alternative condition probability maximum.
As can be seen from the above technical solutions, the embodiment of the invention has the following advantages:
The candidate word that embodiment of the invention utilization last time generated from the candidate word of the initial arc correspondence of described digraph, is selected the candidate word of the conditional probability maximum of the described candidate word correspondence that last time generated; Based on the candidate word of described conditional probability maximum, the whole sentence that obtains described pinyin string generates the result.Because in calculated candidate speech digraph during the conditional probability of the candidate word of initial arc correspondence, utilized the word frequency of described candidate word combination, and the word frequency of the candidate word that last time generated, promptly utilize contextual information to realize that whole sentence generates, improved the accuracy rate that whole sentence generation accuracy rate and candidate word generate.
Embodiment
The embodiment of the invention provides a kind of Chinese complete sentence generating method and device, for the purpose that makes the embodiment of the invention, technical scheme, and advantage clearer, below the embodiment of the invention is elaborated with reference to accompanying drawing.
In embodiments of the present invention, described whole sentence is meant speech or contamination.
The Chinese complete sentence generating method that the embodiment of the invention provides comprises: obtain the candidate word that last time generated;
Obtain the candidate word that occurs in the pinyin string, make up the candidate word digraph; From the candidate word of the initial arc correspondence of described digraph, select the candidate word of the conditional probability maximum of the described candidate word correspondence that last time generated; Based on the candidate word of described conditional probability maximum, the whole sentence that obtains described pinyin string generates the result.
See also Fig. 5, the Chinese complete sentence generating method process flow diagram for the embodiment of the invention provides comprises:
Step 501: pinyin string is carried out syllabification;
Step 502: according to the syllabification result, in the phonetic dictionary, search all candidate word that occur in the described pinyin string, make up the candidate word digraph;
Step 503: obtain the candidate word that last time generated;
Wherein, last time the candidate word of Sheng Chenging was meant that the user was at speech that carries out importing before the current input operation or whole sentence, last time the candidate word of Sheng Chenging was stored in the buffer zone, the user whenever carries out an input operation, then speech that described buffer zone is preserved or whole sentence replace with new speech or whole sentence, if what the user imported once more is punctuation mark, then buffer zone is emptied.Such as, the current input of user " woshihenda ", and " zhejian " imported in user's input " woshihenda " before, and the user selects " this ", then " this " is kept in the buffer zone, the user selects " bedroom is very big " in input " woshihenda " back, then the speech of preserving in the buffer zone " this " is replaced with whole sentence " bedroom is very big ".
Step 504: with the candidate word of the initial arc correspondence of described digraph respectively with the candidate word combination that last time generated;
Wherein, the initial arc of described candidate word digraph is meant that the start node with described digraph is the arc of starting point.
Step 505: inquire about the word frequency of described candidate word combination respectively, the word frequency of the candidate word of described initial arc correspondence, and the word frequency of the described candidate word that last time generated;
In embodiments of the present invention, utilize the phonetic dictionary in advance, the urtext cutting is the branch set of words, scanning divides set of words, the number of times that speech in the statistics phonetic dictionary and contamination occur in minute set of words, promptly add up the word frequency of speech and contamination in the phonetic dictionary, and the word frequency summation of all speech in the phonetic dictionary, described word frequency information is kept in the word frequency message file.It should be noted that:, then the word frequency of this speech or phrase is counted zero in minute set of words if certain speech in the phonetic dictionary or phrase do not occur.
Wherein, with candidate word, speech or the contamination preserved in candidate's contamination and the word frequency message file mate in the step 505, search the word frequency of candidate word and candidate's contamination correspondence.
Step 506: according to the word frequency of described candidate word combination, the word frequency of the candidate word of described initial arc correspondence, and the word frequency of the described candidate word that last time generated, calculate the conditional probability of the candidate word of described initial arc correspondence, the candidate word of alternative condition probability maximum respectively;
Step 507: based on the candidate word of selected initial arc, the whole sentence that obtains described pinyin string generates the result;
Step 508: described whole sentence is generated the result be presented at first of candidate word window.
Below specifically introduce the implementation procedure of step 507 in the embodiment of the invention, comprising:
The conditional probability of candidate word of obtaining the conditional probability maximum is as the probability of initial arc;
Calculate in the described candidate word digraph probability of other arcs except that initial arc, the probability of other arcs equals the word frequency of candidate word of the word frequency maximum that other arcs carry divided by the word frequency summation of all speech in the phonetic dictionary;
A paths (candidate word assembled scheme) that utilizes shortest path first (as dijkstra's algorithm, Viterbi algorithm etc.) to obtain the probability maximum generates the result as whole sentence.
Below be that example is specifically introduced and adopted shortest path first to obtain the process that whole sentence generates the result with the Viterbi algorithm.
Start node from described digraph, calculate the accumulated probability (product of probability) of each node, the accumulated probability of start node is initialized as 1, choose an accumulated probability and the corresponding forward direction node sequence number of record maximum in the accumulated probability of each node, up to the accumulated probability and the forward direction node sequence number thereof of last node that obtains described digraph as this node; Then, from last node of described digraph, recall forward according to the forward direction node sequence number of record, date back to start node always, obtain a paths of probability maximum, the candidate word sequence combination of every arc correspondence in this path is obtained whole sentence generate the result.Wherein, the computing formula of accumulated probability is: the probability of the accumulated probability * forward direction arc of the accumulated probability of current node=its forward direction node.
By said process as can be seen, embodiment of the invention difference with the prior art is: in the embodiment of the invention, the probability of initial arc is the conditional probability of the candidate word of conditional probability maximum, and in the prior art, the probability of initial arc is for according to the word frequency probability that calculates of the word frequency of high candidate word.
More than the Chinese complete sentence generating method that provides for the embodiment of the invention, in other embodiments of the invention, also can be when making up the candidate word digraph, calculate the conditional probability of candidate word of the initial arc correspondence of digraph; Also can be after having made up the candidate word digraph, the conditional probability of the candidate word of the initial arc correspondence of calculating digraph does not influence the realization of the embodiment of the invention.
When the specific implementation said method, can adopt the conditional probability of the candidate word of the initial arc correspondence of following method calculated candidate speech:
According to the word frequency of described candidate word combination, and the word frequency of the described candidate word that last time generated, the co-occurrence probabilities of described candidate word combination calculated;
According to the word frequency of the candidate word of described initial arc correspondence, calculate the independent probability of described candidate word;
With described co-occurrence probabilities and described independent probability addition, obtain the conditional probability of the candidate word of described initial arc correspondence.
In embodiments of the present invention, specifically can adopt following formula to calculate co-occurrence probabilities, separately probability and conditional probability:
The word frequency of co-occurrence probabilities=described candidate word combination multiply by first parameter again divided by the word frequency of the described candidate word that last time generated;
The word frequency summation of all speech multiply by second parameter again in the word frequency/phonetic dictionary of the candidate word of probability=described initial arc correspondence separately;
Conditional probability=co-occurrence probabilities+independent probability+offset delta
Wherein, described first parameter and second parameter are greater than zero less than 1 positive number, and described first parameter and second parameter and less than 1; Total speech number of offset delta=(1-first parameter-second parameter)/phonetic dictionary, offset delta can be approximately equal to 0.
In other embodiments of the invention, also can adopt other formula to calculate above-mentioned three kinds of probability, all do not influence the realization of the embodiment of the invention.
Below illustrate the specific implementation process of the whole sentence generating method that the embodiment of the invention provides.Suppose: the user thinks input " this bedroom is very big ", if the user imports at twice, input " zhejian " for the first time, the user selects " this ", at this moment, buffer zone is preserved " this ", the user continues input " woshihenda ", through syllabification to " woshihenda ", the syllabification result who obtains is: " wo ' shi ' hen ' da ' ", all candidate word in the inquiry pinyin string in the phonetic dictionary, make up candidate word digraph as shown in Figure 6, this candidate word digraph is 5 nodes altogether, start node is numbered 0, last node be numbered 4, with the candidate word of the initial arc correspondence of this digraph respectively and " this " make up, obtain " this I ", " this holds ", " this bedroom ", candidate's word combinations such as " I make for this ", the word frequency of above-mentioned candidate word combination in the word frequency message file, the word frequency that obtains " this bedroom " is the integer greater than zero, and the word frequency of other candidate word combinations is zero, and therefore, the conditional probability in " bedroom " is greater than the conditional probability of other candidate word of initial arc correspondence, with the conditional probability in " bedroom " probability as initial arc, then, the word frequency of the candidate word of the word frequency maximum of carrying according to other arcs is calculated the probability of other arcs; With No. 0 node just the accumulated probability of start node be initialized as 1, since No. 0 node, calculate the cumulative probability and the forward direction arc node sequence number thereof of each node, at last, since No. 4 node, forward direction arc node sequence number according to record is recalled forward, dates back to the 0th node always, obtains the path of probability maximum.Recalled forward by node 4 in this example, its forward direction node is 2, is recalled forward by node 2 then, and its forward direction node is 0, finishes, and the node of the probability maximum path that obtains is 0-2-4, and the candidate word sequence combination of path correspondence is obtained " bedroom is very big ".In embodiments of the present invention, because maximum is the conditional probability in " bedroom " in the probability of initial arc, so, its forward direction node of No. 2 nodes records is No. 0 node, and its forward direction node of No. 4 nodes records is the reason of No. 2 nodes rather than No. 3 nodes be: the accumulated probability that the probability of " very big " multiply by " No. 2 nodes " multiply by the accumulated probability of " No. 3 nodes " greater than the probability of " greatly ", so, the result that this whole sentence generates is: " bedroom is very big ", rather than " I am very big " of prior art generation.
The embodiment of the invention also provides the whole sentence of a kind of Chinese generating apparatus, sees also Fig. 7 (a), and this device comprises:
Digraph construction unit 701 is used for obtaining the candidate word that pinyin string occurs, and makes up the candidate word digraph;
Last time the candidate word acquiring unit 702, were used to obtain the candidate word that last time generated;
Candidate word selected cell 703 is used for from the candidate word of the initial arc correspondence of described digraph, selects the candidate word of the conditional probability maximum of the described candidate word correspondence that last time generated;
Whole sentence generation unit 704 is used for the candidate word based on described conditional probability maximum, obtains whole sentence and generates the result.
When specific implementation, described digraph construction unit 701 can be made of following three unit, sees also Fig. 7 (b), comprising:
Syllabification unit 7011 is used for pinyin string is carried out syllabification;
Candidate word is searched unit 7012, is used for according to the syllabification result, searches the candidate word that occurs in the described pinyin string in the phonetic dictionary;
Digraph generation unit 7013 is used for searching the candidate word that the unit obtains according to described candidate word, makes up the candidate word digraph.
When specific implementation, described candidate word selected cell 703 can be made of following four unit, sees also Fig. 7 (c), comprising:
Candidate word assembled unit 7031, be used for the candidate word of the initial arc correspondence of described digraph respectively with the candidate word combination that last time generated;
Word frequency inquiry unit 7032 is used for inquiring about respectively the word frequency that described candidate word makes up, the word frequency of the candidate word of described initial arc correspondence, and the word frequency of the described candidate word that last time generated;
Selected cell 7033, be used for word frequency, the word frequency of the candidate word of described initial arc correspondence, and the word frequency of the described candidate word that last time generated according to described candidate word combination, calculate the conditional probability of the candidate word of described initial arc correspondence respectively, the candidate word of alternative condition probability maximum.
When specific implementation, described selected cell 7033 can have following 4 unit to constitute, and sees also Fig. 7 (d), comprising:
Co-occurrence probabilities computing unit 70331 is used for the word frequency according to described candidate word combination, and the word frequency of the described candidate word that last time generated, and calculates the co-occurrence probabilities of described candidate word combination;
Probability calculation unit 70332 is used for the word frequency according to the candidate word of described initial arc correspondence separately, calculates the independent probability of described candidate word;
Conditional probability computing unit 70333 is used for described co-occurrence probabilities and described independent probability addition are obtained the conditional probability of the candidate word of described initial arc correspondence;
Select speech unit 70334, be used for the candidate word of alternative condition probability maximum.
Wherein, co-occurrence probabilities computing unit 70331 and separately probability calculation unit 70332 can adopt the calculating co-occurrence probabilities that preamble stated and the computing formula of probability separately, calculate co-occurrence probabilities and independent probability, related content please refer to preamble and has stated content, repeats no more herein.
When specific implementation, whole sentence generation unit 704 can be made of following unit, sees also Fig. 7 (e), comprising:
Initial arc probability acquiring unit 7041, the conditional probability of candidate word that is used to obtain described conditional probability maximum is as the probability of the initial arc of described candidate word digraph;
Other arc probability calculation unit 7042 are used for calculating the probability of described candidate word digraph other arcs except that initial arc;
Path selection unit 7043 adopts shortest path first, obtains the whole sentence generation result of a paths of probability maximum as described pinyin string.
In order to realize showing that described whole sentence generates the result, said apparatus can further include:
Whole sentence display unit is used for that described whole sentence is generated the result and is presented at first of candidate word window.
In addition, in embodiments of the present invention, if the user is divided into twice input with a speech, such as, the user imports " motorcycle " at twice, input " rubbing " for the first time, input " holder car " for the second time, at this moment, the phonetic of " rubbing " in the buffer zone of preserving and the pinyin combinations of " the touch é " of input for the second time can be obtained " motuoche " together, then, in the phonetic dictionary, search " motuoche " corresponding speech, then, " holder car " corresponding in " motorcycle " is presented at first of candidate word window as generating the result.
More than a kind of Chinese complete sentence generating method provided by the present invention and device are described in detail, for one of ordinary skill in the art, thought according to the embodiment of the invention, part in specific embodiments and applications all can change, in sum, this description should not be construed as limitation of the present invention.