CN101122901A - Chinese integral sentence generation method and device - Google Patents

Chinese integral sentence generation method and device Download PDF

Info

Publication number
CN101122901A
CN101122901A CNA200710151332XA CN200710151332A CN101122901A CN 101122901 A CN101122901 A CN 101122901A CN A200710151332X A CNA200710151332X A CN A200710151332XA CN 200710151332 A CN200710151332 A CN 200710151332A CN 101122901 A CN101122901 A CN 101122901A
Authority
CN
China
Prior art keywords
candidate word
word
candidate
probability
frequency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA200710151332XA
Other languages
Chinese (zh)
Other versions
CN101122901B (en
Inventor
张会鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Shiji Guangsu Information Technology Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN200710151332XA priority Critical patent/CN101122901B/en
Publication of CN101122901A publication Critical patent/CN101122901A/en
Application granted granted Critical
Publication of CN101122901B publication Critical patent/CN101122901B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Document Processing Apparatus (AREA)

Abstract

The present invention discloses a Chinese complete sentence generating method and device. The method of the present invention includes: a candidate word generated the last time and a candidate in a pinyin string are obtained; a candidate directed graph is constructed and a candidate with a biggest conditional probability is selected which corresponds with the candidate generated the previous time from candidates which correspond with the initial arc of the directed graph. Based on the candidate with the biggest conditional probability, a complete sentence result of the pinyin string is obtained. The embodiment of the present invention also provides a corresponding device. When computing probability of the candidate which corresponds with the initial arc of the candidate word directed graph, the embodiment of the present invention applies word frequency of the candidate combination and of the candidate word generated the previous time, that is, applies the context information to generalize a complete sentence, thus improving the complete sentence generating accuracy and the candidate word generating accuracy.

Description

Chinese complete sentence generating method and device
Technical field
The present invention relates to the input in Chinese technology, relate in particular to a kind of Chinese complete sentence generating method and device.
Background technology
At present, the function that most input in Chinese softwares all have whole sentence to generate, such as, the user thinks input " People's Republic of China (PRC) ", so, the user only need be in input method software input Pinyin string " zhonghuarenmingongheguo " continuously, can obtain correct whole sentence and generate the result, see also Fig. 1.See also Fig. 2, the Chinese complete sentence generating method process flow diagram for prior art provides comprises:
Step 201: pinyin string is carried out syllabification;
Step 202: according to the syllabification result, in the phonetic dictionary, search all candidate word that occur in the pinyin string, and make up the candidate word digraph, and corresponding one or several candidate word of each bar arc of this digraph, and each bar arc all has the word frequency of the candidate word of word frequency maximum;
Wherein, writing down the mapping relations of phonetic to candidate word in the phonetic dictionary, described word frequency is meant the number of times that candidate word occurs.
Step 203:, obtain the probability of every arc according to the word frequency that described digraph carries;
Wherein, the probability that obtains every arc specifically comprises: the word frequency of carrying with every arc of described digraph obtains the probability of every arc respectively divided by the word frequency summation of all speech in the phonetic dictionary.
Step 204: a paths (candidate word assembled scheme) that utilizes shortest path first (as dijkstra's algorithm, Viterbi algorithm etc.) to obtain the probability maximum generates the result as whole sentence;
Step 205: described whole sentence is generated the result be presented at first of candidate word window, and the candidate word of the initial arc correspondence of digraph is presented in the candidate word window successively according to word frequency order from high to low.
With the Viterbi algorithm is example, briefly describes the specific implementation process of step 204.
Start node from described digraph, calculate the accumulated probability (product of probability) of each node, the accumulated probability of start node is initialized as 1, choose an accumulated probability and the corresponding forward direction node sequence number of record maximum in the accumulated probability of each node, up to the accumulated probability and the forward direction node sequence number thereof of last node that obtains described digraph as this node; Then, from last node of described digraph, recall forward according to the forward direction node sequence number of record, date back to start node always, obtain a paths of probability maximum, the candidate word sequence combination of every arc correspondence in this path is obtained whole sentence generate the result.Wherein, the computing formula of accumulated probability is: the probability of the accumulated probability * forward direction arc of the accumulated probability of current node=its forward direction node.
Below illustrate the implementation procedure of existing Chinese complete sentence generating method.
For example, user's input Pinyin string " womendoushipingfanren ", result after the syllabification is " wo ' men ' dou ' shi ' ping ' fan ' ren ", according to this syllabification result, in the phonetic dictionary, search all candidate word that occur in this pinyin string, and the candidate word digraph of structure shown in Fig. 3 (a), every arc of this digraph is all corresponding one or more candidate word (candidate word is from top to bottom according to word frequency series arrangement from high to low), and each bar arc all carries the word frequency (not marking among the figure) of the candidate word (promptly coming uppermost candidate word among the figure) of word frequency maximum; Adopt the Viterbi algorithm to obtain whole sentence generation result and be " we are the ordinary peoples ", this whole sentence generation result is presented at first of candidate word window, shown in Fig. 3 (b), show successively that according to word frequency order from high to low the candidate word " we " " I " of the initial arc correspondence of this digraph " is held " from second beginning of candidate word window.
But, generally user and one group of very long pinyin string of uncomfortable continuous input, but custom is a unit input Pinyin string with the speech, such as, the user thinks input " this bedroom is very big ", if the user imports at twice, and input " zhejian " for the first time, the word that generates sees also Fig. 4 (a), the user selects " this ", continues input " woshihenda ", and the word of generation sees also Fig. 4 (b), the candidate word that makes number one is " I am very big ", this whole sentence generates the requirement that the result does not meet the user, and the user needs to select 2 earlier, obtains Fig. 4 (c) result displayed, the user selects 1 more then, obtains correct whole sentence and generates result's " bedroom is very big ".
The defective of prior art is: because prior art is only considered the candidate word that word frequency is the highest when whole sentence generates, this makes the user when importing whole sentence several times, precision is not high as a result in the whole sentence generation of first demonstration of candidate word window, the selection operation that the user need carry out repeatedly just can obtain correct whole sentence generation result, influences user's input speed.
Summary of the invention
The technical matters that the embodiment of the invention will solve provides a kind of Chinese complete sentence generating method and device, can access whole accurately sentence and generate the result.
For solving the problems of the technologies described above, the embodiment of the invention provides a kind of Chinese complete sentence generating method, comprising:
Obtain the candidate word that last time generated;
Obtain the candidate word that occurs in the pinyin string, make up the candidate word digraph;
From the candidate word of the initial arc correspondence of described digraph, select the candidate word of the conditional probability maximum of the described candidate word correspondence that last time generated;
Based on the candidate word of described conditional probability maximum, the whole sentence that obtains described pinyin string generates the result.
Preferably, the described candidate word of selecting the conditional probability maximum of the described candidate word correspondence that last time generated specifically comprises:
With the candidate word of the initial arc correspondence of described digraph respectively with the described candidate word combination that last time generated;
Inquire about the word frequency of described candidate word combination respectively, the word frequency of the candidate word of described initial arc correspondence, and the word frequency of the described candidate word that last time generated;
According to the word frequency of described candidate word combination, the word frequency of the candidate word of described initial arc correspondence, and the word frequency of the described candidate word that last time generated are calculated the conditional probability of the candidate word of described initial arc correspondence, the candidate word of alternative condition probability maximum respectively.
Preferably, the conditional probability of the candidate word of the described initial arc correspondence of described calculating is specially:
According to the word frequency of described candidate word combination, and the word frequency of the described candidate word that last time generated, the co-occurrence probabilities of described candidate word combination calculated;
According to the word frequency of the candidate word of described initial arc correspondence, calculate the independent probability of described candidate word;
With described co-occurrence probabilities and described independent probability addition, obtain the conditional probability of the candidate word of described initial arc correspondence.
Preferably, the co-occurrence probabilities of the described candidate word combination of described calculating are specially:
Word frequency with described candidate word combination multiply by first parameter again divided by the described last time word frequency of the candidate word of generation, obtains the co-occurrence probabilities of described candidate word combination;
The independent probability of the described candidate word of described calculating is specially:
Multiply by second parameter with the word frequency of the candidate word of described initial arc correspondence again divided by the word frequency summation of all speech in the phonetic dictionary, obtain the independent probability of described candidate word;
Wherein, described first parameter and second parameter are greater than 0 less than 1 positive number, and described first parameter and second parameter and less than 1.
Preferably, the described whole sentence that obtains described pinyin string correspondence based on selected candidate word generates the result and is specially:
The conditional probability of candidate word of obtaining described conditional probability maximum is as the probability of the initial arc of described candidate word digraph;
Calculate in the described candidate word digraph probability of other arcs except that initial arc;
Adopt shortest path first, obtain the whole sentence generation result of a paths of probability maximum as described pinyin string.
Preferably, said method further comprises:
Described whole sentence is generated the result be presented at first of candidate word window.
Preferably, said method further comprises:
The candidate word that last time generates is kept in the buffer zone;
After obtaining whole sentence generation result, the candidate word of preserving in the described buffer zone is replaced with described whole sentence generate the result.
The embodiment of the invention also provides the whole sentence of a kind of Chinese generating apparatus, comprising:
The digraph construction unit is used for obtaining the candidate word that pinyin string occurs, and makes up the candidate word digraph;
Last time the candidate word acquiring unit was used to obtain the candidate word that last time generated;
The candidate word selected cell is used for from the candidate word of the initial arc correspondence of described digraph, selects the candidate word of the conditional probability maximum of the described candidate word correspondence that last time generated;
Whole sentence generation unit is used for the candidate word based on described conditional probability maximum, obtains whole sentence and generates the result.
Preferably, described candidate word selected cell specifically comprises: candidate word assembled unit, word frequency inquiry unit, selected cell;
Described candidate word assembled unit, be used for the candidate word of the initial arc correspondence of described digraph respectively with the candidate word combination that last time generated;
Described word frequency inquiry unit is used for inquiring about respectively the word frequency that described candidate word makes up, the word frequency of the candidate word of described initial arc correspondence, and the word frequency of the described candidate word that last time generated;
Described selected cell, be used for word frequency, the word frequency of the candidate word of described initial arc correspondence, and the word frequency of the described candidate word that last time generated according to described candidate word combination, calculate the conditional probability of the candidate word of described initial arc correspondence respectively, the candidate word of alternative condition probability maximum.
Preferably, described selected cell specifically comprises: the co-occurrence probabilities computing unit, and independent probability calculation unit, the conditional probability computing unit selects the speech unit;
Described co-occurrence probabilities computing unit is used for, and the word frequency that makes up with described candidate word multiply by first parameter again divided by the described last time word frequency of the candidate word of generation, obtains the co-occurrence probabilities of described candidate word combination;
Described independent probability calculation unit is used for, and with the word frequency of the candidate word of the described initial arc correspondence word frequency summation divided by all speech in the phonetic dictionary, multiply by second parameter again, obtains the independent probability of described candidate word;
Wherein, described first parameter and second parameter are greater than 0 less than 1 positive number, and described first parameter and second parameter and less than 1;
Described conditional probability computing unit is used for described co-occurrence probabilities and described independent probability addition are obtained the conditional probability of the candidate word of described initial arc correspondence;
The described speech unit that selects is used for the candidate word of alternative condition probability maximum.
As can be seen from the above technical solutions, the embodiment of the invention has the following advantages:
The candidate word that embodiment of the invention utilization last time generated from the candidate word of the initial arc correspondence of described digraph, is selected the candidate word of the conditional probability maximum of the described candidate word correspondence that last time generated; Based on the candidate word of described conditional probability maximum, the whole sentence that obtains described pinyin string generates the result.Because in calculated candidate speech digraph during the conditional probability of the candidate word of initial arc correspondence, utilized the word frequency of described candidate word combination, and the word frequency of the candidate word that last time generated, promptly utilize contextual information to realize that whole sentence generates, improved the accuracy rate that whole sentence generation accuracy rate and candidate word generate.
Description of drawings
Fig. 1 generates example one as a result for the Chinese whole sentence that prior art provides;
The Chinese complete sentence generating method process flow diagram that Fig. 2 provides for prior art;
Fig. 3 (a) generates digraph as a result for the Chinese whole sentence that prior art provides;
Fig. 3 (b) generates example two as a result for the Chinese whole sentence that prior art provides;
Fig. 4 (a) generates example three as a result for the Chinese whole sentence that prior art provides;
Fig. 4 (b) generates example three as a result for the Chinese whole sentence that prior art provides;
Fig. 4 (c) generates example three as a result for the Chinese whole sentence that prior art provides;
The Chinese complete sentence generating method that Fig. 5 provides for the embodiment of the invention;
Fig. 6 generates digraph as a result for the Chinese whole sentence that the embodiment of the invention provides;
Fig. 7 (a) forms synoptic diagram for the Chinese whole sentence generating apparatus that the embodiment of the invention provides;
Fig. 7 (b) forms synoptic diagram for the digraph construction unit that the embodiment of the invention provides;
Fig. 7 (c) forms synoptic diagram for the candidate word selected cell that the embodiment of the invention provides;
Fig. 7 (d) forms synoptic diagram for the selected cell that the embodiment of the invention provides;
Fig. 7 (e) forms synoptic diagram for the whole sentence generation unit that the embodiment of the invention provides.
Embodiment
The embodiment of the invention provides a kind of Chinese complete sentence generating method and device, for the purpose that makes the embodiment of the invention, technical scheme, and advantage clearer, below the embodiment of the invention is elaborated with reference to accompanying drawing.
In embodiments of the present invention, described whole sentence is meant speech or contamination.
The Chinese complete sentence generating method that the embodiment of the invention provides comprises: obtain the candidate word that last time generated;
Obtain the candidate word that occurs in the pinyin string, make up the candidate word digraph; From the candidate word of the initial arc correspondence of described digraph, select the candidate word of the conditional probability maximum of the described candidate word correspondence that last time generated; Based on the candidate word of described conditional probability maximum, the whole sentence that obtains described pinyin string generates the result.
See also Fig. 5, the Chinese complete sentence generating method process flow diagram for the embodiment of the invention provides comprises:
Step 501: pinyin string is carried out syllabification;
Step 502: according to the syllabification result, in the phonetic dictionary, search all candidate word that occur in the described pinyin string, make up the candidate word digraph;
Step 503: obtain the candidate word that last time generated;
Wherein, last time the candidate word of Sheng Chenging was meant that the user was at speech that carries out importing before the current input operation or whole sentence, last time the candidate word of Sheng Chenging was stored in the buffer zone, the user whenever carries out an input operation, then speech that described buffer zone is preserved or whole sentence replace with new speech or whole sentence, if what the user imported once more is punctuation mark, then buffer zone is emptied.Such as, the current input of user " woshihenda ", and " zhejian " imported in user's input " woshihenda " before, and the user selects " this ", then " this " is kept in the buffer zone, the user selects " bedroom is very big " in input " woshihenda " back, then the speech of preserving in the buffer zone " this " is replaced with whole sentence " bedroom is very big ".
Step 504: with the candidate word of the initial arc correspondence of described digraph respectively with the candidate word combination that last time generated;
Wherein, the initial arc of described candidate word digraph is meant that the start node with described digraph is the arc of starting point.
Step 505: inquire about the word frequency of described candidate word combination respectively, the word frequency of the candidate word of described initial arc correspondence, and the word frequency of the described candidate word that last time generated;
In embodiments of the present invention, utilize the phonetic dictionary in advance, the urtext cutting is the branch set of words, scanning divides set of words, the number of times that speech in the statistics phonetic dictionary and contamination occur in minute set of words, promptly add up the word frequency of speech and contamination in the phonetic dictionary, and the word frequency summation of all speech in the phonetic dictionary, described word frequency information is kept in the word frequency message file.It should be noted that:, then the word frequency of this speech or phrase is counted zero in minute set of words if certain speech in the phonetic dictionary or phrase do not occur.
Wherein, with candidate word, speech or the contamination preserved in candidate's contamination and the word frequency message file mate in the step 505, search the word frequency of candidate word and candidate's contamination correspondence.
Step 506: according to the word frequency of described candidate word combination, the word frequency of the candidate word of described initial arc correspondence, and the word frequency of the described candidate word that last time generated, calculate the conditional probability of the candidate word of described initial arc correspondence, the candidate word of alternative condition probability maximum respectively;
Step 507: based on the candidate word of selected initial arc, the whole sentence that obtains described pinyin string generates the result;
Step 508: described whole sentence is generated the result be presented at first of candidate word window.
Below specifically introduce the implementation procedure of step 507 in the embodiment of the invention, comprising:
The conditional probability of candidate word of obtaining the conditional probability maximum is as the probability of initial arc;
Calculate in the described candidate word digraph probability of other arcs except that initial arc, the probability of other arcs equals the word frequency of candidate word of the word frequency maximum that other arcs carry divided by the word frequency summation of all speech in the phonetic dictionary;
A paths (candidate word assembled scheme) that utilizes shortest path first (as dijkstra's algorithm, Viterbi algorithm etc.) to obtain the probability maximum generates the result as whole sentence.
Below be that example is specifically introduced and adopted shortest path first to obtain the process that whole sentence generates the result with the Viterbi algorithm.
Start node from described digraph, calculate the accumulated probability (product of probability) of each node, the accumulated probability of start node is initialized as 1, choose an accumulated probability and the corresponding forward direction node sequence number of record maximum in the accumulated probability of each node, up to the accumulated probability and the forward direction node sequence number thereof of last node that obtains described digraph as this node; Then, from last node of described digraph, recall forward according to the forward direction node sequence number of record, date back to start node always, obtain a paths of probability maximum, the candidate word sequence combination of every arc correspondence in this path is obtained whole sentence generate the result.Wherein, the computing formula of accumulated probability is: the probability of the accumulated probability * forward direction arc of the accumulated probability of current node=its forward direction node.
By said process as can be seen, embodiment of the invention difference with the prior art is: in the embodiment of the invention, the probability of initial arc is the conditional probability of the candidate word of conditional probability maximum, and in the prior art, the probability of initial arc is for according to the word frequency probability that calculates of the word frequency of high candidate word.
More than the Chinese complete sentence generating method that provides for the embodiment of the invention, in other embodiments of the invention, also can be when making up the candidate word digraph, calculate the conditional probability of candidate word of the initial arc correspondence of digraph; Also can be after having made up the candidate word digraph, the conditional probability of the candidate word of the initial arc correspondence of calculating digraph does not influence the realization of the embodiment of the invention.
When the specific implementation said method, can adopt the conditional probability of the candidate word of the initial arc correspondence of following method calculated candidate speech:
According to the word frequency of described candidate word combination, and the word frequency of the described candidate word that last time generated, the co-occurrence probabilities of described candidate word combination calculated;
According to the word frequency of the candidate word of described initial arc correspondence, calculate the independent probability of described candidate word;
With described co-occurrence probabilities and described independent probability addition, obtain the conditional probability of the candidate word of described initial arc correspondence.
In embodiments of the present invention, specifically can adopt following formula to calculate co-occurrence probabilities, separately probability and conditional probability:
The word frequency of co-occurrence probabilities=described candidate word combination multiply by first parameter again divided by the word frequency of the described candidate word that last time generated;
The word frequency summation of all speech multiply by second parameter again in the word frequency/phonetic dictionary of the candidate word of probability=described initial arc correspondence separately;
Conditional probability=co-occurrence probabilities+independent probability+offset delta
Wherein, described first parameter and second parameter are greater than zero less than 1 positive number, and described first parameter and second parameter and less than 1; Total speech number of offset delta=(1-first parameter-second parameter)/phonetic dictionary, offset delta can be approximately equal to 0.
In other embodiments of the invention, also can adopt other formula to calculate above-mentioned three kinds of probability, all do not influence the realization of the embodiment of the invention.
Below illustrate the specific implementation process of the whole sentence generating method that the embodiment of the invention provides.Suppose: the user thinks input " this bedroom is very big ", if the user imports at twice, input " zhejian " for the first time, the user selects " this ", at this moment, buffer zone is preserved " this ", the user continues input " woshihenda ", through syllabification to " woshihenda ", the syllabification result who obtains is: " wo ' shi ' hen ' da ' ", all candidate word in the inquiry pinyin string in the phonetic dictionary, make up candidate word digraph as shown in Figure 6, this candidate word digraph is 5 nodes altogether, start node is numbered 0, last node be numbered 4, with the candidate word of the initial arc correspondence of this digraph respectively and " this " make up, obtain " this I ", " this holds ", " this bedroom ", candidate's word combinations such as " I make for this ", the word frequency of above-mentioned candidate word combination in the word frequency message file, the word frequency that obtains " this bedroom " is the integer greater than zero, and the word frequency of other candidate word combinations is zero, and therefore, the conditional probability in " bedroom " is greater than the conditional probability of other candidate word of initial arc correspondence, with the conditional probability in " bedroom " probability as initial arc, then, the word frequency of the candidate word of the word frequency maximum of carrying according to other arcs is calculated the probability of other arcs; With No. 0 node just the accumulated probability of start node be initialized as 1, since No. 0 node, calculate the cumulative probability and the forward direction arc node sequence number thereof of each node, at last, since No. 4 node, forward direction arc node sequence number according to record is recalled forward, dates back to the 0th node always, obtains the path of probability maximum.Recalled forward by node 4 in this example, its forward direction node is 2, is recalled forward by node 2 then, and its forward direction node is 0, finishes, and the node of the probability maximum path that obtains is 0-2-4, and the candidate word sequence combination of path correspondence is obtained " bedroom is very big ".In embodiments of the present invention, because maximum is the conditional probability in " bedroom " in the probability of initial arc, so, its forward direction node of No. 2 nodes records is No. 0 node, and its forward direction node of No. 4 nodes records is the reason of No. 2 nodes rather than No. 3 nodes be: the accumulated probability that the probability of " very big " multiply by " No. 2 nodes " multiply by the accumulated probability of " No. 3 nodes " greater than the probability of " greatly ", so, the result that this whole sentence generates is: " bedroom is very big ", rather than " I am very big " of prior art generation.
The embodiment of the invention also provides the whole sentence of a kind of Chinese generating apparatus, sees also Fig. 7 (a), and this device comprises:
Digraph construction unit 701 is used for obtaining the candidate word that pinyin string occurs, and makes up the candidate word digraph;
Last time the candidate word acquiring unit 702, were used to obtain the candidate word that last time generated;
Candidate word selected cell 703 is used for from the candidate word of the initial arc correspondence of described digraph, selects the candidate word of the conditional probability maximum of the described candidate word correspondence that last time generated;
Whole sentence generation unit 704 is used for the candidate word based on described conditional probability maximum, obtains whole sentence and generates the result.
When specific implementation, described digraph construction unit 701 can be made of following three unit, sees also Fig. 7 (b), comprising:
Syllabification unit 7011 is used for pinyin string is carried out syllabification;
Candidate word is searched unit 7012, is used for according to the syllabification result, searches the candidate word that occurs in the described pinyin string in the phonetic dictionary;
Digraph generation unit 7013 is used for searching the candidate word that the unit obtains according to described candidate word, makes up the candidate word digraph.
When specific implementation, described candidate word selected cell 703 can be made of following four unit, sees also Fig. 7 (c), comprising:
Candidate word assembled unit 7031, be used for the candidate word of the initial arc correspondence of described digraph respectively with the candidate word combination that last time generated;
Word frequency inquiry unit 7032 is used for inquiring about respectively the word frequency that described candidate word makes up, the word frequency of the candidate word of described initial arc correspondence, and the word frequency of the described candidate word that last time generated;
Selected cell 7033, be used for word frequency, the word frequency of the candidate word of described initial arc correspondence, and the word frequency of the described candidate word that last time generated according to described candidate word combination, calculate the conditional probability of the candidate word of described initial arc correspondence respectively, the candidate word of alternative condition probability maximum.
When specific implementation, described selected cell 7033 can have following 4 unit to constitute, and sees also Fig. 7 (d), comprising:
Co-occurrence probabilities computing unit 70331 is used for the word frequency according to described candidate word combination, and the word frequency of the described candidate word that last time generated, and calculates the co-occurrence probabilities of described candidate word combination;
Probability calculation unit 70332 is used for the word frequency according to the candidate word of described initial arc correspondence separately, calculates the independent probability of described candidate word;
Conditional probability computing unit 70333 is used for described co-occurrence probabilities and described independent probability addition are obtained the conditional probability of the candidate word of described initial arc correspondence;
Select speech unit 70334, be used for the candidate word of alternative condition probability maximum.
Wherein, co-occurrence probabilities computing unit 70331 and separately probability calculation unit 70332 can adopt the calculating co-occurrence probabilities that preamble stated and the computing formula of probability separately, calculate co-occurrence probabilities and independent probability, related content please refer to preamble and has stated content, repeats no more herein.
When specific implementation, whole sentence generation unit 704 can be made of following unit, sees also Fig. 7 (e), comprising:
Initial arc probability acquiring unit 7041, the conditional probability of candidate word that is used to obtain described conditional probability maximum is as the probability of the initial arc of described candidate word digraph;
Other arc probability calculation unit 7042 are used for calculating the probability of described candidate word digraph other arcs except that initial arc;
Path selection unit 7043 adopts shortest path first, obtains the whole sentence generation result of a paths of probability maximum as described pinyin string.
In order to realize showing that described whole sentence generates the result, said apparatus can further include:
Whole sentence display unit is used for that described whole sentence is generated the result and is presented at first of candidate word window.
In addition, in embodiments of the present invention, if the user is divided into twice input with a speech, such as, the user imports " motorcycle " at twice, input " rubbing " for the first time, input " holder car " for the second time, at this moment, the phonetic of " rubbing " in the buffer zone of preserving and the pinyin combinations of " the touch é " of input for the second time can be obtained " motuoche " together, then, in the phonetic dictionary, search " motuoche " corresponding speech, then, " holder car " corresponding in " motorcycle " is presented at first of candidate word window as generating the result.
More than a kind of Chinese complete sentence generating method provided by the present invention and device are described in detail, for one of ordinary skill in the art, thought according to the embodiment of the invention, part in specific embodiments and applications all can change, in sum, this description should not be construed as limitation of the present invention.

Claims (10)

1. a Chinese complete sentence generating method is characterized in that, comprising:
Obtain the candidate word that last time generated;
Obtain the candidate word that occurs in the pinyin string, make up the candidate word digraph;
From the candidate word of the initial arc correspondence of described digraph, select the candidate word of the conditional probability maximum of the described candidate word correspondence that last time generated;
Based on the candidate word of described conditional probability maximum, the whole sentence that obtains described pinyin string generates the result.
2. the method for claim 1 is characterized in that, the described candidate word of selecting the conditional probability maximum of the described candidate word correspondence that last time generated specifically comprises:
With the candidate word of the initial arc correspondence of described digraph respectively with the described candidate word combination that last time generated;
Inquire about the word frequency of described candidate word combination respectively, the word frequency of the candidate word of described initial arc correspondence, and the word frequency of the described candidate word that last time generated;
According to the word frequency of described candidate word combination, the word frequency of the candidate word of described initial arc correspondence, and the word frequency of the described candidate word that last time generated are calculated the conditional probability of the candidate word of described initial arc correspondence, the candidate word of alternative condition probability maximum respectively.
3. method as claimed in claim 2 is characterized in that, the conditional probability of the candidate word of the described initial arc correspondence of described calculating is specially:
According to the word frequency of described candidate word combination, and the word frequency of the described candidate word that last time generated, the co-occurrence probabilities of described candidate word combination calculated;
According to the word frequency of the candidate word of described initial arc correspondence, calculate the independent probability of described candidate word;
With described co-occurrence probabilities and described independent probability addition, obtain the conditional probability of the candidate word of described initial arc correspondence.
4. method as claimed in claim 3 is characterized in that, the co-occurrence probabilities of the described candidate word combination of described calculating are specially:
Word frequency with described candidate word combination multiply by first parameter again divided by the described last time word frequency of the candidate word of generation, obtains the co-occurrence probabilities of described candidate word combination;
The independent probability of the described candidate word of described calculating is specially:
Multiply by second parameter with the word frequency of the candidate word of described initial arc correspondence again divided by the word frequency summation of all speech in the phonetic dictionary, obtain the independent probability of described candidate word;
Wherein, described first parameter and second parameter are greater than 0 less than 1 positive number, and described first parameter and second parameter and less than 1.
5. as the described arbitrary method of claim 1 to 4, it is characterized in that the described whole sentence that obtains described pinyin string correspondence based on selected candidate word generates the result and is specially:
The conditional probability of candidate word of obtaining described conditional probability maximum is as the probability of the initial arc of described candidate word digraph;
Calculate in the described candidate word digraph probability of other arcs except that initial arc;
Adopt shortest path first, obtain the whole sentence generation result of a paths of probability maximum as described pinyin string.
6. as the described arbitrary method of claim 1 to 4, it is characterized in that, further comprise:
Described whole sentence is generated the result be presented at first of candidate word window.
7. as the described arbitrary method of claim 1 to 4, it is characterized in that, further comprise:
The candidate word that last time generates is kept in the buffer zone;
After obtaining whole sentence generation result, the candidate word of preserving in the described buffer zone is replaced with described whole sentence generate the result.
8. the whole sentence of a Chinese generating apparatus is characterized in that, comprising:
The digraph construction unit is used for obtaining the candidate word that pinyin string occurs, and makes up the candidate word digraph;
Last time the candidate word acquiring unit was used to obtain the candidate word that last time generated;
The candidate word selected cell is used for from the candidate word of the initial arc correspondence of described digraph, selects the candidate word of the conditional probability maximum of the described candidate word correspondence that last time generated;
Whole sentence generation unit is used for the candidate word based on described conditional probability maximum, obtains whole sentence and generates the result.
9. device as claimed in claim 8 is characterized in that, described candidate word selected cell specifically comprises: candidate word assembled unit, word frequency inquiry unit, selected cell;
Described candidate word assembled unit, be used for the candidate word of the initial arc correspondence of described digraph respectively with the candidate word combination that last time generated;
Described word frequency inquiry unit is used for inquiring about respectively the word frequency that described candidate word makes up, the word frequency of the candidate word of described initial arc correspondence, and the word frequency of the described candidate word that last time generated;
Described selected cell, be used for word frequency, the word frequency of the candidate word of described initial arc correspondence, and the word frequency of the described candidate word that last time generated according to described candidate word combination, calculate the conditional probability of the candidate word of described initial arc correspondence respectively, the candidate word of alternative condition probability maximum.
10. device as claimed in claim 9 is characterized in that, described selected cell specifically comprises: the co-occurrence probabilities computing unit, and independent probability calculation unit, the conditional probability computing unit selects the speech unit;
Described co-occurrence probabilities computing unit is used for, and the word frequency that makes up with described candidate word multiply by first parameter again divided by the described last time word frequency of the candidate word of generation, obtains the co-occurrence probabilities of described candidate word combination;
Described independent probability calculation unit is used for, and with the word frequency of the candidate word of the described initial arc correspondence word frequency summation divided by all speech in the phonetic dictionary, multiply by second parameter again, obtains the independent probability of described candidate word;
Wherein, described first parameter and second parameter are greater than 0 less than 1 positive number, and described first parameter and second parameter and less than 1;
Described conditional probability computing unit is used for described co-occurrence probabilities and described independent probability addition are obtained the conditional probability of the candidate word of described initial arc correspondence;
The described speech unit that selects is used for the candidate word of alternative condition probability maximum.
CN200710151332XA 2007-09-25 2007-09-25 Chinese integral sentence generation method and device Active CN101122901B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200710151332XA CN101122901B (en) 2007-09-25 2007-09-25 Chinese integral sentence generation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200710151332XA CN101122901B (en) 2007-09-25 2007-09-25 Chinese integral sentence generation method and device

Publications (2)

Publication Number Publication Date
CN101122901A true CN101122901A (en) 2008-02-13
CN101122901B CN101122901B (en) 2011-11-09

Family

ID=39085238

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200710151332XA Active CN101122901B (en) 2007-09-25 2007-09-25 Chinese integral sentence generation method and device

Country Status (1)

Country Link
CN (1) CN101122901B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014040510A1 (en) * 2012-09-12 2014-03-20 腾讯科技(深圳)有限公司 Method, device, and terminal equipment for enabling intelligent association in input method
CN104081320A (en) * 2012-01-27 2014-10-01 触摸式有限公司 User data input prediction
CN106896936A (en) * 2017-02-24 2017-06-27 百度在线网络技术(北京)有限公司 Vocabulary method for pushing and device
CN107390892A (en) * 2016-05-17 2017-11-24 富士通株式会社 The method and apparatus for generating user-oriented dictionary
CN107688398A (en) * 2016-08-03 2018-02-13 中国科学院计算技术研究所 Determine the method and apparatus and input reminding method and device of candidate's input
CN107688397A (en) * 2016-08-03 2018-02-13 北京搜狗科技发展有限公司 A kind of input method, system and the device for input
US10037319B2 (en) 2010-09-29 2018-07-31 Touchtype Limited User input prediction
CN108595437A (en) * 2018-05-04 2018-09-28 和美(深圳)信息技术股份有限公司 Text query error correction method, device, computer equipment and storage medium
US10613746B2 (en) 2012-01-16 2020-04-07 Touchtype Ltd. System and method for inputting text

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10037319B2 (en) 2010-09-29 2018-07-31 Touchtype Limited User input prediction
US10613746B2 (en) 2012-01-16 2020-04-07 Touchtype Ltd. System and method for inputting text
CN104081320B (en) * 2012-01-27 2017-12-12 触摸式有限公司 User data input is predicted
CN104081320A (en) * 2012-01-27 2014-10-01 触摸式有限公司 User data input prediction
US10049091B2 (en) 2012-09-12 2018-08-14 Tencent Technology (Shenzhen) Company Limited Method, device, and terminal equipment for enabling intelligent association in input method
TWI505139B (en) * 2012-09-12 2015-10-21 Tencent Tech Shenzhen Co Ltd A method for realizing intelligent association in the input method, device and terminal device
WO2014040510A1 (en) * 2012-09-12 2014-03-20 腾讯科技(深圳)有限公司 Method, device, and terminal equipment for enabling intelligent association in input method
CN107390892A (en) * 2016-05-17 2017-11-24 富士通株式会社 The method and apparatus for generating user-oriented dictionary
CN107688398A (en) * 2016-08-03 2018-02-13 中国科学院计算技术研究所 Determine the method and apparatus and input reminding method and device of candidate's input
CN107688397A (en) * 2016-08-03 2018-02-13 北京搜狗科技发展有限公司 A kind of input method, system and the device for input
CN107688398B (en) * 2016-08-03 2019-09-17 中国科学院计算技术研究所 It determines the method and apparatus of candidate input and inputs reminding method and device
CN107688397B (en) * 2016-08-03 2022-10-21 北京搜狗科技发展有限公司 Input method, system and device for inputting
CN106896936A (en) * 2017-02-24 2017-06-27 百度在线网络技术(北京)有限公司 Vocabulary method for pushing and device
CN106896936B (en) * 2017-02-24 2020-06-12 百度在线网络技术(北京)有限公司 Vocabulary pushing method and device
CN108595437A (en) * 2018-05-04 2018-09-28 和美(深圳)信息技术股份有限公司 Text query error correction method, device, computer equipment and storage medium
CN108595437B (en) * 2018-05-04 2022-06-03 和美(深圳)信息技术股份有限公司 Text query error correction method and device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN101122901B (en) 2011-11-09

Similar Documents

Publication Publication Date Title
CN101122901B (en) Chinese integral sentence generation method and device
CN107704102B (en) Text input method and device
US8914275B2 (en) Text prediction
US20190163361A1 (en) System and method for inputting text into electronic devices
US8077983B2 (en) Systems and methods for character correction in communication devices
US7810030B2 (en) Fault-tolerant romanized input method for non-roman characters
US9659002B2 (en) System and method for inputting text into electronic devices
EP2805218B1 (en) A system and method for inputting text
CN102866782B (en) Input method and input method system for improving sentence generating efficiency
EP2807535B1 (en) User data input prediction
Zhou et al. Resolving surface forms to wikipedia topics
KR102348845B1 (en) A method and system for context sensitive spelling error correction using realtime candidate generation
CN104252484B (en) A kind of phonetic error correction method and system
JP2012521025A (en) Input method editor
US11550751B2 (en) Sequence expander for data entry/information retrieval
CN106528846B (en) A kind of search method and device
JP2014194774A (en) Misspelling correction system and misspelling correction method
CN101158969A (en) Whole sentence generating method and device
JP2007004633A (en) Language model generation device and language processing device using language model generated by the same
KR20100105586A (en) Cjk name detection
CN104298672A (en) Error correction method and device for input
US10152473B2 (en) English input method and input device
KR20080085165A (en) Multi-word word wheeling
CN103324683A (en) Method, device and client for providing search suggestion in input field of browser
CN107408109B (en) Method for suggesting one or more multi-word candidates based on an input string received at an electronic device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
ASS Succession or assignment of patent right

Owner name: SHENZHEN SHIJI LIGHT SPEED INFORMATION TECHNOLOGY

Free format text: FORMER OWNER: TENGXUN SCI-TECH (SHENZHEN) CO., LTD.

Effective date: 20131021

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: 518044 SHENZHEN, GUANGDONG PROVINCE TO: 518057 SHENZHEN, GUANGDONG PROVINCE

TR01 Transfer of patent right

Effective date of registration: 20131021

Address after: 518057 Tencent Building, 16, Nanshan District hi tech park, Guangdong, Shenzhen

Patentee after: Shenzhen Shiji Guangsu Information Technology Co., Ltd.

Address before: 2, 518044, East 410 room, SEG science and Technology Park, Zhenxing Road, Shenzhen, Guangdong, Futian District

Patentee before: Tencent Technology (Shenzhen) Co., Ltd.