CN105718070A

CN105718070A - Pinyin long sentence continuous type-in input method and Pinyin long sentence continuous type-in input system

Info

Publication number: CN105718070A
Application number: CN201610029530.8A
Authority: CN
Inventors: 周诚
Original assignee: Shanghai Gaoxin Computer Systems Co Ltd
Current assignee: Shanghai Gaoxin Computer Systems Co Ltd
Priority date: 2016-01-16
Filing date: 2016-01-16
Publication date: 2016-06-29

Abstract

The invention relates to the field of an input method, and discloses a pinyin long sentence continuous type-in input method and a Pinyin long sentence continuous type-in input system. The pinyin long sentence continuous type-in input method comprises the following steps of pre-building a BHMM (Bidirectional Hidden Markov Model); continuously receiving Pinyin codes input by a user; according to the BHMM and the continuously received Pinyin codes, obtaining a long sentence formed by each Chinese character with the highest occurring probability in the BHMM; and outputting the long sentence formed by each Chinese character with the highest occurring probability in the BHMM. The long sentence continuous type-in input system comprises a client and a cloud server, wherein the cloud server comprises a model building module, a matching module and a return module; and the client comprises a receiving module, a sending module and an outputting module. The method and the system have the advantage that by building the BHMM, the precision of outputting the correct result during whole sentence or long sentence Pinyin-to-Chinese-character conversion in Pinyin long sentence continuous type-pin input is improved.

Description

The long sentence of a kind of phonetic even beats input method and system thereof

Technical field

The present invention relates to input method field, particularly to Pinyin Input.

Background technology

Development and progress along with computer technology, the technology of spelling input method there has also been progressive and improves, particularly the long sentence of phonetic even lose into, prior art is based on context to find an optimum sentence when given phonetic, a dynamic programming problems can be regarded as, find shortest path.Its core technology, for adopting HMM, utilizes the processing mode of statistics natural language to carry out computing and process, obtains the sentence of optimum.The shortcoming of Markov is in that its independence assumption, and it assumes to have ignored moment event and contacting between all events before the moment, exactly because but also having had such it is assumed that make algorithm become simple and clear.HMM adopts the joint ensemble of production to solve this conditional probability problem, and this method is not suitable for processing the situation of a lot of feature description observation sequence.Markov it is assumed that when causing the exponent number of model more high, the statistical result that will not bring to model on the contrary, even if exponent number improves again, also cannot cover all of language phenomenon.All these defects, the output result precision after causing whole sentence that phonetic long sentence even beats or long sentence company to beat is not high.

Summary of the invention

It is an object of the invention to provide the long sentence of a kind of phonetic and even beat input method and system thereof, by setting up two-way hidden Markov BHMM model, improve phonetic long sentence and even lose out the precision of correct result.

For solving above-mentioned technical problem, embodiments of the present invention provide the long sentence of a kind of phonetic and even beat input method, comprise the steps of

Pre-build two-way hidden Markov BHMM model；In described BHMM model, the number of times that when probability of occurrence of each Chinese character in long sentence is by forward-propagating, the top n Chinese character of this Chinese character and this Chinese character occurs in data base simultaneously, the number of times that during back propagation, rear N number of Chinese character of this Chinese character and this Chinese character occurs in data base simultaneously determines；Described N is the natural number more than 1；

Persistently receive the phonetic code of user's input；

According to described BHMM model and the described phonetic code continuing to receive, obtain the long sentence that each Chinese character that probability of occurrence is maximum in this BHMM model forms；

Export the long sentence that each Chinese character that probability of occurrence is maximum in this BHMM model of described acquisition forms.

Corresponding to this, it is a further object to provide the long sentence of phonetic and even beat input system, comprise: client and cloud server；

Cloud server comprises:

Model building module, is used for pre-building two-way hidden Markov BHMM model；In described BHMM model, the number of times that when probability of occurrence of each Chinese character in long sentence is by forward-propagating, the top n Chinese character of this Chinese character and this Chinese character occurs in data base simultaneously, the number of times that during back propagation, rear N number of Chinese character of this Chinese character and this Chinese character occurs in data base simultaneously determines；Described N is the natural number more than 1；

Matching module, for according to described BHMM model with from the phonetic code continuing to receive of described client, obtaining the long sentence that each Chinese character that probability of occurrence is maximum in this BHMM model forms；

Return module, for the long sentence that described matching module obtains is back to described client；

Described client comprises:

Receiver module, for persistently receiving the phonetic code of user's input；

Sending module, for sending the described phonetic code continuing to receive to described cloud server；

Output module, for exporting the described long sentence that described cloud server returns.

Embodiment of the present invention is in terms of existing technologies, provide long sentence input method and the long sentence input system of a kind of phonetic, by setting up two-way hidden Markov BHMM model, improve the precision of output correct result during phonetic conversion Chinese character in whole sentence or long sentence.

It addition, described BHMM model is based upon in cloud server；The described phonetic code continuing to receive is sent to described cloud server by client in real time, by described cloud server according to BHMM model and the described phonetic code continuing to receive, obtains described long sentence in real time, optimizes phonetic and convert the efficiency of Chinese character.

It addition, described according to BHMM model with the described phonetic code continuing to receive, obtain in the step of the long sentence that each Chinese character that probability of occurrence is maximum in this BHMM model forms, obtain described long sentence according to below equation:

w_{1}, w_{2}, ..., w_{L} = A r g M a x Π_{i = 1}^{L} (P (w_{i} | w_{i - 1}, w_{i - 2}, ... w_{i - N}) + P (w_{i}^{'} | w_{i + 1}^{'}, w_{i + 2}^{'}, ... w_{i + N}^{'}))

Wherein, w₁,w₂,...,w_LRepresenting each Chinese character in the described long sentence obtained, L is the natural number more than 1；P (w_i|w_i-1,w_i-2,...w_i-N) represent in forward-propagating process, Chinese character w_iTop n Chinese character w_i-1,w_i-2,...w_i-NWhen occurring, Chinese character w_iThe probability occurred；P (w'_i|w'_i+1,w'_i+2,...w'_i+N) represent in back-propagation process, Chinese character w'_iRear N number of Chinese character w'_i+1,w'_i+2,...w'_i+NWhen occurring, Chinese character w'_iThe probability occurred.

It addition, define in this two-way hidden Markov BHMM model, N is 3, is the three two-way HMMs in rank (being alternatively referred to as the two-way HMM of quaternary).Specifically, it is simply that in communication process, current state is respectively by the impacts of before it and thereafter three states, and the result accuracy of output is higher than single order and second order, and speed is faster.

It addition, after the described phonetic code persistently receiving user's input, according to described BHMM model and the described phonetic code continuing to receive, obtain the long sentence that the big each Chinese character of probability of occurrence time forms in this BHMM model；Export the long sentence that the big each Chinese character of described probability of occurrence time forms, facilitate user to select.

Accompanying drawing explanation

The long sentence that Fig. 1 is the phonetic according to first embodiment of the invention connects the flow chart beating input method；

Fig. 2 is according to the two-way HMM ultimate principle figure of single order in first embodiment of the invention；

Fig. 3 a-3d is the output Comparative result schematic diagram of two kinds of input methods in first embodiment of the invention；

The long sentence that Fig. 4 is the phonetic according to second embodiment of the invention connects the structural representation beating input system.

Detailed description of the invention

For making the object, technical solutions and advantages of the present invention clearly, below in conjunction with accompanying drawing, the embodiments of the present invention are explained in detail.But, it will be understood by those skilled in the art that in each embodiment of the present invention, propose many ins and outs in order to make reader be more fully understood that the application.But, even without these ins and outs with based on the many variations of following embodiment and amendment, it is also possible to realize the application each claim technical scheme required for protection.

First embodiment of the invention relates to the long sentence input method of a kind of phonetic, implements flow process as it is shown in figure 1, details are as follows:

In step S101, cloud server pre-builds two-way hidden Markov BHMM model；This BHMM model specifies, the number of times that when probability of occurrence of each Chinese character in long sentence is by forward-propagating, the top n Chinese character of this Chinese character and this Chinese character occurs simultaneously in data base, the number of times that during back propagation, rear N number of Chinese character of this Chinese character and this Chinese character occurs in data base simultaneously determines.Wherein it is desired to illustrate, data base of the present embodiment saves common long sentence and the phonetic code of corresponding long sentence thereof.

It should be noted that, one sentence has L word (or word) to form, phonetic is corresponding to the state of two-way hidden Markov BHMM model, candidate word is corresponding to the output symbol of BHMM, turn to another one word (or word) corresponding to State Transferring from a word (or word), from 26 letters, carry out the input of phonetic corresponding to from the observation symbol of a State-output.Next for the two-way hidden Markov model of single order (namely N takes 1), the principle of two-way HMM being illustrated, ultimate principle is as shown in Figure 2.

Wherein, py₁,py₂,...,py_LRepresent L phonetic；W₁,w₂,...,w_LRepresent the Chinese character of each phonetic corresponding output respectively, i.e. candidate word.In forward transmitting procedure, as illustrated in solid line in figure 2, when moment t, corresponding phonetic is py_t, corresponding Chinese character is w_t, phonetic py_tPrevious phonetic py_t-1When occurring, input Pinyin py_tProbability be P (py_t|py_t-1)；Phonetic py_tBe converted to Chinese character w_tProbability P (w_t|py_t), and the probability P (w of last exported word (or word) that is correct or that be referred to alternatively as the optimum sentence of composition_t|w_t-1).In reverse transfer process, as shown by the dash line in figure 2, equally when moment t, phonetic py'_tPrevious phonetic py'_t+1When occurring, input Pinyin py'_tProbability be P (py_t'|py'_t+1)；Phonetic py'_tBe converted to Chinese character w'_tProbability be P (w'_t|py'_t)；The probability finally exporting correct word (or word) is: P (w_t'|w_t'₊₁).At this, in forward-propagating process, with # (py_t-1,py_t) represent the transfer number of phonetic when being transferred to moment t-1 by moment t, namely from py_t-1Transfer to py_tNumber of times；With # (py_t,w_t) represent py_tBe converted to w_tNumber of times, # (w_t-1,w_t) represent word w_t-1And w_tThe number of times simultaneously occurred, with # (py_t) represent phonetic py_tThe number of times occurred, with # (py_t-1) represent phonetic py_t-1The number of times occurred, # (w_t-1) represent candidate word (or word) # (w_t-1) represent candidate word (or word) w_t-1The number of times occurred；In back-propagation process, with # (py'_t+1,py_t') represent from py'_t+1Transfer to py'_tNumber of times, with # (py'_t,w'_t) represent py'_tBe converted to w'_tNumber of times, with # (w'_t+1,w_t') represent w'_t+1And w'_tThe number of times simultaneously occurred, with # (py'_t) represent py'_tThe number of times occurred, with # (py'_t+1) represent py'_t+1The number of times occurred, with # (w_t') represent candidate word (or word) w'_tThe number of times occurred, with # (w'_t+1) represent candidate word (or word) w'_t+1The number of times occurred.

In step s 102, client continues to receive the phonetic code of user's input, and according to the rule input that spelling input method is set on 26 buttons of a to z on keyboard, the phonetic code continuing to receive is to be whole phonetic codes of each individual character respectively.

In step s 103, the phonetic code continuing to receive is sent to cloud server by client.

In step S104, cloud server is according to BHMM model and the described phonetic code continuing to receive, the long sentence that maximum and suboptimum each Chinese character of probability of occurrence forms in this BHMM model in real time.

Concrete BHMM model algorithm step is as described below:

In model, the number of times that when probability of occurrence of each Chinese character in long sentence is by forward-propagating, the top n Chinese character of this Chinese character and this Chinese character occurs in data base simultaneously, the number of times that during back propagation, rear N number of Chinese character of this Chinese character and this Chinese character occurs in data base simultaneously determines.In the present embodiment, N takes 3, namely illustrates for the three two-way HMMs in rank (being alternatively referred to as the two-way HMM of quaternary).The three two-way HMMs in rank refer to, in forward-propagating process, the event of moment t is relevant with moment t-1, t-2 and the event of moment t-3；In back-propagation process, the event of moment t is relevant with moment t+1, t+2 and t+3 moment event.Specific algorithm step is as follows:

In forward-propagating process, calculate the transition probability at t phonetic

P ({py}_{t} | {py}_{t - 1}, {py}_{t - 2}, {py}_{t - 3}) \approx \frac{# ({py}_{t - 1} {py}_{t - 2} {py}_{t - 3}, {py}_{t})}{# ({py}_{t - 1} {py}_{t - 2} {py}_{t - 3})}

Calculate the probability being converted to Chinese character at t phonetic, thus can calculate candidate word

P (w_{t} | {py}_{t}, {py}_{t - 1}, {py}_{t - 2}) = \frac{# ({py}_{t} {py}_{t - 1} {py}_{t - 2}, w_{t})}{# ({py}_{t} {py}_{t - 1} {py}_{t - 2})}

Calculate the probability exporting correct word (or word)

P (w_{t} | w_{t - 1}, w_{t - 2}, w_{t - 3}) = \frac{# (w_{t - 1} w_{t - 2} w_{t - 3}, w_{t})}{# (w_{t - 1} w_{t - 2} w_{t - 3})}

In back-propagation process, calculate the probability of t input Pinyin

P ({py}_{t}^{'} | {py}_{t + 1}^{'}, {py}_{t + 2}^{'}, {py}_{t + 3}^{'}) \approx \frac{# ({py}_{t + 1}^{'} {py}_{t + 2}^{'} {py}_{t + 3}^{'}, {py}_{t}^{'})}{# ({py}_{t + 1}^{'} {py}_{t + 2}^{'} {py}_{t + 3}^{'})}

Calculating in t, phonetic is converted to the probability of Chinese character, thus can obtain candidate word

P (w_{t}^{'} | {py}_{t}^{'}, {py}_{t + 1}^{'}, {py}_{t + 2}^{'}) \approx \frac{# ({py}_{t}^{'} {py}_{t + 1}^{'} {py}_{t + 2}^{'}, w_{t}^{'})}{# ({py}_{t}^{'} {py}_{t + 1}^{'} {py}_{t + 2}^{'})}

Calculate the probability of correct output word (or word)

P (w_{t}^{'} | w_{t + 1}^{'}, w_{t + 2}^{'}, w_{t + 3}^{'}) \approx \frac{# (w_{t + 1}^{'} w_{t + 2}^{'} w_{t + 3}^{'}, w_{t}^{'})}{# (w_{t + 1}^{'} w_{t + 2}^{'} w_{t + 3}^{'})}

Thus can calculate, the event of t institute, namely produce the final probability of candidate word (or word)

w_{t} = \underset{w &Element; W}{M a x} (P (w_{t} | w_{t - 1}, w_{t - 2}, w_{t - 3}) + P (w_{t}^{'} | w_{t + 1}^{'}, w_{t + 2}^{'}, w_{t + 3}^{'}))

Calculate the output of final optimal sentence

\begin{matrix} w_{1}, w_{2}, ..., w_{L} = \underset{w &Element; W}{A r g M a x} (P ({py}_{1}, {py}_{2}, ..., {py}_{N} | w_{1}, w_{2}, ..., w_{N}) \cdot P (w_{1}, w_{2}, ..., w_{N})) \\ \approx A r g M a x Π_{i = 1}^{L} (P (w_{i} | w_{i - 1}, w_{i - 2}, w_{i - 3}) + P (w_{i}^{'} | w_{i + 1}^{'}, w_{i + 2}^{'}, w_{i + 3}^{'})) \end{matrix}

Wherein, in forward-propagating process, # (py_t-1py_t-2py_t-3,py_t) represent the number of times that phonetic shifts, # (py_t-1py_t-2py_t-3) represent the common number of times occurred, # (py_tpy_t-1py_t-2,w_t) represent phonetic and be converted to Chinese character w_tNumber of times, # (py_tpy_t-1py_t-2) represent the number of times that three phonetics occur simultaneously, # (w_t-1w_t-2w_t-3,w_t) represent w_t-1w_t-2w_t-3Transfer to w_tNumber of times, # (w_t-1w_t-2w_t-3) represent the number of times that three words (or word) occur simultaneously；In reverse procedure, # (py'_t+1py'_t+2py'_t+3,py'_t) represent the transfer number of phonetic, # (py'_t+1py'_t+2py'_t+3) represent the number of times that phonetic occurs simultaneously, # (py'_tpy'_t+1py'_t+2,w'_t) represent phonetic and be converted to Chinese character w'_tNumber of times, # (py'_tpy'_t+1py'_t+2) represent the number of times that three phonetics occur simultaneously, # (w'_t+1w'_t+2w'_t+3,w_t') represent w'_t+1w'_t+2w'_t+3Transfer to w_t' number of times, # (w'_t+1w'_t+2w'_t+3) represent the number of times that three words (or word) occur simultaneously.

In step S105, the long sentence that the maximum and secondary big each Chinese character of probability of occurrence in this BHMM model that output obtains forms.

Without the input method output result that two-way hidden Markov model processes, with the comparing result of the input method output result processed through two-way HMM as illustrated in figs. 3 a-3d, Fig. 3 a-3d contains 4 groups of examples, often in group example, being arranged above the input method output result processed without two-way hidden Markov model, lower section is the input method output result processed through two-way HMM.

It is seen that, in the present embodiment by pre-building two-way hidden Markov BHMM model, in solution prior art phonetic long sentence input method poor accuracy defect；Improve the precision of output correct result during phonetic conversion Chinese character in whole sentence or long sentence.

Second embodiment of the invention relates to the long sentence of a kind of phonetic and even beats input system, and this system includes client and cloud server, and Fig. 4 is the structural representation of present embodiment.

Cloud server comprises model building module, matching module and return module；Model building module, is used for pre-building two-way hidden Markov BHMM model；Matching module, for according to described BHMM model with from the phonetic code continuing to receive of described client, obtaining the long sentence that each Chinese character that probability of occurrence is maximum in this BHMM model forms；Return module, for the long sentence that described matching module obtains is back to described client.

Client comprises receiver module, sending module and output module；Receiver module, for persistently receiving the phonetic code of user's input；Sending module, for sending the described phonetic code continuing to receive to described cloud server；Output module, for exporting the described long sentence that described cloud server returns.

It is seen that, present embodiment is the system embodiment corresponding with the first or second embodiment, and present embodiment can be worked in coordination enforcement with the first or second embodiment.The relevant technical details mentioned in first embodiment is still effective in the present embodiment, in order to reduce repetition, repeats no more here.Correspondingly, the relevant technical details mentioned in present embodiment is also applicable in the first embodiment.

It will be understood by those skilled in the art that the respective embodiments described above are to realize specific embodiments of the invention, and in actual applications, it is possible in the form and details it is done various change, without departing from the spirit and scope of the present invention.

Claims

1. the long sentence of a phonetic even beats input method, it is characterised in that comprise the steps of

Persistently receive the phonetic code of user's input；

2. the long sentence of phonetic according to claim 1 even beats input method, it is characterised in that

Described BHMM model is based upon in cloud server；

Described according to BHMM model with the described phonetic code continuing to receive, obtain in the step of the long sentence that each Chinese character that probability of occurrence is maximum in this BHMM model forms, comprise following sub-step:

The described phonetic code continuing to receive is sent to described cloud server by client in real time, by described cloud server according to BHMM model and the described phonetic code continuing to receive, obtains described long sentence in real time.

3. the long sentence of phonetic according to claim 1 even beats input method, it is characterized in that, described according to BHMM model with the described phonetic code continuing to receive, obtain in the step of the long sentence that each Chinese character that probability of occurrence is maximum in this BHMM model forms, obtain described long sentence according to below equation:

w_{1}, w_{2}, ..., w_{L} = A r g M a x Π_{i = 1}^{L} (P (w_{i} | w_{i - 1}, w_{i - 2}, ... w_{i - N}) + P (w_{i}^{'} | w_{i + 1}^{'}, w_{i + 2}^{'}, ... w_{i + N}^{'}))

Wherein, w₁,w₂,...,w_LRepresenting each Chinese character in the described long sentence obtained, L is the natural number more than 1；P (w_i|w_i-1,w_i-2,...w_i-N) represent in forward-propagating process, Chinese character w_iTop n Chinese character w_i-1,w_i-2,...w_i-NWhen occurring, Chinese character w_iThe probability occurred；P (w '_i|w′_i+1,w′_i+2,...w′_i+N) represent in back-propagation process, Chinese character w '_iRear N number of Chinese character w '_i+1,w′_i+2,...w′_i+NWhen occurring, Chinese character w '_iThe probability occurred.

4. the long sentence of phonetic according to claim 3 even beats input method, it is characterised in that described N is 3.

5. the long sentence of phonetic according to claim 1 even beats input method, it is characterised in that also comprise the steps of

After the described phonetic code persistently receiving user's input, according to described BHMM model and the described phonetic code continuing to receive, obtain the long sentence that the big each Chinese character of probability of occurrence time forms in this BHMM model；

Export the long sentence that the big each Chinese character of described probability of occurrence time forms, select for user.

6. the long sentence of a phonetic even beats input system, it is characterised in that comprise: client and cloud server；

Cloud server comprises:

Described client comprises:

7. the long sentence of phonetic according to claim 6 even beats input system, it is characterized in that, described according to BHMM model with the described phonetic code continuing to receive, obtain in the step of the long sentence that each Chinese character that probability of occurrence is maximum in this BHMM model forms, obtain described long sentence according to below equation:

w_{1}, w_{2}, ..., w_{L} = A r g M a x Π_{i = 1}^{L} (P (w_{i} | w_{i - 1}, w_{i - 2}, ... w_{i - N}) + P (w_{i}^{'} | w_{i + 1}^{'}, w_{i + 2}^{'}, ... w_{i + N}^{'}))

8. the long sentence of phonetic according to claim 7 even beats input system, it is characterised in that described N is 3.