CN103366742B

CN103366742B - Pronunciation inputting method and system

Info

Publication number: CN103366742B
Application number: CN201210101302.9A
Authority: CN
Inventors: 李曜; 许东星
Original assignee: SHANGHAI GUOKE ELECTRONIC CO Ltd
Current assignee: SHANGHAI GEAK ELECTRONICS Co.,Ltd.
Priority date: 2012-03-31
Filing date: 2012-03-31
Publication date: 2018-07-31
Anticipated expiration: 2032-03-31
Also published as: CN103366742A

Abstract

The present invention relates to a kind of pronunciation inputting method and system, the method includes：Constantly by the phonetic segmentation sound bite of input and the text of each sound bite is generated while recording；And the text of each sound bite is shown successively, the text of each sound bite is modified successively according to the user's choice.The present invention can with automatic segmentation voice recognition result and carry out segmentation return for user's secondary-confirmation, user can be directed at returned text while recording and modify and confirm.

Description

Pronunciation inputting method and system

Technical field

The invention belongs to field of speech recognition, more particularly to a kind of pronunciation inputting method and system.

Background technology

With the rise of progress and the cloud computing of speech recognition technology, inputs and pass through using voice on mobile terminals Cloud server carries out the transcription of speech-to-text and text is had become a kind of trend back to the scheme of mobile terminal.Due to The size of mobile terminal limits, and the convenience that text input is directly carried out by physics or dummy keyboard is always not fully up to expectations, It is contemplated that voice input will substitute key-press input in more and more places.

But the present situation that speech recognition accuracy is difficult to reach 100% hinders voice input thoroughly replacement key-press input Process.In fact, due to the complexity really pronounced under the conditions of various in life, the accuracy rate of speech recognition never may Reach 100%, especially under noisy environment, necessarily there may be various mistakes in recognition result, that is to say, that for voice The result of identification certainly exists the process of a secondary-confirmation.Existing voice input scheme is as follows：When press record button it Afterwards, the interface that expression as shown in Figure 1 is being recorded can be popped up on mobile terminal, then user loquiturs, after finishing, meeting On interface as shown in Figure 2 by the textual presentation recognized in a Text Entry 21, if in text input frame 21 There is identification mistake in text, then recall keyboard 22 by user and modify and confirm preservation.However in this voice input scheme, User cannot make recognition result any editor in Recording Process, it is necessary to all be finished in the voice that will disposably input Afterwards, user could be changed and be confirmed that preservation, the text that then will confirm that again are used for one by one to the mistake in returned text It subsequently such as sends short messages, sends out mail, the application kept record etc.So this confirmation process is generally for more numerous for user It is trivial, not friendly enough.

Invention content

The purpose of the present invention is to provide a kind of pronunciation inputting method and systems, can be automatically segmented to input voice Identification, the recording side user Ke Bian to identification by stages to text be modified.

To solve the above problems, the present invention provides a kind of pronunciation inputting method, including：

Constantly by the phonetic segmentation sound bite of input and the text of each sound bite is generated while recording；And

The text of each sound bite is shown successively, and the text of each sound bite is carried out successively according to the user's choice It corrects.

Further, in the above-mentioned methods, it constantly by the phonetic segmentation sound bite of input and is generated by cloud server The text of each sound bite.

Further, in the above-mentioned methods, by voice activity detection algorithm constantly by the phonetic segmentation voice sheet of input Section.

Further, in the above-mentioned methods, described that the text of each sound bite is carried out successively according to the user's choice The step of amendment includes：

User selects to need modified content in the text of each sound bite；

It generates corresponding to the syllable of each word in the candidate word of each word, the content in the content and corresponding to described The candidate syllable of each word in content；

The text in pronunciation segment is carried out according to the candidate word of user's selection, the syllable and the candidate syllable It corrects.

Further, in the above-mentioned methods, the candidate word, the syllable and the candidate selected according to user The step of syllable is modified the text in pronunciation segment include：

When user selects the candidate word, the candidate word selected is replaced to the corresponding word in the content；

When user selects the syllable, the candidate word corresponding to the syllable is generated, from the candidate word of the syllable It selects correct candidate word and replaces corresponding word in the content；

When user's selection candidate syllable, the candidate word for corresponding to candidate syllable is generated, from the candidate syllable Correct candidate word is selected in candidate word replaces corresponding word in the content；

When in the candidate word, candidate syllable of generation without correctly as a result, can then call input method to text into Row modification.

Further, in the above-mentioned methods, it is described recording while constantly by the phonetic segmentation sound bite of input simultaneously Before the step of generating the text of each sound bite, further include：Noise monitoring is carried out to playback environ-ment in recording and obtains letter It makes an uproar ratio.

Further, in the above-mentioned methods, described generate corresponds to the candidate word of each word, the content in the content In each word syllable and corresponding to including the step of the candidate syllable of each word in the content：

When the signal-to-noise ratio is more than predetermined threshold value, the candidate word, the candidate syllable are reduced；

When the signal-to-noise ratio is less than predetermined threshold value, increase the candidate word, the candidate syllable.

Another side according to the present invention provides a kind of voice entry system, including：

Cutting module, for constantly by the phonetic segmentation sound bite of input and generating each voice sheet while recording The text of section；And

Correcting module, the text for showing each sound bite successively, according to the user's choice successively to each voice The text of segment is modified.

Further, in above system, the cutting module is located on cloud server.

Further, in above system, the cutting module is by voice activity detection algorithm constantly by the language of input Sound cutting sound bite.

Further, in above system, the correcting module includes：

Selecting unit selects to need modified content in the text of each sound bite for obtaining user；

Candidate unit, for generating the sound corresponding to each word in the candidate word of each word, the content in the content Section and corresponding to each word in the content candidate syllable；

Amending unit, the candidate word, the syllable and the candidate syllable for being selected according to user are to pronunciation piece Text in section is modified.

Further, in above system, the amending unit, for when user selects the candidate word, will select The candidate word replace the corresponding word in the content；When user selects the syllable, generate corresponding to the syllable Candidate word replaces corresponding word in the content from correct candidate word is selected in the candidate word of the syllable；When user selects When candidate's syllable, the candidate word for corresponding to candidate syllable is generated, is selected correctly from the candidate word of the candidate syllable Candidate word replaces the corresponding word in the content；When no correctly as a result, then may be used in the candidate word, candidate syllable of generation To call input method to modify text.

Further, further include noise monitoring unit in above system, for making an uproar to playback environ-ment in recording Sound monitoring obtains signal-to-noise ratio.

Further, in above system, the candidate unit, for when the signal-to-noise ratio is more than predetermined threshold value, subtracting Few candidate word, the candidate syllable；When the signal-to-noise ratio is less than predetermined threshold value, increase the candidate word.

Compared with prior art, the present invention by recording while constantly by the phonetic segmentation sound bite of input and life At the text of each sound bite, the text of each sound bite is shown successively, according to the user's choice successively to each voice The text of segment is modified, and with automatic segmentation voice recognition result and can carry out segmentation return for user's secondary-confirmation, user Returned text can be directed at while recording to modify and confirm.

In addition, needing modified content in selecting the text of each sound bite by user, then generates and correspond to institute State in content the syllable of each word and the candidate sound corresponding to each word in the content in the candidate word of each word, the content Section repaiies the text in pronunciation segment further according to the candidate word of user's selection, the syllable and the candidate syllable Just, it can facilitate user that correct word is quickly selected to be modified the content in text.

In addition, signal-to-noise ratio is obtained by carrying out noise monitoring to playback environ-ment in recording, when the signal-to-noise ratio is more than in advance If when threshold value, reducing the candidate word, the candidate syllable；When the signal-to-noise ratio is less than predetermined threshold value, increase the candidate Word, the candidate syllable, can adjust the number of candidate result according to different signal-to-noise ratio.

Description of the drawings

Fig. 1 is the recording interface schematic diagram of existing voice input scheme；

Fig. 2 is the identification textual presentation and modification interface schematic diagram of existing voice input scheme；

Fig. 3 is the flow chart of the pronunciation inputting method of the embodiment of the present invention one；

Fig. 4 is recording, identification textual presentation and the modification interface schematic diagram of the embodiment of the present invention one

Fig. 5 is the embodiment of the present invention one successively to identifying that text is shown and changes interface schematic diagram；

Fig. 6 is the flow chart of the pronunciation inputting method of the embodiment of the present invention two；

Fig. 7 is the noise monitoring interface schematic diagram of the embodiment of the present invention two；

Fig. 8 is the high-level schematic functional block diagram of the voice entry system of the embodiment of the present invention three.

Specific implementation mode

In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, below in conjunction with the accompanying drawings and specific real Applying mode, the present invention is described in further detail.

Embodiment one

As seen in figures 3-5, the present invention provides a kind of pronunciation inputting method, including：

Step S11 constantly by the phonetic segmentation sound bite of input and generates each sound bite while recording Text, specifically, the present invention can automatic segmentation voice recognition result and carry out segmentation return for user's secondary-confirmation, can be by high in the clouds Server constantly by the phonetic segmentation sound bite of input and generates the text of each sound bite, is calculated by speech terminals detection For method constantly by the phonetic segmentation sound bite of input, speech terminals detection is accurately determined from the segment signal comprising voice The starting point and ending point of voice, it is one in voice processing technology to distinguish voice and non-speech audio, speech terminals detection Importance can be used the algorithm of end-point detection, by effective language for example, when user continuously inputs voice by cloud server Sound is cut into one one according to the speak rhythm of pause of user, and is converted into text successively, returns to mobile terminal as shown in Figure 4 Displaying interface on, the interface will record interface and recognition result displaying interface be integrated on the same interface；

Step S12 shows the text of each sound bite successively；

Step S13 is according to the user's choice successively modified the text of each sound bite, specifically, of the invention Middle user can be directed at returned text while recording and modify and confirm, it should be noted that interaction schemes of the invention In, all text identification results are not all shown, but only by the text identification result of current fragment be illustrated in as On the interface of Fig. 5, after user is modified and confirms to the recognition result 1 of sound bite 1, then next section of recognition result is shown 2, this exhibition scheme is advantageous in that be shown limited as a result, allowing user can be by attention collection successively on limited screen In in current recognition result, improve the efficiency of modification text, this shown step may particularly include：

Step S131, user selects to need modified content in the text of each sound bite, specifically, when user needs When changing the partial words in text identification result, the specific word in text identification result can be clicked；

Step S132, generate correspond in the content in the candidate word of each word, the content syllable of each word and Corresponding to the candidate syllable of each word in the content, specifically, the specific text changed is needed in recognition result when the user clicks When word, pop-up several candidate words corresponding with the word, including the correspondence syllable of the word and several candidate sounds can be set Section, can effectively combine voice recognition result with input method in this way, provide multiple candidates and selected for user, and will know It is syllable that other result is degenerated from word, expands the range of hit, makes user that need not input a string of letters, but is found by candidate Oneself required word；

Step S133, according to the candidate word of user's selection, the syllable and the candidate syllable in pronunciation segment Text be modified, specifically, when user is modified and confirms to the recognition result of return, it is possible to provide as shown in Figure 5 " cancellation " and " confirmation " two order, be respectively used to rapidly to delete and preserve this text identification as a result, this step can be into One step includes：

The candidate word selected is replaced the phase in the content by step S1331 when user selects the candidate word Word is answered, if specifically, correct word is present in candidate word, user, which clicks directly on candidate word, can substitute original identification mistake Word；

Step S1332 generates the candidate word corresponding to the syllable, from the syllable when user selects the syllable Candidate word in select correct candidate word and replace corresponding word in the content, if specifically, there is no correct in candidate word Word, then user can click correct syllable, then that of input is thought in selection from the syllable corresponding candidate word of offer A word；

Step S1333 generates the candidate word for corresponding to candidate syllable when user's selection candidate syllable, from described Correct candidate word is selected in the candidate word of candidate syllable and replaces corresponding word in the content, if specifically, correct syllable Correct word is not present in corresponding candidate word, then user can click candidate syllable, then be corresponded to from candidate's syllable of offer Candidate word in selection think input that word；

Step S1334, when no correctly as a result, can then call input in the candidate word, candidate syllable of generation Method modifies to text.

The present invention recording interface and can will return the result interface and be simultaneously displayed on the interface of mobile terminal, allow user can be with The text results of return are seen while recording, and the text results of return can be modified in real time, i.e., user can connect One section of voice is continued, the text results of return are modified and are confirmed in the case where not closing recording, then proceed to record, Other people voice of sound recordings can also be used on one side, and is corrected confirmation identification simultaneously and returned the result.

Embodiment two

As shown in Figure 6 and Figure 7, the present invention provides another pronunciation inputting method, and the difference of the present embodiment and embodiment exists In, it increases and the step of noise monitoring obtains signal-to-noise ratio is carried out to playback environ-ment in recording, it can be according to different signal-to-noise ratio The number of candidate result is adjusted, and is being not suitable for prompting user, this example can be specific using the very noisy of voice input Including：

Step S21 carries out noise monitoring acquisition signal-to-noise ratio, specifically, this step can be automatic in recording to playback environ-ment The signal-to-noise ratio of detection input voice is simultaneously fed back on interactive interface, can be not suitable for carrying using the very noisy of voice input Show user, the number of candidate result can be also adjusted according to different signal-to-noise ratio in subsequent step S242, since noise is for language The influence of sound identification is very big, and when playback environ-ment noise is stronger, the accuracy rate meeting dramatic decrease of speech recognition, user needs to change Word also greatly increase, therefore, the function of noise monitoring can be added in the present embodiment, can be according to the knot of end-point detection Fruit, calculating separately the corresponding voice segments energy of the result and mute section of energy to every section of recognition result, (mute section of energy is equivalent to The energy of noise), to estimate the signal-to-noise ratio of this section of voice, and by the pollution level of ambient noise when recording with such as Fig. 7 institutes The interface of the band recording volume bar 71 and noise ration bar 72 that show is shown, when ambient noise is more than certain threshold value Later, user " current noise is excessive, it is proposed that is inputted using keyboard " can be prompted；

Step S22 constantly by the phonetic segmentation sound bite of input and generates each sound bite while recording Text, specifically, the text of each sound bite constantly by the phonetic segmentation sound bite of input and is generated by cloud server, By voice activity detection algorithm constantly by the phonetic segmentation sound bite of input；

Step S23 shows the text of each sound bite successively；

Step S24 is according to the user's choice successively modified the text of each sound bite, this step can be wrapped specifically It includes：

Step S241, user select to need modified content in the text of each sound bite；

Step S242, generate correspond in the content in the candidate word of each word, the content syllable of each word and Corresponding to the candidate syllable of each word in the content, can facilitate user quickly select correct word to the content in text into Row is corrected, this step can further comprise：

Step S2421 reduces the candidate word, the candidate syllable, specifically when the signal-to-noise ratio is more than predetermined threshold value , signal-to-noise ratio is big, indicates that voice is small by the pollution of noise, and the accuracy of recognition result is high, then can suitably reduce candidate result Number；

Step S2422 increases the candidate word, the candidate syllable, specifically when the signal-to-noise ratio is less than predetermined threshold value , signal-to-noise ratio is small, indicates that voice is big by noise pollution, then the possibility that mistake occurs in recognition result also greatly increases, then needs The number for increasing candidate result, correct word can be therefrom selected convenient for user；

Step S243, according to the candidate word of user's selection, the syllable and the candidate syllable in pronunciation segment Text be modified, this step can further comprise：

The candidate word selected is replaced the phase in the content by step S2431 when user selects the candidate word Answer word；

Step S2432 generates the candidate word corresponding to the syllable, from the syllable when user selects the syllable Candidate word in select correct candidate word and replace corresponding word in the content；

Step S2433 generates the candidate word for corresponding to candidate syllable when user's selection candidate syllable, from described Correct candidate word is selected in the candidate word of candidate syllable replaces corresponding word in the content；

Step S2434, when no correctly as a result, can then call input in the candidate word, candidate syllable of generation Method modifies to text.

The multiple voices such as noise monitoring, end-point detection, continuous speech recognition technology or frame are integrated in the present embodiment It in one interactive process, allows user that can fully experience the convenience of voice input, it is defeated with button in voice input to improve user Enter user experience when promiscuous operation.

Embodiment three

As shown in figure 8, the present invention also provides another voice entry system, including cutting module 41, correcting module 42 and Noise monitoring unit 43.

Cutting module 41 is used to constantly by the phonetic segmentation sound bite of input and generate each voice while recording The text of segment, specifically, the cutting module 41 is located on cloud server, the cutting module 41 is examined by sound end For method of determining and calculating constantly by the phonetic segmentation sound bite of input, this module automatic segmentation voice recognition result and can carry out segmentation return For user's secondary-confirmation.

Correcting module 42 is used to show the text of each sound bite successively, according to the user's choice successively to each voice The text of segment is modified, specifically, this module can realize that user is directed at returned text while recording and modifies and really Recognize, it should be noted that in interaction schemes of the invention, all text identification results are not all shown, but only By on the text identification result of current fragment displaying interface, the text identification result of the sound bite is modified in user and After confirmation, then show next section of recognition result, this exhibition scheme be advantageous in that shown successively on limited screen it is limited As a result, allow user that can concentrate our efforts for current recognition result, improve the efficiency of modification text, the correcting module 42 can further comprise selecting unit 421, candidate unit 422 and amending unit 423.

Selecting unit 421 is used to obtain user and selects to need modified content in the text of each sound bite.

Candidate unit 422 is used to generate corresponding to each word in the candidate word of each word, the content in the content Syllable and candidate syllable corresponding to each word in the content, specifically, need to change when the user clicks in recognition result When specific word, corresponding with the word several candidate words of pop-up can be set, including the correspondence syllable of the word and several Candidate syllable can effectively combine voice recognition result with input method in this way, provide multiple candidates and selected for user, And it is syllable that recognition result is degenerated from word, expands the range of hit, makes user that need not input a string of letters, but passes through time Oneself required word is found in choosing, in addition, the candidate unit 412 can be additionally used in when the signal-to-noise ratio is more than predetermined threshold value, The candidate word, the candidate syllable are reduced, signal-to-noise ratio is big, indicates that voice is small by the pollution of noise, the accuracy of recognition result Height then can suitably reduce the number of candidate result；When the signal-to-noise ratio is less than predetermined threshold value, increase the candidate word, institute Candidate syllable is stated, signal-to-noise ratio is small, indicates that voice is big by noise pollution, then the possibility that mistake occurs in recognition result also increases greatly Add, then needs the number for increasing candidate result, correct word can be therefrom selected convenient for user.

The candidate word, the syllable and the candidate syllable that amending unit 423 is used to be selected according to user are to pronunciation Text in segment is modified, specifically, the amending unit 413 is used to, when user selects the candidate word, to select The candidate word replace the corresponding word in the content；When user selects the syllable, generate corresponding to the syllable Candidate word replaces corresponding word in the content from correct candidate word is selected in the candidate word of the syllable；When user selects When candidate's syllable, the candidate word for corresponding to candidate syllable is generated, is selected correctly from the candidate word of the candidate syllable Candidate word replaces the corresponding word in the content；When no correctly as a result, then may be used in the candidate word, candidate syllable of generation To call input method to modify text.

Noise monitoring unit 43 is used to carry out noise monitoring to playback environ-ment in recording to obtain signal-to-noise ratio, can be according to not The number of same signal-to-noise ratio adjustment candidate result, and be not suitable for prompting user using the very noisy of voice input.

The present invention while recording by constantly by the phonetic segmentation sound bite of input and generating each sound bite Text, show the text of each sound bite successively, according to the user's choice successively to the text of each sound bite carry out It corrects, with automatic segmentation voice recognition result and segmentation can be carried out return for user's secondary-confirmation, user can record one on one side While modifying and confirming to returned text.

Each embodiment is described by the way of progressive in this specification, the highlights of each of the examples are with other The difference of embodiment, just to refer each other for identical similar portion between each embodiment.For system disclosed in embodiment For, due to corresponding to the methods disclosed in the examples, so description is fairly simple, related place is referring to method part illustration .

Professional further appreciates that, unit described in conjunction with the examples disclosed in the embodiments of the present disclosure And algorithm steps, can be realized with electronic hardware, computer software, or a combination of the two, in order to clearly demonstrate hardware and The interchangeability of software generally describes each exemplary composition and step according to function in the above description.These Function is implemented in hardware or software actually, depends on the specific application and design constraint of technical solution.Profession Technical staff can use different methods to achieve the described function each specific application, but this realization is not answered Think beyond the scope of this invention.

Obviously, those skilled in the art can carry out invention spirit of the various modification and variations without departing from the present invention And range.If in this way, these modifications and changes of the present invention belong to the claims in the present invention and its equivalent technologies range it Interior, then the present invention is also intended to including these modification and variations.

Claims

1. a kind of pronunciation inputting method, which is characterized in that including：

Constantly by the phonetic segmentation sound bite of input and the text of each sound bite is generated while recording；

And the text of each sound bite is shown successively, the text of each sound bite is carried out successively according to the user's choice It corrects, including：

User selects to need modified content in the text of each sound bite；

It generates corresponding to the syllable of each word in the candidate word of each word, the content in the content and corresponds to the content In each word candidate syllable；

The text in pronunciation segment is repaiied according to the candidate word of user's selection, the syllable and the candidate syllable Just.

2. pronunciation inputting method as described in claim 1, which is characterized in that constantly cut the voice of input by cloud server Divide sound bite and generates the text of each sound bite.

3. pronunciation inputting method as described in claim 1, which is characterized in that constantly will input by voice activity detection algorithm Phonetic segmentation sound bite.

4. pronunciation inputting method as described in claim 1, which is characterized in that the candidate word selected according to user, The step of syllable and the candidate syllable are modified the text in pronunciation segment include：

When user selects the syllable, the candidate word corresponding to the syllable is generated, is selected from the candidate word of the syllable Correct candidate word replaces the corresponding word in the content；

When user's selection candidate syllable, the candidate word for corresponding to candidate syllable is generated, from the candidate of the candidate syllable Correct candidate word is selected in word replaces corresponding word in the content；

When no correctly as a result, can then input method be called to repair text in the candidate word, candidate syllable of generation Change.

5. pronunciation inputting method as claimed in claim 4, which is characterized in that it is described recording while constantly by the language of input Sound cutting sound bite and the step of generate the text of each sound bite before, further include：

Noise monitoring is carried out to playback environ-ment in recording and obtains signal-to-noise ratio.

6. pronunciation inputting method as claimed in claim 5, which is characterized in that described generate corresponds to each word in the content Candidate word, the syllable of each word and corresponding to including the step of the candidate syllable of each word in the content in the content：

When the signal-to-noise ratio is more than predetermined threshold value, increase the candidate word, the candidate syllable；

When the signal-to-noise ratio is less than predetermined threshold value, the candidate word, the candidate syllable are reduced.

7. a kind of voice entry system, which is characterized in that including：

Cutting module, for constantly by the phonetic segmentation sound bite of input and generating each sound bite while recording Text；

And correcting module, the text for showing each sound bite successively, according to the user's choice successively to each voice The text of segment is modified, wherein

The correcting module includes：

Candidate unit, for generates corresponding in the candidate word of each word, the content in the content each the syllable of word and Corresponding to the candidate syllable of each word in the content；

Amending unit, the candidate word, the syllable and the candidate syllable for being selected according to user are in pronunciation segment Text be modified.

8. voice entry system as claimed in claim 7, which is characterized in that the cutting module is located on cloud server.

9. voice entry system as claimed in claim 7, which is characterized in that the cutting module is calculated by speech terminals detection Method is constantly by the phonetic segmentation sound bite of input.

10. voice entry system as claimed in claim 7, which is characterized in that

The amending unit, for when user selects the candidate word, the candidate word selected to be replaced in the content Corresponding word；

11. voice entry system as claimed in claim 10, which is characterized in that further include noise monitoring unit, for recording Noise monitoring is carried out to playback environ-ment when sound and obtains signal-to-noise ratio.

12. voice entry system as claimed in claim 11, which is characterized in that

The candidate unit, for when the signal-to-noise ratio is more than predetermined threshold value, reducing the candidate word, the candidate syllable；