Pronunciation inputting method and system
Technical field
The invention belongs to field of speech recognition, more particularly to a kind of pronunciation inputting method and system.
Background technology
With the rise of progress and the cloud computing of speech recognition technology, inputs and pass through using voice on mobile terminals
Cloud server carries out the transcription of speech-to-text and text is had become a kind of trend back to the scheme of mobile terminal.Due to
The size of mobile terminal limits, and the convenience that text input is directly carried out by physics or dummy keyboard is always not fully up to expectations,
It is contemplated that voice input will substitute key-press input in more and more places.
But the present situation that speech recognition accuracy is difficult to reach 100% hinders voice input thoroughly replacement key-press input
Process.In fact, due to the complexity really pronounced under the conditions of various in life, the accuracy rate of speech recognition never may
Reach 100%, especially under noisy environment, necessarily there may be various mistakes in recognition result, that is to say, that for voice
The result of identification certainly exists the process of a secondary-confirmation.Existing voice input scheme is as follows:When press record button it
Afterwards, the interface that expression as shown in Figure 1 is being recorded can be popped up on mobile terminal, then user loquiturs, after finishing, meeting
On interface as shown in Figure 2 by the textual presentation recognized in a Text Entry 21, if in text input frame 21
There is identification mistake in text, then recall keyboard 22 by user and modify and confirm preservation.However in this voice input scheme,
User cannot make recognition result any editor in Recording Process, it is necessary to all be finished in the voice that will disposably input
Afterwards, user could be changed and be confirmed that preservation, the text that then will confirm that again are used for one by one to the mistake in returned text
It subsequently such as sends short messages, sends out mail, the application kept record etc.So this confirmation process is generally for more numerous for user
It is trivial, not friendly enough.
Invention content
The purpose of the present invention is to provide a kind of pronunciation inputting method and systems, can be automatically segmented to input voice
Identification, the recording side user Ke Bian to identification by stages to text be modified.
To solve the above problems, the present invention provides a kind of pronunciation inputting method, including:
Constantly by the phonetic segmentation sound bite of input and the text of each sound bite is generated while recording;And
The text of each sound bite is shown successively, and the text of each sound bite is carried out successively according to the user's choice
It corrects.
Further, in the above-mentioned methods, it constantly by the phonetic segmentation sound bite of input and is generated by cloud server
The text of each sound bite.
Further, in the above-mentioned methods, by voice activity detection algorithm constantly by the phonetic segmentation voice sheet of input
Section.
Further, in the above-mentioned methods, described that the text of each sound bite is carried out successively according to the user's choice
The step of amendment includes:
User selects to need modified content in the text of each sound bite;
It generates corresponding to the syllable of each word in the candidate word of each word, the content in the content and corresponding to described
The candidate syllable of each word in content;
The text in pronunciation segment is carried out according to the candidate word of user's selection, the syllable and the candidate syllable
It corrects.
Further, in the above-mentioned methods, the candidate word, the syllable and the candidate selected according to user
The step of syllable is modified the text in pronunciation segment include:
When user selects the candidate word, the candidate word selected is replaced to the corresponding word in the content;
When user selects the syllable, the candidate word corresponding to the syllable is generated, from the candidate word of the syllable
It selects correct candidate word and replaces corresponding word in the content;
When user's selection candidate syllable, the candidate word for corresponding to candidate syllable is generated, from the candidate syllable
Correct candidate word is selected in candidate word replaces corresponding word in the content;
When in the candidate word, candidate syllable of generation without correctly as a result, can then call input method to text into
Row modification.
Further, in the above-mentioned methods, it is described recording while constantly by the phonetic segmentation sound bite of input simultaneously
Before the step of generating the text of each sound bite, further include:Noise monitoring is carried out to playback environ-ment in recording and obtains letter
It makes an uproar ratio.
Further, in the above-mentioned methods, described generate corresponds to the candidate word of each word, the content in the content
In each word syllable and corresponding to including the step of the candidate syllable of each word in the content:
When the signal-to-noise ratio is more than predetermined threshold value, the candidate word, the candidate syllable are reduced;
When the signal-to-noise ratio is less than predetermined threshold value, increase the candidate word, the candidate syllable.
Another side according to the present invention provides a kind of voice entry system, including:
Cutting module, for constantly by the phonetic segmentation sound bite of input and generating each voice sheet while recording
The text of section;And
Correcting module, the text for showing each sound bite successively, according to the user's choice successively to each voice
The text of segment is modified.
Further, in above system, the cutting module is located on cloud server.
Further, in above system, the cutting module is by voice activity detection algorithm constantly by the language of input
Sound cutting sound bite.
Further, in above system, the correcting module includes:
Selecting unit selects to need modified content in the text of each sound bite for obtaining user;
Candidate unit, for generating the sound corresponding to each word in the candidate word of each word, the content in the content
Section and corresponding to each word in the content candidate syllable;
Amending unit, the candidate word, the syllable and the candidate syllable for being selected according to user are to pronunciation piece
Text in section is modified.
Further, in above system, the amending unit, for when user selects the candidate word, will select
The candidate word replace the corresponding word in the content;When user selects the syllable, generate corresponding to the syllable
Candidate word replaces corresponding word in the content from correct candidate word is selected in the candidate word of the syllable;When user selects
When candidate's syllable, the candidate word for corresponding to candidate syllable is generated, is selected correctly from the candidate word of the candidate syllable
Candidate word replaces the corresponding word in the content;When no correctly as a result, then may be used in the candidate word, candidate syllable of generation
To call input method to modify text.
Further, further include noise monitoring unit in above system, for making an uproar to playback environ-ment in recording
Sound monitoring obtains signal-to-noise ratio.
Further, in above system, the candidate unit, for when the signal-to-noise ratio is more than predetermined threshold value, subtracting
Few candidate word, the candidate syllable;When the signal-to-noise ratio is less than predetermined threshold value, increase the candidate word.
Compared with prior art, the present invention by recording while constantly by the phonetic segmentation sound bite of input and life
At the text of each sound bite, the text of each sound bite is shown successively, according to the user's choice successively to each voice
The text of segment is modified, and with automatic segmentation voice recognition result and can carry out segmentation return for user's secondary-confirmation, user
Returned text can be directed at while recording to modify and confirm.
In addition, needing modified content in selecting the text of each sound bite by user, then generates and correspond to institute
State in content the syllable of each word and the candidate sound corresponding to each word in the content in the candidate word of each word, the content
Section repaiies the text in pronunciation segment further according to the candidate word of user's selection, the syllable and the candidate syllable
Just, it can facilitate user that correct word is quickly selected to be modified the content in text.
In addition, signal-to-noise ratio is obtained by carrying out noise monitoring to playback environ-ment in recording, when the signal-to-noise ratio is more than in advance
If when threshold value, reducing the candidate word, the candidate syllable;When the signal-to-noise ratio is less than predetermined threshold value, increase the candidate
Word, the candidate syllable, can adjust the number of candidate result according to different signal-to-noise ratio.
Description of the drawings
Fig. 1 is the recording interface schematic diagram of existing voice input scheme;
Fig. 2 is the identification textual presentation and modification interface schematic diagram of existing voice input scheme;
Fig. 3 is the flow chart of the pronunciation inputting method of the embodiment of the present invention one;
Fig. 4 is recording, identification textual presentation and the modification interface schematic diagram of the embodiment of the present invention one
Fig. 5 is the embodiment of the present invention one successively to identifying that text is shown and changes interface schematic diagram;
Fig. 6 is the flow chart of the pronunciation inputting method of the embodiment of the present invention two;
Fig. 7 is the noise monitoring interface schematic diagram of the embodiment of the present invention two;
Fig. 8 is the high-level schematic functional block diagram of the voice entry system of the embodiment of the present invention three.
Specific implementation mode
In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, below in conjunction with the accompanying drawings and specific real
Applying mode, the present invention is described in further detail.
Embodiment one
As seen in figures 3-5, the present invention provides a kind of pronunciation inputting method, including:
Step S11 constantly by the phonetic segmentation sound bite of input and generates each sound bite while recording
Text, specifically, the present invention can automatic segmentation voice recognition result and carry out segmentation return for user's secondary-confirmation, can be by high in the clouds
Server constantly by the phonetic segmentation sound bite of input and generates the text of each sound bite, is calculated by speech terminals detection
For method constantly by the phonetic segmentation sound bite of input, speech terminals detection is accurately determined from the segment signal comprising voice
The starting point and ending point of voice, it is one in voice processing technology to distinguish voice and non-speech audio, speech terminals detection
Importance can be used the algorithm of end-point detection, by effective language for example, when user continuously inputs voice by cloud server
Sound is cut into one one according to the speak rhythm of pause of user, and is converted into text successively, returns to mobile terminal as shown in Figure 4
Displaying interface on, the interface will record interface and recognition result displaying interface be integrated on the same interface;
Step S12 shows the text of each sound bite successively;
Step S13 is according to the user's choice successively modified the text of each sound bite, specifically, of the invention
Middle user can be directed at returned text while recording and modify and confirm, it should be noted that interaction schemes of the invention
In, all text identification results are not all shown, but only by the text identification result of current fragment be illustrated in as
On the interface of Fig. 5, after user is modified and confirms to the recognition result 1 of sound bite 1, then next section of recognition result is shown
2, this exhibition scheme is advantageous in that be shown limited as a result, allowing user can be by attention collection successively on limited screen
In in current recognition result, improve the efficiency of modification text, this shown step may particularly include:
Step S131, user selects to need modified content in the text of each sound bite, specifically, when user needs
When changing the partial words in text identification result, the specific word in text identification result can be clicked;
Step S132, generate correspond in the content in the candidate word of each word, the content syllable of each word and
Corresponding to the candidate syllable of each word in the content, specifically, the specific text changed is needed in recognition result when the user clicks
When word, pop-up several candidate words corresponding with the word, including the correspondence syllable of the word and several candidate sounds can be set
Section, can effectively combine voice recognition result with input method in this way, provide multiple candidates and selected for user, and will know
It is syllable that other result is degenerated from word, expands the range of hit, makes user that need not input a string of letters, but is found by candidate
Oneself required word;
Step S133, according to the candidate word of user's selection, the syllable and the candidate syllable in pronunciation segment
Text be modified, specifically, when user is modified and confirms to the recognition result of return, it is possible to provide as shown in Figure 5
" cancellation " and " confirmation " two order, be respectively used to rapidly to delete and preserve this text identification as a result, this step can be into
One step includes:
The candidate word selected is replaced the phase in the content by step S1331 when user selects the candidate word
Word is answered, if specifically, correct word is present in candidate word, user, which clicks directly on candidate word, can substitute original identification mistake
Word;
Step S1332 generates the candidate word corresponding to the syllable, from the syllable when user selects the syllable
Candidate word in select correct candidate word and replace corresponding word in the content, if specifically, there is no correct in candidate word
Word, then user can click correct syllable, then that of input is thought in selection from the syllable corresponding candidate word of offer
A word;
Step S1333 generates the candidate word for corresponding to candidate syllable when user's selection candidate syllable, from described
Correct candidate word is selected in the candidate word of candidate syllable and replaces corresponding word in the content, if specifically, correct syllable
Correct word is not present in corresponding candidate word, then user can click candidate syllable, then be corresponded to from candidate's syllable of offer
Candidate word in selection think input that word;
Step S1334, when no correctly as a result, can then call input in the candidate word, candidate syllable of generation
Method modifies to text.
The present invention recording interface and can will return the result interface and be simultaneously displayed on the interface of mobile terminal, allow user can be with
The text results of return are seen while recording, and the text results of return can be modified in real time, i.e., user can connect
One section of voice is continued, the text results of return are modified and are confirmed in the case where not closing recording, then proceed to record,
Other people voice of sound recordings can also be used on one side, and is corrected confirmation identification simultaneously and returned the result.
Embodiment two
As shown in Figure 6 and Figure 7, the present invention provides another pronunciation inputting method, and the difference of the present embodiment and embodiment exists
In, it increases and the step of noise monitoring obtains signal-to-noise ratio is carried out to playback environ-ment in recording, it can be according to different signal-to-noise ratio
The number of candidate result is adjusted, and is being not suitable for prompting user, this example can be specific using the very noisy of voice input
Including:
Step S21 carries out noise monitoring acquisition signal-to-noise ratio, specifically, this step can be automatic in recording to playback environ-ment
The signal-to-noise ratio of detection input voice is simultaneously fed back on interactive interface, can be not suitable for carrying using the very noisy of voice input
Show user, the number of candidate result can be also adjusted according to different signal-to-noise ratio in subsequent step S242, since noise is for language
The influence of sound identification is very big, and when playback environ-ment noise is stronger, the accuracy rate meeting dramatic decrease of speech recognition, user needs to change
Word also greatly increase, therefore, the function of noise monitoring can be added in the present embodiment, can be according to the knot of end-point detection
Fruit, calculating separately the corresponding voice segments energy of the result and mute section of energy to every section of recognition result, (mute section of energy is equivalent to
The energy of noise), to estimate the signal-to-noise ratio of this section of voice, and by the pollution level of ambient noise when recording with such as Fig. 7 institutes
The interface of the band recording volume bar 71 and noise ration bar 72 that show is shown, when ambient noise is more than certain threshold value
Later, user " current noise is excessive, it is proposed that is inputted using keyboard " can be prompted;
Step S22 constantly by the phonetic segmentation sound bite of input and generates each sound bite while recording
Text, specifically, the text of each sound bite constantly by the phonetic segmentation sound bite of input and is generated by cloud server,
By voice activity detection algorithm constantly by the phonetic segmentation sound bite of input;
Step S23 shows the text of each sound bite successively;
Step S24 is according to the user's choice successively modified the text of each sound bite, this step can be wrapped specifically
It includes:
Step S241, user select to need modified content in the text of each sound bite;
Step S242, generate correspond in the content in the candidate word of each word, the content syllable of each word and
Corresponding to the candidate syllable of each word in the content, can facilitate user quickly select correct word to the content in text into
Row is corrected, this step can further comprise:
Step S2421 reduces the candidate word, the candidate syllable, specifically when the signal-to-noise ratio is more than predetermined threshold value
, signal-to-noise ratio is big, indicates that voice is small by the pollution of noise, and the accuracy of recognition result is high, then can suitably reduce candidate result
Number;
Step S2422 increases the candidate word, the candidate syllable, specifically when the signal-to-noise ratio is less than predetermined threshold value
, signal-to-noise ratio is small, indicates that voice is big by noise pollution, then the possibility that mistake occurs in recognition result also greatly increases, then needs
The number for increasing candidate result, correct word can be therefrom selected convenient for user;
Step S243, according to the candidate word of user's selection, the syllable and the candidate syllable in pronunciation segment
Text be modified, this step can further comprise:
The candidate word selected is replaced the phase in the content by step S2431 when user selects the candidate word
Answer word;
Step S2432 generates the candidate word corresponding to the syllable, from the syllable when user selects the syllable
Candidate word in select correct candidate word and replace corresponding word in the content;
Step S2433 generates the candidate word for corresponding to candidate syllable when user's selection candidate syllable, from described
Correct candidate word is selected in the candidate word of candidate syllable replaces corresponding word in the content;
Step S2434, when no correctly as a result, can then call input in the candidate word, candidate syllable of generation
Method modifies to text.
The multiple voices such as noise monitoring, end-point detection, continuous speech recognition technology or frame are integrated in the present embodiment
It in one interactive process, allows user that can fully experience the convenience of voice input, it is defeated with button in voice input to improve user
Enter user experience when promiscuous operation.
Embodiment three
As shown in figure 8, the present invention also provides another voice entry system, including cutting module 41, correcting module 42 and
Noise monitoring unit 43.
Cutting module 41 is used to constantly by the phonetic segmentation sound bite of input and generate each voice while recording
The text of segment, specifically, the cutting module 41 is located on cloud server, the cutting module 41 is examined by sound end
For method of determining and calculating constantly by the phonetic segmentation sound bite of input, this module automatic segmentation voice recognition result and can carry out segmentation return
For user's secondary-confirmation.
Correcting module 42 is used to show the text of each sound bite successively, according to the user's choice successively to each voice
The text of segment is modified, specifically, this module can realize that user is directed at returned text while recording and modifies and really
Recognize, it should be noted that in interaction schemes of the invention, all text identification results are not all shown, but only
By on the text identification result of current fragment displaying interface, the text identification result of the sound bite is modified in user and
After confirmation, then show next section of recognition result, this exhibition scheme be advantageous in that shown successively on limited screen it is limited
As a result, allow user that can concentrate our efforts for current recognition result, improve the efficiency of modification text, the correcting module
42 can further comprise selecting unit 421, candidate unit 422 and amending unit 423.
Selecting unit 421 is used to obtain user and selects to need modified content in the text of each sound bite.
Candidate unit 422 is used to generate corresponding to each word in the candidate word of each word, the content in the content
Syllable and candidate syllable corresponding to each word in the content, specifically, need to change when the user clicks in recognition result
When specific word, corresponding with the word several candidate words of pop-up can be set, including the correspondence syllable of the word and several
Candidate syllable can effectively combine voice recognition result with input method in this way, provide multiple candidates and selected for user,
And it is syllable that recognition result is degenerated from word, expands the range of hit, makes user that need not input a string of letters, but passes through time
Oneself required word is found in choosing, in addition, the candidate unit 412 can be additionally used in when the signal-to-noise ratio is more than predetermined threshold value,
The candidate word, the candidate syllable are reduced, signal-to-noise ratio is big, indicates that voice is small by the pollution of noise, the accuracy of recognition result
Height then can suitably reduce the number of candidate result;When the signal-to-noise ratio is less than predetermined threshold value, increase the candidate word, institute
Candidate syllable is stated, signal-to-noise ratio is small, indicates that voice is big by noise pollution, then the possibility that mistake occurs in recognition result also increases greatly
Add, then needs the number for increasing candidate result, correct word can be therefrom selected convenient for user.
The candidate word, the syllable and the candidate syllable that amending unit 423 is used to be selected according to user are to pronunciation
Text in segment is modified, specifically, the amending unit 413 is used to, when user selects the candidate word, to select
The candidate word replace the corresponding word in the content;When user selects the syllable, generate corresponding to the syllable
Candidate word replaces corresponding word in the content from correct candidate word is selected in the candidate word of the syllable;When user selects
When candidate's syllable, the candidate word for corresponding to candidate syllable is generated, is selected correctly from the candidate word of the candidate syllable
Candidate word replaces the corresponding word in the content;When no correctly as a result, then may be used in the candidate word, candidate syllable of generation
To call input method to modify text.
Noise monitoring unit 43 is used to carry out noise monitoring to playback environ-ment in recording to obtain signal-to-noise ratio, can be according to not
The number of same signal-to-noise ratio adjustment candidate result, and be not suitable for prompting user using the very noisy of voice input.
The present invention while recording by constantly by the phonetic segmentation sound bite of input and generating each sound bite
Text, show the text of each sound bite successively, according to the user's choice successively to the text of each sound bite carry out
It corrects, with automatic segmentation voice recognition result and segmentation can be carried out return for user's secondary-confirmation, user can record one on one side
While modifying and confirming to returned text.
In addition, needing modified content in selecting the text of each sound bite by user, then generates and correspond to institute
State in content the syllable of each word and the candidate sound corresponding to each word in the content in the candidate word of each word, the content
Section repaiies the text in pronunciation segment further according to the candidate word of user's selection, the syllable and the candidate syllable
Just, it can facilitate user that correct word is quickly selected to be modified the content in text.
In addition, signal-to-noise ratio is obtained by carrying out noise monitoring to playback environ-ment in recording, when the signal-to-noise ratio is more than in advance
If when threshold value, reducing the candidate word, the candidate syllable;When the signal-to-noise ratio is less than predetermined threshold value, increase the candidate
Word, the candidate syllable, can adjust the number of candidate result according to different signal-to-noise ratio.
Each embodiment is described by the way of progressive in this specification, the highlights of each of the examples are with other
The difference of embodiment, just to refer each other for identical similar portion between each embodiment.For system disclosed in embodiment
For, due to corresponding to the methods disclosed in the examples, so description is fairly simple, related place is referring to method part illustration
.
Professional further appreciates that, unit described in conjunction with the examples disclosed in the embodiments of the present disclosure
And algorithm steps, can be realized with electronic hardware, computer software, or a combination of the two, in order to clearly demonstrate hardware and
The interchangeability of software generally describes each exemplary composition and step according to function in the above description.These
Function is implemented in hardware or software actually, depends on the specific application and design constraint of technical solution.Profession
Technical staff can use different methods to achieve the described function each specific application, but this realization is not answered
Think beyond the scope of this invention.
Obviously, those skilled in the art can carry out invention spirit of the various modification and variations without departing from the present invention
And range.If in this way, these modifications and changes of the present invention belong to the claims in the present invention and its equivalent technologies range it
Interior, then the present invention is also intended to including these modification and variations.