CN102982811A

CN102982811A - Voice endpoint detection method based on real-time decoding

Info

Publication number: CN102982811A
Application number: CN2012104830464A
Authority: CN
Inventors: 吴玲; 王兵; 赵乾; 潘颂声; 何春江; 朱群
Original assignee: iFlytek Co Ltd
Current assignee: Anhui Hear Technology Co Ltd
Priority date: 2012-11-24
Filing date: 2012-11-24
Publication date: 2013-03-20
Anticipated expiration: 2032-11-24
Also published as: CN102982811B

Abstract

A voice endpoint detection method based on real-time decoding includes the following steps: inputting texts related to voice recognition, analyzing the texts, constructing a decoding network according to text analyzing results, inputting voice, extracting acoustic characteristics in the voice, decoding the acoustic characteristics based on the constructed decoding network to obtain a decoded voice unit sequence, conducting voice endpoint judgment on the decoded voice unit sequence to judge whether voice endpoints exist or not, dividing voice endpoints into voice starting points and voice ending points, feeding voice endpoint information to an external application system on yes judgment and continuing step2 on no judgment. The voice starting point judgment is selectable in the third step, and voice starting point is not judged if the external application system does not care about the voice starting point. The voice endpoint detection method resolves the problem that targeted detection cannot be conducted on voice cared by users due to the fact that real time performance of the traditional endpoint detection technology is not high under the condition that voice recognition texts are not determined.

Description

A kind of sound end detecting method based on real-time decoding

Technical field

The present invention relates to a kind of sound end detecting method based on decoded result, the especially a kind of method that can in time feed back the voice end point.

Background technology

The end-point detection of voice is exactly to determine starting point and the terminal point of voice, gets rid of unvoiced segments from voice signal.The correctness of end-point detection has a great impact the performance of speech recognition.In the speech evaluating system, the content of user recording determined by the paper text, in time provides the voice distal point after running through the paper content and stops to calculate in that the user is bright, helps to improve system performance and evaluation and test effect.In outer application system, the effect of end-point detection quality directly affects the user and experiences.

Such as in phonetic study software, carry out simultaneously end-point detection during the user recording evaluation and test, when detecting the voice end, automatically stop recording, saved the troublesome operation of some stop button, when the user repeatedly uses, can greatly promote the user and experience; At speech control system, such as Smart Home, the user can be by " turning on light ", the switch of order control lamp such as " turn off the light " in the not high situation of end-point detection real-time, can cause these command responses untimely, experience bad, if the user just finishes " turning on light " " lamp " word, lamp has just been opened, and it is extraordinary experiencing like this.

Existing end-point detecting method can be divided into two classes: threshold method and pattern-recongnition method.

(1) threshold method

Extract a certain feature of voice, such as: short-time energy, short-time average magnitude, zero-crossing rate etc., calculate the size of its value, determine a threshold value in conjunction with actual conditions and experience, according to some determination strategy, determine whether it being voice start frame or end frame, main algorithm has: utilize short-time energy and short-time zero-crossing rate, use cepstrum feature etc.

(2) pattern-recongnition method

These class methods are the end-point detection problem of voice signal to be regarded as every frame signal is classified, by setting up corresponding detection criteria, every frame voice are divided, judge that these frame voice belong to ground unrest or voice, the sound end detecting method based on the auto-correlation similarity distance of having that belongs to this algorithm is based on sound end detection of HMM model etc.

Any method in the said method is all read aloud text-independent with the user.

When the content of text of speech recognition is determined, for example English study or Chinese learning system, in these application scenarioss, the content of text that outer application system user reads aloud is determined, and only pay close attention to the text-dependent voice partly that the user reads aloud, wish the user bright run through specify text or the order word last word the time, endpoint detection module can provide the voice end position immediately.

Normally read aloud the user in the situation of specify text, existing end-point detection technology just can be made a strategic decision because the content of text of not knowing or not utilizing the user to read aloud needs next section non-speech data to arrive, and the response time is larger.

Continue again to read aloud the irrelevant content of some and specify text behind the text of appointment if the user has run through, existing end-point detection can not be distinguished the unconcerned voice of this part system, provides suitable voice end point.

In some application scenario, may be in that the user be bright when running through complete order word or sentence, just can provide the voice end point and stop recording, if the user has read aloud the content in half text, then stopped the long period, it is quiet that existing end-point detection may detect this section, and the too early voice end point that provides can not satisfy this application demand.

Summary of the invention

The technology of the present invention is dealt with problems: overcome the deficiencies in the prior art, a kind of sound end detecting method based on real-time decoding is provided, solution is in the situation that the speech recognition text is determined, the real-time that existing end-point detection technology shows is not high, can't carry out the aimed detection problem to the voice that the user is concerned about.

The technology of the present invention solution: a kind of sound end detecting method based on real-time decoding is a kind of end-point detecting method by combining with content of text, and performing step is as follows:

The first step: input speech recognition related text, resolve text;

Second step: the result makes up decoding network according to text resolution;

The 3rd step: the input voice, extract the acoustic feature in the voice, based on the decoding network that second step makes up described acoustic feature is decoded, obtain decoded linguistic unit sequence; Each unit is called a frame in the described linguistic unit sequence.Acoustic feature described herein is a class value of describing the Short Time Speech essential characteristic, normally a kind of proper vector of fixedly dimension (such as the MFCC proper vector of 39 dimensions).

The 4th step: decoded voice unit sequence is carried out sound end judge that judge whether it is sound end, described sound end is divided into voice starting point and voice end point; If judged result is the voice end point, then the voice endpoint information is fed back to outer application system, otherwise continued for the 3rd step; The voice starting point judges it is optional in the 4th step, if the external application system is indifferent to the voice starting point, does not then judge the voice starting point;

Voice starting point in described the 4th step is judged as follows:

(1.1) get optimal path in the demoder; Demoder is one of core of speech recognition system, and its task is that according to acoustic model, decoding network, searching can be exported with maximum probability the linguistic unit sequence of this signal to the acoustic feature of input.It is one of input of demoder that decoding network is again the grammer network, does not have the decoding network demoder not work, and decoding network defines the scope of demoder output language unit sequence;

(1.2) voice starting point early warning namely according to the optimal path in the demoder, judges that current speech text possibility reaches the voice starting point, if so, carries out step (1.3), otherwise withdraws from;

(1.3) early warning is confirmed, whether namely judge has phoneme or effective rubbish voice in the text, confirm the current voice starting point that whether really reaches by this process in the speech text; If so, obtain starting point, otherwise directly withdraw from.

Voice end point in described the 4th step is judged as follows:

(2.1) get current optimal path in the demoder;

(2.2) voice end point early warning namely according to the optimal path in the demoder, judges that last the phoneme possibility in the speech text has been said, if so, carries out step (2.3), otherwise withdraws from;

(2.3) early warning is identified, and namely whether last phoneme was really said in the speech text, and by frame length, the average likelihood score index of frame is made a strategic decision, and has said if be judged as really, then obtains the voice end point, process ends, otherwise directly finish.

In the application scenarios of some, sometimes the user does not run through content of text, need to return the end point of voice, and this just needs detection method of the present invention to be combined with traditional end-point detecting method, and the process steps that combines with traditional end-point detecting method is as follows:

(1) input speech recognition related text is resolved text;

(2) make up decoding network according to first step text resolution result;

(3) input voice extract the acoustic feature in the voice on the one hand, on the other hand traditional endpoint detection module are passed in voice;

(4) end-point detecting method of the present invention and legacy endpoint detection are carried out simultaneously, detect separately sound end;

(5) sound end that provides in conjunction with end-point detecting method of the present invention and the legacy endpoint detection method sound end of whether making a strategic decision, can adopt in above-mentioned two any one method to detect is the strategy that end points is just thought end points, can also all detect end points with two kinds of methods and just think end points;

(6) the backchannel voice endpoint is to outer application system.

It is as follows to make up the decoding network step in the described second step:

(1) obtaining minimum modeling unit behind the text resolution of the first step, can be phoneme, syllable, word;

(2) count and total nodal point number according to the dummy section in the minimum modeling unit number computational grid, be the node storage allocation, and minimum modeling unit and network node are associated;

(3) according to the arc number in the regular computational grid read aloud that allows, and be the arc storage allocation; The rule of reading aloud of described permission comprises retaking of a year or grade, skip;

(4) according to reading aloud rule, by arc node is coupled together;

(5) output decoding network.

The step of getting the optimal path in the demoder in described step (1.1) and the step (2.1) is as follows:

(1) all paths in the current demoder of traversal are resolved each path and are obtained corresponding voice unit sequence and probability;

(2) according to probability sorted in the path;

(3) get the path of ordering posterior probability maximum as optimal path.

Acoustic feature is Mel cepstrum coefficient MFCC, cepstrum coefficient CEP in described the 3rd step, linear predictor coefficient LPC or perception linear predictor coefficient PLP.

The voice unit sequence is aligned phoneme sequence, syllable sequence or word sequence in described the 3rd step.

Be decoded as Viterbi decoding in described the 3rd step, or based on the decoding of dynamic time warping (DTW).

The present invention's advantage compared with prior art is:

(1) the present invention can in time provide the voice end point when the user runs through last word when the user normally reads aloud specify text, and the response time is shorter than the existing end-point detection technology response time, and real-time is high.

(2) the present invention continues to read aloud some other irrelevant contents after running through specify text when the user is bright, and this programme can intelligence distinguishes the unconcerned rubbish voice of this part system, makes external application system better effects if.

(3) the present invention can be used in the recording occasion that requirement is arranged of the integrality that the user is read aloud, and the user does not run through given content and just do not provide the voice end point, and existing end-point detection technology is not accomplished.

Description of drawings

Fig. 1 is realization flow figure of the present invention;

Fig. 2 is the voice starting point decision flow chart among the present invention;

Fig. 3 is the voice end point decision flow chart among the present invention;

Fig. 4 is the realization flow figure that combines with existing end-point detection technology among the present invention;

Fig. 5 is with the decoding network example of Chinese sound mother as minimum unit;

Fig. 6 is traditional MFCC feature extraction flow process;

Fig. 7 is traditional end-point detection flow process.

Embodiment

The present invention is a kind of and new end-point detecting method text-dependent, take the viterbi decoding as decoding process as example (the present invention is not limited only to the viterbi decoding), process flow diagram of the present invention as shown in Figure 1:

The first step: input speech recognition related text, resolve text:

The text of input be the user be scheduled to read aloud content, also be one of foundation that makes up decoding network.This step is mainly finished two tasks: at first need the coded format of text is unified conversion, convert the UTF8 form to as unified, the benefit of doing like this is that the code of resolving text only needs to realize a cover; Secondly according to the granularity (such as word, syllable, phoneme) of corresponding model unit in the acoustic model resolve (adopt phoneme better as the modeling unit effect, below describe all take phoneme as example), generate and resolve bearing-age tree shape structure, this structure comprises the complete information of chapter, sentence, word, word, syllable, six levels of phoneme, wherein front 4 levels can divide word algorithm to resolve according to the text front end, and rear 2 levels can be resolved according to pronunciation dictionary.

Second step: the result makes up decoding network according to text resolution, and is specific as follows:

(2.1): obtain the minimum modeling unit (can be phoneme, syllable, word) behind the text resolution of the first step;

(2.2): count and total nodal point number according to the dummy section in the minimum modeling unit number computational grid, the distribution node internal memory, and minimum modeling unit and network node associated;

(2.3): according to retaking of a year or grade, the permissions such as skip read aloud arc number in the regular computational grid, and be the arc storage allocation;

(2.4): according to reading aloud rule, make up arc node is coupled together;

(2.5): the output decoding network.

Suppose that the speech recognition text is " China " two words, the minimum modeling unit of network is phoneme (sound is female), the decoding network that constructs is shown in figure five, as can be seen from the figure, this network allows 1-2 word of skip, a retaking of a year or grade 1-2 word, this network is read " state " for the user, and " Chinese state " etc. increases the skip situation and can both be correctly decoded.

The 3rd step: the input voice, extract acoustic feature, decode based on decoding network:

Acoustic feature is some values of describing phonetic feature, and normally a kind of sequence of proper vector of fixedly dimension is such as the 39 MFCC(Mel-cepstrum coefficients of tieing up) feature, 39 floating point values represent the feature of frame voice exactly; The extraction flow process of MFCC feature as shown in Figure 6, concrete steps are as follows:

(3.1) A/D conversion is digital signal with analog signal conversion;

(3.2) pre-emphasis: by a limited exciter response Hi-pass filter of single order, make the frequency spectrum of signal become smooth, be not vulnerable to the impact of finite word length effect;

(3.3) divide frame: according to the in short-term smooth performance of voice, voice can be processed take frame as unit, generally can get 25 milliseconds (ms) as a frame;

(3.4) windowing: adopt hamming code window to a frame voice windowing, to reduce the impact of Gibbs' effect;

(3.5) fast fourier transform (Fast Fourier Transformation, FFT): the power spectrum that time-domain signal is for conversion into signal;

(3.6) quarter window filtering: with the quarter window wave filter (totally 24 quarter window wave filters) of one group of Mel frequency marking Linear distribution, power spectrum filtering to signal, the scope that each quarter window wave filter covers is similar to a critical bandwidth of people's ear, simulates the masking effect of people's ear with this;

(3.7) ask logarithm: logarithm is asked in the output of quarter window bank of filters, can obtain being similar to the result of isomorphic transformation;

(3.8) discrete cosine transform (Discrete Cosine Transformation, DCT): remove the correlativity between each dimensional signal, signal map is arrived lower dimensional space;

(3.9) spectrum weighting: because the low order parameter of cepstrum is subject to the impact of speaker's characteristic, the characteristic of channel etc., and the resolution characteristic of high order parameters is lower, so need to compose weighting, suppresses its low order and high order parameters;

(3.10) cepstral mean subtracts (Cepstrum Mean Subtraction, CMS): CMS and can effectively reduce the phonetic entry channel to the impact of characteristic parameter;

(3.11) differential parameter: great many of experiments shows, adds the differential parameter that characterizes the voice dynamic perfromance in phonetic feature, can improve the recognition performance of system.We have also used first order difference parameter and the second order difference parameter of MFCC parameter.

After acoustic feature extracts, just carry out real-time decoding, tone decoding is a step (being decoded as example with Viterbi) important among the present invention, the process of decoding among the present invention is: to every frame acoustic feature of input, calculate output probability and the intra-node state transition probability of current every feasible path corresponding node in the decoding network, and upgrade the accumulated probability of current path.Output probability herein can be corresponding according to the node phoneme Hidden Markov Model (HMM) and acoustics feature calculation, the intra-node state transition probability directly reads from model; When above-mentioned when being decoded to last state of intra-node, can expand current decoding path, the foundation of expansion is exactly to follow the tracks of decoding network, when this node is connected to a plurality of node, need the expansion mulitpath to proceed decoding, if there is path punishment on the arc of tracking decoding network, then need punishment is added in the accumulated probability in path, decode procedure can generate mulitpath in real time.

The 4th step: judge whether it is sound end, sound end can be divided into voice starting point and voice end point, if the voice end point then feeds back to the external application system to the voice end point, otherwise continues for the 3rd step; Starting point judges it is optional in the 4th step, if external system is indifferent to starting point information, then can not judge starting point, and both of these case of the present invention is all included.

Among Fig. 2 sound end judge can comprise that the voice starting point is judged, a kind of or whole in judging of voice end point.

The judgement of voice starting point: the flow process that the voice starting point is judged as shown in Figure 2.

The judgement of voice starting point roughly divides following a few step:

The first step: get the demoder optimal path, it is as follows in detail to get optimal path:

The path namely is that demoder arrives the road that end node is passed by, because begin not know which bar road is the shortest, demoder can be walked a lot of bars road, the corresponding a kind of voice unit sequence in each bar road, and optimal path is exactly for path corresponding to current speech maximum probability voice unit sequence.

(1.1): travel through all paths in the current demoder, resolve each path and obtain voice unit sequence and probability;

(1.2): according to probability sorted in the path;

(1.3): get the path of ordering posterior probability maximum as optimal path.

Second step: the early warning of voice starting point, namely judge that according to optimal path current possibility reaches starting point, if so, carried out for the 3rd step, otherwise withdraw from flow process.Decoding network with figure five is decoded as example, and the identification text is " China ", and zh is the text starting point of network, judges namely herein whether current decoding has arrived " zh " node in the network.

The 3rd step: early warning is confirmed, by judging whether that the process such as phoneme or effective rubbish voice is confirmed the current voice starting point that whether really reaches in the text.If so, obtain starting point, otherwise directly withdraw from flow process.Decoding network with figure five is decoded as example, judges namely whether the probability of decoding path arrival " zh " is enough large.

The judgement of voice end point: the flow process that the voice end point is judged as shown in Figure 3.

The first step: get current optimal path in the demoder, the same with the method for optimal path in the above-mentioned voice starting point judgement.

Second step: the early warning of voice end point, namely said according to last phoneme possibility of optimal path judgement text, if so, carried out for the 3rd step, otherwise withdraw from flow process.Decoding network with Fig. 5 is decoded as example, judges namely herein whether the demoder optimal path is decoded to " uo " node.

The 3rd step: confirm early warning, judge namely whether last phoneme was really said in the text.Can make a strategic decision by indexs such as frame length, the average likelihood scores of frame.Said if be judged as really, then obtained the end point process ends, otherwise direct process ends.Decoding network with Fig. 5 is decoded as example, namely weighs the frame length of the optimal path that arrives " uo " herein, and whether average likelihood score is reasonable.

Detect in conjunction with (expansion scheme) with legacy endpoint:

In the application scenarios of some, sometimes the user does not run through content of text, needs endpoint detection system can return the end point of voice yet, and this just need to be combined with traditional end-point detection technology, and process flow diagram as shown in Figure 4.Suppose that end-point detection flow process traditional among Fig. 4 is the end-point detection based on energy, process flow diagram mainly is divided into following a few step as shown in Figure 7:

The first step: input voice;

Second step: to voice segment, extract short-time energy;

The 3rd step: comprehensive judgement, according to current short-time energy judgement voice segments and non-speech segment, can adopt one of four thresholdings or two kinds of methods of double threshold;

The 4th step: the end points feedback, according to the 3rd step court verdict feedback testing result.

The process steps that combines with traditional end-point detecting method among the present invention is as follows:

The first step: input speech recognition related text, resolve text;

Second step: the result makes up decoding network according to first step text resolution;

The 3rd step: the input voice, extract on the one hand the acoustic feature in the voice, on the other hand traditional endpoint detection module passed in voice;

The 4th step: end-point detecting method of the present invention and legacy endpoint detection are carried out simultaneously, detect separately sound end;

The 5th step: the sound end that provides in conjunction with end-point detecting method of the present invention and the legacy endpoint detection method sound end of whether making a strategic decision, can adopt in above-mentioned two any one method to detect is the strategy that end points is just thought end points, can also all detect end points with two kinds of methods and just think end points;

The 6th step: the backchannel voice endpoint is to outer application system.

" whether detect starting point " among upper Fig. 4 and comprised the process that starting point does not then detect that detects.

In a word, the invention solves in the situation that the speech recognition text is determined, the real-time that the legacy endpoint detection technique shows is not high, can't carry out the aimed detection problem to the voice that the user is concerned about.

The non-elaborated part of the present invention belongs to those skilled in the art's known technology.

Claims

1. sound end detecting method based on real-time decoding is characterized in that performing step is as follows:

The first step: input speech recognition related text, resolve text;

Second step: the result makes up decoding network according to text resolution;

The 3rd step: input in real time voice, extract the acoustic feature in the voice, based on the decoding network that second step makes up described acoustic feature is decoded, obtain decoded linguistic unit sequence; Each unit is called a frame in the described linguistic unit sequence;

Voice starting point in described the 4th step is judged as follows:

(1.1) get optimal path in the demoder;

(1.2) voice starting point early warning namely according to the optimal path in the demoder, judges that current speech text possibility reaches the voice starting point, if so, carries out step (1.3), otherwise finishes to judge;

(1.3) whether confirm early warning, namely judging has phoneme or effective rubbish voice in the text, confirm the current voice starting point that whether really reaches by this process in the speech text; If so, obtain starting point, otherwise directly withdraw from;

Voice end point in described the 4th step is judged as follows:

(2.1) get current optimal path in the demoder;

(2.2) voice end point early warning namely according to the optimal path in the demoder, judges that last the phoneme possibility in the speech text has been said, if so, carries out step (2.3), otherwise finishes to judge;

(2.3) confirm early warning, namely whether last phoneme was really said in the speech text, and by frame length, the average likelihood score index of frame is made a strategic decision, and has said if be judged as really, then obtains the voice end point, process ends, otherwise directly finish.

2. a kind of sound end detecting method based on real-time decoding according to claim 1, it is characterized in that: in the application scenarios of some, sometimes the user does not run through content of text, need to return the end point of voice, this just needs detection method of the present invention to be combined with traditional end-point detecting method, and the process steps that combines with traditional end-point detecting method is as follows:

(1) input speech recognition related text is resolved text;

(2) make up decoding network according to first step text resolution result;

(6) the backchannel voice endpoint is to outer application system.

3. a kind of sound end detecting method based on real-time decoding according to claim 1 and 2 is characterized in that: it is as follows to make up the decoding network step in the described second step:

(4) according to reading aloud rule, by arc node is coupled together;

(5) output decoding network.

4. a kind of sound end detecting method based on real-time decoding according to claim 1 and 2, it is characterized in that: the step of getting the optimal path in the demoder in described step (1.1) and the step (2.1) is as follows:

(2) according to probability sorted in the path;

(3) get the path of ordering posterior probability maximum as optimal path.

5. a kind of sound end detecting method based on real-time decoding according to claim 1 and 2, it is characterized in that: acoustic feature is Mel cepstrum coefficient MFCC, cepstrum coefficient CEP in described the 3rd step, linear predictor coefficient LPC or perception linear predictor coefficient PLP.

6. a kind of sound end detecting method based on real-time decoding according to claim 1 and 2 is characterized in that: the voice unit sequence is aligned phoneme sequence, syllable sequence or word sequence in described the 3rd step.

7. a kind of sound end detecting method based on real-time decoding according to claim 1 and 2 is characterized in that: be decoded as the Viterbi decoding in described the 3rd step, or based on the decoding of dynamic time warping (DTW).