CN102982811B

CN102982811B - Voice endpoint detection method based on real-time decoding

Info

Publication number: CN102982811B
Application number: CN201210483046.4A
Authority: CN
Inventors: 吴玲; 王兵; 赵乾; 潘颂声; 何春江; 朱群
Original assignee: iFlytek Co Ltd
Current assignee: Anhui Hear Technology Co Ltd
Priority date: 2012-11-24
Filing date: 2012-11-24
Publication date: 2015-01-14
Anticipated expiration: 2032-11-24
Also published as: CN102982811A

Abstract

A voice endpoint detection method based on real-time decoding includes the following steps: inputting texts related to voice recognition, analyzing the texts, constructing a decoding network according to text analyzing results, inputting voice, extracting acoustic characteristics in the voice, decoding the acoustic characteristics based on the constructed decoding network to obtain a decoded voice unit sequence, conducting voice endpoint judgment on the decoded voice unit sequence to judge whether voice endpoints exist or not, dividing voice endpoints into voice starting points and voice ending points, feeding voice endpoint information to an external application system on yes judgment and continuing step2 on no judgment. The voice starting point judgment is selectable in the third step, and voice starting point is not judged if the external application system does not care about the voice starting point. The voice endpoint detection method resolves the problem that targeted detection cannot be conducted on voice cared by users due to the fact that real time performance of the traditional endpoint detection technology is not high under the condition that voice recognition texts are not determined.

Description

A kind of sound end detecting method based on real-time decoding

Technical field

The present invention relates to a kind of sound end detecting method based on decoded result, especially a kind of method feeding back voice end point in time.

Background technology

The end-point detection of voice is exactly determine starting point and the terminal of voice, from voice signal, get rid of unvoiced segments.The performance of correctness to speech recognition of end-point detection has a great impact.In speech evaluating system, the content of user recording is determined by paper text, user is bright run through paper content after provide voice distal point in time and stop calculate, contribute to improve system performance and evaluation and test effect.In outer application system, the effect quality of end-point detection directly affects Consumer's Experience.

Such as in phonetic study software, during user recording evaluation and test, carry out end-point detection simultaneously, at the end of voice being detected, automatically stop recording, eliminate the troublesome operation of a stop button, greatly can promote Consumer's Experience when user repeatedly uses; At speech control system, as Smart Home, user can pass through " turning on light ", the order such as " to turn off the light " controls the switch of lamp, when end-point detection real-time is not high, can cause these command responses not in time, experience bad, " lamp " word that if user just finishes " turning on light ", lamp has just been opened, and it is extraordinary for experiencing like this.

Existing end-point detecting method can be divided into two classes: threshold method and pattern-recongnition method.

(1) threshold method

Extract a certain feature of voice, such as: short-time energy, short-time average magnitude, zero-crossing rate etc., calculate the size of its value, a threshold value is determined in conjunction with actual conditions and experience, according to some determination strategy, determine whether it being voice start frame or end frame, main algorithm has: utilize short-time energy and short-time zero-crossing rate, application cepstrum feature etc.

(2) pattern-recongnition method

These class methods the end-point detection problem of voice signal are regarded as classify to every frame signal, by setting up corresponding detection criteria, every frame voice are divided, judge that these frame voice belong to ground unrest or voice, belong to the sound end detecting method had based on auto-correlation similarity distance of this algorithm, based on the speech terminals detection etc. of HMM model.

Any one method in said method is all read aloud text and is had nothing to do with user.

When the content of text of speech recognition is determined, such as English study or Chinese learning system, in these application scenarioss, the content of text that outer application system user reads aloud is determined, and only pay close attention to the voice of the text relevant portion that user reads aloud, when wishing that user is bright and run through last word of specifying text or order word, endpoint detection module can provide voice end position immediately.

When user normally reads aloud appointment text, existing end-point detection technology is due to the content of text of not knowing or do not utilize user to read aloud, and need next section of non-speech data to arrive, just can carry out decision-making, the response time is larger.

If user continues again after having run through the text of specifying to read aloud some contents irrelevant with specifying text, existing end-point detection can not distinguish the unconcerned voice of this part system, provides suitable voice end point.

In some application scenario, may need user is bright run through complete order word or sentence time, just can provide voice end point and stop recording, if user has read aloud the content in half text, then the long period has been stopped, existing end-point detection may detect this section quiet, too early provides voice end point, can not meet this application demand.

Summary of the invention

The technology of the present invention is dealt with problems: overcome the deficiencies in the prior art, a kind of sound end detecting method based on real-time decoding is provided, solve when speech recognition text is determined, existing end-point detection Technological expression real-time is out not high, cannot carry out aimed detection problem to the voice that user is concerned about.

The technology of the present invention solution: a kind of sound end detecting method based on real-time decoding, be a kind of end-point detecting method by combining with content of text, performing step is as follows:

The first step: input speech recognition related text, resolves text;

Second step: build decoding network according to text resolution result;

3rd step: input voice, extract the acoustic feature in voice, the decoding network built based on second step is decoded to described acoustic feature, obtains decoded language unit sequence; In described language unit sequence, each unit is called a frame.Acoustic feature described herein is the class value describing Short Time Speech essential characteristic, normally a kind of proper vector (the MFCC proper vectors as 39 dimensions) of fixing dimension.

4th step: carry out sound end judgement to decoded speech unit sequence, judges whether it is sound end, and described sound end is divided into voice starting point and voice end point; If judged result is voice end point, then voice endpoint information is fed back to outer application system, otherwise continue the 3rd step; In the 4th step, voice starting point judges is optional, if external application system is indifferent to voice starting point, does not then judge voice starting point;

Voice starting point in described 4th step judges as follows:

(1.1) optimal path in demoder is got; Demoder is one of core of speech recognition system, and its task is the acoustic feature to input, according to acoustic model, decoding network, finds the language unit sequence that can export this signal with maximum probability.Decoding network makes again grammer network be one of input of demoder, and do not have decoding network demoder not work, decoding network defines the scope of demoder output language unit sequence;

(1.2) voice starting point early warning, namely according to the optimal path in demoder, judges whether current speech text may reach voice starting point, if so, carries out step (1.3), otherwise exit;

(1.3) early warning confirms, namely to judge whether to have in speech text in text phoneme or effective rubbish voice, confirms currently whether really reach voice starting point by this process; If so, obtain starting point, otherwise directly exit.

Voice end point in described 4th step judges as follows:

(2.1) current optimal path in demoder is got;

(2.2) voice end point early warning, namely according to the optimal path in demoder, judges whether last phoneme in speech text can be talkative, if so, carries out step (2.3), otherwise exit;

(2.3) early warning is identified, and namely in speech text, whether last phoneme was really said, by frame length, the average likelihood score index of frame carrys out decision-making, if be judged as really having said, then obtain voice end point, process ends, otherwise directly terminates.

In the application scenarios of some, sometimes user does not run through content of text, needs the end point returning voice, and this just needs detection method to be combined with traditional end-point detecting method, and the process steps combined with traditional end-point detecting method is as follows:

(1) input speech recognition related text, resolve text;

(2) decoding network is built according to first step text resolution result;

(3) input voice, extract the acoustic feature in voice on the one hand, on the other hand traditional endpoint detection module passed in voice;

(4) end-point detecting method of the present invention and legacy endpoint detection are carried out simultaneously, detect sound end separately;

(5) whether the sound end decision-making provided in conjunction with end-point detecting method of the present invention and legacy endpoint detection method is sound end, any one method in above-mentioned two can be adopted to detect it is the strategy that end points just thinks end points, can also all detect that end points just thinks end points by two kinds of methods;

(6) backchannel voice endpoint is to outer application system.

Decoding network step is built as follows in described second step:

(1) obtaining the minimum modeling unit after the text resolution of the first step, can be phoneme, syllable, word;

(2) count and total nodal point number according to the dummy section in minimum modeling unit number computational grid, be peer distribution internal memory, and minimum modeling unit and network node are associated;

(3) according to the arc number read aloud in regular computational grid allowed, and be arc storage allocation; The rule of reading aloud of described permission comprises retaking of a year or grade, skip;

(4) according to reading aloud rule, by arc, node is coupled together;

(5) decoding network is exported.

The step of getting the optimal path in demoder in described step (1.1) and step (2.1) is as follows:

(1) travel through all paths in current decoder, resolve each path and obtain corresponding speech unit sequence and probability;

(2) according to probability, sorted in path;

(3) the maximum path of sequence posterior probability is got as optimal path.

In described 3rd step, acoustic feature is mel cepstrum coefficients MFCC, cepstrum coefficient CEP, linear predictor coefficient LPC or perception linear predictor coefficient PLP.

In described 3rd step, speech unit sequence is aligned phoneme sequence, syllable sequence or word sequence.

Viterbi decoding is decoded as in described 3rd step, or based on the decoding of dynamic time warping (DTW).

The present invention's advantage is compared with prior art:

(1) the present invention is when user normally reads aloud appointment text, and can provide voice end point in time when user runs through last word, the response time is shorter than the existing end-point detection technology response time, and real-time is high.

(2) the present invention runs through to specify after text continue to read aloud some other irrelevant contents when user is bright, this programme can intelligence distinguish the unconcerned rubbish voice of this part system, make external application system better effects if.

(3) the present invention can be used in the recording occasion having requirement to the integrality that user reads aloud, and user does not run through given content and just do not provide voice end point, and existing end-point detection technology is not accomplished.

Accompanying drawing explanation

Fig. 1 is realization flow figure of the present invention;

Fig. 2 is the voice starting point decision flow chart in the present invention;

Fig. 3 is the voice end point decision flow chart in the present invention;

Fig. 4 is the realization flow figure combined with existing end-point detection technology in the present invention;

Fig. 5 is using the female decoding network example as minimum unit of Chinese sound;

Fig. 6 is traditional MFCC feature extraction flow process;

Fig. 7 is traditional end-point detection flow process.

Embodiment

The present invention is a kind of new end-point detecting method relevant to text, for viterbi decoding as decoding process (the present invention be not limited only to viterbi decoding), process flow diagram of the present invention as shown in Figure 1:

The first step: input speech recognition related text, resolve text:

Input text be user predetermined read aloud content, be also build decoding network according to one of.This step mainly completes two tasks: first need to unify conversion to the coded format of text, convert UTF8 form to as unified, and the code that the benefit done like this is to resolve text only needs to realize a set of; Secondly carry out resolving that (general employing phoneme is better as modeling unit effect according to the granularity (as word, syllable, phoneme) of model unit corresponding in acoustic model, below describe all for phoneme), generate and resolve bearing-age tree shape structure, this structure comprises the complete information of chapter, sentence, word, word, syllable, phoneme six levels, wherein front 4 levels can be resolved according to text front end segmentation methods, and rear 2 levels can be resolved according to pronunciation dictionary.

Second step: build decoding network according to text resolution result, specific as follows:

(2.1): obtain the minimum modeling unit after the text resolution of the first step (can be phoneme, syllable, word);

(2.2): count and total nodal point number according to the dummy section in minimum modeling unit number computational grid, distribution node internal memory, and minimum modeling unit and network node are associated;

(2.3): according to retaking of a year or grade, the arc number read aloud in regular computational grid that skip etc. allow, and be arc storage allocation;

(2.4): according to reading aloud rule, build arc node is coupled together;

(2.5): export decoding network.

Suppose that speech recognition text is for " China " two word, the minimum modeling unit of network is phoneme (sound is female), the decoding network constructed as shown in Figure 5, as can be seen from the figure, this network allows skip 1-2 word, a retaking of a year or grade 1-2 word, this network is read " state " for user, and " Chinese state " etc. increase skip situation and can be correctly decoded.

3rd step: input voice, extract acoustic feature, decode based on decoding network:

Acoustic feature describes some values of phonetic feature, normally a kind of sequence of proper vector of fixing dimension, the MFCC(MFCC cepstrum as 39 dimensions) feature, 39 floating point values represent the feature of frame voice exactly; As shown in Figure 6, concrete steps are as follows for the extraction flow process of MFCC feature:

(3.1) A/D conversion, is converted to digital signal by simulating signal;

(3.2) pre-emphasis: by a limited exciter response Hi-pass filter of single order, make the frequency spectrum of signal become smooth, be not vulnerable to the impact of finite word length effect;

(3.3) framing: according to the short-term stationarity characteristic of voice, voice can process in units of frame, generally can get 25 milliseconds (ms) as a frame;

(3.4) windowing: adopt hamming code window to a frame voice windowing, to reduce the impact of Gibbs' effect;

(3.5) fast fourier transform (Fast Fourier Transformation, FFT): power spectrum time-domain signal being for conversion into signal;

(3.6) quarter window filtering: with the quarter window wave filter (totally 24 quarter window wave filters) of one group of Mel frequency marking Linear distribution, to the power spectrum filtering of signal, the scope that each quarter window wave filter covers is similar to a critical bandwidth of people's ear, simulates the masking effect of people's ear with this;

(3.7) ask logarithm: logarithm is asked in the output of quarter window bank of filters, the result being similar to isomorphic transformation can be obtained;

(3.8) discrete cosine transform (Discrete Cosine Transformation, DCT): remove the correlativity between each dimensional signal, by signal map to lower dimensional space;

(3.9) compose weighting: the low order parameter due to cepstrum is subject to the impact of speaker's characteristic, the characteristic of channel etc., and the resolution characteristic of high order parameters is lower, so need to carry out spectrum weighting, suppress its low order and high order parameters;

(3.10) cepstral mean subtracts (Cepstrum Mean Subtraction, CMS): CMS can reduce the impact of phonetic entry channel on characteristic parameter effectively;

(3.11) differential parameter: great many of experiments shows, adds the differential parameter characterizing non-speech dynamic characteristics, can improve the recognition performance of system in phonetic feature.We have also used first order difference parameter and the second order difference parameter of MFCC parameter.

After acoustic feature extracts, just carry out real-time decoding, tone decoding is a step (being decoded as example with Viterbi) important in the present invention, the process of decoding in the present invention is: to every frame acoustic feature of input, calculate output probability and the intra-node state transition probability of current every bar feasible path corresponding node in decoding network, and upgrade the accumulated probability of current path.Output probability herein can according to Hidden Markov Model (HMM) corresponding to node phoneme and acoustics feature calculation, and intra-node state transition probability directly reads from model; When above-mentioned be decoded to last state of intra-node time, can expand current decoding paths, the foundation of expansion is exactly follow the tracks of decoding network, when this node is connected to multiple node, expansion mulitpath is needed to proceed decoding, if the arc of tracking decoding network exists path punishment, then need punishment to be added in the accumulated probability in path, decode procedure can generate mulitpath in real time.

4th step: judge whether it is sound end, sound end can be divided into voice starting point and voice end point, if voice end point, then voice end point is fed back to external application system, otherwise continues the 3rd step; In the 4th step, starting point judges is optional, if external system is indifferent to starting point information, then can not judge starting point, both of these case of the present invention is all included.

In Fig. 2, sound end judges to comprise the one or whole in the judgement of voice starting point, the judgement of voice end point.

The judgement of voice starting point: the flow process that voice starting point judges as shown in Figure 2.

Voice starting point judges roughly to divide a few step below:

The first step: get demoder optimal path, get optimal path as follows in detail:

Namely path is the road that demoder arrival end node is passed by, because start not know which bar road is the shortest, demoder can walk a lot of bar road, and the corresponding a kind of speech unit sequence in each road, optimal path is exactly for path corresponding to current speech maximum probability speech unit sequence.

(1.1): all paths in traversal current decoder, resolve each path and obtain speech unit sequence and probability;

(1.2): according to probability, sorted in path;

(1.3): get the maximum path of sequence posterior probability as optimal path.

Second step: voice starting point early warning, namely judges currently whether may reach starting point according to optimal path, if so, carries out the 3rd step, otherwise exit flow process.Carry out being decoded as example with the decoding network of figure five, identify that text is " China ", zh is the text starting point of network, namely judges whether current decoding has arrived " zh " node in network herein.

3rd step: early warning confirms, by judging whether in text that the process such as phoneme or effective rubbish voice confirms currently whether really to reach voice starting point.If so, obtain starting point, otherwise directly exit flow process.Carry out being decoded as example with the decoding network of figure five, namely judge that whether the probability of decoding paths arrival " zh " is enough large.

The judgement of voice end point: the flow process that voice end point judges as shown in Figure 3.

The first step: get current optimal path in demoder, in judging with above-mentioned voice starting point, the method for optimal path is the same.

According to optimal path, second step: voice end point early warning, namely judges whether last phoneme of text can be talkative, if so, carry out the 3rd step, otherwise exit flow process.Carry out being decoded as example with the decoding network of Fig. 5, namely judge whether demoder optimal path is decoded to " uo " node herein.

3rd step: confirm early warning, namely judge in text, whether last phoneme was really said.Decision-making can be carried out by indexs such as frame length, the average likelihood scores of frame.If be judged as really having said, then obtain end point process ends, otherwise direct process ends.Carry out being decoded as example with the decoding network of Fig. 5, namely weigh the frame length of the optimal path of arrival " uo " herein, whether average likelihood score is reasonable.

Detect with legacy endpoint and combine (expansion scheme):

In the application scenarios of some, sometimes user does not run through content of text, also needs endpoint detection system can return the end point of voice, and this just needs and traditional end-point detection combine with technique, and process flow diagram as shown in Figure 4.Suppose that end-point detection flow process traditional in Fig. 4 is the end-point detection based on energy, process flow diagram as shown in Figure 7, is mainly divided into the following steps:

The first step: input voice;

Second step: to voice segment, extracts short-time energy;

3rd step: comprehensively adjudicate, according to current short-time energy judgement voice segments and non-speech segment, can adopt one of four thresholdings or double threshold two kinds of methods;

4th step: end points feeds back, according to the 3rd step court verdict feedback testing result.

The process steps combined with traditional end-point detecting method in the present invention is as follows:

The first step: input speech recognition related text, resolves text;

Second step: build decoding network according to first step text resolution result;

3rd step: input voice, extracts the acoustic feature in voice on the one hand, on the other hand traditional endpoint detection module is passed in voice;

4th step: end-point detecting method of the present invention and legacy endpoint are detected and carries out simultaneously, detect sound end separately;

5th step: whether the sound end decision-making provided in conjunction with end-point detecting method of the present invention and legacy endpoint detection method is sound end, any one method in above-mentioned two can be adopted to detect it is the strategy that end points just thinks end points, can also all detect that end points just thinks end points by two kinds of methods;

6th step: backchannel voice endpoint is to outer application system.

" whether starting point detected " in upper Fig. 4 and contain the process detecting that starting point does not then detect.

In a word, the invention solves when speech recognition text is determined, the real-time that legacy endpoint detection technique shows is not high, cannot carry out aimed detection problem to the voice that user is concerned about.

Non-elaborated part of the present invention belongs to the known technology of those skilled in the art.

Claims

1., based on a sound end detecting method for real-time decoding, it is characterized in that performing step is as follows:

The first step: input speech recognition related text, resolves text;

Second step: build decoding network according to text resolution result;

3rd step: input voice in real time, extracts the acoustic feature in voice, and the decoding network built based on second step is decoded to described acoustic feature, obtains decoded speech unit sequence; In described speech unit sequence, each unit is called a frame;

4th step: carry out sound end judgement to decoded speech unit sequence, judges whether it is sound end, and described sound end is divided into voice starting point and voice end point; If judged result is voice end point, then voice endpoint information is fed back to outer application system, otherwise continue the 3rd step; In the 4th step, voice starting point judges is optional, if outer application system is indifferent to voice starting point, does not then judge voice starting point;

Voice starting point in described 4th step judges as follows:

(1.1) optimal path in demoder is got;

(1.2) voice starting point early warning, namely according to the optimal path in demoder, judges whether current speech text may reach voice starting point, if so, carries out step (1.3), otherwise terminates to judge;

(1.3) confirm early warning, namely to judge whether to have in speech text in text phoneme or effective rubbish voice confirm currently whether really reach voice starting point by this process; If so, obtain starting point, otherwise directly exit;

Voice end point in described 4th step judges as follows:

(2.1) current optimal path in demoder is got;

(2.2) voice end point early warning, namely according to the optimal path in demoder, judges whether last phoneme in speech text can be talkative, if so, carries out step (2.3), otherwise terminates to judge;

(2.3) confirm early warning, namely in speech text, whether last phoneme was really said, by frame length, the average likelihood score index of frame carrys out decision-making, if be judged as really having said, then obtain voice end point, process ends, otherwise directly terminates.

2. a kind of sound end detecting method based on real-time decoding according to claim 1, it is characterized in that: in the application scenarios of some, sometimes user does not run through content of text, need the end point returning voice, this just needs described sound end detecting method to be combined with traditional end-point detecting method, and the process steps combined with traditional end-point detecting method is as follows:

(1) input speech recognition related text, resolve text;

(2) decoding network is built according to first step text resolution result;

(4) described sound end detecting method and legacy endpoint detection are carried out simultaneously, detect sound end separately;

(5) whether the sound end decision-making provided in conjunction with described sound end detecting method and legacy endpoint detection method is sound end, adopt any one method in above-mentioned two to detect it is the strategy that end points just thinks end points, or all detect that end points just thinks end points by two kinds of methods;

(6) backchannel voice endpoint is to outer application system.

3. a kind of sound end detecting method based on real-time decoding according to claim 1 and 2, is characterized in that: build decoding network step in described second step as follows:

(1) obtain the minimum modeling unit after the text resolution of the first step, described minimum modeling unit is phoneme, syllable, word;

(4) according to reading aloud rule, by arc, node is coupled together;

(5) decoding network is exported.

4. a kind of sound end detecting method based on real-time decoding according to claim 1 and 2, is characterized in that: the step of getting the optimal path in demoder in described step (1.1) and step (2.1) is as follows:

(2) according to probability, sorted in path;

(3) the maximum path of sequence posterior probability is got as optimal path.

5. a kind of sound end detecting method based on real-time decoding according to claim 1 and 2, is characterized in that: in described 3rd step, acoustic feature is mel cepstrum coefficients MFCC, cepstrum coefficient CEP, linear predictor coefficient LPC or perception linear predictor coefficient PLP.

6. a kind of sound end detecting method based on real-time decoding according to claim 1 and 2, is characterized in that: in described 3rd step, speech unit sequence is aligned phoneme sequence, syllable sequence or word sequence.

7. a kind of sound end detecting method based on real-time decoding according to claim 1 and 2, is characterized in that: be decoded as Viterbi decoding in described 3rd step, or based on the decoding of dynamic time warping (DTW).