CN106098076B

CN106098076B - One kind estimating time-frequency domain adaptive voice detection method based on dynamic noise

Info

Publication number: CN106098076B
Application number: CN201610393406.XA
Authority: CN
Inventors: 何云鹏
Original assignee: Chengdu Leader Technology Co Ltd
Current assignee: Chengdu Leader Technology Co Ltd
Priority date: 2016-06-06
Filing date: 2016-06-06
Publication date: 2019-05-21
Anticipated expiration: 2036-06-06
Also published as: CN106098076A

Abstract

The present invention relates to the information processing technology and transducing signal process fields, it especially relates to a kind of based on the dynamic noise estimation adaptive automatic speech detection method of time-frequency domain, the present invention changes the detection for carrying out voice respectively according to the time domain short-time energy and a certain range frequency domain short-time energy of sound, the size of the background noise energy finally gone out according to dynamic estimation, it selects optimal as a result, thus the adaptability that the accuracy rate of speech recognition greatly improved and improve speech recognition to environmental change.

Description

One kind estimating time-frequency domain adaptive voice detection method based on dynamic noise

Technical field

The present invention relates to the information processing technology and transducing signal process fields, especially relate to a kind of based on dynamic noise Estimate time-frequency domain adaptive voice detection method.

Background technique

One hot spot in artificial intelligence application field is exactly speech recognition, and speech recognition has begun in every field at present It is widely applied.The realization of speech detection is the pith of speech recognition system real-time implementation, and the purpose is in complicated reality Voice segments and non-speech segment are distinguished in environment.There is document to show that discrimination is largely due to compared with lower part in practical application Voice is not handled correctly, a large amount of non-voice information has seriously affected the accuracy rate of speech recognition system, especially answers The speech recognition of much noise is had with environment, correct speech detection technology can be effectively reduced system operations amount, shorten system The system processing time reduces mobile terminal transmission power and saves channel resource, improves speech recognition accuracy, especially carries on the back in complexity Under scape noise, the superiority and inferiority of speech recognition system performance depends greatly on the superiority and inferiority of speech detection technology, therefore steadily and surely, Accurately, in real time, the speech detection technology that adaptivity is strong and robustness is good be necessary to each speech recognition system.

Speech recognition technology is on mobile terminal especially mobile phone or voice remote controller in application, relying primarily on key side at present Formula determines the starting and ending of voice, however this mode far says that application is then very inconvenient for largely, to far saying either For the smart machine of the support speech recognition not taken, robot, automatic speech detection system is exactly essential Component.

The main stream approach of current automatic speech detection is by short-time energy size in time domain, zero-crossing rate size, Yi Jipin Domain Frequency band energy mean square deviation three kinds of methods detect, and it is equal that specific method formula finds out short-time energy, zero-crossing rate or frequency band energy Then variance is compared with an empirical value, it is demonstrated experimentally that this independent relatively short-time energy size or zero-crossing rate size Method it is bad for noisy environmental suitability, especially when application environment changes, the background of same environment is made an uproar Sound can also occur to change accordingly, and frequency band energy mean square deviation method quiet environment is also adapted to it is bad.

To solve the above problems, need to invent a kind of variation according to time domain and spectrum domain voice average energy carries out language respectively The detection of sound, the ambient noise size finally gone out according to dynamic estimation select optimal as a result, to which voice knowledge greatly improved Other accuracy rate and adaptability to environmental change.

Summary of the invention

It is an object of the invention to overcome the above-mentioned deficiency in the presence of the prior art, providing a kind of can greatly improve voice The accuracy rate of identification and speech detection method to environmental change adaptability.

In order to achieve the above object, the present invention provides following technical solutions.

One kind estimating time-frequency domain adaptive voice detection method based on dynamic noise comprising following steps:

Step 1, is loaded into current frame data, and the current frame data is voice data in time domain；

Step 2 calculates the energy summation of every frame sound of voice data in the time domain as time domain short-time energy, and It is frequency domain data that voice data in time domain described in every frame, which is passed through FFT transform,；

Step 3 chooses the frequency domain data certain frequency range subband data, calculates the certain frequency range subband The energy of data is simultaneously cumulative as frequency domain short-time energy；

Step 4, ambient noise estimation unit calculate background noise energy, and frequency domain background energy computing unit calculates Frequency domain background energy；

The time domain short-time energy is compared by step 5 with the background noise energy, and result is greater than the back Scape noise energy is then voice, result be less than or equal to the background noise energy be then non-voice；

The frequency domain short-time energy is compared by step 6 with the frequency domain background energy, and result is greater than the frequency Domain background energy is then voice, result be less than or equal to the frequency domain background energy be then non-voice；

The threshold value one of the background noise energy and a default is compared, first if more than threshold value by step 7 It selects to compare for voice in step 6 as a result, first selecting to compare the result for voice in step 5 if being less than or equal to threshold value；

Step 8, if the present frame result is detected as non-voice, by the time domain short-time energy of the present frame It is sent in the ambient noise estimation unit and adds up, after being added to the first frame number, accumulated value is obtained divided by first frame number The frequency domain short-time energy of the present frame is sent to the frequency domain background energy meter as output by new ambient noise It calculates in unit and adds up, after being added to the second frame number, accumulated value is obtained into new frequency domain background energy divided by second frame number and is made For output.

Common speech energy have short-time stability, and the background noise energy have it is long when stability, when described Domain short-time energy is compared with the background noise energy, and comparison result is the time domain probability of voice as the moment, usually The period can be much larger than during voice, because the time domain short-time energy, which can be regarded as, may contain voice and the background during non-voice The acoustic energy of noise energy, and energy is mainly made of the background noise energy when time domain is long, the time domain short-time energy Energy is big when longer than the time domain, then be voice probability with regard to big, and energy is that dynamic is calculated when the time domain is long, so The variation that can well adapt to ambient noise utilizes method ratio of the time domain short-time energy compared with the background noise energy Relatively it is suitble to quiet environment, in order to improve the accuracy of speech detection, uses the time domain short-time energy and the ambient noise The new method that the method for the method of energy comparison and the frequency domain short-time energy compared with the frequency domain background energy combines into Row speech detection improves the accuracy of speech detection.

As a preferred solution of the present invention, time domain short-time energy described in step 5 is compared with the background noise energy Compared with method be to subtract the difference of the background noise energy compared with the threshold value two of default with the time domain short-time energy, As a result second being greater than the threshold value is voice, second being as a result less than or equal to the threshold value is non-voice；

The method that frequency domain short-time energy described in step 6 is compared with the frequency domain background energy is with the frequency domain Short-time energy subtracts the difference of the frequency domain background energy compared with the threshold value three of default, is as a result greater than the threshold value three then For voice, being as a result less than or equal to the threshold value three is then non-voice.

As a preferred solution of the present invention, time domain short-time energy described in step 5 is compared with the background noise energy Compared with method be with the time domain short-time energy with the ratio of the background noise energy compared with the threshold value four of default, knot It is voice that fruit, which is greater than the threshold value four fundamental rules, and being as a result less than or equal to the threshold value four fundamental rules is non-voice；

The method that frequency domain short-time energy described in step 6 is compared with the frequency domain background energy is with the frequency domain With the ratio of the frequency domain background energy compared with the threshold value five of default, being as a result greater than the threshold value five is then for short-time energy Voice, being as a result less than or equal to the threshold value five is then non-voice.

As a preferred solution of the present invention, the frequency range is the frequency range that people's speech energy is mainly distributed, people's Sound spectrum distribution is wider, and voice frequency band section can be arranged by two parameters, first is that upper threshold frequency, another It is lower frequency threshold value, the sound of usually more than this frequency range is often ambient noise or other non-voice, in the frequency band In range, environmental noise power receives biggish inhibition, in general voice energy be concentrated mainly on 300Hz to 4000Hz it Between, and background noise energy is mainly distributed within 300Hz, the energy for taking voice to be mainly distributed frequency range is compared, because This is in the frequency range, and when there is voice, the frequency domain short-time energy, which has, significantly increases, therefore in short-term with the time domain Energy comparison is similar, with the frequency domain short-time energy compared with the frequency domain background energy, more than the threshold value of system setting Three or the threshold value five, then the period maximum probability is voice.

As a preferred solution of the present invention, the time range size of the frame is between 10 milliseconds to 50 milliseconds, and described One frame number and second frame number are by system configuration.

As a preferred solution of the present invention, the background noise energy is that the time domain during will be deemed as non-voice is short Shi Nengliang carries out the result being averaging after adding up.

As a preferred solution of the present invention, the frequency domain background energy is that the frequency domain during will be deemed as non-voice is short Shi Nengliang carries out the result being averaging after adding up.

Compared with prior art, beneficial effects of the present invention:

The present invention carries out the detection of voice according to the variation of time domain and spectrum domain voice average energy respectively, finally according to dynamic The ambient noise size estimated selects optimal as a result, so that the accuracy rate of speech recognition greatly improved and to environment The adaptability of variation.

Detailed description of the invention

Fig. 1 is flow chart of the present invention；

Fig. 2 is present invention operation block diagram.

Specific embodiment

Below with reference to embodiment and specific embodiment, the present invention is described in further detail, but should not understand this It is only limitted to embodiment below for the range of aforementioned body of the present invention, it is all that this is belonged to based on the technology that the content of present invention is realized The range of invention.

As shown in Figure 1, a kind of estimate time-frequency domain adaptive voice detection method based on dynamic noise comprising following step It is rapid:

Step 1, is loaded into current frame data, and current frame data is voice data in time domain；

Step 2 calculates the energy summation of every frame sound of voice data in time domain as time domain short-time energy, and will be every Voice data is frequency domain data by FFT transform in frame time domain；

Step 3 chooses frequency domain data certain frequency range subband data, calculates the energy of certain frequency range subband data It measures and adds up as frequency domain short-time energy；

Time domain short-time energy is compared by step 5 with background noise energy, and result is greater than background noise energy Then be voice, result be less than or equal to background noise energy be then non-voice；

Frequency domain short-time energy is compared by step 6 with frequency domain background energy, and result is greater than frequency domain background energy Then be voice, result be less than or equal to frequency domain background energy be then non-voice；

The threshold value one of background noise energy and a default is compared, first selects if more than threshold value by step 7 Compare for voice in step 6 as a result, first selecting to compare the result for voice in step 5 if being less than or equal to threshold value；

The time domain short-time energy of present frame is sent to ambient noise if present frame result is detected as non-voice by step 8 In estimation unit add up, after being added to the first frame number, using accumulated value divided by the first frame number obtain new background noise energy as Output, while the frequency domain short-time energy of present frame being sent in frequency domain background energy computing unit and is added up, it is added to the second frame number Afterwards, accumulated value is obtained into new frequency domain background energy as output divided by the second frame number.

As depicted in figs. 1 and 2, it is first loaded into current frame data, current frame data is voice data in time domain, is worked as in loading The calculating that time domain short-time energy is carried out after preceding frame data, passes through voice data in time domain while calculating time domain short-time energy FFT transform is frequency domain data, then calculates frequency domain short-time energy, calculates background noise energy by ambient noise estimation unit, Frequency domain background energy is calculated by frequency domain background energy computing unit, respectively by time domain short-time energy and background noise energy and frequency Domain short-time energy is compared with frequency domain background energy, in the present embodiment using time domain short-time energy and background noise energy Difference and the threshold value of default two are compared and the difference of frequency domain short-time energy and frequency domain background energy and default The method that threshold value three is compared, time domain short-time energy subtracts the difference of background noise energy and the threshold value two of default compares Compared with second being as a result greater than threshold value is voice, second being as a result less than or equal to threshold value is non-voice, frequency domain short-time energy subtracts frequency domain For the difference of background energy compared with the threshold value three of default, being as a result greater than threshold value three is then voice, is as a result less than or equal to threshold Value three is then non-voice, and two above-mentioned comparison results export, and the threshold value one that background noise energy is arranged with system carries out Compare, first selects to compare for voice in step 6 if more than threshold value as a result, if first being less than or equal to threshold value selects step 5 Middle comparison is voice as a result, comparison result is that the result of non-voice is delivered to ambient noise respectively in step 5 and step 6 New background noise energy and new frequency domain background energy are calculated in energy estimation unit and frequency domain background energy computing unit, The frequency range that people's speech energy is mainly distributed in the present embodiment takes 300Hz to 4000Hz, and the time range size of frame exists Between 10 milliseconds to 50 milliseconds.

The ratio of time domain short-time energy and background noise energy and the threshold value four of default are used in another embodiment It is compared and method that the threshold value five of the ratio of frequency domain short-time energy and frequency domain background energy and default is compared, when With the ratio of background noise energy compared with the threshold value four of default, being as a result greater than threshold value four fundamental rules is voice for domain short-time energy, As a result being less than or equal to threshold value four fundamental rules is non-voice, frequency domain short-time energy and the ratio of frequency domain background energy and the threshold of default Value five compares, and being as a result greater than threshold value five is then voice, and being as a result less than or equal to threshold value five is then non-voice, remaining calculating process Identical as previous embodiment, details are not described herein.

It can also be set in other embodiments using the difference using time domain short-time energy and background noise energy with system Fixed threshold value six is compared and frequency domain short-time energy is compared with the ratio of frequency domain background energy and the threshold value seven of default Compared with method etc..

Claims

1. one kind estimates time-frequency domain adaptive voice detection method based on dynamic noise comprising following steps:

Step 2 calculates the energy summation of every frame sound of voice data in the time domain as time domain short-time energy, and will be every Voice data is frequency domain data by FFT transform in time domain described in frame；

Step 3 chooses the frequency domain data certain frequency range subband data, calculates the certain frequency range subband data Energy and cumulative be used as frequency domain short-time energy；

Step 4, background noise energy estimation unit calculate background noise energy, and frequency domain background energy computing unit calculates Frequency domain background energy；

The time domain short-time energy is compared by step 5 with the background noise energy, and result is to make an uproar greater than the background Acoustic energy is then voice, result be less than or equal to the background noise energy be then non-voice；

The frequency domain short-time energy is compared by step 6 with the frequency domain background energy, and result is to carry on the back greater than the frequency domain Scape energy is then voice, result be less than or equal to the frequency domain background energy be then non-voice；

The threshold value one of the background noise energy and a default is compared, first selects if more than threshold value by step 7 Compare for voice in step 6 as a result, first selecting to compare the result for voice in step 5 if being less than or equal to threshold value；

The time domain short-time energy of the present frame is sent to by step 8 if the present frame result is detected as non-voice It adds up in the ambient noise estimation unit, after being added to the first frame number, accumulated value is obtained divided by first frame number new The frequency domain short-time energy of the present frame is sent to the frequency domain background energy meter as output by background noise energy It calculates in unit and adds up, after being added to the second frame number, accumulated value is obtained into new frequency domain background energy divided by second frame number and is made For output.

2. according to claim 1 estimate time-frequency domain adaptive voice detection method based on dynamic noise, it is characterised in that:

The method that time domain short-time energy described in step 5 is compared with the background noise energy be with the time domain in short-term Energy subtracts the difference of the background noise energy compared with the threshold value two of default, second being as a result greater than the threshold value is language Sound, second being as a result less than or equal to the threshold value is non-voice；

The method that frequency domain short-time energy described in step 6 is compared with the frequency domain background energy be with the frequency domain in short-term Energy subtracts the difference of the frequency domain background energy compared with the threshold value three of default, and being as a result greater than the threshold value three is then language Sound, being as a result less than or equal to the threshold value three is then non-voice.

3. according to claim 1 estimate time-frequency domain adaptive voice detection method based on dynamic noise, it is characterised in that:

The method that time domain short-time energy described in step 5 is compared with the background noise energy be with the time domain in short-term For energy with the ratio of the background noise energy compared with the threshold value four of default, being as a result greater than the threshold value four fundamental rules is language Sound, being as a result less than or equal to the threshold value four fundamental rules is non-voice；

The method that frequency domain short-time energy described in step 6 is compared with the frequency domain background energy be with the frequency domain in short-term For energy with the ratio of the frequency domain background energy compared with the threshold value five of default, being as a result greater than the threshold value five is then language Sound, being as a result less than or equal to the threshold value five is then non-voice.

4. according to claim 1 estimate time-frequency domain adaptive voice detection method based on dynamic noise, it is characterised in that: The frequency range is the frequency range that people's speech energy is mainly distributed, and the frequency range passes through upper threshold frequency and lower frequency Threshold value determines.

5. according to claim 1 estimate time-frequency domain adaptive voice detection method based on dynamic noise, it is characterised in that: The time range size of the frame between 10 milliseconds to 50 milliseconds, matched by system by first frame number and second frame number It sets.

6. according to claim 1 estimate time-frequency domain adaptive voice detection method based on dynamic noise, it is characterised in that: The background noise energy is that the time domain short-time energy during will be deemed as non-voice carries out the result being averaging after adding up.

7. according to claim 1 estimate time-frequency domain adaptive voice detection method based on dynamic noise, it is characterised in that: The frequency domain background energy is that the frequency domain short-time energy during will be deemed as non-voice carries out the result being averaging after adding up.