CN106098076A

CN106098076A - A kind of based on dynamic noise estimation time-frequency domain adaptive voice detection method

Info

Publication number: CN106098076A
Application number: CN201610393406.XA
Authority: CN
Inventors: 何云鹏
Original assignee: Chengdu Leader Technology Co Ltd
Current assignee: Chengdu Leader Technology Co Ltd; Chipintelli Technology Co Ltd
Priority date: 2016-06-06
Filing date: 2016-06-06
Publication date: 2016-11-09
Anticipated expiration: 2036-06-06
Also published as: CN106098076B

Abstract

The present invention relates to the information processing technology and transducing signal process field, especially relate to a kind of based on dynamic noise estimation time-frequency domain self adaptation automatic speech detection method, the present invention carries out the detection of voice respectively according to the time domain short-time energy of sound and certain limit frequency domain short-time energy change, size finally according to the background noise energy that dynamic estimation goes out, select optimum result, thus the accuracy rate of speech recognition is greatly improved and improves the speech recognition adaptability to environmental change.

Description

A kind of based on dynamic noise estimation time-frequency domain adaptive voice detection method

Technical field

The present invention relates to the information processing technology and transducing signal process field, especially relate to a kind of based on dynamic noise Estimate time-frequency domain adaptive voice detection method.

Background technology

One focus in artificial intelligence application field is exactly speech recognition, and current speech recognition has begun in every field Extensively application.The realization of speech detection is the pith of speech recognition system real-time implementation, its objective is in complicated reality Environment is distinguished voice segments and non-speech segment.Having document to show, in actual application, discrimination relatively lower part is largely due to Correctly not processing voice, substantial amounts of non-voice information has had a strong impact on the accuracy rate of speech recognition system, particularly should With environment with the speech recognition of much noise, correct speech detection technology can be effectively reduced system operations amount, shortens system System processes the time, reduces mobile terminal and launches power and save channel resource, improves speech recognition accuracy, especially carries on the back in complexity Under scape noise, the quality of speech recognition system performance depends greatly on the quality of speech detection technology, the most steadily and surely, Accurately, in real time the speech detection technology that, adaptivity is strong and robustness is good is necessary to each speech recognition system.

When speech recognition technology is applied on mobile terminal especially mobile phone or voice remote controller at present, rely primarily on button side Formula determines the starting and ending of voice, but for a large amount of, this mode the farthest says that application is the most very inconvenient, to far saying or For the smart machine of support speech recognition that do not takes, robot, automatic speech detecting system is exactly requisite Parts.

The main stream approach of current automatic speech detection is dependent on short-time energy size in time domain, zero-crossing rate size, Yi Jipin Frequency band energy mean square deviation three kinds of methods in territory detect, and it is equal that concrete grammar formula obtains short-time energy, zero-crossing rate or frequency band energy Variance, then compares with an empirical value, individually compares short-time energy size or zero-crossing rate size it is demonstrated experimentally that this Method bad for noisy environmental suitability, especially when applied environment changes, the background of same environment is made an uproar Sound also can occur to change accordingly, and frequency band energy mean square deviation method is also adapted to bad for quiet environment.

For solving the problems referred to above, need to invent a kind of change according to time domain and spectrum domain voice average energy and carry out language respectively The detection of sound, the background noise size gone out finally according to dynamic estimation, select optimum result, thus voice is greatly improved and knows Other accuracy rate and the adaptability to environmental change.

Summary of the invention

It is an object of the invention to overcome the above-mentioned deficiency in the presence of prior art, it is provided that one can be greatly improved voice The accuracy rate identified and speech detection method adaptive to environmental change.

In order to achieve the above object, the invention provides following technical scheme.

A kind of based on dynamic noise estimation time-frequency domain adaptive voice detection method, it comprises the following steps:

Step one, is loaded into current frame data, and described current frame data is speech data in time domain；

Step 2, calculates in described time domain the energy summation of every frame sound of speech data as time domain short-time energy, and will be every In time domain described in frame, speech data is frequency domain data by FFT；

Step 3, chooses described frequency domain data certain frequency scope subband data, calculates described certain frequency scope subband data Energy cumulative as frequency domain short-time energy；

Step 4, background noise estimation unit calculates background noise energy, and frequency domain background energy computing unit calculates frequency domain Background energy；

Step 5, compares described time domain short-time energy with described background noise energy, and result is for make an uproar more than described background Acoustic energy is then voice, result be less than or equal to described background noise energy then for non-voice；

Step 6, compares described frequency domain short-time energy with described frequency domain background energy, and result is for carry on the back more than described frequency domain Scape energy is then voice, result be less than or equal to described frequency domain background energy then for non-voice；

Step 7, compares the threshold value one of described background noise energy and a default, if selecting first more than threshold value Step 6 compares the result for voice, if selecting step 5 compares the result for voice first less than or equal to threshold value；

Step 8, if described present frame result is detected as non-voice, then delivers to the described time domain short-time energy of described present frame Described background noise estimation unit adds up, after being added to the first frame number, accumulated value is obtained new divided by described first frame number The described frequency domain short-time energy of described present frame, as output, is delivered to described frequency domain background energy simultaneously and is calculated single by background noise Unit adds up, after being added to the second frame number, accumulated value is obtained new frequency domain background energy as defeated divided by described second frame number Go out.

Common speech energy has short-time stability, and described background noise energy has Long-term stability, time described Territory short-time energy compares with described background noise energy, and comparative result is as the time domain probability that this moment is voice, generally During non-voice, the cycle can be much larger than during voice, because described time domain short-time energy can be regarded as may contain voice and described background The acoustic energy of noise energy, and when time domain is long, energy is mainly made up of described background noise energy, described time domain short-time energy Time longer than described time domain, energy is big, then be that the probability of voice is the biggest, and when described time domain is long, energy is that dynamic calculation goes out, so The change of environment noise can be well adapted to, utilize the method ratio that described time domain short-time energy is compared with described background noise energy Relatively it is suitable for quiet environment, in order to improve the accuracy of speech detection, uses described time domain short-time energy and described background noise The new method that the method that the method for energy comparison and described frequency domain short-time energy are compared with described frequency domain background energy combines is entered Row speech detection, improves the accuracy of speech detection.

As the preferred version of the present invention, time domain short-time energy described in step 5 compares with described background noise energy Method relatively is that the threshold value two of the difference and default that deduct described background noise energy with described time domain short-time energy compares, Second result is voice more than described threshold value, and second result is non-voice less than or equal to described threshold value；

The method that frequency domain short-time energy described in step 6 and described frequency domain background energy compare is with described frequency domain in short-term The threshold value three of difference and default that energy deducts described frequency domain background energy compares, and result is then language more than described threshold value three Sound, result is then non-voice less than or equal to described threshold value three.

As the preferred version of the present invention, time domain short-time energy described in step 5 compares with described background noise energy Method relatively is to compare with the threshold value four of default with the ratio of described background noise energy with described time domain short-time energy, knot Fruit is voice more than described threshold value four fundamental rules, and result is non-voice less than or equal to described threshold value four fundamental rules；

The method that frequency domain short-time energy described in step 6 and described frequency domain background energy compare is with described frequency domain in short-term Energy compares with the threshold value five of default with the ratio of described frequency domain background energy, and result is then language more than described threshold value five Sound, result is then non-voice less than or equal to described threshold value five.

As the preferred version of the present invention, the frequency range that described frequency range behaviour speech energy is mainly distributed, people's Sound spectrum distribution is relatively wider, and people's sonic-frequency band interval can be arranged by two parameters, and one is upper threshold frequency, another Being lower frequency threshold value, usually more than the sound of this frequency range is often environment noise or other non-voice, at this frequency band In the range of, environmental noise power receives bigger suppression, in general people's acoustic energy be concentrated mainly on 300Hz to 4000Hz it Between, and within background noise energy is mainly distributed on 300Hz, takes voice and be mainly distributed the energy of frequency band range and compare, because of This is in this frequency band range, and when there being voice, described frequency domain short-time energy has significantly increases, therefore with described time domain in short-term Energy comparison is similar to, and compares with described frequency domain background energy with described frequency domain short-time energy, exceedes the described threshold value that system is arranged Three or described threshold value five, then this period big probability is voice.

As the preferred version of the present invention, the time range size of described frame between 10 milliseconds to 50 milliseconds, described One frame number and described second frame number are configured by system.

As the preferred version of the present invention, described background noise energy is that the described time domain that will be deemed as during non-voice is short The result that Shi Nengliang is averaging after carrying out adding up.

As the preferred version of the present invention, described frequency domain background energy is that the described frequency domain that will be deemed as during non-voice is short The result that Shi Nengliang is averaging after carrying out adding up.

Compared with prior art, beneficial effects of the present invention:

The present invention carries out the detection of voice respectively according to the change of time domain and spectrum domain voice average energy, finally according to dynamic estimation The background noise size gone out, selects optimum result, thus the accuracy rate of speech recognition is greatly improved and to environmental change Adaptability.

Accompanying drawing explanation

Fig. 1 is flow chart of the present invention；

Fig. 2 is that the present invention runs block diagram.

Detailed description of the invention

Below in conjunction with embodiment and detailed description of the invention, the present invention is described in further detail, but should this not understood Scope for aforementioned body of the present invention is only limitted to below example, and all technology realized based on present invention belong to this The scope of invention.

As it is shown in figure 1, one estimates time-frequency domain adaptive voice detection method based on dynamic noise, it includes following step Rapid:

Step one, is loaded into current frame data, and current frame data is speech data in time domain；

Step 2, in calculating time domain, the energy summation of every frame sound of speech data is as time domain short-time energy, and during by every frame In territory, speech data is frequency domain data by FFT；

Step 3, chooses frequency domain data certain frequency scope subband data, calculates the energy of certain frequency scope subband data also Cumulative as frequency domain short-time energy；

Step 5, compares time domain short-time energy with background noise energy, and result is to be then more than background noise energy Voice, result be less than or equal to background noise energy then for non-voice；

Step 6, compares frequency domain short-time energy with frequency domain background energy, and result is to be then more than frequency domain background energy Voice, result be less than or equal to frequency domain background energy then for non-voice；

Step 7, compares the threshold value one of background noise energy and a default, if selecting step first more than threshold value The result for voice is compared, if selecting step 5 compares the result for voice first less than or equal to threshold value in six；

Step 8, if present frame result is detected as non-voice, then delivers to the time domain short-time energy of present frame background noise and estimates In unit cumulative, after being added to the first frame number, accumulated value is obtained new background noise energy as output divided by the first frame number, The frequency domain short-time energy of present frame is delivered in frequency domain background energy computing unit cumulative simultaneously, after being added to the second frame number, will Accumulated value obtains new frequency domain background energy as output divided by the second frame number.

As depicted in figs. 1 and 2, being first loaded into current frame data, current frame data is speech data in time domain, works as being loaded into Carry out the calculating of time domain short-time energy after front frame data, pass through calculating while time domain short-time energy speech data in by time domain FFT is frequency domain data, then calculates frequency domain short-time energy, background noise estimation unit calculates background noise energy, Frequency domain background energy is calculated, respectively by time domain short-time energy and background noise energy and frequency by frequency domain background energy computing unit Territory short-time energy compares with frequency domain background energy, uses time domain short-time energy and background noise energy in the present embodiment The threshold value two of difference and default compares difference and the default of also frequency domain short-time energy and frequency domain background energy The method that threshold value three compares, the difference of time domain short-time energy subtracting background noise energy compares with the threshold value two of default Relatively, second result is voice more than threshold value, and second result is non-voice less than or equal to threshold value, and frequency domain short-time energy deducts frequency domain The difference of background energy compares with the threshold value three of default, and result is then voice more than threshold value three, and result is less than or equal to threshold Value three is then non-voice, and two above-mentioned comparative results all export, and the threshold value one background noise energy and system arranged is carried out Relatively, if selecting step 6 compares the result for voice first more than threshold value, if selecting step 5 first less than or equal to threshold value Middle comparing the result for voice, in step 5 and step 6, comparative result is that the result of non-voice is delivered to background noise respectively Energy estimation unit and frequency domain background energy computing unit calculate new background noise energy and new frequency domain background energy, The frequency range that people's speech energy is mainly distributed in the present embodiment takes 300Hz to 4000Hz, and the time range size of frame exists Between 10 milliseconds to 50 milliseconds.

Use the threshold value four of the time domain short-time energy ratio with background noise energy and default in another embodiment Compare and method that the ratio of frequency domain short-time energy and frequency domain background energy compares with the threshold value five of default, time Territory short-time energy is compared with the threshold value four of default with the ratio of background noise energy, and result is voice more than threshold value four fundamental rules, Result is non-voice less than or equal to threshold value four fundamental rules, the ratio of frequency domain short-time energy and frequency domain background energy and the threshold of default Value five compares, and result is then voice more than threshold value five, and result is then non-voice less than or equal to threshold value five, and remaining calculates process All identical with previous embodiment, do not repeat them here.

Time domain short-time energy can also be used in other embodiments to set with system with the difference of background noise energy Fixed threshold value six compares and frequency domain short-time energy is compared with the threshold value seven of default with the ratio of frequency domain background energy Method etc. relatively.

Claims

1. estimating a time-frequency domain adaptive voice detection method based on dynamic noise, it comprises the following steps:

Step 4, background noise energy estimation unit calculates background noise energy, and frequency domain background energy computing unit calculates Frequency domain background energy；

Step 8, if described present frame result is detected as non-voice, then delivers to the described time domain short-time energy of described present frame Described background noise estimation unit adds up, after being added to the first frame number, accumulated value is obtained new divided by described first frame number The described frequency domain short-time energy of described present frame, as output, is delivered to described frequency domain background energy meter by background noise energy simultaneously Calculate in unit cumulative, after being added to the second frame number, accumulated value is obtained new frequency domain background energy divided by described second frame number and makees For output.

It is the most according to claim 1 based on dynamic noise estimation time-frequency domain adaptive voice detection method, it is characterised in that:

The method that time domain short-time energy described in step 5 and described background noise energy compare is by described time domain in short-term The threshold value two of difference and default that energy deducts described background noise energy compares, and second result is language more than described threshold value Sound, second result is non-voice less than or equal to described threshold value；

The most according to claim 1 based on dynamic noise estimation time-frequency domain self adaptation automatic speech detection method, its feature It is:

The method that time domain short-time energy described in step 5 and described background noise energy compare is by described time domain in short-term Energy compares with the threshold value four of default with the ratio of described background noise energy, and result is language more than described threshold value four fundamental rules Sound, result is non-voice less than or equal to described threshold value four fundamental rules；

It is the most according to claim 1 based on dynamic noise estimation time-frequency domain adaptive voice detection method, it is characterised in that: The frequency range that described frequency range behaviour speech energy is mainly distributed, described frequency range passes through upper threshold frequency and lower frequency Threshold value determines.

It is the most according to claim 1 based on dynamic noise estimation time-frequency domain adaptive voice detection method, it is characterised in that: The time range size of described frame is between 10 milliseconds to 50 milliseconds, and described first frame number and described second frame number are joined by system Put.

It is the most according to claim 1 based on dynamic noise estimation time-frequency domain adaptive voice detection method, it is characterised in that: Described background noise energy is the result being averaging after the described time domain short-time energy that will be deemed as during non-voice carries out adding up.

It is the most according to claim 1 based on dynamic noise estimation time-frequency domain adaptive voice detection method, it is characterised in that: Described frequency domain background energy is the result being averaging after the described frequency domain short-time energy that will be deemed as during non-voice carries out adding up.