CN106504760B

CN106504760B - Broadband ambient noise and speech Separation detection system and method

Info

Publication number: CN106504760B
Application number: CN201610947596.5A
Authority: CN
Inventors: 何云鹏
Original assignee: Chengdu Leader Technology Co Ltd
Current assignee: Chengdu Leader Technology Co Ltd
Priority date: 2016-10-26
Filing date: 2016-10-26
Publication date: 2019-04-26
Anticipated expiration: 2036-10-26
Also published as: CN106504760A

Abstract

The present invention relates to the information processing technology and transducing signal process fields, especially relate to a kind of broadband ambient noise and speech Separation detection system, the system includes present frame time domain circuit for calculating energy, ambient noise counting circuit, time domain speech detects long short-time average energy comparison circuit, frequency domain speech detects length time-frequency domain energy comparison circuit, ambient noise comparison circuit, sub-belt energy distributing homogeneity speech detection circuit and number of speech frames statistical circuit, the invention also discloses a kind of broadband ambient noises and speech Separation detection method, the present invention uses three-level speech detection means, there is good detection effect for the ambient noise of low-and high-frequency, also there is extraordinary detection effect simultaneously for the noise of accidental discontinuously row, the accuracy of speech detection under complicated noise is greatly improved.

Description

Broadband ambient noise and speech Separation detection system and method

Technical field

The present invention relates to the information processing technology and transducing signal process field, especially relates to a kind of broadband background and make an uproar Sound and speech Separation detection system and method.

Background technique

One hot spot in artificial intelligence application field is exactly speech recognition, and speech recognition has begun in every field at present Start to be widely applied.Speech detection realization is the pith of speech recognition system real-time implementation, and the purpose is in complicated reality Voice segments and non-speech segment are distinguished in the environment of border, have document show in practical application discrimination compared with lower part be largely by In not handled correctly voice, a large amount of non-speech noise seriously affects the accuracy rate of speech recognition system, especially answers The speech recognition of much noise is had with environment, correct speech detection technology can be effectively reduced system operations amount, shorten system The system processing time reduces mobile terminal transmission power and saves channel resource, improves speech recognition accuracy, especially carries on the back in complexity Under scape noise, the superiority and inferiority of speech recognition system performance depends greatly on the superiority and inferiority of speech detection technology, therefore steadily and surely, Accurately, in real time, the speech detection technology that adaptivity is strong and robustness is good be necessary to each speech recognition system.

The main stream approach of current automatic speech end-point detection is to rely on short-time energy size in time domain, zero-crossing rate size, with And three kinds of methods of frequency domain Frequency band energy mean square deviation detect, specific method is to find out short-time energy, zero-crossing rate or frequency band energy Mean square deviation is measured, is then compared with an empirical value, it is demonstrated experimentally that this independent relatively short-time energy size or zero-crossing rate The method of size is bad for noisy environmental suitability, and especially application environment can change, the background of same environment When noise can also change, and frequency band energy mean square deviation method is bad for quiet environment adaptability.

The detection that can also carry out voice respectively according to the variation of time domain and spectrum domain voice average energy, finally according to dynamic The ambient noise size estimated selects optimal as a result, to greatly improve the accuracy rate of speech recognition and become to environment The adaptability of change, since the energy of most of stationary background noises concentrates on low-frequency range, this method is for most low frequencies The noise of distribution is highly effective, and for the sound such as chirping of birds that object or animal issue, car horn, piano and other musical instrument bullets The sound played, since its frequency band distribution is wider, in the voice band distribution in same people, for such noise It is then easy to for the type noise to be mistaken for voice using the above method, distinguishes the type noise for speech detection, voice drop It makes an uproar, one of all extremely important and difficult point for speech recognition.

To solve the above problems, needing to invent a kind of frequency domain by broadband non-speech noise and time domain specification carries out The broadband ambient noise and speech Separation detection system and method proposed after many experiments analysis and theoretical research.

Summary of the invention

It is an object of the invention to overcome the above-mentioned deficiency in the presence of the prior art, provide it is a kind of can greatly improve it is all kinds of The broadband ambient noise of the accuracy of adaptability and the automatic speech detection of ambient noise and speech Separation detection system and side Method.

In order to achieve the above object, the present invention provides following technical solutions.

Broadband ambient noise and speech Separation detection system comprising: frequency domain energy counting circuit when the current frame, with institute The ambient noise counting circuit, time domain speech for stating frequency domain energy counting circuit connection when the current frame detect long short-time average energy ratio Length time-frequency domain energy comparison circuit is detected compared with circuit and frequency domain speech, is examined with the ambient noise counting circuit, time domain speech Survey the ambient noise ratio of long short-time average energy comparison circuit and frequency domain speech detection length time-frequency domain energy comparison circuit connection Compared with circuit, long short-time average energy comparison circuit is detected with the time domain speech and frequency domain speech detects length time-frequency domain energy ratio Compared with the sub-belt energy distributing homogeneity speech detection circuit that circuit is separately connected, examined with the sub-belt energy distributing homogeneity voice The number of speech frames statistical circuit of slowdown monitoring circuit connection, the ambient noise counting circuit are also evenly distributed with the sub-belt energy respectively Property speech detection circuit, number of speech frames statistical circuit, time domain speech detect long short-time average energy comparison circuit and frequency domain speech Detect length time-frequency domain energy comparison circuit connection.

As a preferred solution of the present invention, the number of speech frames statistical circuit is made of time width filter, the time width filter Wave device is used to count the frame number of voice, and the quantity of the time width filter is more than or equal to 1.

The invention also discloses a kind of broadband ambient noises and speech Separation detection method comprising following steps:

Step 1 is loaded into voice data, and the voice data is handled by frame, and the voice data is voice number in time domain According to the time size of the frame can configure, usually between 10 milliseconds to 50 milliseconds；

Step 2 calculates time domain short-time energy and time domain long-term average energy, the time domain short-time energy are the time domains Time domain short-time energy described in multiframe is accumulated and divided by the time domain short-time energy by the energy summation of interior voice data present frame Frame number obtains the time domain long-term average energy；

Voice data present frame in the time domain is carried out FFT(fast Flourier by step 3) transformation, it will be in the time domain Voice data present frame is transformed into sub--band speech data in frequency domain；

Step 4 calculates frequency domain short-time energy and frequency domain long-term average energy, and sub--band speech data in the frequency domain are worked as Previous frame voice main energetic distribution frequency range sub-belt energy is cumulative to obtain the frequency domain short-time energy, and frequency domain described in multiframe is short When energy accumulation and obtain the frequency domain long-term average energy divided by the frame number of the frequency domain short-time energy；

The time domain short-time energy of non-speech frame is sent into ambient noise estimation by step 5 ambient noise accumulation calculating Unit adds up, and is often added to certain frame number and then exports the new ambient noise；

The ambient noise and the threshold value of setting one are compared by step 6, are first walked if more than the threshold value Rapid seven, if first being less than the threshold value carries out step 8；

Step 7 carries out frequency domain speech detection, is that voice then enters step nine, is not that voice then carries out step 5 and step 11；

Step 8 carries out time domain speech detection, is that voice then enters the step 9, is not that voice then carries out the step Five and step 11；

Step 9 carry out the detection of frequency domain sub-band energy distribution of laser, be that voice then enters step ten, be not voice then into Row step described rapid five and step 11；

Step 10 time width filter counts the number of speech frames that the step 9 generates, and is compared with the threshold value of setting two Compared with if the frame number is greater than the threshold value and is second directly entered the step 11, if second the frame number is less than the threshold value Into the step 5 and step 11；

The output of step 11 testing result, detection terminate.

As a preferred solution of the present invention, the frequency domain speech detection is by the frequency domain short-time energy and the long Shi Ping of frequency domain Equal energy is compared, and the frequency domain short-time energy is then voice, otherwise to a certain degree more than the frequency domain long-term average energy For non-voice, the output when being judged as non-voice is as a result, detection terminates.

As a preferred solution of the present invention, the time domain speech detection is by the time domain short-time energy and the long Shi Ping of time domain Equal energy is compared, and the time domain short-time energy is then voice, otherwise to a certain degree more than the time domain long-term average energy For non-voice, it is judged as output when non-voice as a result, detection terminates.

As a preferred solution of the present invention, when carrying out step 8, if testing result uniformity compared with Gao Zewei voice, such as Lower fruit testing result uniformity is then non-voice, is judged as output when non-voice as a result, detection terminates.

As a preferred solution of the present invention, the time width filter counts the voice data continuously and is the frame number of voice, If second it is voice that the frame number, which is greater than the threshold value, if the frame number is less than the threshold value and is second judged as non-voice, It is judged as output when non-voice as a result, detection terminates.

As a preferred solution of the present invention, in operating procedure seven to step 9, when operation result is determined as non-voice, The non-speech data operating procedure five is generated to the new ambient noise.

The present invention has used three-level speech detection, first using described during detecting voice data in the time domain Time domain speech detection or frequency domain speech detection, are secondly detected using the frequency domain sub-band energy distribution of laser, when finally using Wide filter counts the number of speech frames that the step 8 generates, and is compared with the threshold value of setting two, is successively filtered, most Authentic and valid voice data screens at last.

Compared with prior art, beneficial effects of the present invention:

The present invention uses three-level speech detection means, has good detection effect for the ambient noise of low-and high-frequency, together When also have extraordinary detection effect for the accidental discontinuously noise of row, speech detection under complicated noise is greatly improved Accuracy.

Detailed description of the invention

Fig. 1 is circuit frame figure of the present invention；

Fig. 2 is flow chart of the present invention.

Specific embodiment

Below with reference to embodiment and specific embodiment, the present invention is described in further detail, but should not understand this It is only limitted to embodiment below for the range of aforementioned body of the present invention, it is all that this is belonged to based on the technology that the content of present invention is realized The range of invention.

As shown in Figure 1, a kind of broadband ambient noise and speech Separation detection system, system frequency domain energy when the current frame Counting circuit, the ambient noise counting circuit being connect with the counting circuit of frequency domain energy when the current frame, time domain speech detection length Short-time average energy comparison circuit and frequency domain speech detect length time-frequency domain energy comparison circuit, calculate electricity with the ambient noise Long short-time average energy comparison circuit is detected on road, time domain speech and frequency domain speech detects length time-frequency domain energy comparison circuit connection Ambient noise comparison circuit, detect long short-time average energy comparison circuit with the time domain speech and frequency domain speech detect length The sub-belt energy distributing homogeneity speech detection circuit that time-frequency domain energy comparison circuit is separately connected is distributed with the sub-belt energy The number of speech frames statistical circuit of uniformity speech detection circuit connection, the ambient noise counting circuit also respectively with the subband Energy distribution of laser speech detection circuit, number of speech frames statistical circuit, the long short-time average energy of time domain speech detection are more electric Road and frequency domain speech detect length time-frequency domain energy comparison circuit connection, and number of speech frames statistical circuit is made of time width filter, Time width filter is used to count the frame number of voice, and the quantity of time width filter is 1 in the present embodiment, in the present embodiment when Wide filter is a voice frame counter.

As shown in Fig. 2, a kind of broadband ambient noise and speech Separation detection method comprising following 11 steps:

Step 7 carries out frequency domain speech detection, and the frequency domain speech detection is that the frequency domain short-time energy and frequency domain is long When average energy be compared, the frequency domain short-time energy be more than the frequency domain long-term average energy to a certain degree, then be voice, Otherwise it is non-voice, is that voice then enters step nine, is not that voice then carries out step 5 and step 11；

Step 8 carries out time domain speech detection, and the time domain speech detection is that the time domain short-time energy and time domain is long When average energy be compared, the time domain short-time energy be more than the time domain long-term average energy to a certain degree, then be voice, Otherwise it is non-voice, is that voice then enters the step 9, is not that voice then carries out the step 5 and step 11；

Step 9 carries out the detection of frequency domain sub-band energy distribution of laser, if testing result uniformity compared with Gao Zewei voice, It is non-voice if testing result uniformity is lower, ten is entered step if being voice, is not that voice then carries out walking described rapid five And step 11；

Step 10 time width filter counts the number of speech frames that the step 9 generates, described in the time width filter statistics Voice data is continuously the frame number of voice, and is compared with the threshold value of setting two, if second the frame number is greater than the threshold value It then is directly entered the step 11 for voice, if second it is that non-voice enters the step 5 that the frame number, which is less than the threshold value, And step 11；

The output of step 11 testing result, detection terminate.

In operating procedure seven to step 9, when operation result is determined as non-voice, the non-speech data is run Step 5 generates the new ambient noise.

In the present embodiment, the calculating process of step 3 is as follows:

Assuming that frequency domain sub-band number is N, then average sub band energy is, wherein Eavg is average son Band energy, Etotal are all sub-belt energy summations, and Ei is the i-th sub-belt energy, i=1,2......N.In a frequency domain, sub It is equal to square obtaining with square summation of imaginary part for its real part with energy.

In the present embodiment, the calculating process of step 9 is as follows:

Heterogeneity is asked using mean square deviation method, if each sub-belt energy is Ei, then asks heterogeneity, formula with mean square deviation For, wherein nU is heterogeneity, if threshold value Th_nu is non-homogeneous The threshold value of property can temporarily be judged to voice then as nU < Th_nu, be otherwise non-voice.

It can be calculated in other embodiments with following two ways:

One, using asking absolute value of the difference and averaging, formula is, Middle nU is heterogeneity, if threshold value Th_nu is that heteropical threshold value can temporarily be judged to voice then as nU < Th_nu, It otherwise is non-voice；

Two, the subband close from average sub band energy to sub-belt energy counts, if more sub-belt energy be distributed in it is flat Near equal energy, then it is voice, is otherwise non-voice.Specific formula is as follows, if: | Ei-Eavg | when < k*Eavg, U=U+ 1, k is a configuration parameter between 0 and 1 here, and representative value is configurable to 0.5, U and is characterized as uniformity, if Th_u It if U > Th_u, is judged to voice is otherwise non-voice for threshold value.

The detailed calculating process of step 10 is as follows in the present embodiment:

If a voice frame counter, the counter are initially 0 at the beginning, clearing when encountering non-speech frame encounters voice When adding 1 when frame, and speech frame will be changed to from non-speech frame, the serial number of first speech frame is updated to speech frame initial address, When the speech frame counter values are greater than a threshold value two, then since first speech frame, continuous speech frame is all language Sound frame, until non-speech frame occur, if change to non-speech frame from speech frame, the voice frame counter values be less than threshold value, then this Preceding speech frame is also judged to non-speech frame.

Claims

1. broadband ambient noise and speech Separation detection system comprising: frequency domain energy counting circuit when the current frame, and it is described The ambient noise counting circuit, time domain speech of frequency domain energy counting circuit connection detect long short-time average energy and compare when the current frame Circuit and frequency domain speech detect length time-frequency domain energy comparison circuit, detect with the ambient noise counting circuit, time domain speech Long short-time average energy comparison circuit and the ambient noise of frequency domain speech detection length time-frequency domain energy comparison circuit connection compare Circuit detects long short-time average energy comparison circuit with the time domain speech and frequency domain speech detects length time-frequency domain energy comparison The sub-belt energy distributing homogeneity speech detection circuit that circuit is separately connected, with the sub-belt energy distributing homogeneity speech detection The number of speech frames statistical circuit of circuit connection, the ambient noise counting circuit also respectively with the sub-belt energy distributing homogeneity Speech detection circuit, number of speech frames statistical circuit, time domain speech detect long short-time average energy comparison circuit and frequency domain speech inspection Survey length time-frequency domain energy comparison circuit connection.

2. broadband ambient noise according to claim 1 and speech Separation detection system, it is characterised in that: the voice Frames statistic circuit is made of time width filter, and the time width filter is used to count the frame number of voice, the time width filter Quantity be more than or equal to 1.

3. broadband ambient noise and speech Separation detection method comprising following steps:

Step 1 is loaded into voice data, and the voice data is handled by frame, and the voice data is voice data in time domain；

Step 2 calculates time domain short-time energy and time domain long-term average energy, the time domain short-time energy are languages in the time domain Time domain short-time energy described in multiframe is accumulated and divided by the frame number of the time domain short-time energy by the energy summation of sound data present frame Obtain the time domain long-term average energy；

Voice data present frame in the time domain is carried out FFT(fast Flourier by step 3) transformation, by voice in the time domain Data present frame is transformed into sub--band speech data in frequency domain；

Step 4 calculates frequency domain short-time energy and frequency domain long-term average energy, by sub--band speech data present frame in the frequency domain Voice main energetic distribution frequency range sub-belt energy is cumulative to obtain the frequency domain short-time energy, and frequency domain described in multiframe in short-term can The frame number that amount accumulates and divides by the frequency domain short-time energy obtains the frequency domain long-term average energy；

Step 5 ambient noise accumulation calculating；

The ambient noise and the threshold value of setting one are compared by step 6, first carry out step 7 if more than the threshold value, If first being less than the threshold value carries out step 8；

Step 7 carries out frequency domain speech detection, is that voice then enters step nine, is not that voice then carries out step 5 and step 10 One；

Step 8 carry out time domain speech detection, be that voice then enters the step 9, be not voice then carry out the step 5 and Step 11；

Step 9 carries out the detection of frequency domain sub-band energy distribution of laser, is that voice then enters step ten, is not that voice is then walked Described rapid five and step 11；

Step 10 time width filter counts the number of speech frames that the step 9 generates, and is compared with the threshold value of setting two, if The frame number is greater than the threshold value and is second directly entered the step 11, if the frame number is less than the threshold value and second enters institute State step 5 and step 11；

The output of step 11 testing result, detection terminate.

4. broadband ambient noise according to claim 3 and speech Separation detection method, it is characterised in that: the frequency domain Speech detection is to be compared the frequency domain short-time energy and frequency domain long-term average energy, and the frequency domain short-time energy is more than institute It states frequency domain long-term average energy to a certain degree, is then voice, be otherwise non-voice, the output when being judged as non-voice is as a result, inspection Survey terminates.

5. broadband ambient noise according to claim 3 and speech Separation detection method, it is characterised in that: the time domain Speech detection is to be compared the time domain short-time energy and time domain long-term average energy, and the time domain short-time energy is more than institute It states time domain long-term average energy to a certain degree, is then voice, be otherwise non-voice, output when being judged as non-voice is as a result, detection Terminate.

6. broadband ambient noise according to claim 3 and speech Separation detection method, it is characterised in that: walked When rapid eight, if testing result uniformity compared with Gao Zewei voice, is non-voice if testing result uniformity is lower, is judged as Output when non-voice is as a result, detection terminates.

7. broadband ambient noise according to claim 3 and speech Separation detection method, it is characterised in that: the time width Filter counts the voice data continuously and is the frame number of voice, if second it is voice that the frame number, which is greater than the threshold value, such as Frame number described in fruit is less than the threshold value and is second judged as non-voice, is judged as output when non-voice as a result, detection terminates.

8. broadband ambient noise according to claim 3 and speech Separation detection method, it is characterised in that: walked in operation Rapid seven to step 9 when, when operation result is determined as non-voice, the non-speech data operating procedure five is generated to new institute State ambient noise.