CN105023572A

CN105023572A - Noised voice end point robustness detection method

Info

Publication number: CN105023572A
Application number: CN201410152461.0A
Authority: CN
Inventors: 王景芳
Original assignee: Individual
Current assignee: Individual
Priority date: 2014-04-16
Filing date: 2014-04-16
Publication date: 2015-11-04

Abstract

The invention discloses a noised voice end point robustness detection method. The method comprises the following steps of constructing an estimation method of a noise power spectrum of each frame of acoustical signals in filtering and providing a time-varying updating mechanism of a noise spectrum; firstly, carrying out iterative wiener filtering on a frequency spectrum of each frame of voices; then, dividing into several sub-band and calculating a frequency spectrum entropy of each sub-band; and then making successive several frames of sub-band frequency spectrum entropies pass through one group of median filters so as to acquire each frame of the frequency spectrum entropies; according to values of the frequency spectrum entropies, classifying input voices. By using the algorithm, the voices and noises, and a voice state and a voiceless state can be effectively distinguished. Under different noise environment conditions, robustness is possessed. The algorithm has low calculating cost, is simple, is easy to realize and is suitable for application of real-time voice signal processing system of various kinds of systems needing voice end point detection. The method is a real-time voice end points detection algorithm which adapts to a complex environment, and voice end point detection and voice filtering enhancement are completed together in a one-time state.

Description

A kind of noisy speech end points Robust Detection Method

Technical field

The invention belongs to voice process technology field, refer to especially a kind ofnoisy speech end points Robust Detection Method.

Background technology

Speech terminals detection (also claims voice activity detection) be the important step of digital speech processing, its objective is the digital signal obtained from sampling and detect speech signal segments and noise signal section; From ground unrest, detect that the starting point and ending point of voice is of use in many ways exactly, such as: in speech recognition, can remove not containing the noise signal section of phonetic element, characteristic parameter is extracted for speech signal segments, not only increase the precision of identification like this, and decrease the identifying processing time; In voice coding, then when not affecting the quality of speech signal received, the bit rate of noise segment can be reduced, improving code efficiency; There is following benefit to Embedded speech recognition system: 1. non-speech frame abandoned and do not deliver to the recognizer of rear end, the calculated amount of back-end recognizer can be reduced; In built-in speech recognition system, as mobile phone, PDA (Personal digital assistant) etc., the response time of system can be reduced, improve the real-time of system; 2., in distributed speech recognition system, a speech frames can reduce the data volume of transmission significantly; 3. the feature reduced due to a large amount of non-speech frame is sent to back-end recognizer and the inserting error caused.

Short-time energy is feature the most frequently used in voice activity detection algorithm, it can effectively separate voice and noise in high s/n ratio environment, but a large amount of experimental result display, based on the method for short-time energy in low signal-to-noise ratio and nonstationary noise environment, its performance obviously declines; Certainly, some algorithm can keep stable performance in low signal-to-noise ratio environment, and its shortcoming is that computation complexity is too large, is not suitable for the application of time Speech Recognition System; Propose the earliest information entropy to be used for speech/noise classification, the pronunciation of people and the difference of noise can show from their frequency spectrum entropy; Experimental result shows, the algorithm based on voice spectrum entropy surpasses the method based on energy under low signal-to-noise ratio environment.

The present invention proposes a kind of subband spectrum entropy voice activity detection algorithm based on Iterative Wiener Filtering, first the frequency spectrum of every frame voice is carried out Iterative Wiener Filtering, again it be divided into several subbands and calculate the frequency spectrum entropy of each subband, then the subband spectrum entropy of some frames in succession being obtained the frequency spectrum entropy of every frame through a class mean wave filter; Due in the noise circumstance of non-stationary, the fluctuation ratio of frequency spectrum entropy contour curve is comparatively large, is unfavorable for the selection of threshold value; Therefore, we by subband spectrum entropy through the smoothing process of a class mean wave filter; In smothing filtering process, before and after having used, the information of the subband spectrum entropy of frame, substantially increases accuracy of detection and the validity of algorithm; This algorithm can keep good performance in multiple environment and signal to noise ratio (S/N ratio) condition, can meet the real-time of embedded system and the requirement of low-power consumption.

Summary of the invention

(1) technical matters that will solve

In view of this, fundamental purpose of the present invention is to propose a kind of noisy speech end points Robust Detection Method, detects speech signal segments and noise signal section the digital signal obtained from sampling; From ground unrest, detect the starting point and ending point of voice exactly, for the voice signal polluted by additive noise, design the voice activity detection method of high robust; The frequency spectrum entropy of the subband of nonstationary noise voice, noisy speech and noise double thresholddiscrimination threshold and method of discrimination.

(2) technical scheme

For achieving the above object, the invention provides a kind of noisy speech end points Robust Detection Method, the method comprises:

1) weiner equalizer

Wiener filtering is the maximal possibility estimation under the minimum mean square error criterion meaning of time domain waveform under smooth conditions, and its main method is as follows, for Noisy Speech Signal:

(1)

Wherein y( n), s( n), v( n) represent noisy speech, clean speech and noise respectively.Wiener Filtering is exactly by minimum mean square error criterion pair s( n) estimate, namely choose s( n) estimation s( n), suppose impulse response h (n) of wave filter, have

（2）

And meet ε= e[ s( n) s( n)] ²minimum.Will s( n) substitute in error formula ε, according to orthogonality principle, Fourier transform and s( n) with v( n) irrelevance can obtain Wiener filtering equation;

(3)

(4)

H(k) be frequency field transport function, k is frequency sequence number, P _sthe power spectrum density of (k) voice signal s (n), P _vthe power spectrum density of (k) noise v (n); Determine P in actual applications _s(k) and P _vk () is problem focus; P _snrk () iteration realizes;

2) noise power spectrum iteration upgrades

The signal of l frame obtains its N in spectrum through fast fourier transform (FFT) _fFTindividual some YF(i, l) (0≤i≤N _fFT), before initial, N frame elects noise signal as, and the every point spectrum initial value of noise is:

（5）

Noise often puts power spectrum initial value:

（6）

The renewal of noise spectrum and noise power spectrum: be l frame voice signal if current, calculates current signal

（7）

MN (i, l-1) is the noise in former frame situation, (0≤i≤N _fFT);

The signal to noise ratio snr of this frame is defined as and meets SNR (i) >0 (0≤i≤N _fFT) mean value of number;

If SNR<3db (decibel), so record successive frame noise times N oiseCounter=NoiseCounter+1, only have as NoiseCounter>HangOver (such as HangOver=8), just upgrade noise spectrum and noise power spectrum, other situations are constant; Newer (as α=0.9):

（8）

（9）

3) Iterative Wiener Filtering realizes

Setting signal to noise ratio (S/N ratio) iteration initial value SNRpre (i, 0)=1, (0≤i≤N _fFT);

Be l frame voice signal if current, calculate snr gain

（10）

Lastest imformation amount, max is maximizing (as β=0.95)

（11）

Calculating filter coefficient (frequency domain filtering transport function)

（12）

Calculate spectral filter value

（13）

Upgrade the standby next frame filtering of signal to noise ratio snr pre (i, l) to use

（14）

4) subband spectrum entropy calculates

By the voice signal of every frame through fast fourier transform (FFT) obtain it spectrum on N _fFTindividual some YF _i(0≤i≤N _fFT), obtain N by the filtering of above-mentioned iteration dimension sodium _fFTindividual some YN _i(0≤i≤N _fFT), calculate its N _fFTindividual some power spectrum Y _i=YN _i* YN _i(0≤i≤N _fFT), because pure voice spectrum is between [250Hz, 3500Hz], look for some interval [Nd, the Ng] (0≤Nd<Ng≤N of its correspondence _fFT), the point in frequency domain section [Nd, Ng] is divided into the frequency range of a non-overlapping copies, is called subband (Subband).Because the noise in some environment just concentrates on certain subband, sub-band approach can improve the accuracy rate of algorithm in narrow band noise environment. the probability of each point on l frame frequency spectral domain is calculated according to formula (15)

(15)

Wherein, Y _ibe the point on i-th subband, Q is a positive number, adds that the object of Q is to make the frequency spectrum entropy of various noise signal in identical signal to noise ratio (S/N ratio) environment relatively, thus can more easily distinguish voice and noise; In experiment, the value of Q gets the linear formula of the mean value STD of initial front N frame time domain each frame standard deviation, i.e. Q=a*STD+b, (as get: a=500, b=1); In order to the impact of single-point power spectrum is too concentrated in cancellation, if .9, then ;

According to the definition of information entropy, the value of the frequency spectrum entropy of a kth subband of l frame is

( ) (16)

5) medium filtering and double threshold voice activity detection

According to the principle of information entropy, when the noise signal in some environment is more regular, the accuracy of sorter will be affected; Therefore, when calculating the frequency spectrum entropy of present frame, the information of front and back L frame used by wave filter; A class mean wave filter is adopted respectively to the smoothing process of frequency spectrum entropy of each subband in algorithm;

Be L=2N to one group of length ₁the sub-band information entropy Es [l-N of+1 ₁, k] ..., Es [l, k] ..., Es [l+N ₁, k] and carry out medium filtering, l is the speech frame of present analysis;

According to following formula, we can calculate the spectrum information negentropy of l frame:

After subband spectrum entropy estimate and intermediate value statistical filtering, the signal of every frame can obtain a frequency spectrum negentropy H _l;

VAD (Voice Activity Detection) is double-threshold algorithm, and this calculation ratio juris (see figure 2) is as follows:

A) calculate subband spectrum entropy E_Sil (such as front 3 frames are averaging) according to ground unrest, and calculate two high-low threshold E_Low and E_High;

B) find the forward terminal of voice, when the position met the following conditions is exactly speech front-end point: have continuous m frame in the l frame after current location more than E_Low, and have N continuous frame more than E_High in M frame after current location;

C) aft terminal of voice is found, when the position met the following conditions is exactly voice aft terminals: first find the point lower than E_Low, and lower than not having continuous C frame in B frame after the point of E_Low more than E_High.

Preferably, the parameter initialization of described extraction: noisy speech signal framing, frame length N=[0.25fs] point, fs is signal sampling frequency, and frame moves N/2; Noise spectrum initial value is determined to take away the beginning without a few frame of voice segments.

Preferably, described in weiner equalizer:voice signal as a whole its characteristic and the parameter that characterizes its essential characteristic is all times to time change, it is a typical non-stationary process, but in a short time period (10 ~ 30ms), its characteristic keeps stable relatively, thus can be regarded as an accurate stationary process, i.e. the short-term stationarity of voice signal; The voice process technology of the current overwhelming majority is all on the basis of " in short-term ", voice signal is divided into many sections and analyzes its characteristic parameter piecemeal, wherein each section is called one " frame ", the process of segmentation is called that " framing " processes, by realizing voice signal windowed function, frame length generally gets 10 ~ 30ms; Framing can contiguous segmentation, but is generally carry out overlapping type segmentation by a moving window, makes like this to seamlessly transit between frame and frame, maintains the continuity of signal.Choosing at window function, in order to can obtain high frequency resolution and overcome Gibbas phenomenon, we choose the segmentation of Hamming (Hamming) window overlapping type.

Preferably, this invention implementation procedure described is shown in Fig. 1, voice double threshold voice activity detectionprocess as shown in Figure 2.

Preferably, noisy speech signal processes one by one in real time.

(3) beneficial effect

1, this noisy speech end points Robust Detection Method provided by the invention, propose the subband power spectrum sound end real-time detection method of entropy under Iterative Wiener Filtering, this algorithm has and calculates simple, and real-time is high, the feature that noise resisting ability is strong, has good robustness; This algorithm versatility is good, conforms wide, even if apply under very low signal to noise ratio (S/N ratio), speech frame still has the subband compared with high s/n ratio, is applicable to embedded time Speech Recognition System;

2, this noisy speech end points Robust Detection Method advantage provided by the invention and characteristic:

1) Iterative Wiener Filtering be have employed to every frame signal frequency spectrum;

2) low, the high frequency pruned outside voice band in subband power spectrum calculates;

3) introduce the Q of restraint speckle in the computation process of sub-bands of a spectrum entropy, can attenuating noise to the interference of frequency spectrum entropy;

4) at subband spectrum entropy through the smoothing process of a class mean wave filter, be conducive to the selection of threshold value;

3, this noisy speech end points Robust Detection Method provided by the invention is for non-stationary environment noise, detects speech signal segments and noise signal section the digital signal obtained from sampling; From ground unrest, detect the starting point and ending point of voice exactly, for the voice signal polluted by additive noise, design the voice activity detection method of high robust.

Accompanying drawing explanation

Fig. 1 a kind of noisy speech end points Robust Detection Method process flow diagram provided by the invention;

Fig. 2 is provided by the invention double threshold voice activity detectionschematic diagram;

Fig. 3 is that the end-point detection of the different noise of the primitive mixture of tones provided by the invention compares with frequency spectrum entropy curve;

The former speech terminals detection of Fig. 3 (a), (a ₁) primitive sound algorithm frequency spectrum entropy herein, (a ₂) this conventional spectral of primitive sound entropy;

Fig. 3 (b) mixes white noise (white) voice (SNR=16.5dB) end-point detection, (b ₁) mixing white noise algorithm frequency spectrum entropy herein, (b ₂) this conventional spectral of mixing white noise entropy;

Fig. 3 (c) mixed powder coloured noise (pink) voice (SNR=13.4dB) end-point detection, (c ₁) mixed powder coloured noise algorithm frequency spectrum entropy herein, (c ₂) this conventional spectral of mixed powder coloured noise entropy;

Fig. 3 (d) mixes voice (SNR=8.8dB) end-point detection of opportunity of combat passenger cabin (f16_cockpit) noise, (d ₁) mixing opportunity of combat noise algorithm frequency spectrum entropy herein, (d ₂) this conventional spectral of mixing opportunity of combat noise entropy;

Fig. 3 (e) mixes voice (SNR=7.76dB) end-point detection of the noisy noise of people (babble), (e ₁) the noisy noise of mixing people algorithm frequency spectrum entropy herein, (e ₂) this conventional spectral of the noisy noise of mixing people entropy;

Fig. 4 is that the noisy noise of mixing provided by the invention (babble) voice (SNR=5dB, 0dB ,-5dB) end-point detection compares with frequency spectrum entropy curve.

Embodiment

For making the object, technical solutions and advantages of the present invention clearly understand, below in conjunction with specific embodiment, and with reference to accompanying drawing, the present invention is described in more detail.

Core content of the present invention is: change new principle when proposing noise spectrum, achieve iterative Wiener Filtering, carry out subband spectrum entropy calculates, devise medium filtering and double threshold voice activity detection,reach speech terminals detection object.

As shown in Figure 1, Fig. 1 is a kind of noisy speech end points Robust Detection Method process flow diagram provided by the invention, and the method comprises the following steps:

Step 101: parameter initialization: noisy speech signal framing, frame length N=[0.25fs] point, fs is signal sampling frequency, and frame moves N/2; Noise spectrum initial value;

Step 102: framing: the noisy speech in m frame moment, Fourier transform;

Step 103: change new during m frame signal noise power spectrum;

Step 104: m frame signal to noise ratio (S/N ratio) Iterative Wiener Filtering;

Step 105: the calculating of m frame voice subband spectrum entropy, medium filtering and double threshold end-point detection;

Step 106: next frame real time signal processing goes to step 102.

Become step of updating described in above-mentioned steps 103 during noise power spectrum to comprise:

Noise often puts power spectrum initial value:

MN (i, l-1) is the noise in former frame situation, (0≤i≤N _fFT);

。

The step of signal to noise ratio (S/N ratio) Iterative Wiener Filtering described in above-mentioned steps 104 comprises:

Be l frame voice signal if current, calculate snr gain

Lastest imformation amount, max is maximizing (as β=0.95)

Calculating filter coefficient (frequency domain filtering transport function)

Calculate spectral filter value

。

The calculating of the entropy of voice subband spectrum described in above-mentioned steps 105, medium filtering and double threshold end-point detection step comprise:

The probability of each point on l frame frequency spectral domain

If .9, then .

( )

Be L=2N to one group of length ₁the sub-band information entropy Es [l-N of+1 ₁, k] ..., Es [l, k] ..., Es [l+N ₁, k] and carry out medium filtering, l is the speech frame of present analysis.

(17)

It is as follows that double-threshold algorithm calculates ratio juris (see figure 2):

Based on a kind of noisy speech end points Robust Detection Method process flow diagram shown in Fig. 1, Fig. 2 further illustrates voice double threshold end-point detection process schematic process.

Below in conjunction with specific embodiment, this noisy speech end points Robust Detection Method provided by the invention is further described; experimentground unrest is selected from Noisex-92 database, its sample frequency fs=19.98kHZ.We are with same sample frequency fs below, and under computing machine noise and room noise environment are recorded, " language, sound, end, point " sound is shown in Fig. 3 (a), and doorframe broken line is context of methods extreme result; In voice framing process, every frame gets 25ms, i.e. frame length M=[0.025fs] point, and frame moves , cut and start noise frame N=20, voice cross-talk spectrum segmentation K=8 section, medium filtering chooses adjacent L=9 frame;

White noise (white) in primitive sound, primitive sound and noise Noisex-92 storehouse, pink colour noise (pink), the noisy noise of people (babble), opportunity of combat passenger cabin (f16_cockpit) noise are composed entropy method with herein Iterative Wiener Filtering respectively and common spectrum entropy method is shown in Fig. 3, in Fig. 3, the horizontal ordinate of left part is the time (second), ordinate is amplitude, in, the horizontal ordinate of right part is frame number, ordinate is negentropy; The left part of Fig. 3 is voice, is mixed with the voice of different noise and their end-point detection, and figure middle part is frequency spectrum entropy and the end points cut-off rule of algorithm herein, and figure right part is corresponding conventional spectral entropy; Algorithm is in each middle noise mixing situation herein, and entropy-spectrum curvilinear motion is little, and accurately, adaptivity is good in sound end segmentation.And conventional entropy-spectrum method, in white noise situation better, poor in noiseless and other noise situations effect, the noisy noise of people (babble) effect is the poorest;

Noise source is become further to during mixing---the noisy noise of people's language (babble) is also fine at signal to noise ratio (S/N ratio) speech terminals detection under 5db, 0db ,-5db, and conventional entropy-spectrum method detects inefficacy;

Test result is weighed by 3 indexs ^]:

,

Wherein, N ₁and N ₀be respectively manual markings speech frame and the total number of noise frame in tested speech, N _1,0the number of errors of noise frame is identified as, N for manual markings speech frame _0,1for manual markings noise frame and be identified as the number of errors of speech frame, then P (A/S) is that speech frame detects accuracy, and P (A/N) is that noise frame detects accuracy, and P (A) is total detection accuracy;

Table 1 provides the abridged table of the experimental result of different signal to noise ratio (S/N ratio) (SNR) under noisy noise (babble) environment.

Above-described specific embodiment; object of the present invention, technical scheme and beneficial effect are further described; be understood that; the foregoing is only specific embodiments of the invention; be not limited to the present invention; within the spirit and principles in the present invention all, any amendment made, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. a noisy speech end points Robust Detection Method, is characterized in that the method comprises:

The projectional technique of the design noise power spectrum of acoustical signal in filtering, give noise spectrum time change new mechanism; The frequency spectrum of voice carries out Iterative Wiener Filtering algorithm, spectrum division become several subbands and calculate the frequency spectrum entropy of each subband, the subband spectrum entropy of some frames is in succession obtained the frequency spectrum entropy of every frame through a class mean wave filter, the voice of value to input according to frequency spectrum entropy are classified; This algorithm can distinguish voice and noise effectively, have voice with without voice.

2. noisy speech end points Robust Detection Method according to claim 1, is characterized in that, described noise spectrum time change new mechanism:

1) l frame signal through fast fourier transform (FFT) obtain it spectrum on N _fFTindividual some YF(i, l) (0≤i≤N _fFT), before initial, N frame elects noise signal as, and the every point spectrum initial value of noise is:

（1）

Noise often puts power spectrum initial value:

（2）

2) renewal of noise spectrum and noise power spectrum: be l frame voice signal if current, calculates current signal

（3）

MN (i, l-1) is the noise in former frame situation, (0≤i≤N _fFT);

（4）

（5）。

3. noisy speech end points Robust Detection Method according to claim 1, is characterized in that, described Iterative Wiener Filtering algorithm:

Be l frame voice signal if current, calculate snr gain

（6）

Lastest imformation amount, max is maximizing (as β=0.95)

（7）

Calculating filter coefficient (frequency domain filtering transport function)

（8）

Calculate spectral filter value

（9）

（10）。

4. noisy speech end points Robust Detection Method according to claim 1, is characterized in that, described in subband spectrum entropy calculates:

By the voice signal of every frame through fast fourier transform (FFT) obtain it spectrum on N _fFTindividual some YF _i(0≤i≤N _fFT), obtain N by the filtering of above-mentioned iteration dimension sodium _fFTindividual some YN _i(0≤i≤N _fFT), calculate its N _fFTindividual some power spectrum Y _i=YN _i* YN _i(0≤i≤N _fFT), because pure voice spectrum is between [250Hz, 3500Hz], look for some interval [Nd, the Ng] (0≤Nd<Ng≤N of its correspondence _fFT), the point in frequency domain section [Nd, Ng] is divided into the frequency range of a non-overlapping copies, is called subband (Subband);

Because the noise in some environment just concentrates on certain subband, sub-band approach can improve the accuracy rate of algorithm in narrow band noise environment. the probability of each point on l frame frequency spectral domain is calculated according to formula (9)

(11)

( ) (12)

According to the principle of information entropy, when the noise signal in some environment is more regular, the accuracy of sorter will be affected; Therefore, when calculating the frequency spectrum entropy of present frame, the information of front and back L frame used by wave filter; A class mean wave filter is adopted respectively to the smoothing process of frequency spectrum entropy of each subband in algorithm.

5. noisy speech end points Robust Detection Method according to claim 1, is characterized in that, described in medium filtering and end points segmentation threshold:

(13)

(14)

Work as H _lvalue when being greater than the threshold value of in advance setting, l frame is judged to be speech frame, otherwise is judged to non-speech frame. voice initial N frame hypothesis is pure noise, is used for estimating noise parameter initialization threshold value, and threshold value T is defined as follows:

(15)

T=avg+c (16)

Wherein, be intermediate value, avg is that the noise that input signal starts N frame is most estimated. experimentally result selects c=0.01 (max (H _l)-min (H _l)) or about 0.004 constant.