CN105575405A

CN105575405A - Double-microphone voice active detection method and voice acquisition device

Info

Publication number: CN105575405A
Application number: CN201410524677.5A
Authority: CN
Inventors: 吴晟; 蒋斌; 林福辉
Original assignee: Spreadtrum Communications Shanghai Co Ltd
Current assignee: Spreadtrum Communications Shanghai Co Ltd
Priority date: 2014-10-08
Filing date: 2014-10-08
Publication date: 2016-05-11

Abstract

The invention provides a double-microphone voice active detection method and a voice acquisition device. The method comprises the following steps: performing frequency-domain transform on a noise-containing voice signal and a noise signal to get frequency-domain amplitude spectrums; using a pre-filter to pre-filter the frequency-domain amplitude spectrums to get pre-filtered amplitude spectrums; performing short-term envelope shaping on the pre-filtered amplitude spectrums by use of voice signal short-term envelope; accumulating and comparing the shaped amplitude spectrums to get the energy ratio of the noise-containing voice signal to the noise signal; and making voice activation judgment according to the energy ratio. By implementing the technical scheme of the invention, the accuracy of voice activation judgment at low signal-to-noise ratio is improved significantly.

Description

A kind of dual microphone voice-activation detecting method and voice capture device

Technical field

The present invention relates to communication technique field, be specifically related to a kind of dual microphone voice-activation detecting method and voice capture device.

Background technology

Along with the innovation of mechanics of communication, the capacity of communication network constantly increases, and the processing power of communicating terminal is constantly strengthened, and people improve constantly for the quality requirements of speech communication.This wherein, except improving the frequency bandwidth of speech communication to improve except fidelity, the noiseproof feature of communication terminal is also the significant concern point of speech communication quality.Experienced by single microphone systems by single-channel voice enhanced scheme reduction noise, after improving the stage of voice quality, increasing communication terminal starts the dual-microphone system configuring primary and secondary microphone structure, a microphone (main microphone) is placed on the lower end of voice capture device by this dual-microphone system usually, near the position of mouth, for receiving noisy speech signal, another microphone (secondary microphone) is placed on back or the top of voice capture device upper end, near the position of ear, for receiving the reference signal based on noise.

Double-channel pronunciation enhanced scheme utilizes noisy speech signal and these two signals of reference signal to carry out analysis and calculation, obtains clean voice.Double-channel pronunciation Enhancement Method mainly contains Beam synthesis and energy difference filtering two class methods, and most variations all can comprehensive two kinds of methods.But no matter take any method, all need to coordinate voice activation to detect (voiceactivedetection, VAD).It is judge that current time signal is voice or non-voice that voice activation detects, and this judged result will submit to follow-up speech enhan-cement module, and it has conclusive impact to the performance of speech enhancement schema.If what voice activation detection was regular misses voice segments, the loss of voice that speech enhan-cement exports can be caused; If regular erroneous judgement voice segments, much noise can be caused to remain.Except for except the application of speech enhan-cement, voice activation detects and is also widely used in voice coding, in the fields such as speech recognition, such as, in voice coding, effective voice coding can be carried out to having the fragment of voice, quiet coding or comfortableness noise code are carried out to the fragment without voice, thus improve the efficiency of coding; For speech enhan-cement and denoising, voice activation detects and makes the SNR estimation of the estimation of the noise of speech gaps and sound bite be called possibility; Good voice activation detects the accuracy rate that then greatly can improve speech recognition.

Existing voice activates the implementation method detected, and comprises the implementation method based on energy/snr threshold and the implementation method based on frequency domain character.Based on the algorithm of energy/snr threshold, have time domain short-time energy/signal to noise ratio (S/N ratio) to differentiate and subband domain short-time energy/signal to noise ratio (S/N ratio) differentiation, this kind of algorithm carries out activation by the simple gate limit or double threshold arranging energy/signal to noise ratio (S/N ratio) and judges.Algorithm based on frequency domain character detects the unsmooth feature of frequency spectrum, typically has signal entropy to detect and utilizes the pattern classification of Mel cepstrum coefficient.Above-mentioned algorithm all only make use of the noisy speech signal of single passage, and its robustness is in a noisy environment not high, cannot ensure the accuracy rate that voice activation judges.

Summary of the invention

For the problems referred to above that existing voice activation detection technique exists, now provide a kind of the dual microphone voice-activation detecting method and the voice capture device that are intended to improve the accuracy rate that voice activation under Low SNR judges.

Concrete technical scheme is as follows:

A kind of dual microphone voice-activation detecting method, wherein, comprises the following steps:

Step 1, obtain the noise signal of a noisy speech signal and a corresponding described noisy speech signal;

Step 2, frequency domain conversion is carried out to described noisy speech signal, to obtain noisy speech signal amplitude spectrum, and frequency domain conversion is carried out to described noise signal, to obtain noise signal amplitude spectrum;

Step 3, to described noisy speech signal amplitude spectrum and described noise signal amplitude spectrum carry out pre-filtering respectively;

The short time envelope of step 4, acquisition voice signal;

Step 5, the short time envelope of described voice signal is utilized to carry out shaping to the described noise signal amplitude spectrum after the described noisy speech signal amplitude spectrum after pre-filtering and pre-filtering;

Step 6, carried out to the described noise signal amplitude spectrum after the described noisy speech signal amplitude spectrum after shaping and shaping cumulative comparison, to obtain an energy Ratios;

Step 7: judge whether to carry out voice activation according to described energy Ratios.

Preferably, in described step 2:

By discrete Fourier transform (DFT), or discrete cosine transform, or improvement cosine transform carries out frequency domain conversion to described noisy speech signal, obtains noisy speech signal amplitude spectrum; And/or

By discrete Fourier transform (DFT), or discrete cosine transform, or improvement cosine transform carries out frequency domain conversion to described noise signal, to obtain noise signal amplitude spectrum.

Preferably, adopt discrete Fourier transform (DFT) to obtain described noisy speech signal amplitude spectrum to be calculated by following formula:

S_{a 1} {[k]}_{t} = | Σ_{n = 1}^{N} w (n) s_{1} (t - N + n) e^{- 2 πj (n - 1) (k - 1)} |

Wherein, S _a1for described noisy speech signal amplitude spectrum, s ₁t () is described noisy speech signal, e is the truth of a matter of natural logarithm, and j is imaginary unit, j=(-1) ^0.5, k is discrete spectrum sequence number, k=1,2,3 ..., N, subscript t are discrete time sequence number, and w (k) is the window function of N point; And/or

Adopt discrete Fourier transform (DFT) to obtain described noise signal amplitude spectrum to be calculated by following formula:

S_{a 2} {[k]}_{t} = | Σ_{n = 1}^{N} w (n) s_{2} (t - N + n) e^{\frac{- 2 πj (n - 1) (k - 1)}{N}} |

Wherein, S _a2for described noise signal amplitude is composed, s ₂t () is described is noise signal, and e is the truth of a matter of natural logarithm, and j is imaginary unit, j=(-1) ^0.5, k is discrete spectrum sequence number, k=1,2,3 ..., N, subscript t are discrete time sequence number, and w (k) is the window function of N point.

Preferably, the span of described N is f _s/ 100/2<N<0.2f _s, wherein f _sfor sample frequency; Or sample frequency f _sn=512 during=8000Hz.

Preferably, described window function adopts rectangular window or sinusoidal windows or Hanning window or hamming window or Tukey window.

Preferably, in described step 3:

The pre-filtering of described noisy speech signal amplitude spectrum is calculated by following formula:

S _pa1[k] _t＝S _a1[k] _tG ₁[k] _t，k＝1,2,3,...,N

Wherein, S _pa1for the noisy speech signal amplitude spectrum after pre-filtering, S _a1for noisy speech signal amplitude spectrum, G ₁for pre-filtering transport function, G ₁for the vector of length N, element coefficient is between 0 to 1; And/or

The pre-filtering of described noise signal amplitude spectrum is calculated by following formula:

S _pa2[k] _t＝S _a2[k] _tG ₂[k] _t，k＝1,2,3,...,N

Wherein, S _pa2for the noise signal amplitude spectrum after pre-filtering, S _a2for noise signal amplitude spectrum, G ₂for pre-filtering transport function, G ₂for the vector of length N, element coefficient is between 0 to 1.

Preferably, adopt frequency domain S filter to carry out pre-filtering to described noisy speech signal amplitude spectrum, the frequency domain S filter described noisy speech signal amplitude spectrum being carried out to filtering is calculated by following formula:

G_{1} {[k]}_{t} = \sqrt{\frac{\max (P_{s 1} {[k]}_{t} - P_{n 1} {[k]}_{t}, 0)}{P_{s 1} {[k]}_{t}}}

Wherein, P _s1for the auto-power spectrum of noisy speech signal, P _n1for the auto-power spectrum of noise in described noisy speech signal; And/or

Adopt frequency domain S filter to carry out pre-filtering to described noise signal amplitude spectrum, the frequency domain S filter that described noise signal amplitude spectrum carries out filtering calculated by following formula:

G_{2} {[k]}_{t} = \sqrt{\frac{\max (P_{s 2} {[k]}_{t} - P_{n 2} {[k]}_{t}, 0)}{P_{s 2} {[k]}_{t}}}

Wherein, P _s2for the auto-power spectrum of noise signal, P _n2for the auto-power spectrum of noise in noise signal.

G_{1} {[k]}_{t} = \frac{{SNR}_{1} {[k]}_{t}}{{SNR}_{1} {[k]}_{t} + 1}

SNR ₁[k] _t＝α ₁G ₁[k] _t-1 ²SNR _P1[k] _t-1+(1-α ₁)max(SNR _P1[k] _t-1,0)

{SNR}_{P 1} {[k]}_{t} = \frac{P_{s 1} {[k]}_{t}}{P_{n 1} {[k]}_{t}}

Wherein, SNR ₁for the signal to noise ratio (S/N ratio) of noisy speech signal, SNR _p1for the posteriori SNR of noisy speech signal, P _s1for the auto-power spectrum of noisy speech signal, P _n1for the auto-power spectrum of noise in described noisy speech signal, α ₁and α ₂span is 0< α ₁, α ₂<1; And/or,

G_{2} {[k]}_{t} = \frac{{SNR}_{2} {[k]}_{t}}{{SNR}_{2} {[k]}_{t} + 1}

SNR ₂[k] _t＝α ₂G ₂[k] _t-1 ²SNR _P2[k] _t-1+(1-α ₂)max(SNR _P2[k] _t-1,0)

{SNR}_{P 2} {[k]}_{t} = \frac{P_{s 2} {[k]}_{t}}{P_{n 2} {[k]}_{t}}

Wherein, SNR ₂for the signal to noise ratio (S/N ratio) of noise signal, SNR _p2for the posteriori SNR of noise signal, P _s2for the auto-power spectrum of noise signal, P _n2for the auto-power spectrum of noise in noise signal, α ₁and α ₂span is 0 ^<α ₁, α ₂<1.

Preferably, the auto-power spectrum P of described noisy speech signal _s1calculated by following formula:

P _s1＝S _a1 ²，

Wherein, S _a1for the described noisy speech signal amplitude spectrum that described noisy speech signal is formed after frequency domain conversion; And/or

The auto-power spectrum P of described noise signal _s2calculated by following formula:

P _s2＝S _a2 ²，

Wherein, S _a2for the described noise signal amplitude spectrum that described noise signal is formed after frequency domain conversion.

Preferably, the auto-power spectrum P of noise in described noisy speech signal _n1estimated by following formula:

P_{n 1} {[k]}_{t} = \{\begin{matrix} \begin{matrix} η_{1} P_{n 1} {[k]}_{t - 1} + (1 - η_{1}) P_{s 1} {[k]}_{t}, & P_{n 1} {[k]}_{t - 1} > P_{s 1} {[k]}_{t} \end{matrix} \\ \begin{matrix} \max (P_{n 1} {[k]}_{t - 1}, η_{2} P_{n 1} {[k]}_{t - 1} + (1 - η_{2}) \frac{P_{s 1} {[k]}_{t} - η_{3} P_{s 1} {[k]}_{t - 1}}{(1 - η_{3})}), & P_{n 1} {[k]}_{t - 1} \leq P_{s 1} {[k]}_{t} \end{matrix} \end{matrix}

Wherein, subscript t is discrete time sequence number, η ₁, η ₂, η ₃for smoothing factor, span is 0< η ₁, η ₂, η ₃<1; And/or

The auto-power spectrum P of noise in described noise signal _n2estimated by following formula:

P_{n 2} {[k]}_{t} = \{\begin{matrix} η_{1} P_{n 2} {[k]}_{t - 1} + (1 - η_{1}) P_{s 2} {[k]}_{t}, & P_{n 2} {[k]}_{t - 1} > P_{s 2} {[k]}_{t} \\ \max (P_{n 2} {[k]}_{t - 1}, η_{2} P_{n 2} {[k]}_{t - 1} + (1 - η_{2}) \frac{P_{s 2} {[k]}_{t} - η_{3} P_{s 2} {[k]}_{t - 1}}{(1 - η_{3})}), & P_{n 2} {[k]}_{t - 1} \leq P_{s 2} {[k]}_{t} \end{matrix}

Wherein, subscript t is discrete time sequence number, η ₁, η ₂, η ₃for smoothing factor, span is 0< η ₁, η 2, η 3<1.

Preferably, in described step 4, the short time envelope of described voice signal is calculated by following formula:

G_{L} [k] = \frac{S_{a} [k]}{\max {[S_{a} [k]]}_{k = 1}^{N}}, k = 1,2,3, . . ., N

Wherein, G _lfor the short time envelope of described voice signal, S _afor Short Time Speech amplitude spectrum.

Preferably, described Short Time Speech amplitude spectrum S _athe short-time average magnitude spectrum of the enhancing signal adopting described noisy speech signal to export after speech enhan-cement substitutes; Or

Described Short Time Speech amplitude spectrum S _aadopt the short-time average of the noisy speech signal amplitude spectrum of described noisy speech signal after pre-filtering to substitute, and calculated by following formula:

S_{a} [k] = \{\begin{matrix} α_{sa} S_{a} {[k]}_{t - 1} + (1 - α_{sa}) S_{pa 1} {[k]}_{t}, S_{pa 1} {[k]}_{t} > S_{a} {[k]}_{t - 1} \\ β_{sa} S_{a} {[k]}_{t - 1} + (1 - β_{sa}) S_{pa 1} {[k]}_{t}, S_{pa 1} {[k]}_{t} \leq S_{a} {[k]}_{t - 1} \end{matrix}, k = 1,2,3, . . ., N

Wherein, S _pa1for the noisy speech signal amplitude spectrum after pre-filtering, α _seand β _sebe smoothing factor, 0≤α _se≤ β _se<1.

Preferably, described smoothing factor α _sa=1/2, described smoothing factor β _sa=31/32.

G_{L} [k] = \frac{S_{e} [k]}{\max {[S_{e} [k]]}_{k = 1}^{N}}, k = 1,2,3, . . ., N

Wherein, G _lfor the short time envelope of described voice signal, S _efor Short Time Speech energy spectrum; Or

G_{L} [k] = \sqrt{\frac{S_{e} [k]}{\max {[S_{e} [k]]}_{k = 1}^{N}}}, k = 1,2,3, . . ., N

Wherein, G _lfor the short time envelope of described voice signal, S _efor Short Time Speech energy spectrum.

Preferably, described Short Time Speech energy spectrum S _ethe short-time average energy spectrum of the enhancing signal adopting described noisy speech signal to export after speech enhan-cement substitutes; Or

Described Short Time Speech energy spectrum S _eadopt the noisy speech signal amplitude spectrum S of described noisy speech signal after pre-filtering _pa1square mean in short-term replace, and to be calculated by following formula:

S_{e} [k] = \{\begin{matrix} α_{se} S_{e} {[k]}_{t - 1} + (1 - α_{se}) S_{pa 1} {[k]}_{t}^{2}, S_{pa 1} {[k]}_{t}^{2} > S_{e} {[k]}_{t - 1} \\ β_{se} S_{e} {[k]}_{t - 1} + (1 - β_{se}) S_{pa 1} {[k]}_{t}^{2}, S_{pa 1} {[k]}_{t}^{2} \leq S_{e} {[k]}_{t - 1} \end{matrix}, k = 1,2,3, . . ., N

Wherein, S _pa1for the noisy speech signal amplitude spectrum after pre-filtering, α _seand β _sebe smoothing factor, and 0≤α _se≤ β _se<1.

Preferably, described smoothing factor α _se=1/2, described smoothing factor β _se=31/32.

Preferably, in described step 4, the short time envelope of described voice signal is by the pre-filtering transport function G to noisy speech signal amplitude spectrum ₁or the pre-filtering special delivery function G of noise signal amplitude spectrum ₂smoothly obtain in short-term:

As the pre-filtering transport function G by noisy speech signal amplitude spectrum ₁when obtaining the short time envelope of described voice signal, calculated by following formula:

G_{L} {[k]}_{t} = \{\begin{matrix} α_{G} G_{L} {[k]}_{t} + (1 - α_{G}) G_{1} {[k]}_{t}, G_{1} {[k]}_{t} > G_{L} {[k]}_{t} \\ β_{G} G_{L} {[k]}_{t} + (1 - β_{G}) G_{1} {[k]}_{t}, G_{1} {[k]}_{t} \leq G_{L} {[k]}_{t} \end{matrix}, k = 1,2,3, . . ., N

Wherein, G _lfor the short time envelope of described voice signal, α _gand β _gbe smoothing factor, and 0≤α _g≤ β _g<1.

Preferably, described smoothing factor α _g=1/2, described smoothing factor β _g=31/32.

Preferably, in described step 5:

Utilize the short time envelope of described voice signal to carry out shaping to the described noisy speech signal amplitude spectrum after pre-filtering to be calculated by following formula:

S _sa1[k] _t＝S _pa1[k] _tG _L[k] _t，k＝1,2,3,...,N

Wherein, S _sa1for the described noisy speech signal amplitude spectrum after shaping, S _pa1for the described noisy speech signal amplitude spectrum after pre-filtering, G _lfor the short time envelope of described voice signal, G _lfor the vector of length N, element coefficient is between 0 to 1; And/or

Utilize the short time envelope of described voice signal to carry out shaping to the described noise signal amplitude spectrum after pre-filtering to be calculated by following formula:

S _sa2[k] _t＝S _pa2[k] _tG _L[k] _t，k＝1,2,3,...,N

Wherein, S _sa2for the described noise signal amplitude spectrum after shaping, S _pa2for the described noise signal amplitude spectrum after pre-filtering, G _lfor the short time envelope of described voice signal, G _lfor the vector of length N, element coefficient is between 0 to 1.

Preferably, in described step 6, the described energy Ratios of acquisition is full-band energy ratio, and is calculated by following formula:

r_{t_{1}} = \frac{Σ_{k = 1}^{N} S_{sa 1} {[k]}_{t}}{ϵ + Σ_{k = 1}^{N} S_{sa 2} {[k]}_{t}}

Wherein, S _sa1for the noisy speech signal amplitude spectrum after shaping, S _sa2for the noise signal amplitude spectrum after shaping, for full-band energy ratio, ε prevents the small positive number except zero error.

Preferably, in described step 6, the described energy Ratios of acquisition is sub-band energy ratio, and is calculated by following formula:

r_{t 2} = \frac{Σ_{k = Ks}^{Ke} S_{sa 1} {[k]}_{t}}{ϵ + Σ_{k = Ks}^{Ke} S_{sa 2} {[k]}_{t}}

Wherein, S _sa1for the noisy speech signal amplitude spectrum after shaping, S _sa2for the noise signal amplitude spectrum after shaping, r _t2for sub-band energy ratio, ε prevents the small positive number except zero error, K _sfor the beginning sequence number of sub-band, K _efor the end sequence number of sub-band.

Preferably, in described step 7, described energy Ratios and a predetermined threshold value are compared;

When described energy Ratios is greater than described predetermined threshold value, judge that the signal of corresponding frequency band of moment residing for described energy Ratios is voice;

When described energy Ratios is less than described predetermined threshold value, judge that the signal of corresponding frequency band of moment residing for described energy Ratios is noise;

Described predetermined threshold value is the arithmetic number between 0 to 1.

Preferably, in described step 7, in judgement before by following formula to the smoothing process of described energy Ratios:

r_{st} = \begin{matrix} \{\begin{matrix} α_{r} r_{st - 1} + (1 - α_{r}) r_{st}, r_{st} > r_{st - 1}, \\ β_{r} r_{st - 1} + (1 - β_{r}) r_{st}, r_{st} \leq r_{st - 1}, \end{matrix} \end{matrix}

Wherein, r _stfor the energy Ratios after smoothing processing, α _rand β _rbe smoothing factor, 0≤α _r≤ β _r<1;

Compare through the described energy Ratios of smoothing processing and a predetermined threshold value;

Described energy Ratios through smoothing processing is greater than described predetermined threshold value, judges that the signal of corresponding frequency band of moment residing for described energy Ratios is voice;

Described energy Ratios through smoothing processing is less than described predetermined threshold value, judges that the signal of corresponding frequency band of moment residing for described energy Ratios is noise;

Preferably, described smoothing factor α _r=1/16, described smoothing factor β _r=3/4.

Preferably, described predetermined threshold value is 0.25.

Also comprise, a kind of voice capture device, adopt dual microphone voice-activation detecting method described above.

The beneficial effect of technique scheme is,

The accuracy rate that voice activation judges under Low SNR can be significantly improved, and then export an ideal voice signal.

Accompanying drawing explanation

By reading the detailed description done non-limiting example with reference to the following drawings, the present invention and feature, profile and advantage will become more obvious.Mark identical in whole accompanying drawing indicates identical part.Deliberately proportionally do not draw accompanying drawing, focus on purport of the present invention is shown.

Fig. 1 is that energy Ratios voice activation of the prior art judges flow process;

Fig. 2 is the flow chart of steps of dual microphone noise-reduction method provided by the invention;

Fig. 3 is the process flow diagram of a kind of embodiment of dual microphone noise-reduction method provided by the invention;

Fig. 4 is that the embodiment of technical solution of the present invention and prior art carry out the effect contrast figure of voice activation detection at one section of noisy speech.

Embodiment

In the following description, a large amount of concrete details is given to provide more thorough understanding of the invention.But, it is obvious to the skilled person that the present invention can be implemented without the need to these details one or more.In other example, in order to avoid obscuring with the present invention, technical characteristics more well known in the art are not described.

Should be understood that, the present invention can implement in different forms, and should not be interpreted as the embodiment that is confined to propose here.On the contrary, provide these embodiments will expose thoroughly with complete, and scope of the present invention is fully passed to those skilled in the art.

The object of term is only to describe specific embodiment and not as restriction of the present invention as used herein.When this uses, " one ", " one " and " described/to be somebody's turn to do " of singulative is also intended to comprise plural form, unless context is known point out other mode.It is also to be understood that term " composition " and/or " comprising ", when using in this specification, determine the existence of described feature, integer, step, operation, element and/or parts, but do not get rid of one or more other feature, integer, step, operation, element, the existence of parts and/or group or interpolation.When this uses, term "and/or" comprises any of relevant Listed Items and all combinations.

In order to thoroughly understand the present invention, detailed step and detailed structure will be proposed in following description, to explain technical scheme of the present invention.Preferred embodiment of the present invention is described in detail as follows, but except these are described in detail, the present invention can also have other embodiments.

In dual-microphone system, main microphone is arranged on the position near user's sounding position usually, and the voice signal that main microphones is arrived is stronger; Secondary microphone can be arranged on the position away from user's sounding position, and the voice signal received is more weak; Dual-microphone system also has the sizable feature of noise signal that two microphones arrive simultaneously.Therefore, in prior art, have a kind of technical scheme of being carried out voice activation judgement by primary and secondary microphone energy difference, wherein, when main microphone energy exceed time microphone energy to a certain degree after, think that current time signal is voice.The flow process of this method as shown in Figure 1.

Technical scheme shown in Fig. 1 is by the physical arrangement support of primary and secondary microphone, and under identical signal to noise ratio (S/N ratio) condition, accuracy of judgement degree is significantly beyond the voice-activation detecting method only utilizing single microphone signal.But along with the further reduction of signal to noise ratio (S/N ratio), the energy of noise constantly strengthens, and the smoothness of noise constantly reduces, make the energy difference of two passages no longer obvious, cause this scheme accuracy rate only utilizing energy difference to carry out voice activation judgement to there will be significant decline.

Based on above-mentioned discovery, now provide a kind of dual microphone voice-activation detecting method, as shown in Figure 2, wherein,

Comprise the following steps:

Step 1, obtain the noise signal of a noisy speech signal and a corresponding noisy speech signal;

Step 2, frequency domain conversion is carried out to noisy speech signal, to obtain noisy speech signal amplitude spectrum, and frequency domain conversion is carried out to noise signal, to obtain noise signal amplitude spectrum;

Step 3, to noisy speech signal amplitude spectrum and noise signal amplitude spectrum carry out pre-filtering respectively;

The short time envelope of step 4, acquisition voice signal;

Step 5, utilize the short time envelope of voice signal to the noisy speech signal amplitude spectrum after pre-filtering and the noise signal amplitude after pre-filtering spectrum carry out shaping;

Step 6, carried out to the noisy speech signal amplitude spectrum after shaping and the spectrum of the noise signal amplitude after shaping cumulative comparison, to obtain an energy Ratios;

Step 7: judge whether to carry out voice activation according to energy Ratios.

Fig. 3 illustrates a kind of specific embodiment of technique scheme, wherein, gathers the noise signal of noisy speech signal and corresponding noisy speech signal by existing dual microphone structure, wherein gathers noisy speech signal s by a main microphone ₁t (), gathers the noise signal s of corresponding noisy speech signal by microphone ₂(t).

In one preferred embodiment in, in step 2: by discrete Fourier transform (DFT), or discrete cosine transform, or improve cosine transform frequency domain conversion carried out to noisy speech signal, obtain noisy speech signal amplitude spectrum; Technical at this, adopt discrete Fourier transform (DFT) to obtain noisy speech signal amplitude spectrum and calculate by following formula:

S_{a 1} {[k]}_{t} = | Σ_{n = 1}^{N} w (n) s_{1} (t - N + n) e^{- 2 πj (n - 1) (k - 1)} |

Wherein, S _a1for noisy speech signal amplitude spectrum, s ₁t () is noisy speech signal, e is the truth of a matter of natural logarithm, and j is imaginary unit, j=(-1) ^0.5, k is discrete spectrum sequence number, k=1,2,3 ..., N, subscript t are discrete time sequence number, and w (k) is the window function of N point.

Due to, discrete cosine transform, and improvement cosine transform is method well known in the art, therefore repeats no more.

In the further embodiment of one, by discrete Fourier transform (DFT), or discrete cosine transform, or improvement cosine transform carries out frequency domain conversion to noise signal, to obtain noise signal amplitude spectrum.

On this basis, adopt discrete Fourier transform (DFT) to obtain noise signal amplitude spectrum to calculate by following formula:

S_{a 2} {[k]}_{t} = | Σ_{n = 1}^{N} w (n) s_{2} (t - N + n) e^{\frac{- 2 πj (n - 1) (k - 1)}{N}} |

Wherein, S _a2for noise signal amplitude spectrum, s ₂t () is noise signal, e is the truth of a matter of natural logarithm, and j is imaginary unit, j=(-1) ^0.5, k is discrete spectrum sequence number, k=1,2,3 ..., N, subscript t are discrete time sequence number, and w (k) is the window function of N point.

Based on above-mentioned embodiment, the value of N determines the resolution of frequency-domain analysis, is greater than 100Hz based on frequency domain resolution, the requirement that time window is less than 0.2 second, and the span of N can be f _s/ 100/2<N<0.2f _s, wherein f _sfor sample frequency, preferably, as sample frequency f _sn=512 during=8000Hz.

In further technical scheme, window function w (k) can adopt rectangular window or sinusoidal windows or Hanning window or hamming window or Tukey window.

Because the above-mentioned window function enumerated is method well known in the art, therefore repeat no more.

In one preferred embodiment in, in step 3, the pre-filtering of noisy speech signal amplitude spectrum is calculated by following formula:

S _pa1[k] _t＝S _a1[k] _tG ₁[k] _t，k＝1,2,3,...,N

Wherein, S _pa1for the noisy speech signal amplitude spectrum after pre-filtering, S _a1for noisy speech signal amplitude spectrum, G ₁for pre-filtering transport function, G ₁show as the vector that length is N, element coefficient is between 0 to 1.

In further embodiment, in step 3, the pre-filtering of noise signal amplitude spectrum is calculated by following formula:

S _pa2[k] _t＝S _a2[k] _tG ₂[k] _t，k＝1,2,3,...,N

Wherein, S _pa2for the noise signal amplitude spectrum after pre-filtering, S _a2for noise signal amplitude spectrum, G ₂for pre-filtering transport function, G ₂show as the vector for length N, element coefficient is between 0 to 1.

In one preferred embodiment in, by frequency domain S filter to above-mentioned noisy speech signal amplitude spectrum S _a1, and noise signal amplitude spectrum S _a2carry out pre-filtering.

According to the concept of S filter well known in the art, frequency domain S filter is to above-mentioned noisy speech signal amplitude spectrum S _a1, and noise signal amplitude spectrum S _a2the theory form carrying out pre-filtering is:

G_{1} {[k]}_{t} = \frac{P_{s, s 1} {[k]}_{t}}{P_{s 1} {[k]}_{t}}, G_{2} {[k]}_{t} = \frac{P_{s, s 2} {[k]}_{t}}{P_{s 2} {[k]}_{t}}

Wherein P _{s, s1}voice signal and noisy speech signal s ₁the cross-power spectrum of (t), P _{s, s2}voice signal and noise signal s ₂the cross-power spectrum of (t), P _s1noisy speech signal s ₁the auto-power spectrum of (t), P _s2noise signal s ₂the auto-power spectrum of (t).

For above-mentioned noisy speech signal s ₁the auto-power spectrum P of (t) _s1, obtain by following formula:

P _s1＝S _a1 ²，

Wherein, S _a1for noisy speech signal amplitude spectrum.

For above-mentioned noise signal s ₂the auto-power spectrum P of (t) _s2, obtain by following formula:

P _s2＝S _a2 ²，

Wherein, S _a2for noise signal amplitude spectrum.

Owing to being unknowable at pre-filtering link voice signal, therefore cross-power spectrum P _{s, s1}with cross-power spectrum P _{s, s2}can not directly obtain.In preferrred embodiment of the present invention, the method by estimating noise auto-power spectrum obtains.The method that noise is estimated has tracking signal frequency spectrum minimum value in short-term, the flat equalization method of time recurrence, these are method well known in the art, therefore repeat no more, below with the G.Doblinger noise estimation technique (see document GerhardDoblinger, " Computaionallyefficientspeechenhancementbyspectralminima trackinginsubbands, " Proc.EUROSPEECH ' 95, Madrid, pp.1513-1516) for example is to illustrate the feasibility of technical solution of the present invention, the G.Doblinger noise estimation technique be frequency spectrum in short-term minimum value and time recurrence averaging method one combine.

Based on the G.Doblinger noise estimation technique, noisy speech signal s ₁the auto-power spectrum P of noise in (t) _n1estimate by following formula:

P_{n 1} {[k]}_{t} = \{\begin{matrix} \begin{matrix} η_{1} P_{n 1} {[k]}_{t - 1} + (1 - η_{1}) P_{s 1} {[k]}_{t}, & P_{n 1} {[k]}_{t - 1} > P_{s 1} {[k]}_{t} \end{matrix} \\ \begin{matrix} \max (P_{n 1} {[k]}_{t - 1}, η_{2} P_{n 1} {[k]}_{t - 1} + (1 - η_{2}) \frac{P_{s 1} {[k]}_{t} - η_{3} P_{s 1} {[k]}_{t - 1}}{(1 - η_{3})}), & P_{n 1} {[k]}_{t - 1} \leq P_{s 1} {[k]}_{t} \end{matrix} \end{matrix}

On technique scheme basis, further, noise signal s ₂the auto-power spectrum P of noise in (t) _n2estimated by following formula:

P_{n 2} {[k]}_{t} = \{\begin{matrix} η_{1} P_{n 2} {[k]}_{t - 1} + (1 - η_{1}) P_{s 2} {[k]}_{t}, & P_{n 2} {[k]}_{t - 1} > P_{s 2} {[k]}_{t} \\ \max (P_{n 2} {[k]}_{t - 1}, η_{2} P_{n 2} {[k]}_{t - 1} + (1 - η_{2}) \frac{P_{s 2} {[k]}_{t} - η_{3} P_{s 2} {[k]}_{t - 1}}{(1 - η_{3})}), & P_{n 2} {[k]}_{t - 1} \leq P_{s 2} {[k]}_{t} \end{matrix}

Wherein, subscript t is discrete time sequence number, η ₁, η ₂, η ₃for smoothing factor, span is 0< η ₁, η ₂, η ₃<1.

In preferred embodiment, above-mentioned smoothing factor η 1=0.99, η 2=0.99, η 3=0.8

When obtaining noisy speech signal s ₁the auto-power spectrum P of (t) _s1, and noisy speech signal s ₁the auto-power spectrum P of noise in (t) _n1after, in one preferred embodiment in, noisy speech signal amplitude spectrum is carried out to the frequency domain S filter G of pre-filtering ₁calculate by following formula:

G_{1} {[k]}_{t} = \sqrt{\frac{\max (P_{s 1} {[k]}_{t} - P_{n 1} {[k]}_{t}, 0)}{P_{s 1} {[k]}_{t}}}

Wherein, P _s1for the auto-power spectrum of noisy speech signal, P _n1for the auto-power spectrum of noise in noisy speech signal.

In further embodiment, when obtaining noise signal s ₂the auto-power spectrum P of (t) _s2, and noise signal s ₂the auto-power spectrum P of noise in (t) _n2after, noise signal amplitude spectrum is carried out to the frequency domain S filter G of pre-filtering ₂calculate by following formula:

G_{2} {[k]}_{t} = \sqrt{\frac{\max (P_{s 2} {[k]}_{t} - P_{n 2} {[k]}_{t}, 0)}{P_{s 2} {[k]}_{t}}}

As the optional embodiment of one, for the consideration obtaining more dominance energy, noisy speech signal amplitude spectrum is carried out to the frequency domain S filter G of pre-filtering ₁calculate by following formula:

G_{1} {[k]}_{t} = \frac{{SNR}_{1} {[k]}_{t}}{{SNR}_{1} {[k]}_{t} + 1}

{SNR}_{P 1} {[k]}_{t} = \frac{P_{s 1} {[k]}_{t}}{P_{n 1} {[k]}_{t}}

Wherein, SNR ₁for the signal to noise ratio (S/N ratio) of noisy speech signal, SNR _p1for the posteriori SNR of noisy speech signal, P _s1for the auto-power spectrum of noisy speech signal, P _n1for the auto-power spectrum of noise in noisy speech signal, α ₁and α ₂span is 0< α ₁, α ₂<1.

In one further optional embodiment, for the consideration obtaining more dominance energy, noise signal amplitude spectrum is carried out to the frequency domain S filter G of pre-filtering ₂calculate by following formula:

G_{2} {[k]}_{t} = \frac{{SNR}_{2} {[k]}_{t}}{{SNR}_{2} {[k]}_{t} + 1}

{SNR}_{P 2} {[k]}_{t} = \frac{P_{s 2} {[k]}_{t}}{P_{n 2} {[k]}_{t}}

Wherein, SNR ₂for the signal to noise ratio (S/N ratio) of noise signal, SNR _p2for the posteriori SNR of noise signal, P _s2for the auto-power spectrum of noise signal, P _n2for the auto-power spectrum of noise in noise signal, α ₁and α ₂span is 0< α ₁, α ₂<1.

In one preferred embodiment in, in step 4, the short time envelope of voice signal is tried to achieve by being normalized Short Time Speech amplitude spectrum, specifically calculates by following formula:

G_{L} [k] = \frac{S_{a} [k]}{\max {[S_{a} [k]]}_{k = 1}^{N}}, k = 1,2,3, . . ., N

Wherein, G _lfor the short time envelope of voice signal, S _afor Short Time Speech amplitude spectrum.

Due to Short Time Speech amplitude spectrum S _acannot directly obtain, in preferably embodiment, Short Time Speech amplitude spectrum S _athe short-time average magnitude spectrum of the enhancing signal that noisy speech signal can be adopted to export after speech enhan-cement substitutes; Interchangeable, Short Time Speech amplitude spectrum S _aalso the noisy speech signal amplitude spectrum S of noisy speech signal after pre-filtering can be used _pa1short-time average substitute;

Short Time Speech amplitude spectrum S _aadopt the noisy speech signal amplitude spectrum S of noisy speech signal after pre-filtering _pa1short-time average when substituting, calculate by following formula:

S_{a} [k] = \{\begin{matrix} α_{sa} S_{a} {[k]}_{t - 1} + (1 - α_{sa}) S_{pa 1} {[k]}_{t}, S_{pa 1} {[k]}_{t} > S_{a} {[k]}_{t - 1} \\ β_{sa} S_{a} {[k]}_{t - 1} + (1 - β_{sa}) S_{pa 1} {[k]}_{t}, S_{pa 1} {[k]}_{t} \leq S_{a} {[k]}_{t - 1} \end{matrix}, k = 1,2,3, . . ., N

In preferred embodiment, smoothing factor α _sa=1/2, smoothing factor β _sa=31/32.

As a kind of embodiment of replacement, in step 4, the short time envelope of voice signal is tried to achieve by being normalized Short Time Speech energy spectrum, specifically calculates by following formula:

G_{L} [k] = \frac{S_{e} [k]}{\max {[S_{e} [k]]}_{k = 1}^{N}}, k = 1,2,3, . . ., N

Or

G_{L} [k] = \sqrt{\frac{S_{e} [k]}{\max {[S_{e} [k]]}_{k = 1}^{N}}}, k = 1,2,3, . . ., N

Wherein, G _lfor the short time envelope of voice signal, S _efor Short Time Speech energy spectrum.

Due to Short Time Speech energy spectrum S _ecannot directly obtain, in preferably embodiment, Short Time Speech energy spectrum S _ethe short-time average energy spectrum of the enhancing signal that noisy speech signal can be adopted to export after speech enhan-cement substitutes; Interchangeable, Short Time Speech energy spectrum S _ealso the noisy speech signal amplitude spectrum S of noisy speech signal after pre-filtering can be adopted _pa1square mean in short-term replace;

Short Time Speech energy spectrum S _eadopt the noisy speech signal amplitude spectrum S of noisy speech signal after pre-filtering _pa1square mean in short-term replace time, calculate by following formula:

And calculated by following formula:

S_{e} [k] = \{\begin{matrix} α_{se} S_{e} {[k]}_{t - 1} + (1 - α_{se}) S_{pa 1} {[k]}_{t}^{2}, S_{pa 1} {[k]}_{t}^{2} > S_{e} {[k]}_{t - 1} \\ β_{se} S_{e} {[k]}_{t - 1} + (1 - β_{se}) S_{pa 1} {[k]}_{t}^{2}, S_{pa 1} {[k]}_{t}^{2} \leq S_{e} {[k]}_{t - 1} \end{matrix}, k = 1,2,3, . . ., N

In preferred embodiment, smoothing factor α _se=1/2, smoothing factor β _se=31/32.

In preferred embodiment, in step 4, the short time envelope of voice signal is by the pre-filtering transport function G to noisy speech signal amplitude spectrum ₁or the pre-filtering special delivery function G of noise signal amplitude spectrum ₂smoothly obtain in short-term:

As the pre-filtering transport function G by noisy speech signal amplitude spectrum ₁when obtaining the short time envelope of voice signal, calculate by following formula:

G_{L} {[k]}_{t} = \{\begin{matrix} α_{G} G_{L} {[k]}_{t} + (1 - α_{G}) G_{1} {[k]}_{t}, G_{1} {[k]}_{t} > G_{L} {[k]}_{t} \\ β_{G} G_{L} {[k]}_{t} + (1 - β_{G}) G_{1} {[k]}_{t}, G_{1} {[k]}_{t} \leq G_{L} {[k]}_{t} \end{matrix}, k = 1,2,3, . . ., N

Wherein, G _lfor the short time envelope of voice signal, α _gand β _gbe smoothing factor, and 0≤α _g≤ β _g<1.

In further embodiment, smoothing factor α _g=1/2, smoothing factor β _g=31/32.

In preferred embodiment, in step 5:

Utilize the short time envelope of voice signal to carry out shaping to the noisy speech signal amplitude spectrum after pre-filtering to calculate by following formula:

S _sa1[k] _t＝S _pa1[k] _tG _L[k] _t，k＝1,2,3,...,N

Wherein, S _sa1for the noisy speech signal amplitude spectrum after shaping, S _pa1for the noisy speech signal amplitude spectrum after pre-filtering, G _lfor the short time envelope of voice signal, the short time envelope G of voice signal _lfor the vector of length N, element coefficient is between 0 to 1.

In further embodiment, utilize the short time envelope of voice signal to carry out shaping to the noise signal amplitude spectrum after pre-filtering and calculate by following formula:

S _sa2[k] _t＝S _pa2[k] _tG _L[k] _t，k＝1,2,3,...,N

Wherein, S _sa2for the noise signal amplitude spectrum after shaping, S _pa2for the noise signal amplitude spectrum after pre-filtering, G _lfor the short time envelope of voice signal, the short time envelope G of voice signal _lfor the vector of length N, element coefficient is between 0 to 1.

In preferred embodiment, in step 6, the energy Ratios of acquisition is full-band energy ratio, and calculates by following formula:

r_{t_{1}} = \frac{Σ_{k = 1}^{N} S_{sa 1} {[k]}_{t}}{ϵ + Σ_{k = 1}^{N} S_{sa 2} {[k]}_{t}}

As the optional embodiment of one, in step 6, the energy Ratios of acquisition is sub-band energy ratio, and calculates by following formula:

r_{t 2} = \frac{Σ_{k = Ks}^{Ke} S_{sa 1} {[k]}_{t}}{ϵ + Σ_{k = Ks}^{Ke} S_{sa 2} {[k]}_{t}}

In preferred embodiment, in step 7, the energy Ratios obtained and a predetermined threshold value can be compared in step 6;

When energy Ratios is greater than predetermined threshold value, judge that the signal of corresponding frequency band of moment residing for energy Ratios is voice;

When energy Ratios is less than predetermined threshold value, judge that the signal of corresponding frequency band of moment residing for energy Ratios is noise;

Predetermined threshold value is the arithmetic number between 0 to 1.

As the optional embodiment of one, in step 7, in judge before can first by following formula in step 6 obtain the smoothing process of energy Ratios:

r_{st} = \begin{matrix} \{\begin{matrix} α_{r} r_{st - 1} + (1 - α_{r}) r_{st}, r_{st} > r_{st - 1}, \\ β_{r} r_{st - 1} + (1 - β_{r}) r_{st}, r_{st} \leq r_{st - 1}, \end{matrix} \end{matrix}

Preferably, smoothing factor α _r=1/16, smoothing factor β _r=3/4.

As further embodiment, can just compare through the energy Ratios of smoothing processing and a predetermined threshold value;

Energy Ratios through smoothing processing is greater than predetermined threshold value, judges that the signal of corresponding frequency band of moment residing for energy Ratios is voice;

Energy Ratios through smoothing processing is less than predetermined threshold value, judges that the signal of corresponding frequency band of moment residing for energy Ratios is noise;

Predetermined threshold value is the arithmetic number between 0 to 1.

In preferred embodiment, above-mentioned predetermined threshold value is 0.25.

Also comprise in technical scheme of the present invention, a kind of voice capture device, adopt above-mentioned dual microphone voice-activation detecting method.

Fig. 4 is the example one section of noisy speech being carried out to voice activation detection, and this section of voice continue 35 seconds, and noise is the noise of aircraft engine, and it arrives little consecutive variations from small to large again, and voice are almost covered by intermediate period completely.This section of single pass sound spectrograph of noisy speech master is shown in the top of Fig. 4.Shown in the middle part of Fig. 4 is energy Ratios curve of the prior art, and bottom is the energy Ratios curve obtained by the embodiment of technical scheme of the present invention.For the ease of observing, energy Ratios r is here converted into logarithm value reciprocal, i.e.-log ₂(r).Easily see, along with the increase of noise energy, energy Ratios curve contrast of the prior art is more and more less, and threshold decision is carried out in from 10 second to 25 second this interval very large at noise, the syllable that meeting lost part energy is less; And still remain larger contrast by the energy Ratios curve that the embodiment of technical solution of the present invention obtains, the voice activation that it can provide accuracy higher judges.

In sum, a kind of dual microphone voice-activation detecting method and voice capture device is provided in embodiments of the invention, first it carry out frequency domain conversion to noisy speech signal and noise signal, obtain frequency domain amplitude spectrum, then prefilter is used to carry out pre-filtering to it respectively, obtain the amplitude spectrum after pre-filtering, voice signal short time envelope is used to carry out short time envelope shaping to the amplitude spectrum after pre-filtering subsequently, cumulative comparison is carried out to the amplitude spectrum after shaping, obtain the energy Ratios of noisy speech signal and noise signal, energy Ratios is finally used to carry out voice activation judgement.By the embodiment of technical solution of the present invention, the accuracy rate that voice activation judges under Low SNR can be significantly improved.

Above preferred embodiment of the present invention is described.It is to be appreciated that the present invention is not limited to above-mentioned particular implementation, the equipment wherein do not described in detail to the greatest extent and structure are construed as to be implemented with the common mode in this area; Any those of ordinary skill in the art, do not departing under technical solution of the present invention ambit, the Method and Technology content of above-mentioned announcement all can be utilized to make many possible variations and modification to technical solution of the present invention, or being revised as the Equivalent embodiments of equivalent variations, this does not affect flesh and blood of the present invention.Therefore, every content not departing from technical solution of the present invention, according to technical spirit of the present invention to any simple modification made for any of the above embodiments, equivalent variations and modification, all still belongs in the scope of technical solution of the present invention protection.

Claims

1. a dual microphone voice-activation detecting method, is characterized in that, comprises the following steps:

The short time envelope of step 4, acquisition voice signal;

2. dual microphone voice-activation detecting method as claimed in claim 1, is characterized in that, in described step 2:

3. dual microphone voice-activation detecting method as claimed in claim 2, is characterized in that, adopts discrete Fourier transform (DFT) to obtain described noisy speech signal amplitude spectrum and is calculated by following formula:

S_{a 1} {[k]}_{k} = | Σ_{n = 1}^{N} w (n) s_{1} (t - N + n) e^{\frac{- 2 πj (n - 1) (k - 1)}{N}} |

S_{a 2} {[k]}_{t} = | Σ_{n = 1}^{N} w (n) s_{2} (t - N + n) e^{\frac{- 2 πj (n - 1) (k - 1)}{N}} |

4. dual microphone voice-activation detecting method as claimed in claim 3, it is characterized in that, the span of described N is f _s/ 100/2<N<0.2f _s, wherein f _sfor sample frequency; Or sample frequency f _sn=512 during=8000Hz.

5. dual microphone voice-activation detecting method as claimed in claim 3, is characterized in that, described window function adopts rectangular window or sinusoidal windows or Hanning window or hamming window or Tukey window.

6. dual microphone voice-activation detecting method as claimed in claim 1, is characterized in that, in described step 3:

S _pa1[k] _t＝S _a1[k] _tG ₁[k] _t，k＝1,2,3,...,N

S _pa2[k] _t＝S _a2[k] _tG ₂[k] _t，k＝1,2,3,...,N

7. dual microphone voice-activation detecting method as claimed in claim 6, it is characterized in that, adopt frequency domain S filter to carry out pre-filtering to described noisy speech signal amplitude spectrum, the frequency domain S filter described noisy speech signal amplitude spectrum being carried out to filtering is calculated by following formula:

G_{1} {[k]}_{t} = \sqrt{\frac{\max (P_{s 1} {[k]}_{t} - P_{n 1} {[k]}_{t}, 0)}{P_{s 1} {[k]}_{t}}}

G_{2} {[k]}_{t} = \sqrt{\frac{\max (P_{s 2} {[k]}_{t} - P_{n 2} {[k]}_{t}, 0)}{P_{s 2} {[k]}_{t}}}

8. dual microphone voice-activation detecting method as claimed in claim 6, it is characterized in that, adopt frequency domain S filter to carry out pre-filtering to described noisy speech signal amplitude spectrum, the frequency domain S filter described noisy speech signal amplitude spectrum being carried out to filtering is calculated by following formula:

G_{1} {[k]}_{t} = \frac{{SNR}_{1} {[k]}_{t}}{{SNR}_{1} {[k]}_{t} + 1}

{SNR}_{P 1} {[k]}_{t} = \frac{P_{s 1} {[k]}_{t}}{P_{n 1} {[k]}_{t}}

G_{2} {[k]}_{t} = \frac{{SNR}_{2} {[k]}_{t}}{{SNR}_{2} {[k]}_{t} + 1}

{SNR}_{P 2} {[k]}_{t} = \frac{P_{s 2} {[k]}_{t}}{P_{n 2} {[k]}_{t}}

9. dual microphone voice-activation detecting method as claimed in claim 7 or 8, is characterized in that, the auto-power spectrum P of described noisy speech signal _s1calculated by following formula:

P _s1＝S _a1 ²，

P _s2＝S _a2 ²，

10. dual microphone voice-activation detecting method as claimed in claim 7 or 8, is characterized in that, the auto-power spectrum P of noise in described noisy speech signal _n1estimated by following formula:

P_{n 1} {[k]}_{t} = \{\begin{matrix} η_{1} P_{n 1} {[k]}_{t - 1} + (1 - η_{1}) P_{s 1} {[k]}_{t} & , P_{n 1} {[k]}_{t - 1} > P_{s 1} {[k]}_{t} \\ \max (P_{n 1} {[k]}_{t - 1}, η_{2} P_{n 1} {[k]}_{t - 1} + (1 - η_{2}) \frac{P_{s 1} {[k]}_{t} - η_{3} P_{s 1} {[k]}_{t - 1}}{(1 - η_{3})}) & , P_{n 1} {[k]}_{t - 1} \leq P_{s 1} {[k]}_{t} \end{matrix}

P_{n 2} {[k]}_{t} = \{\begin{matrix} η_{1} P_{n 2} {[k]}_{t - 1} + (1 - η_{1}) P_{s 2} {[k]}_{t} & , P_{n 2} {[k]}_{t - 1} > P_{s 2} {[k]}_{t} \\ \max (P_{n 2} {[k]}_{t - 1}, η_{2} P_{n 2} {[k]}_{t - 1} + (1 - η_{2}) \frac{P_{s 2} {[k]}_{t} - η_{3} P_{s 2} {[k]}_{t - 1}}{(1 - η_{3})}) & , P_{n 2} {[k]}_{t - 1} \leq P_{s 2} {[k]}_{t} \end{matrix}

11. dual microphone voice-activation detecting methods as claimed in claim 1, it is characterized in that, in described step 4, the short time envelope of described voice signal is calculated by following formula:

G_{L} [k] = \frac{S_{a} [k]}{\max {[S_{a} [k]]}_{k = 1}^{N}}, k = 1,2,3, . . ., N

12. dual microphone voice-activation detecting methods as claimed in claim 11, is characterized in that, described Short Time Speech amplitude spectrum S _athe short-time average magnitude spectrum of the enhancing signal adopting described noisy speech signal to export after speech enhan-cement substitutes; Or

S_{a} [k] = \{\begin{matrix} α_{sa} S_{a} {[k]}_{t - 1} + (1 - α_{sa}) S_{pa 1} {[k]}_{t}, S_{pa 1} {[k]}_{t} > S_{a} {[k]}_{t - 1} \\ β_{sa} S_{a} {[k]}_{t - 1} + (1 - β_{sa}) S_{pa 1} {[k]}_{t}, S_{pa 1} {[k]}_{t} \leq S_{a} {[k]}_{t - 1} \end{matrix}, k = 1,2,3, . . ., N

13. dual microphone voice-activation detecting methods as claimed in claim 12, is characterized in that, described smoothing factor α _sa=1/2, described smoothing factor β _sa=31/32.

14. dual microphone voice-activation detecting methods as claimed in claim 1, it is characterized in that, in described step 4, the short time envelope of described voice signal is calculated by following formula:

G_{L} [k] = \frac{S_{e} [k]}{\max {[S_{e} [k]}_{k = 1}^{N}}, k = 1,2,3, . . ., N

G_{L} [k] = \sqrt{\frac{S_{e} [k]}{\max {[S_{e} [k]}_{k = 1}^{N}}}, k = 1,2,3, . . ., N

15. dual microphone voice-activation detecting methods as claimed in claim 14, is characterized in that, described Short Time Speech energy spectrum S _ethe short-time average energy spectrum of the enhancing signal adopting described noisy speech signal to export after speech enhan-cement substitutes; Or

S_{e} [k] = \{\begin{matrix} α_{se} S_{e} {[k]}_{t - 1} + (1 - α_{se}) S_{pa 1} {[k]}_{t}^{2}, S_{pa 1} {[k]}_{t}^{2} > S_{e} {[k]}_{t - 1} \\ β_{se} S_{e} {[k]}_{t - 1} + (1 - β_{se}) S_{pa 1} {[k]}_{t}^{2}, S_{pa 1} {[k]}_{t}^{2} \leq S_{e} {[k]}_{t - 1} \end{matrix}, k = 1,2,3, . . ., N

16. dual microphone voice-activation detecting methods as claimed in claim 15, is characterized in that, described smoothing factor α _se=1/2, described smoothing factor β _se=31/32.

17. dual microphone voice-activation detecting methods as claimed in claim 2, is characterized in that, in described step 4, the short time envelope of described voice signal is by the pre-filtering transport function G to noisy speech signal amplitude spectrum ₁or the pre-filtering special delivery function G of noise signal amplitude spectrum ₂smoothly obtain in short-term:

G_{L} {[k]}_{t} = \{\begin{matrix} α_{G} G_{L} {[k]}_{t} + (1 - α_{G}) G_{1} {[k]}_{t}, G_{1} {[k]}_{t} > G_{L} {[k]}_{t} \\ β_{G} G_{L} {[k]}_{t} + (1 - β_{G}) G_{1} {[k]}_{t}, G_{1} {[k]}_{t} \leq G_{L} {[k]}_{t} \end{matrix}, k = 1,2,3, . . ., N

18. dual microphone voice-activation detecting methods as claimed in claim 17, is characterized in that, described smoothing factor α _g=1/2, described smoothing factor β _g=31/32.

19. dual microphone voice-activation detecting methods as claimed in claim 1, is characterized in that, in described step 5:

S _sa1[k] _t＝S _pa1[k] _tG _L[k] _t，k＝1,2,3,...,N

S _sa2[k] _t＝S _pa2[k] _tG _L[k] _t，k＝1,2,3,...,N

20. dual microphone voice-activation detecting methods as claimed in claim 1, is characterized in that, in described step 6, the described energy Ratios of acquisition is full-band energy ratio, and is calculated by following formula:

r_{t_{1}} = \frac{Σ_{k = 1}^{N} S_{sa 1} {[k]}_{t}}{ϵ + Σ_{k = 1}^{N} S_{sa 2} {[k]}_{t}}

21. dual microphone voice-activation detecting methods as claimed in claim 1, is characterized in that, in described step 6, the described energy Ratios of acquisition is sub-band energy ratio, and is calculated by following formula:

r_{t 2} = \frac{Σ_{k = Ks}^{Ke} S_{sa 1} {[k]}_{t}}{ϵ + Σ_{k = Ks}^{Ke} S_{sa 2} {[k]}_{t}}

22. dual microphone voice-activation detecting methods as claimed in claim 1, is characterized in that, in described step 7, described energy Ratios and a predetermined threshold value are compared;

23. dual microphone voice-activation detecting methods as claimed in claim 1, is characterized in that, in described step 7, in judgement before by following formula to the smoothing process of described energy Ratios:

r_{st} = \{\begin{matrix} α_{r} r_{st - 1} + (1 - α_{r}) r_{st}, r_{st} > r_{st - 1}, \\ β_{r} r_{st - 1} + (1 - β_{r}) r_{st}, r_{st} \leq r_{st - 1}, \end{matrix}

24. dual microphone voice-activation detecting methods as claimed in claim 23, is characterized in that, described smoothing factor α _r=1/16, described smoothing factor β _r=3/4.

25. dual microphone voice-activation detecting methods as described in claim 22 or 23, it is characterized in that, described predetermined threshold value is 0.25.

26. 1 kinds of voice capture device, adopt the dual microphone voice-activation detecting method as described in claim 1-25.