CN105244036A

CN105244036A - Microphone speech enhancement method and microphone speech enhancement device

Info

Publication number: CN105244036A
Application number: CN201410305776.4A
Authority: CN
Inventors: 范泛; 付中华; 黎家力
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2014-06-27
Filing date: 2014-06-27
Publication date: 2016-01-13
Also published as: WO2015196729A1

Abstract

The invention provides a microphone speech enhancement method and a corresponding device. The method comprises the steps of acquiring first array speech signals which are acquired and inputted through multi-channel digital speech acquisition equipment; calculating optimal beam output signals synthesized by the first array speech signals by adopting the first array voice signals according to a minimum variance adaptive beam optimization model of the first array speech signals; and carrying out single-channel speech enhancement processing by adopting a power spectrum estimation value of the optimal beam output signals, wherein the minimum variance adaptive beam optimization model of the first array speech signals comprises a space guidance vector from a target sound source to the multi-channel digital speech acquisition equipment. The microphone speech enhancement method and the microphone speech enhancement device provided by the invention can process original speech of a speech acquisition equipment array with many array elements and large spacing.

Description

A kind of microphone sound enhancement method and device

Technical field

The present invention relates to speech processes, particularly a kind of microphone sound enhancement method and device.

Background technology

Along with the development of hand-free call, conference system, Smart Home and intelligent appliance, high-quality remote speech pickup becomes one of key factor affecting voice collecting disposal system performance.In order to adapt to complicated acoustic environment, single microphone techniques has been difficult to competent, and the microphone array with multi-path voice collecting device then becomes main flow day by day, and wherein the most frequently used is exactly various beam-forming technology, speech enhancement technique etc.Speech enhancement technique needs to extract target voice pure as far as possible from the primary speech signal that voice capture device gathers.Beam-forming technology improves microphone array to the sensitivity of certain direction sound by adjustment parameter, improves the effect of speech enhan-cement.But most of speech enhancement technique can only process the raw tone that few, the closely spaced voice capture device array of array element gathers in prior art, therefore often performance is very limited for traditional array speech enhancement technique.

Summary of the invention

Be directed to this, the invention provides a kind of microphone sound enhancement method and device.Described method and device can process the raw tone of the voice capture device array that array element is more, spacing is larger.

Based on above-mentioned purpose a kind of microphone sound enhancement method provided by the invention, comprise the steps:

Obtain the first array voice signal by multi-path digital voice capture device Gather and input;

According to the minimum variance adaptive beam Optimized model of described first array voice signal, the optimal beam adopting the first array voice signal to calculate synthesized by the first array voice signal outputs signal;

The power Spectral Estimation value adopting described optimal beam to output signal is carried out single-channel voice and is strengthened process;

The minimum variance adaptive beam Optimized model of described first array voice signal comprises the steric direction vector of target sound source to described multi-path digital voice capture device.

Optionally, before obtaining the first array voice signal by multi-path digital voice capture device Gather and input, also comprise:

Raw tone array signal y is gathered by multi-path digital voice capture device ₁(n) ... y _n(n);

The time-frequency representation signal y that Short Time Fourier Transform obtains described raw tone array signal is carried out to described primary speech signal ₁(k, λ) ... y _n(k, λ);

Adopt optimum super sensing beam coefficient A (k)=[a ₁(k) ..., a _n(k)] ^tthe process of frequency domain optimum super sensing wave beam is carried out to described time-frequency representation signal, obtains the first array voice signal i=1 ... N;

Described n is discrete-time variable; N is element number of array; K is frequency numbering; λ is short time frame numbering.

Optionally, described optimum surpass point to beam coefficient set according to the set-up mode of described multi-path digital voice capture device.

Optionally, according to the minimum variance adaptive beam Optimized model of described first array voice signal, during the optimal beam output signal adopting the first array voice signal to calculate synthesized by the first array voice signal, adopt following formula:

\overset{&OverBar;}{y} (k, λ) = Σ_{i = 1}^{N} w_{i}^{*} a_{i}^{*} y_{i} (k, λ);

for described optimal beam outputs signal; beam coefficient and the target sound source sef-adapting filter parameter to the steric direction Vector operation of each digital speech collecting device is pointed to for surpassing according to noise signal column vector and optimum; beam coefficient A (k)=[a is pointed to for optimum is super ₁(k) ..., a _n(k)] ^tmiddle array element a _iconjugate complex number; y _i(k, λ) is described first array voice signal.

Optionally, the minimum variance adaptive beam Optimized model of described first array voice signal is:

w (k) = \underset{w (k)}{\arg \min} w^{H} (k) R_{\tilde{v}} (k) w (k),

And meet

w^{H} (k) \tilde{d} (k) = 1;

Wherein, the array element in w (k) with conjugate complex number each other; w ^hk conjugation transformation of ownership matrix that () is w (k); for the noise coherence matrix estimated according to described first array voice signal; for target sound source is to the steric direction vector of described digital speech collecting device.

Optionally, described target sound source to the steric direction vector of digital speech collecting device according to following formulae discovery:

\tilde{d} (k) = [a_{1}^{*} \exp (jk \frac{d_{1} \cos (θ)}{c} f_{s}), . . . . . ., a_{N}^{*} \exp (jk \frac{d_{N} \cos (θ)}{c} f_{s})]^{T};

Wherein, d ₁d _nbe the 1st to N number of digital speech collecting device to the distance of digital speech collecting device array center, c is the velocity of sound; f _sit is sample frequency; θ is the position angle that target sound source arrives digital speech collecting device; beam coefficient A (k)=[a is pointed to for optimum is super ₁(k) ..., a _n(k)] ^tmiddle array element a _iconjugate complex number.

Optionally, described method also comprises:

Voice activity detection VAD is carried out to the noise signal array in the array voice input signal of described multiple passage;

Result according to described voice activity detection VAD carries out noise power Power estimation to noise signal array;

The optimal power Power estimation value outputed signal according to described optimal beam and described noise power spectrum estimated value are carried out second time to described optimal beam output signal and are strengthened.

Optionally, according to the result of described voice activity detection VAD, the step that noise signal array carries out noise power Power estimation is comprised:

Calculate and have voice status, without noise power spectrum when voice status, voice initial state, voice done state;

To described have a voice status time noise power spectrum and carry out compromise process without noise power spectrum during voice status, obtain noise power spectrum estimated value.

Optionally, calculating has voice status, specifically comprises without the step of noise power spectrum when voice status, voice initial state, voice done state:

When being in without voice status, adopt following formula to noise signal array power Power estimation:

φ_{\overset{&OverBar;}{v}} (k, λ) = a_{1} φ_{\overset{&OverBar;}{v}} (k, λ - 1) + (1 - a_{1}) φ_{\overset{&OverBar;}{y}} (k, λ);

When being in voice initial state and have voice status, following formula is adopted to estimate noise signal array power spectrum:

φ_{\overset{&OverBar;}{v}} (k, λ) = \min ({\hat{φ}}_{\overset{&OverBar;}{v} 1} (k, λ), 2 θ_{\overset{&OverBar;}{v}} (k, λ));

When being in voice done state, adopting following formula to carry out duopole to noise signal array power spectrum and return level and smooth estimation:

φ_{\overset{&OverBar;}{v}} (k, λ) = a_{0} φ_{\overset{&OverBar;}{v} 2} (k, λ - 1) + (1 - a_{0}) \max ({\hat{φ}}_{\overset{&OverBar;}{v}} (k, λ), θ_{\overset{&OverBar;}{v}} (k, λ));

In above-mentioned formula,

θ_{\overset{&OverBar;}{v}} (k, λ) = \frac{1}{2 L_{1} + 1} Σ_{m = k - L_{1}}^{k + L_{1}} φ_{\overset{&OverBar;}{v}} (k, λ);

\{\begin{matrix} {\hat{φ}}_{\overset{&OverBar;}{v} 1} (k, λ) = a_{a} {\hat{φ}}_{\overset{&OverBar;}{v} 1} (k, λ - 1) + (1 - a_{a}) φ_{\overset{&OverBar;}{y}} (k, λ), if φ_{\overset{&OverBar;}{y}} (k, λ) &GreaterEqual; {\hat{φ}}_{\overset{&OverBar;}{v} 1} (k, λ) \\ {\hat{φ}}_{\overset{&OverBar;}{v} 1} (k, λ) = a_{d} {\hat{φ}}_{\overset{&OverBar;}{v} 1} (k, λ - 1) + (1 - a_{d}) φ_{\overset{&OverBar;}{y}} (k, λ), if φ_{\overset{&OverBar;}{y}} (k, λ) < {\hat{φ}}_{\overset{&OverBar;}{v} 1} (k, λ) \end{matrix};

\{\begin{matrix} {\hat{φ}}_{\overset{&OverBar;}{v} 2} (k, λ) = a_{a} {\hat{φ}}_{\overset{&OverBar;}{v} 2} (k, λ - 1) + (1 - a_{a}) | \overset{&OverBar;}{y} (k, λ) |^{2}, if φ_{\overset{&OverBar;}{y}} (k, λ) &GreaterEqual; {\hat{φ}}_{\overset{&OverBar;}{v} 2} (k, λ) \\ {\hat{φ}}_{\overset{&OverBar;}{v} 2} (k, λ) = a_{d} {\hat{φ}}_{\overset{&OverBar;}{v} 2} (k, λ - 1) + (1 - a_{d}) | \overset{&OverBar;}{y} (k, λ) |^{2}, if φ_{\overset{&OverBar;}{y}} (k, λ) < {\hat{φ}}_{\overset{&OverBar;}{v} 2} (k, λ) \end{matrix};

Wherein, a ₁for noise spectrum undated parameter; a _a, a _dbe respectively smoothing factor.

Optionally, the power Spectral Estimation value of described optimal beam output signal adopts following formula to calculate:

φ_{\overset{&OverBar;}{y}} (k, λ) = a_{0} φ_{\overset{&OverBar;}{y}} (k, λ - 1) + (1 - a_{0}) | \overset{&OverBar;}{y} (k, λ) |^{2};

Wherein, for the power Spectral Estimation value that described optimal beam outputs signal; for described optimal beam outputs signal; a ₀for noise spectrum undated parameter.

Further, the invention provides a kind of microphone speech sound enhancement device, comprising:

First acquisition module: for obtaining the first array voice signal by multi-path digital voice capture device Gather and input;

Optimal beam output signal computing module: for the minimum variance adaptive beam Optimized model according to described first array voice signal, the optimal beam adopting the first array voice signal to calculate synthesized by the first array voice signal outputs signal;

First strengthens module: the power Spectral Estimation value outputed signal for adopting described optimal beam is carried out single-channel voice and strengthened process;

Optionally, described device also comprises:

Original signal acquisition module: for gathering raw tone array signal y by multi-path digital voice capture device ₁(n) ... y _n(n);

Original signal conversion module: the time-frequency representation signal y obtaining described raw tone array signal for carrying out Short Time Fourier Transform to described primary speech signal ₁(k, λ) ... y _n(k, λ);

Optimum super sensing wave beam processing module: point to beam coefficient A (k)=[a for adopting optimum surpassing ₁(k) ..., a _n(k)] ^tthe process of frequency domain optimum super sensing wave beam is carried out to described time-frequency representation signal, obtains the first array voice signal i=1 ... N;

Optionally, described optimal beam output signal computing module is according to the minimum variance adaptive beam Optimized model of described first array voice signal, during the optimal beam output signal adopting the first array voice signal to calculate synthesized by the first array voice signal, adopt following formula:

\overset{&OverBar;}{y} (k, λ) = Σ_{i = 1}^{N} w_{i}^{*} a_{i}^{*} y_{i} (k, λ);

Optionally, the minimum variance adaptive beam Optimized model of the first array voice signal is:

w (k) = \underset{w (k)}{\arg \min} w^{H} (k) R_{\tilde{v}} (k) w (k),

And meet

w^{H} (k) \tilde{d} (k) = 1;

Optionally, optimal beam output signal computing module calculates the first array voice signal institute when outputing signal with the optimal beam that becomes, the target sound source adopted to the steric direction vector of digital speech collecting device according to following formulae discovery:

\tilde{d} (k) = [a_{1}^{*} \exp (jk \frac{d_{1} \cos (θ)}{c} f_{s}), . . . . . ., a_{N}^{*} \exp (jk \frac{d_{N} \cos (θ)}{c} f_{s})]^{T};

Optionally, also comprise:

VAD module: carry out voice activity detection VAD for the noise signal array in the array voice input signal to described multiple passage;

Noise power spectrum estimation module: noise power Power estimation is carried out to noise signal array for the result according to described voice activity detection VAD;

Second strengthens module: carry out second time enhancing for the optimal power Power estimation value that outputs signal according to described optimal beam and described noise power spectrum estimated value to described optimal beam output signal.

Optionally, described noise power spectrum estimation module comprises:

First noise power spectrum computing unit: have voice status, without noise power spectrum when voice status, voice initial state, voice done state for calculating;

Second noise power spectrum computing unit: for described have a voice status time noise power spectrum and carry out compromise process without noise power spectrum during voice status, obtain noise power spectrum estimated value.

Optionally, described first noise power spectrum computing unit specifically comprises:

Without voice status computation subunit: for when being in without voice status, adopt following formula to noise signal array power Power estimation:

φ_{\overset{&OverBar;}{v}} (k, λ) = a_{1} φ_{\overset{&OverBar;}{v}} (k, λ - 1) + (1 - a_{1}) φ_{\overset{&OverBar;}{y}} (k, λ);

Voice start and have voice status computation subunit: for when being in voice initial state and have voice status, adopt following formula to estimate noise signal array power spectrum:

Without voice status computation subunit: for when being in voice done state, adopting following formula to carry out duopole to noise signal array power spectrum and returning level and smooth estimation:

φ_{\overset{&OverBar;}{v}} (k, λ) = a_{0} φ_{\overset{&OverBar;}{v} 2} (k, λ - 1) + (1 - a_{0}) \max ({\hat{φ}}_{\overset{&OverBar;}{v}} (k, λ), θ_{\overset{&OverBar;}{v}} (k, λ));

In above-mentioned formula,

θ_{\overset{&OverBar;}{v}} (k, λ) = \frac{1}{2 L_{1} + 1} Σ_{m = k - L_{1}}^{k + L_{1}} φ_{\overset{&OverBar;}{v}} (k, λ);

\{\begin{matrix} {\hat{φ}}_{\overset{&OverBar;}{v} 1} (k, λ) = a_{a} {\hat{φ}}_{\overset{&OverBar;}{v} 1} (k, λ - 1) + (1 - a_{a}) φ_{\overset{&OverBar;}{y}} (k, λ), if φ_{\overset{&OverBar;}{y}} (k, λ) &GreaterEqual; {\hat{φ}}_{\overset{&OverBar;}{v} 1} (k, λ) \\ {\hat{φ}}_{\overset{&OverBar;}{v} 1} (k, λ) = a_{d} {\hat{φ}}_{\overset{&OverBar;}{v} 1} (k, λ - 1) + (1 - a_{d}) φ_{\overset{&OverBar;}{y}} (k, λ), if φ_{\overset{&OverBar;}{y}} (k, λ) < {\hat{φ}}_{\overset{&OverBar;}{v} 1} (k, λ) \end{matrix};

\{\begin{matrix} {\hat{φ}}_{\overset{&OverBar;}{v} 2} (k, λ) = a_{a} {\hat{φ}}_{\overset{&OverBar;}{v} 2} (k, λ - 1) + (1 - a_{a}) | \overset{&OverBar;}{y} (k, λ) |^{2}, if φ_{\overset{&OverBar;}{y}} (k, λ) &GreaterEqual; {\hat{φ}}_{\overset{&OverBar;}{v} 2} (k, λ) \\ {\hat{φ}}_{\overset{&OverBar;}{v} 2} (k, λ) = a_{d} {\hat{φ}}_{\overset{&OverBar;}{v} 2} (k, λ - 1) + (1 - a_{d}) | \overset{&OverBar;}{y} (k, λ) |^{2}, if φ_{\overset{&OverBar;}{y}} (k, λ) < {\hat{φ}}_{\overset{&OverBar;}{v} 2} (k, λ) \end{matrix};

φ_{\overset{&OverBar;}{y}} (k, λ) = a_{0} φ_{\overset{&OverBar;}{y}} (k, λ - 1) + (1 - a_{0}) | \overset{&OverBar;}{y} (k, λ) |^{2};

As can be seen from above, the microphone sound enhancement method that the present invention and embodiment provide and device, the first array voice signal of minimum variance adaptive beam Optimized model to multi-path digital speech signal collection equipment Gather and input is adopted to calculate, and described minimum variance adaptive beam Optimized model comprises the steric direction vector of target sound source to described multi-path digital voice capture device, can to array element distance more greatly microphone array carry out speech enhan-cement process, and high-quality pickup can be realized.In addition, the microphone sound enhancement method that the embodiment of the present invention provides and device are according to the result of voice activity detection, in the different phase of voice, noise signal array power spectrum is estimated to have higher noise accuracy of estimation, thus can further improve the effect of speech enhan-cement.

Accompanying drawing explanation

Fig. 1 is the microphone sound enhancement method schematic flow sheet of the embodiment of the present invention;

Fig. 2 is the raw tone acquisition process schematic flow sheet of an embodiment of the present invention;

Fig. 3 is the noise power Power estimation schematic flow sheet of an embodiment of the present invention;

Fig. 4 is the schematic flow sheet that the noise power Power estimation of an embodiment of the present invention is more detailed;

Fig. 5 is the microphone speech sound enhancement device structural representation of the embodiment of the present invention;

Fig. 6 is the Speech processing schematic diagram of the embodiment of the present invention.

Embodiment

In order to provide effective implementation, the invention provides following examples, below in conjunction with Figure of description, embodiments of the invention being described.

First, beam-forming technology related to the present invention comprises fixed beam and adaptive beam two class.

Fixed beam refers to that the parameter of array signal processing system is not with pickup signal change, but determined by array topology and default noise field model, comprise time domain fixed beam and frequency domain fixed beam etc.Fixed beam is deteriorated in the directive property of medium and low frequency, and voice signal is broadband signal, the robustness of array will be caused to be deteriorated if improve medium and low frequency directive property, therefore less independent employing in the minitype microphone array application of reality.

Adaptive beam is the transition function by automatically estimating sound field situation and sound source arrival microphone, dynamically generates optimal beam parameter according to optimal conditions.In actual applications, the transport function arriving each microphone due to sound source is difficult to estimate, therefore be usually combined with multi-channel noise suppression technology, or after wave beam process, increase post-filtering process, this all needs accurately to estimate noise statistics, and finds optimal balance point between echo signal distortion and squelch degree.

The invention provides a kind of microphone sound enhancement method, comprise step as shown in Figure 1:

Step 101: obtain the first array voice signal by multi-path digital voice capture device Gather and input.

Step 102: according to the minimum variance adaptive beam Optimized model of described first array voice signal, the optimal beam adopting the first array voice signal to calculate synthesized by the first array voice signal outputs signal.

Step 103: the power Spectral Estimation value adopting described optimal beam to output signal is carried out single-channel voice and strengthened process.

As can be seen from above, microphone sound enhancement method provided by the invention, the first array voice signal of minimum variance adaptive beam Optimized model to multi-path digital speech signal collection equipment Gather and input is adopted to calculate, and described minimum variance adaptive beam Optimized model comprises the steric direction vector of target sound source to described multi-path digital voice capture device, can to array element distance more greatly microphone array carry out speech enhan-cement process, and high-quality pickup can be realized.

In some embodiments of the invention, the power Spectral Estimation value adopting described optimal beam to output signal is carried out single-channel voice and is strengthened in the step of process, and application logMMSE method processes optimal beam output signal.

In some embodiments of the invention, before obtaining the first array voice signal by multi-path digital voice capture device Gather and input, the step shown in Fig. 2 is also comprised:

Step 201: gather raw tone array signal y by multi-path digital voice capture device ₁(n) ... y _n(n).

Step 202: the time-frequency representation signal y that Short Time Fourier Transform obtains described raw tone array signal is carried out to described primary speech signal ₁(k, λ) ... y _n(k, λ).

Step 203: adopt optimum super sensing beam coefficient A (k)=[a ₁(k) ..., a _n(k)] T carries out to described time-frequency representation signal that frequency domain optimum is super points to wave beam process, obtains the first array voice signal i=1 ... N.

Concrete, the raw tone array signal collected by multi-path digital voice capture device is y ₁(n) ... y _nn (), the signal collected these multi-path digital voice capture device is according to time window length L _wnd, overlapping L between adjacent windows _ovlpcarry out windowing brachymemma.What described windowing brachymemma adopted is Hanning window, and overlapping 3/4 window is long.And then the signal after each passage windowing is carried out the expression signal that Short Time Fourier Transform obtains described raw tone array signal time-frequency: y ₁(k, λ) ... y _n(k, λ).In theory, the expression signal y of described raw tone array signal time-frequency _i(k, λ), noise signal v ₁the targeted voice signal x (k, λ) that (k, λ) and target sound source send meets following relation:

y _i(k,λ)＝v ₁(k,λ)+x(k,λ)。

Adopt optimum super sensing beam coefficient A (k)=[a ₁(k) ..., a _n(k)] ^tthe process of frequency domain optimum super sensing wave beam is carried out to described time-frequency representation signal, obtains the first array voice signal concrete,

\begin{matrix} {\overset{&OverBar;}{y}}_{i} (k, λ) = a_{i}^{*} (k) y_{i} (k, λ) = a_{i}^{*} (k) (v_{1} (k, λ) + x (k, λ)) \\ = a_{i}^{*} (k) x_{1} (k, λ) + a_{i}^{*} (k) v_{1} (k, λ) = \overset{&OverBar;}{x_{1}} (k, λ) + \overset{&OverBar;}{v_{1}} (k, λ) \end{matrix};

Wherein, for the conjugate matrices of a (k); represent the targeted voice signal after frequency domain weighting, represent the noise signal after frequency domain weighting; I=1 ... N.

In certain embodiments, above-mentioned optimum surpass point to beam coefficient determine in conjunction with Sounnd source direction according to the array topology of described multi-path digital voice capture device.

Owing to adopting the process of optimum super sensing wave beam, multi-path digital collecting device array element distance is allowed to be greater than multi-path voice collecting device array element distance of the prior art further.

In certain embodiments, according to the minimum variance adaptive beam Optimized model of described first array voice signal, during the optimal beam output signal adopting the first array voice signal to calculate synthesized by the first array voice signal, adopt following formula:

\overset{&OverBar;}{y} (k, λ) = Σ_{i = 1}^{N} w_{i}^{*} a_{i}^{*} y_{i} (k, λ);

for described optimal beam outputs signal; beam coefficient and the target sound source sef-adapting filter parameter to the steric direction Vector operation of each digital speech collecting device is pointed to for surpassing according to noise signal column vector and optimum; point to for optimum is super beam coefficient A (k)=[a1 (k) ..., a _n(k)] ^tconjugate complex number; y _i(k, λ) is described first array voice signal.

Concrete, in certain embodiments, the minimum variance adaptive beam Optimized model of described first array voice signal is:

w (k) = \underset{w (k)}{\arg \min} w^{H} (k) R_{\tilde{v}} (k) w (k),

And meet

w^{H} (k) \tilde{d} (k) = 1;

According to above-mentioned model, surpass according to noise signal column vector and optimum and point to beam coefficient and target sound source and to the conjugate complex number of the sef-adapting filter parameter of the steric direction Vector operation of each digital speech collecting device be:

w_{opt} (k) = \frac{R_{\tilde{v}}^{- 1} (k) \tilde{d} (k)}{{\tilde{d}}^{H} (k) R_{\tilde{v}}^{- 1} (k) \tilde{d} (k)};

Wherein, for conjugation transformation of ownership matrix.

More specifically, described target sound source adopts following formula to calculate to the steric direction vector of digital speech collecting device:

d (k) = [\exp (jk \frac{d_{1} \cos (θ)}{c} f_{s}), . . . . . ., a \exp (jk \frac{d_{N} \cos (θ)}{c} f_{s})]^{T};

Wherein, d ₁d _nbe the 1st to N number of digital speech collecting device to the distance of digital speech collecting device array center, c is the velocity of sound; f _sit is sample frequency; θ is the position angle that target sound source arrives digital speech collecting device.Because signal first points to wave beam process through frequency domain is super, therefore surpass the target sound source after pointing to process through frequency domain and become to the steric direction vector of digital speech collecting device:

\tilde{d} (k) = [a_{1}^{*} \exp (jk \frac{d_{1} \cos (θ)}{c} f_{s}), . . . . . ., a_{N}^{*} \exp (jk \frac{d_{N} \cos (θ)}{c} f_{s})]^{T} .

More specifically, the noise signal estimated by described first array voice signal is accordingly, the noise coherence matrix that the first array voice signal is estimated is: wherein, E represents expectation; conjugation transformation of ownership matrix.

More specifically, w (k)=[w ₁(k) ..., w _n(k)] ^t.

In some embodiments of the invention, described method also comprises the step shown in Fig. 3:

Step 301: voice activity detection VAD (VoiceActivityDetect) is carried out to the noise signal array in the array voice input signal of described multiple passage.

Step 302: the result according to described voice activity detection VAD carries out noise power Power estimation to noise signal array.

Step 303: the optimal power Power estimation value outputed signal according to described optimal beam and described noise power spectrum estimated value are carried out second time to described optimal beam output signal and strengthened.

Become when above-described embodiment can carry out dynamic to the noise signal in the first array voice signal and estimate, the secondary for sound strengthens prepares.

On the whole, without following formula can be adopted during voice to estimate noise:

{\hat{R}}_{\tilde{v}} (k, λ) = a_{R} {\hat{R}}_{\tilde{v}} (k, λ - 1) + (1 - a_{R}) ({\tilde{v}}^{H} (k, λ)) \tilde{v} (k, λ);

When having voice, following formula can be adopted to estimate noise:

{\hat{R}}_{\tilde{v}} (k, λ) = {\hat{R}}_{\tilde{v}} (k, λ - 1);

A _rfor smoothing factor.

Concrete, according to the result of described voice activity detection VAD, process is as shown in Figure 4 comprised to the step that noise signal array carries out noise power Power estimation:

Step 401: calculate and have voice status, without noise power spectrum when voice status, voice initial state, voice done state.

Step 402: to described have a voice status time noise power spectrum and carry out compromise process without noise power spectrum during voice status, obtain noise power spectrum estimated value.

In certain embodiments, according to the result of described voice activity detection VAD, the step that noise signal array carries out power Spectral Estimation is specifically comprised:

When being in without voice status, following formula is adopted to carry out quick and smooth estimation to noise signal array power spectrum:

φ_{\overset{&OverBar;}{v}} (k, λ) = a_{1} φ_{\overset{&OverBar;}{v}} (k, λ - 1) + (1 - a_{1}) φ_{\overset{&OverBar;}{y}} (k, λ);

Then calculating noise power spectrum thresholding is according to the following equation adopted:

θ_{\overset{&OverBar;}{v}} (k, λ) = \frac{1}{2 L_{1} + 1} Σ_{m = k - L_{1}}^{k + L_{1}} φ_{\overset{&OverBar;}{v}} (k, λ);

Wherein, L ₁for frequency number.

When being in voice initial state, first adopting following formula to carry out duopole to noise signal array power spectrum and returning level and smooth estimation:

\{\begin{matrix} {\hat{φ}}_{\overset{&OverBar;}{v} 1} (k, λ) = a_{a} {\hat{φ}}_{\overset{&OverBar;}{v} 1} (k, λ - 1) + (1 - a_{a}) φ_{\overset{&OverBar;}{y}} (k, λ), if φ_{\overset{&OverBar;}{y}} (k, λ) &GreaterEqual; {\hat{φ}}_{\overset{&OverBar;}{v} 1} (k, λ) \\ {\hat{φ}}_{\overset{&OverBar;}{v} 1} (k, λ) = a_{d} {\hat{φ}}_{\overset{&OverBar;}{v} 1} (k, λ - 1) + (1 - a_{d}) φ_{\overset{&OverBar;}{y}} (k, λ), if φ_{\overset{&OverBar;}{y}} (k, λ) < {\hat{φ}}_{\overset{&OverBar;}{v} 1} (k, λ) \end{matrix};

Then noise signal array power Power estimation value during voice initial state is adopted to calculate noise peak:

φ_{\overset{&OverBar;}{v}} (k, λ) = \min ({\hat{φ}}_{\overset{&OverBar;}{v} 1} (k, λ), 2 θ_{\overset{&OverBar;}{v}} (k, λ));

When being in voice done state, adopting following formula to carry out duopole to noise signal array power spectrum and returning level and smooth estimation:

\{\begin{matrix} {\hat{φ}}_{\overset{&OverBar;}{v} 2} (k, λ) = a_{a} {\hat{φ}}_{\overset{&OverBar;}{v} 2} (k, λ - 1) + (1 - a_{a}) | \overset{&OverBar;}{y} (k, λ) |^{2}, if φ_{\overset{&OverBar;}{y}} (k, λ) &GreaterEqual; {\hat{φ}}_{\overset{&OverBar;}{v} 2} (k, λ) \\ {\hat{φ}}_{\overset{&OverBar;}{v} 2} (k, λ) = a_{d} {\hat{φ}}_{\overset{&OverBar;}{v} 2} (k, λ - 1) + (1 - a_{d}) | \overset{&OverBar;}{y} (k, λ) |^{2}, if φ_{\overset{&OverBar;}{y}} (k, λ) < {\hat{φ}}_{\overset{&OverBar;}{v} 2} (k, λ) \end{matrix};

Then the noise power spectrum estimated value after compromise is calculated:

φ_{\overset{&OverBar;}{v}} (k, λ) = a_{0} φ_{\overset{&OverBar;}{v} 2} (k, λ - 1) + (1 - a_{0}) \max ({\hat{φ}}_{\overset{&OverBar;}{v}} (k, λ), θ_{\overset{&OverBar;}{v}} (k, λ)) .

Wherein, a ₁for noise spectrum undated parameter; a _a, a _dbe respectively smoothing factor; a ₀for noise spectrum undated parameter; for the quick and smooth estimated value of noise signal array power spectrum; for the duopole of noise signal array power spectrum returns level and smooth estimated value; for described single channel strengthens the optimal beam output signal power Power estimation value of process. for the noise power threshold estimated value of noise signal array.

In certain embodiments, the power Spectral Estimation value of described optimal beam output signal adopts following formula to calculate:

φ_{\overset{&OverBar;}{y}} (k, λ) = a_{0} φ_{\overset{&OverBar;}{y}} (k, λ - 1) + (1 - a_{0}) | \overset{&OverBar;}{y} (k, λ) |^{2};

Further, by the noise power spectrum estimated value after compromise with the power Spectral Estimation value of optimal beam output signal input postfilter processes, and in this embodiment, Speech processing process schematic is see Fig. 6.Inverse FFT conversion is carried out to the signal after postfilter process, and then the time-domain signal stream after strengthening with the reconstruct of splicing adding method.

Concrete, be the voice signal sampling system of 16kHz for sample frequency, in the embodiment of the present invention, parameters can carry out value with reference to following numerical value:

N＝6；L _wnd＝32ms；L _ovlp＝24ms；c＝340m/s；f _s＝16000Hz；a ₀＝0.8；a _R＝0.95；a ₁＝0.85；a _a＝0.995；a _d＝0.85；L ₁＝7。

In an embodiment of the present invention, first according to array topology and Sounnd source direction design frequency domain optimum super sensing wave beam, then raw tone array signal is carried out Short Time Fourier Transform, noise coherence matrix is estimated again according to raw tone array signal, and calculating is carried out to the raw tone array signal through Short Time Fourier Transform with the optimum super beam parameters that points to voice signal is enhanced, the dynamic estimation of simultaneously carrying out noise correlation matrix, to upgrade optimum sef-adapting filter parameter, finally improves signal quality with postfilter further.The present invention only needs to use a small amount of microphone can realize the remote speech pickup of high-quality, and have obvious rejection ability to the Complex Noise outside wave beam, voice distortion almost can not be listened out.

As can be seen from above, the microphone sound enhancement method that the embodiment of the present invention provides, noise signal in the primary speech signal of voice capture device Gather and input can be calculated exactly, thus noise signal can effectively be suppressed when speech enhan-cement.

Further, the invention provides a kind of microphone speech sound enhancement device, structure as shown in Figure 5, comprising:

As can be seen from above, the microphone speech sound enhancement device that the embodiment of the present invention provides, adopt the first array voice signal of optimal beam output signal computing module process multi-path digital voice capture device Gather and input, apply minimum variance adaptive beam Optimized model simultaneously, calculate the optimal beam output signal of the first array voice signal, can, microphone array that spacing larger more to digital voice capture device array elements.

Still with reference to Fig. 5, in certain embodiments, described device also comprises:

In certain embodiments, described optimum surpass point to beam coefficient set according to the set-up mode of described multi-path digital voice capture device.

In certain embodiments, described optimal beam output signal computing module is according to the minimum variance adaptive beam Optimized model of described first array voice signal, during the optimal beam output signal adopting the first array voice signal to calculate synthesized by the first array voice signal, adopt following formula:

\overset{&OverBar;}{y} (k, λ) = Σ_{i = 1}^{N} w_{i}^{*} a_{i}^{*} y_{i} (k, λ);

In certain embodiments, the minimum variance adaptive beam Optimized model of the first array voice signal is:

w (k) = \underset{w (k)}{\arg \min} w^{H} (k) R_{\tilde{v}} (k) w (k),

And meet

w^{H} (k) \tilde{d} (k) = 1;

Wherein, the array element in w (k) is conjugate complex number each other; w ^hk conjugation transformation of ownership matrix that () is w (k); for the noise coherence matrix estimated according to described first array voice signal; for target sound source is to the steric direction vector of described digital speech collecting device.

In certain embodiments, optimal beam output signal computing module calculates the first array voice signal institute when outputing signal with the optimal beam that becomes, the target sound source adopted to the steric direction vector of digital speech collecting device according to following formulae discovery:

\tilde{d} (k) = [a_{1}^{*} \exp (jk \frac{d_{1} \cos (θ)}{c} f_{s}), . . . . . ., a_{N}^{*} \exp (jk \frac{d_{N} \cos (θ)}{c} f_{s})]^{T};

Wherein, d ₁d _nbe the 1st to N number of digital speech collecting device to the distance of digital speech collecting device array center, c is the velocity of sound; f _sit is sample frequency; θ is the position angle that target sound source arrives digital speech collecting device; a _i* be optimum super sensing beam coefficient A (k)=[a ₁(k) ..., a _n(k)] ^tmiddle array element a _iconjugate complex number.

Still with reference to Fig. 5, in certain embodiments, described noise power spectrum estimation module comprises:

In certain embodiments, described first noise power spectrum computing unit specifically comprises:

φ_{\overset{&OverBar;}{v}} (k, λ) = a_{1} φ_{\overset{&OverBar;}{v}} (k, λ - 1) + (1 - a_{1}) φ_{\overset{&OverBar;}{y}} (k, λ);

φ_{\overset{&OverBar;}{v}} (k, λ) = \min ({\hat{φ}}_{\overset{&OverBar;}{v} 1} (k, λ), 2 θ_{\overset{&OverBar;}{v}} (k, λ));

φ_{\overset{&OverBar;}{v}} (k, λ) = a_{0} φ_{\overset{&OverBar;}{v} 2} (k, λ - 1) + (1 - a_{0}) \max ({\hat{φ}}_{\overset{&OverBar;}{v}} (k, λ), θ_{\overset{&OverBar;}{v}} (k, λ));

In above-mentioned formula,

θ_{\overset{&OverBar;}{v}} (k, λ) = \frac{1}{2 L_{1} + 1} Σ_{m = k - L_{1}}^{k + L_{1}} φ_{\overset{&OverBar;}{v}} (k, λ);

\{\begin{matrix} {\hat{φ}}_{\overset{&OverBar;}{v} 1} (k, λ) = a_{a} {\hat{φ}}_{\overset{&OverBar;}{v} 1} (k, λ - 1) + (1 - a_{a}) φ_{\overset{&OverBar;}{y}} (k, λ), if φ_{\overset{&OverBar;}{y}} (k, λ) &GreaterEqual; {\hat{φ}}_{\overset{&OverBar;}{v} 1} (k, λ) \\ {\hat{φ}}_{\overset{&OverBar;}{v} 1} (k, λ) = a_{d} {\hat{φ}}_{\overset{&OverBar;}{v} 1} (k, λ - 1) + (1 - a_{d}) φ_{\overset{&OverBar;}{y}} (k, λ), if φ_{\overset{&OverBar;}{y}} (k, λ) < {\hat{φ}}_{\overset{&OverBar;}{v} 1} (k, λ) \end{matrix};

\{\begin{matrix} {\hat{φ}}_{\overset{&OverBar;}{v} 2} (k, λ) = a_{a} {\hat{φ}}_{\overset{&OverBar;}{v} 2} (k, λ - 1) + (1 - a_{a}) | \overset{&OverBar;}{y} (k, λ) |^{2}, if φ_{\overset{&OverBar;}{y}} (k, λ) &GreaterEqual; {\hat{φ}}_{\overset{&OverBar;}{v} 2} (k, λ) \\ {\hat{φ}}_{\overset{&OverBar;}{v} 2} (k, λ) = a_{d} {\hat{φ}}_{\overset{&OverBar;}{v} 2} (k, λ - 1) + (1 - a_{d}) | \overset{&OverBar;}{y} (k, λ) |^{2}, if φ_{\overset{&OverBar;}{y}} (k, λ) < {\hat{φ}}_{\overset{&OverBar;}{v} 2} (k, λ) \end{matrix};

φ_{\overset{&OverBar;}{y}} (k, λ) = a_{0} φ_{\overset{&OverBar;}{y}} (k, λ - 1) + (1 - a_{0}) | \overset{&OverBar;}{y} (k, λ) |^{2};

Further, by the noise power spectrum estimated value after compromise with the power Spectral Estimation value of optimal beam output signal input postfilter processes.Inverse FFT conversion is carried out to the signal after postfilter process, and then the time-domain signal stream after strengthening with the reconstruct of splicing adding method.

As can be seen from above, the microphone speech sound enhancement device that the embodiment of the present invention provides, can effectively the noise signal in the first array voice signal of multi-path digital voice capture device Gather and input be estimated and be processed, be conducive to effective filtering noise signal in the process strengthened in subsequent voice, improve speech enhan-cement effect.

Should be appreciated that multiple embodiments described by this instructions are only for instruction and explanation of the present invention, are not intended to limit the present invention.And when not conflicting, the embodiment in the application and the feature in embodiment can combine mutually.

Obviously, those skilled in the art can carry out various change and modification to the present invention and not depart from the spirit and scope of the present invention.Like this, if these amendments of the present invention and modification belong within the scope of the claims in the present invention and equivalent technologies thereof, then the present invention is also intended to comprise these change and modification.

Claims

1. a microphone sound enhancement method, is characterized in that, comprises the steps:

2. method according to claim 1, is characterized in that, before obtaining the first array voice signal by multi-path digital voice capture device Gather and input, also comprises:

3. method according to claim 2, is characterized in that, described optimum surpasses sensing beam coefficient and sets according to the set-up mode of described multi-path digital voice capture device.

4. method according to claim 1, it is characterized in that, according to the minimum variance adaptive beam Optimized model of described first array voice signal, during the optimal beam output signal adopting the first array voice signal to calculate synthesized by the first array voice signal, adopt following formula:

\overset{&OverBar;}{y} (k, λ) = Σ_{i = 1}^{N} w_{i}^{*} a_{i}^{*} y_{i} (k, λ);

beam coefficient and the target sound source sef-adapting filter parameter to the steric direction Vector operation of each digital speech collecting device is pointed to for surpassing according to noise signal column vector and optimum; beam coefficient A (k)=[a is pointed to for optimum is super ₁(k) ..., a _n(k)] ^tmiddle array element a _iconjugate complex number; y _i(k, λ) is described first array voice signal.

5. method according to claim 3, is characterized in that, the minimum variance adaptive beam Optimized model of described first array voice signal is:

w (k) = \underset{w (k)}{\arg \min} w^{H} (k) R_{\tilde{v}} (k) w (k),

And meet

w^{H} (k) \tilde{d} (k) = 1;

6. method according to claim 5, is characterized in that, described target sound source to the steric direction vector of digital speech collecting device according to following formulae discovery:

\tilde{d} (k) = [a_{1}^{*} \exp (jk \frac{d_{1} \cos (θ)}{c} f_{s}), . . . . . ., a_{N}^{*} \exp (jk \frac{d_{N} \cos (θ)}{c} f_{s})]^{T};

7. method according to claim 1, is characterized in that, described method also comprises:

8. method according to claim 7, is characterized in that, the result according to described voice activity detection VAD comprises the step that noise signal array carries out noise power Power estimation:

9. method according to claim 8, is characterized in that, calculating has voice status, specifically comprises without the step of noise power spectrum when voice status, voice initial state, voice done state:

φ_{\overset{&OverBar;}{v}} (k, λ) = a_{1} φ_{\overset{&OverBar;}{v}} (k, λ - 1) + (1 - a_{1}) φ_{\overset{&OverBar;}{y}} (k, λ);

φ_{\overset{&OverBar;}{v}} (k, λ) = \min ({\hat{φ}}_{\overset{&OverBar;}{v} 1} (k, λ), 2 θ_{\overset{&OverBar;}{v}} (k, λ));

φ_{\overset{&OverBar;}{v}} (k, λ) = a_{0} φ_{\overset{&OverBar;}{v} 2} (k, λ - 1) + (1 - a_{0}) \max ({\hat{φ}}_{\overset{&OverBar;}{v}} (k, λ), θ_{\overset{&OverBar;}{v}} (k, λ));

In above-mentioned formula,

θ_{\overset{&OverBar;}{v}} (k, λ) = \frac{1}{2 L_{1} + 1} Σ_{m = k - L_{1}}^{k + L_{1}} φ_{\overset{&OverBar;}{v}} (k, λ);

\{\begin{matrix} {\hat{φ}}_{\overset{&OverBar;}{v} 1} (k, λ) = a_{a} {\hat{φ}}_{\overset{&OverBar;}{v} 1} (k, λ - 1) + (1 - a_{a}) φ_{\overset{&OverBar;}{y}} (k, λ), if φ_{\overset{&OverBar;}{y}} (k, λ) &GreaterEqual; {\hat{φ}}_{\overset{&OverBar;}{v} 1} (k, λ) \\ {\hat{φ}}_{\overset{&OverBar;}{v} 1} (k, λ) = a_{d} {\hat{φ}}_{\overset{&OverBar;}{v} 1} (k, λ - 1) + (1 - a_{d}) φ_{\overset{&OverBar;}{y}} (k, λ), if φ_{\overset{&OverBar;}{y}} (k, λ) < {\hat{φ}}_{\overset{&OverBar;}{v} 1} (k, λ) \end{matrix};

10. method according to claim 1, is characterized in that, the power Spectral Estimation value of described optimal beam output signal adopts following formula to calculate:

φ_{\overset{&OverBar;}{y}} (k, λ) = a_{0} φ_{\overset{&OverBar;}{y}} (k, λ - 1) + (1 - a_{0}) | \overset{&OverBar;}{y} (k, λ) |^{2};

11. 1 kinds of microphone speech sound enhancement devices, is characterized in that, comprising:

12. devices according to claim 11, is characterized in that, described device also comprises:

13. devices according to claim 12, is characterized in that, described optimum surpasses sensing beam coefficient and sets according to the set-up mode of described multi-path digital voice capture device.

14. devices according to claim 11, it is characterized in that, described optimal beam output signal computing module is according to the minimum variance adaptive beam Optimized model of described first array voice signal, during the optimal beam output signal adopting the first array voice signal to calculate synthesized by the first array voice signal, adopt following formula:

\overset{&OverBar;}{y} (k, λ) = Σ_{i = 1}^{N} w_{i}^{*} a_{i}^{*} y_{i} (k, λ);

15. devices according to claim 13, is characterized in that, the minimum variance adaptive beam Optimized model of the first array voice signal is:

w (k) = \underset{w (k)}{\arg \min} w^{H} (k) R_{\tilde{v}} (k) w (k),

And meet

w^{H} (k) \tilde{d} (k) = 1;

16. devices according to claim 15, it is characterized in that, optimal beam output signal computing module calculates the first array voice signal institute when outputing signal with the optimal beam that becomes, the target sound source adopted to the steric direction vector of digital speech collecting device according to following formulae discovery:

\tilde{d} (k) = [a_{1}^{*} \exp (jk \frac{d_{1} \cos (θ)}{c} f_{s}), . . . . . ., a_{N}^{*} \exp (jk \frac{d_{N} \cos (θ)}{c} f_{s})]^{T};

Wherein, d ₁d _nbe the 1st to N number of digital speech collecting device to the distance of digital speech collecting device array center, c is the velocity of sound; f _sit is sample frequency; θ is the position angle that target sound source arrives digital speech collecting device; beam coefficient A (k)=[a is pointed to for optimum is super ₁(k) ..., a _n(k)] array element a in T _iconjugate complex number.

17. devices according to claim 11, is characterized in that, also comprise:

18. devices according to claim 17, is characterized in that, described noise power spectrum estimation module comprises:

19. devices according to claim 18, is characterized in that, described first noise power spectrum computing unit specifically comprises:

φ_{\overset{&OverBar;}{v}} (k, λ) = a_{1} φ_{\overset{&OverBar;}{v}} (k, λ - 1) + (1 - a_{1}) φ_{\overset{&OverBar;}{y}} (k, λ);

φ_{\overset{&OverBar;}{v}} (k, λ) = a_{0} φ_{\overset{&OverBar;}{v} 2} (k, λ - 1) + (1 - a_{0}) \max ({\hat{φ}}_{\overset{&OverBar;}{v}} (k, λ), θ_{\overset{&OverBar;}{v}} (k, λ));

In above-mentioned formula,

θ_{\overset{&OverBar;}{v}} (k, λ) = \frac{1}{2 L_{1} + 1} Σ_{m = k - L_{1}}^{k + L_{1}} φ_{\overset{&OverBar;}{v}} (k, λ);

\{\begin{matrix} {\hat{φ}}_{\overset{&OverBar;}{v} 1} (k, λ) = a_{a} {\hat{φ}}_{\overset{&OverBar;}{v} 1} (k, λ - 1) + (1 - a_{a}) φ_{\overset{&OverBar;}{y}} (k, λ), if φ_{\overset{&OverBar;}{y}} (k, λ) &GreaterEqual; {\hat{φ}}_{\overset{&OverBar;}{v} 1} (k, λ) \\ {\hat{φ}}_{\overset{&OverBar;}{v} 1} (k, λ) = a_{d} {\hat{φ}}_{\overset{&OverBar;}{v} 1} (k, λ - 1) + (1 - a_{d}) φ_{\overset{&OverBar;}{y}} (k, λ), if φ_{\overset{&OverBar;}{y}} (k, λ) < {\hat{φ}}_{\overset{&OverBar;}{v} 1} (k, λ) \end{matrix};

\{\begin{matrix} {\hat{φ}}_{\overset{&OverBar;}{v} 2} (k, λ) = a_{a} {\hat{φ}}_{\overset{&OverBar;}{v} 2} (k, λ - 1) + (1 - a_{a}) | \overset{&OverBar;}{y} (k, λ) |^{2}, if φ_{\overset{&OverBar;}{y}} (k, λ) &GreaterEqual; {\hat{φ}}_{\overset{&OverBar;}{v} 2} (k, λ) \\ {\hat{φ}}_{\overset{&OverBar;}{v} 2} (k, λ) = a_{d} {\hat{φ}}_{\overset{&OverBar;}{v} 2} (k, λ - 1) + (1 - a_{d}) | \overset{&OverBar;}{y} (k, λ) |^{2}, if φ_{\overset{&OverBar;}{y}} (k, λ) < {\hat{φ}}_{\overset{&OverBar;}{v} 2} (k, λ) \end{matrix};

20. devices according to claim 11, is characterized in that, the power Spectral Estimation value of described optimal beam output signal adopts following formula to calculate:

φ_{\overset{&OverBar;}{y}} (k, λ) = a_{0} φ_{\overset{&OverBar;}{y}} (k, λ - 1) + (1 - a_{0}) | \overset{&OverBar;}{y} (k, λ) |^{2};