WO2015196729A1

WO2015196729A1 - Microphone array speech enhancement method and device

Info

Publication number: WO2015196729A1
Application number: PCT/CN2014/092217
Authority: WO
Inventors: 范泛; 付中华; 黎家力
Original assignee: 中兴通讯股份有限公司
Priority date: 2014-06-27
Filing date: 2014-11-25
Publication date: 2015-12-30
Also published as: CN105244036A

Abstract

A microphone array speech enhancement method and a corresponding device, the method comprising the following steps: first array speech signals collected and inputted via a multi-path digital speech collection device are obtained (101); according to a minimum variance adaptive beam optimisation model of the first array speech signals, the first array speech signals are used to calculate an optimal beam output signal synthesised by the first array speech signals (102); a power spectrum estimation value of the optimal beam output signal is used to perform single-channel speech enhancement processing (103). The minimum variance adaptive beam optimisation model of the first array speech signals comprises a spatial steering vector of a target sound source to the multi-path digital speech collection device.

Description

Microphone array speech enhancement method and device

Technical field

The present invention relates to voice processing, and in particular, to a microphone array voice enhancement method and apparatus.

Background technique

With the development of hands-free calling, conference systems, smart homes and smart home appliances, high-quality long-distance voice pickup has become one of the key factors affecting the performance of voice acquisition and processing systems. In order to adapt to the complex sound environment, single-microphone technology has been difficult, and microphone arrays with multi-channel voice acquisition devices are becoming more and more mainstream. The most commonly used are beamforming technology and voice enhancement technology. Speech enhancement technology needs to extract as pure a target voice as possible from the original speech signal collected by the speech acquisition device. The beamforming technique improves the sensitivity of the microphone array to sound in a certain direction by adjusting the parameters, and improves the effect of speech enhancement. However, most of the speech enhancement techniques in the related art can only process the original speech collected by the array of speech acquisition devices with few array elements and small spacing, so the traditional array speech enhancement technology often has very limited performance.

Summary of the invention

Embodiments of the present invention provide a microphone array voice enhancement method and apparatus. The method and apparatus are capable of processing original speech of an array of speech acquisition devices having more array elements and larger spacing.

A microphone array speech enhancement method includes the following steps:

Acquiring the first array of voice signals collected by the multi-channel digital voice collection device;

Calculating an optimal beam output signal synthesized by the first array voice signal by using the first array voice signal according to the minimum variance adaptive beam optimization model of the first array voice signal;

Performing single channel speech enhancement processing using the power spectrum estimation value of the optimal beam output signal;

The minimum variance adaptive beam optimization model of the first array of speech signals includes a spatial steering vector of the target sound source to the multi-channel digital speech acquisition device.

Optionally, before acquiring the first array voice signal input by the multi-channel digital voice collection device, the method further includes:

Acquiring original speech array signals y ₁ (n), ... y _N (n) through a multi-channel digital voice acquisition device;

Performing a short time Fourier transform on the original speech signal to obtain a time-frequency representation signal y ₁ (k, λ) ... y _N (k, λ) of the original speech array signal;

The optimal super-directional beam coefficient A(k)[a ₁ (k), . . . , a _N (k)] ^{T is used} to perform frequency domain optimal super-directional beam processing on the time-frequency representative signal, and obtained First array of speech signals

The n is a discrete time variable; N is the number of array elements; k is the frequency point number; λ is the short time frame number.

Optionally, the optimal super-directional beam coefficient is set according to a setting manner of the multi-channel digital voice collection device.

Optionally, according to the minimum variance adaptive beam optimization model of the first array voice signal, when the first array voice signal is used to calculate the optimal beam output signal synthesized by the first array voice signal, the following formula is adopted:

Outputting a signal for the optimal beam;

An adaptive filter parameter calculated according to a noise signal column vector and an optimal super-directional beam coefficient and a spatial guidance vector of the target sound source to the digital speech acquisition device;

Is the conjugate complex number of the array element a _i in the optimal super-directed beam coefficient A(k)[a ₁ (k), ..., a _N (k)] ^T ; y _i (k, λ) is The first array of speech signals.

Optionally, the minimum variance adaptive beam optimization model of the first array of speech signals is:

And satisfied

Where, the array elements in w(k)

Conjugated complex numbers of each other; w ^H (k) is a conjugate transformation matrix of w(k);

a noise coherence matrix estimated according to the first array of speech signals;

A spatial steering vector for the target sound source to the digital speech acquisition device.

Optionally, the spatial steering vector of the target sound source to the digital voice collection device is calculated according to the following formula:

Where d ₁ ... d _N is the distance from the first to N digital speech collection devices to the center of the digital speech collection device array, c is the sound velocity; f _s is the sampling frequency; θ is the orientation of the target sound source to the digital speech acquisition device angle;

Is the conjugate complex number of the array element a _i in the optimal super-directed beam coefficient A(k)[a ₁ (k), ..., a _N (k)] ^T .

Optionally, the method further includes:

Performing a voice activity detection VAD on the noise signal array in the array voice input signals of the plurality of channels;

Performing noise power spectrum estimation on the noise signal array according to the result of the voice activity detection VAD;

And performing, according to the optimal power spectrum estimation value of the optimal beam output signal and the noise power spectrum estimation value, the second enhancement of the optimal beam output signal.

Optionally, the step of performing noise power spectrum estimation on the noise signal array according to the result of the voice activity detection VAD includes:

Calculating a noise power spectrum when there is a voice state, a voiceless state, a voice start state, and a voice end state;

The noise power spectrum in the speech state and the noise power spectrum in the non-speech state are traded to obtain a noise power spectrum estimation value.

Optionally, the step of calculating a noise power spectrum when there is a voice state, a voiceless state, a voice start state, and a voice end state includes:

When in the no-speech state, the power spectrum of the noise signal array is estimated using the following formula:

When in the voice start state and the voice state, the power spectrum of the noise signal array is estimated by the following formula:

At the end of speech, the noise spectrum array power spectrum is subjected to two-pole regression smooth estimation using the following formula:

In the above formula,

Where a ₁ is the noise spectrum update parameter; a _a and a _d are the smoothing coefficients respectively.

Optionally, the power spectrum estimation value of the optimal beam output signal is calculated by using the following formula:

among them,

A power spectrum estimate for the optimal beam output signal;

Outputting a signal for the optimal beam; a ₀ is a noise spectrum update parameter.

A microphone array voice enhancement device includes:

a first acquiring module, configured to: acquire a first array voice signal that is input through a multi-channel digital voice collecting device;

An optimal beam output signal calculation module is configured to calculate an optimal beam output signal synthesized by the first array voice signal by using the first array voice signal according to the minimum variance adaptive beam optimization model of the first array voice signal;

a first enhancement module, configured to: perform a single channel speech enhancement process by using a power spectrum estimation value of the optimal beam output signal;

Optionally, the device further includes:

The original signal acquisition module is configured to: collect the original voice array signals y ₁ (n), ... y _N (n) through the multi-channel digital voice collection device;

And an original signal transformation module, configured to: perform short-time Fourier transform on the original speech signal to obtain a time-frequency representation signal y ₁ (k, λ) ... y _N (k, λ) of the original speech array signal ;

An optimal super-directional beam processing module, configured to: represent the time-frequency representation using an optimal super-directional beam coefficient A(k)[a ₁ (k), . . . , a _N (k)] ^T The signal is subjected to frequency domain optimal super-directional beam processing to obtain a first array of speech signals

Optionally, the optimal beam output signal calculation module is configured to adopt the following formula according to the The minimum variance adaptive beam optimization model of the first array of speech signals is used to calculate an optimal beam output signal synthesized by the first array of speech signals using the first array of speech signals:

Outputting a signal for the optimal beam;

And satisfied

Where, the array elements in w(k)

Optionally, when the optimal beam output signal calculation module calculates the optimal beam output signal of the first array voice signal, the spatial steering vector of the target sound source to the digital voice acquisition device is calculated according to the following formula:

It is the conjugate complex number of the array element a _i in the optimal super-directed beam coefficient A(k)[a ₁ (k), ..., a _N (k)] ^T .

Optionally, it also includes:

a VAD module, configured to: perform voice activity detection VAD on an array of noise signals in the array voice input signals of the plurality of channels;

a noise power spectrum estimation module, configured to: perform noise power spectrum estimation on the noise signal array according to the result of the voice activity detection VAD;

a second enhancement module, configured to: estimate an optimal power spectrum according to the optimal beam output signal The estimate and the noise power spectrum estimate provide a second enhancement to the optimal beam output signal.

Optionally, the noise power spectrum estimation module includes:

a first noise power spectrum calculation unit configured to: calculate a noise power spectrum when there is a voice state, a voiceless state, a voice start state, and a voice end state;

The second noise power spectrum calculation unit is configured to perform a compromise process on the noise power spectrum in the voice state and the noise power spectrum in the voiceless state to obtain a noise power spectrum estimation value.

Optionally, the first noise power spectrum calculation unit includes:

The no-speech state calculation sub-unit is set to: when in the non-speech state, estimate the power spectrum of the noise signal array using the following formula:

The voice start and voice state calculation subunit is set to: when in the voice start state and the voice state, estimate the power spectrum of the noise signal array by using the following formula:

The no-speech state calculation sub-unit is set to: when in the speech end state, the noise spectrum array power spectrum is subjected to two-pole regression smooth estimation using the following formula:

In the above formula,

among them,

A power spectrum estimate for the optimal beam output signal;

The output signal is a optimal beam; a ₀ is the noise spectrum update parameters.

Embodiments of the present invention also provide a computer program comprising program instructions that, when executed by a computer, cause the computer to perform the above method.

Embodiments of the present invention also provide a computer readable storage medium carrying the computer program.

As can be seen from the above, the microphone array voice enhancement method and apparatus provided by the embodiments of the present invention use the minimum variance adaptive beam optimization model to calculate the first array voice signal collected and input by the multi-channel digital voice signal acquisition device, and The minimum variance adaptive beam optimization model includes a spatial steering vector of the target sound source to the multi-channel digital speech acquisition device, and can perform speech enhancement processing on the microphone array with larger inter-array spacing, and can achieve high-quality pickup. In addition, the microphone array speech enhancement method and apparatus provided by the embodiments of the present invention estimate the power spectrum of the noise signal array at different stages of the speech according to the result of the voice activity detection, and have higher noise estimation accuracy, thereby improving the voice enhancement. Effect.

BRIEF abstract

1 is a schematic flowchart of a microphone array voice enhancement method according to an embodiment of the present invention;

2 is a schematic flowchart of an original voice collection processing process according to an embodiment of the present invention;

3 is a schematic diagram of a noise power spectrum estimation process according to an embodiment of the present invention;

4 is a schematic flow chart showing a detailed calculation of noise power spectrum according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a microphone array voice enhancement apparatus according to an embodiment of the present invention; FIG.

FIG. 6 is a schematic diagram of voice signal processing according to an embodiment of the present invention.

Preferred embodiment of the invention

The embodiments of the present invention are described below with reference to the accompanying drawings, and the features of the embodiments and the embodiments of the present invention may be arbitrarily combined with each other without conflict.

First, beamforming techniques related to embodiments of the present invention include both fixed beam and adaptive beam.

Fixed beam means that the parameters of the array signal processing system do not change with the pickup signal, but are determined by the array topology and the preset noise field model, including the time domain fixed beam and the frequency domain fixed beam. The directivity of the fixed beam at the low and medium frequencies is degraded, and the speech signal is a wideband signal. If the mid-low frequency directivity is improved, the robustness of the array will be deteriorated, so it is less used alone in practical small microphone array applications.

The adaptive beam dynamically generates the optimal beam parameters according to the optimized conditions by automatically estimating the sound field and the transfer function of the sound source to the microphone. In practical applications, since the transfer function of the sound source to each microphone is difficult to estimate, it is often combined with multi-channel noise suppression technology or post-filtering after beam processing, which requires accurate estimation of noise statistical characteristics. And find the best balance between target signal distortion and noise suppression.

Embodiments of the present invention provide a microphone array wind voice enhancement method, including the steps shown in FIG. 1:

Step 101: Acquire a first array voice signal that is input by using a multi-channel digital voice collection device.

Step 102: Calculate an optimal beam output signal synthesized by the first array voice signal by using the first array voice signal according to the minimum variance adaptive beam optimization model of the first array voice signal.

Step 103: Perform single channel speech enhancement processing by using a power spectrum estimation value of the optimal beam output signal.

As can be seen from the above, the microphone array voice enhancement method provided by the embodiment of the present invention calculates the first array voice signal collected and input by the multi-channel digital voice signal acquisition device by using the minimum variance adaptive beam optimization model, and the The minimum variance adaptive beam optimization model includes a spatial steering vector of the target sound source to the multi-channel digital speech acquisition device, and can perform speech enhancement processing on the microphone array with larger inter-array spacing, and can achieve high-quality pickup.

In some embodiments of the present invention, in the step of performing single channel speech enhancement processing using the power spectrum estimation value of the optimal beam output signal, the log MMSE method is applied to process the optimal beam output signal.

In some embodiments of the present invention, before acquiring the first array of voice signals input by the multi-channel digital voice collection device, the steps shown in FIG. 2 are also included:

Step 201: Acquire original voice array signals y ₁ (n), ... y _N (n) through a multi-channel digital voice collecting device;

Step 202: performing short-time Fourier transform on the original speech signal to obtain a time-frequency representation signal y ₁ (k, λ) ... y _N (k, λ) of the original speech array signal;

Step 203: Perform frequency domain optimal hyper-directional beam on the time-frequency representation signal by using an optimal super-directional beam coefficient A(k)[a ₁ (k), . . . , a _N (k)] ^T Processing to obtain a first array of speech signals

i=1...N.

The original voice array signals collected by the multi-channel digital voice collecting device are y ₁ (n) ... y _n (n), and the signals collected by the multi-channel digital voice collecting devices are according to the time window length L _wnd The adjacent windows are overlapped by L _ovlp for windowing and truncation. The windowing truncation adopts a Hanning window, which overlaps 3/4 window length. Then, the signal after windowing of each channel is subjected to short-time Fourier transform to obtain a representative signal of the time-frequency of the original speech array signal: y ₁ (k, λ) ... y _N (k, λ). Theoretically, the time-frequency representation signal y _i (k, λ), the noise signal v ₁ (k, λ) of the original speech array signal, and the target speech signal x(k, λ) emitted by the target sound source satisfy the following relationship :

y _i (k, λ) = v ₁ (k, λ) + x (k, λ).

among them,

a conjugate matrix of a(k);

Representing the frequency domain weighted target speech signal,

Represents a frequency domain weighted noise signal; i = 1 ... N.

In some embodiments, the optimal super-directional beam coefficients are determined in accordance with an array topology of the multi-channel digital speech acquisition device in conjunction with a sound source direction.

Due to the use of optimal super-directional beam processing, the inter-frame spacing of the multi-channel digital acquisition device is allowed to be larger than that of the multi-channel speech collection device in the related art.

In some embodiments, according to the minimum variance adaptive beam optimization model of the first array of speech signals, when the first array of speech signals is used to calculate an optimal beam output signal synthesized by the first array of speech signals, the following formula is used:

Outputting a signal for the optimal beam;

An adaptive filter parameter calculated according to a noise signal column vector and an optimal super-finger beam coefficient and a target sound source to a spatial steering vector of the digital speech acquisition device;

a conjugate complex number of the optimal super-directional beam coefficients A(k)[a ₁ (k), . . . , a _N (k)] ^T ; y _i (k, λ) is the first array voice signal.

In some embodiments, the minimum variance adaptive beam optimization model of the first array of speech signals is:

And satisfied

Where, the array elements in w(k)

According to the above model, the conjugate complex number of the adaptive filter parameters calculated from the noise signal column vector and the optimal super-directional beam coefficient and the spatial guidance vector of the target sound source to the digital speech acquisition device is:

among them,

for

Conjugated transformation matrix.

The space steering vector of the target sound source to the digital voice collecting device is calculated by the following formula:

Wherein, d ₁ ...... d _N is a first to N digital audio capture device to the device from the digital voice collecting center of the array, c is the speed of sound; f _s is the sampling frequency; [theta] is the orientation of a target sound source to the digital voice collecting device angle. Since the signal is first processed by the frequency domain super-directed beam, the spatial steering vector of the target sound source to the digital speech acquisition device after the frequency domain hyper-pointing processing becomes:

Wherein the noise signal estimated according to the first array voice signal is

Correspondingly, the noise coherence matrix estimated by the first array of speech signals is:

Where E represents the desired utility function;

for

Conjugated transformation matrix.

_{Wherein, w (k) [w 1} (k), ......, w N (k)] T.

In some embodiments of the invention, the method further comprises the steps shown in Figure 3:

Step 301: Perform a voice activity detection (VAD) on the noise signal array in the array voice input signals of the multiple channels;

Step 302: Perform noise power spectrum estimation on the noise signal array according to the result of the voice activity detection VAD.

Step 303: Perform a second enhancement on the optimal beam output signal according to the optimal power spectrum estimation value of the optimal beam output signal and the noise power spectrum estimation value.

The above embodiment can perform dynamic time-varying estimation of the noise signal in the first array speech signal to prepare for secondary enhancement of the sound.

In general, noise can be estimated using the following formula when there is no speech:

When there is speech, the noise can be estimated by the following formula:

a _R is the smoothing factor.

The step of performing noise power spectrum estimation on the noise signal array according to the result of the voice activity detection VAD may include the process shown in FIG. 4:

Step 401: Calculate a noise power spectrum when there is a voice state, a voiceless state, a voice start state, and a voice end state;

Step 402: Perform a compromise process on the noise power spectrum in the voice state and the noise power spectrum in the voiceless state to obtain a noise power spectrum estimation value.

In some embodiments, the step of power spectrum estimation of the noise signal array based on the result of the voice activity detection VAD comprises:

When in the no-speech state, the power spectrum of the noise signal array is quickly and smoothly estimated using the following formula:

Then calculate the noise power spectrum threshold according to the following formula:

Where L ₁ is the number of frequency points.

When in the beginning of speech, the noise spectrum array power spectrum is firstly subjected to two-pole regression smooth estimation using the following formula:

The noise peak is then calculated using the power spectrum estimate of the noise signal array at the beginning of the speech:

When in the speech end state, the noise spectrum array power spectrum is subjected to two-pole regression smooth estimation using the following formula:

Then calculate the estimated noise power spectrum after the compromise:

Where a ₁ is a noise spectrum update parameter; a _a and a _d are respectively a smoothing coefficient; a ₀ is a noise spectrum update parameter;

a fast smooth estimate of the power spectrum of the noise signal array;

Smoothing the estimated value for the two-pole regression of the power spectrum of the noise signal array;

An optimal beam output signal power spectrum estimate for the single channel enhancement process.

An estimate of the noise power threshold for the noise signal array.

In some embodiments, the power spectrum estimate of the optimal beam output signal is calculated using the following formula:

among them,

A power spectrum estimate for the optimal beam output signal;

Optionally, you can also estimate the noise power spectrum after the compromise.

And power spectrum estimates of the optimal beam output signal

The post filter is input for processing. In this embodiment, a schematic diagram of the speech signal processing process is shown in FIG. The inverse processed FFT transform is performed on the signal processed by the post filter, and then the enhanced time domain signal stream is reconstructed by the splicing addition method.

For a sound signal sampling system with a sampling frequency of 16 kHz, various parameters in the embodiment of the present invention can be referred to the following values:

N = 6; L _wnd = 32 ms; L _ovlp = 24 ms; c = 340 m / s; f _s = 16000 Hz; a ₀ = 0.8; a _R = 0.95; a ₁ = 0.85; a _a = 0.995; a _d = 0.85; L ₁ = 7.

In an embodiment of the present invention, the frequency domain optimal super-directional beam is designed according to the array topology and the sound source direction, and then the original speech array signal is subjected to short-time Fourier transform, and then estimated according to the original voice array signal. The noise coherence matrix is calculated by using the optimal super-directional beam parameters to calculate the original speech array signal after the short-time Fourier transform, so that the speech signal is enhanced, and the dynamic estimation of the noise correlation matrix is performed to update the optimal adaptive filter parameters. Finally, the post filter is used to further improve the signal quality. In the embodiment of the present invention, only a small number of microphones can be used to achieve high-quality long-distance voice pickup, and the complex noise outside the beam is obviously suppressed, and the voice distortion is hardly heard.

As can be seen from the above, the microphone array voice enhancement method provided by the embodiment of the present invention can accurately calculate the noise signal in the original voice signal input by the voice collection device, so that the noise signal can be effectively effective when the voice is enhanced. Suppression.

The embodiment of the invention further provides a microphone array voice enhancement device, which has the structure shown in FIG. 5 and includes:

It can be seen from the above that the microphone array voice enhancement device provided by the embodiment of the present invention uses the optimal beam output signal calculation module to process the first array voice signal collected by the multi-channel digital voice collection device, and applies the minimum variance adaptive method. The beam optimization model calculates an optimal beam output signal of the first array of speech signals, and can have a larger array of microphone arrays with larger array elements in the digital speech acquisition device.

Still referring to FIG. 5, in some embodiments, the apparatus further includes:

An optimal super-directional beam processing module for performing the time-frequency representation signal using an optimal super-directional beam coefficient a(k)[a ₁ (k), . . . , a _N (k)] ^T Frequency domain optimal super-directional beam processing to obtain a first array of speech signals

i=1...N;

In some embodiments, the optimal super-directional beam coefficients are set according to a manner in which the multi-channel digital voice collection device is set.

In some embodiments, the optimal beam output signal calculation module calculates an optimal beam synthesized by the first array voice signal by using the first array voice signal according to the minimum variance adaptive beam optimization model of the first array voice signal. When outputting a signal, the following formula is used:

Outputting a signal for the optimal beam;

And satisfied

Where the array element in w(k) is

In some embodiments, when the optimal beam output signal calculation module calculates the optimal beam output signal of the first array of speech signals, the spatial steering vector of the target sound source to the digital speech acquisition device is calculated according to the following formula. :

Still referring to FIG. 5, in some embodiments, the apparatus further includes:

And a second enhancement module, configured to: perform the second enhancement on the optimal beam output signal according to the optimal power spectrum estimation value of the optimal beam output signal and the noise power spectrum estimation value.

Still referring to FIG. 5, in some embodiments, the noise power spectrum estimation module includes:

In some embodiments, the first noise power spectrum calculation unit includes:

In the above formula,

among them,

A power spectrum estimate for the optimal beam output signal;

Optionally, the estimated noise power spectrum after the compromise

And power spectrum estimates of the optimal beam output signal

Input the post filter for processing. The inverse processed FFT transform is performed on the signal processed by the post filter, and then the enhanced time domain signal stream is reconstructed by the splicing addition method.

As can be seen from the above, the microphone array voice enhancement device provided by the embodiment of the present invention can effectively estimate and process the noise signal in the first array voice signal collected by the multi-channel digital voice collection device, which is beneficial to In the process of subsequent speech enhancement, the noise signal is effectively filtered out, and the speech enhancement effect is improved.

It is to be understood that the various embodiments of the present invention are intended to illustrate and explain the invention. And in the case of no conflict, the embodiments in the present application and the features in the embodiments can be combined with each other.

One of ordinary skill in the art will appreciate that all or a portion of the steps of the above-described embodiments can be implemented using a computer program flow, which can be stored in a computer readable storage medium, such as on a corresponding hardware platform (eg, The system, device, device, device, etc. are executed, and when executed, include one or a combination of the steps of the method embodiments.

Alternatively, all or part of the steps of the above embodiments may also be implemented by using an integrated circuit. These steps may be separately fabricated into individual integrated circuit modules, or multiple modules or steps may be fabricated into a single integrated circuit module. achieve. Thus, the invention is not limited to any specific combination of hardware and software.

Each device/function module/functional unit in the above embodiments may use a general-purpose computing device. Implementations can be centralized on a single computing device or distributed across a network of multiple computing devices.

When each device/function module/functional unit in the above embodiment is implemented in the form of a software function module and sold or used as a stand-alone product, it can be stored in a computer readable storage medium. The above mentioned computer readable storage medium may be a read only memory, a magnetic disk or an optical disk or the like.

Variations or substitutions are readily conceivable within the scope of the present invention by those skilled in the art and are within the scope of the present invention. Therefore, the scope of the invention should be determined by the scope of the claims.

Industrial applicability

The embodiment of the invention can perform voice enhancement processing on the microphone array with larger spacing of the array elements, and can realize high quality pickup.

Claims

A microphone array speech enhancement method includes:

Acquiring the first array of voice signals collected by the multi-channel digital voice collection device;

Calculating an optimal beam output signal synthesized by the first array voice signal by using the first array voice signal according to the minimum variance adaptive beam optimization model of the first array voice signal;

Performing single channel speech enhancement processing using the power spectrum estimation value of the optimal beam output signal;

The minimum variance adaptive beam optimization model of the first array of speech signals includes a spatial steering vector of the target sound source to the multi-channel digital speech acquisition device.
The method of claim 1, wherein the method further comprises: before acquiring the first array of voice signals input by the multiplexed digital voice collection device, the method further comprising:

Acquiring original speech array signals y 1 (n), ... y N (n) through a multi-channel digital voice acquisition device;

Performing a short time Fourier transform on the original speech signal to obtain a time-frequency representation signal y 1 (k, λ) ... y N (k, λ) of the original speech array signal;

The optimal super-directional beam coefficient A(k)[a 1 (k), . . . , a N (k)] T is used to perform frequency domain optimal super-directional beam processing on the time-frequency representative signal, and obtained First array of speech signals
i=1...N;

The n is a discrete time variable; N is the number of array elements; k is the frequency point number; λ is the short time frame number.
The method of claim 2 wherein said optimal super-directional beam coefficients are set according to a manner in which said plurality of digital voice capture devices are arranged.
The method of claim 1, wherein the step of calculating an optimal beam output signal synthesized by the first array of speech signals using the first array of speech signals is performed according to a minimum variance adaptive beam optimization model of the first array of speech signals The optimal beam output signal is calculated using the following formula:

Outputting a signal for the optimal beam;
An adaptive filter parameter calculated according to a noise signal column vector and an optimal super-directional beam coefficient and a spatial guidance vector of the target sound source to the digital speech acquisition device;
Is the conjugate complex number of the array element a i in the optimal super-directed beam coefficient A(k)[a 1 (k), ..., a N (k)] T ; y i (k, λ) is The first array of speech signals.
The method of claim 3 wherein the minimum variance adaptive beam optimization model of the first array of speech signals is:

And satisfied

Where, the array elements in w(k)
Conjugated complex numbers of each other; w H (k) is a conjugate transformation matrix of w(k);
a noise coherence matrix estimated according to the first array of speech signals;
A spatial steering vector for the target sound source to the digital speech acquisition device.
The method of claim 5 wherein the spatial steering vector of the target sound source to the digital speech acquisition device is calculated according to the following formula:

Where d 1 ... d N is the distance from the first to N digital speech collection devices to the center of the digital speech collection device array, c is the sound velocity; f s is the sampling frequency; θ is the orientation of the target sound source to the digital speech acquisition device angle;
It is the conjugate complex number of the array element a i in the optimal super-directed beam coefficient A(k)[a 1 (k), ..., a N (k)] T .
The method of claim 1 wherein the method further comprises:

Performing a voice activity detection VAD on the noise signal array in the array voice input signals of the plurality of channels;

Performing noise power spectrum estimation on the noise signal array according to the result of the VAD;

And performing, according to the optimal power spectrum estimation value of the optimal beam output signal and the noise power spectrum estimation value, the second enhancement of the optimal beam output signal.
The method of claim 7, wherein the step of estimating the noise power spectrum of the noise signal array based on the result of the voice activity detection VAD comprises:

Calculating a noise power spectrum when there is a voice state, a voiceless state, a voice start state, and a voice end state;

The noise power spectrum in the speech state and the noise power spectrum in the non-speech state are traded to obtain a noise power spectrum estimation value.
The method according to claim 8, wherein the calculating the noise power spectrum when there is a voice state, a voiceless state, a voice start state, and a voice end state comprises:

When in the no-speech state, the power spectrum of the noise signal array is estimated using the following formula:

When in the voice start state and the voice state, the power spectrum of the noise signal array is estimated by the following formula:

At the end of speech, the noise spectrum array power spectrum is subjected to two-pole regression smooth estimation using the following formula:

In the above formula,

Where a 1 is the noise spectrum update parameter; a a and a d are the smoothing coefficients respectively.
The method of claim 1 wherein the power spectrum estimate of the optimal beam output signal is calculated using the following formula:

among them,
A power spectrum estimate for the optimal beam output signal;
Outputting a signal for the optimal beam; a 0 is a noise spectrum update parameter.
A microphone array voice enhancement device includes:

a first acquiring module, configured to: acquire a first array voice signal that is input through a multi-channel digital voice collecting device;

An optimal beam output signal calculation module is configured to calculate an optimal beam output signal synthesized by the first array voice signal by using the first array voice signal according to the minimum variance adaptive beam optimization model of the first array voice signal; as well as

a first enhancement module, configured to: perform a single channel speech enhancement process by using a power spectrum estimation value of the optimal beam output signal;

The minimum variance adaptive beam optimization model of the first array of speech signals includes a spatial steering vector of the target sound source to the multi-channel digital speech acquisition device.
The apparatus of claim 11 further comprising:

The original signal acquisition module is configured to: collect the original voice array signals y 1 (n), ... y N (n) through the multi-channel digital voice collection device;

And an original signal transformation module, configured to: perform short-time Fourier transform on the original speech signal to obtain a time-frequency representation signal y 1 (k, λ) ... y N (k, λ) of the original speech array signal ;as well as

An optimal super-directional beam processing module, configured to: represent the time-frequency representation using an optimal super-directional beam coefficient A(k)[a 1 (k), . . . , a N (k)] T The signal is subjected to frequency domain optimal super-directional beam processing to obtain a first array of speech signals
i=1...N;

The n is a discrete time variable; N is the number of array elements; k is the frequency point number; λ is the short time frame number.
The apparatus of claim 12, wherein the optimal super-directional beam coefficients are set according to a manner in which the multi-channel digital voice collection device is set.
The apparatus of claim 11, wherein the optimal beam output signal calculation module is an optimal beam output signal that is configured to calculate a first array of speech signals using a subordinate formula:

Outputting a signal for the optimal beam;
An adaptive filter parameter calculated according to a noise signal column vector and an optimal super-directional beam coefficient and a spatial guidance vector of the target sound source to the digital speech acquisition device;
Is the conjugate complex number of the array element a i in the optimal super-directed beam coefficient A(k)[a 1 (k), ..., a N (k)] T ; y i (k, λ) is The first array of speech signals.
The apparatus of claim 13 wherein the minimum variance adaptive beam optimization model of the first array of speech signals is:

And satisfied

Where, the array elements in w(k)
Conjugated complex numbers of each other; w H (k) is a conjugate transformation matrix of w(k);
a noise coherence matrix estimated according to the first array of speech signals;
A spatial steering vector for the target sound source to the digital speech acquisition device.
The apparatus of claim 15 wherein the optimal beam output signal calculation module is arranged to calculate a spatial steering vector of the target sound source to the digital speech acquisition device employed in accordance with the following formula:

Where d 1 ... d N is the distance from the first to N digital speech collection devices to the center of the digital speech collection device array, c is the sound velocity; f s is the sampling frequency; θ is the orientation of the target sound source to the digital speech acquisition device angle;
It is the conjugate complex number of the array element a i in the optimal super-directed beam coefficient A(k)[a 1 (k), ..., a N (k)] T .
The apparatus of claim 11 further comprising:

a VAD module, configured to: perform VAD on the array of noise signals in the array voice input signals of the multiple channels;

a noise power spectrum estimation module configured to: perform noise power spectrum estimation on the noise signal array according to the result of the VAD;

And a second enhancement module, configured to: perform the second enhancement on the optimal beam output signal according to the optimal power spectrum estimation value of the optimal beam output signal and the noise power spectrum estimation value.
The apparatus of claim 17, wherein the noise power spectrum estimation module comprises:

a first noise power spectrum calculation unit configured to: calculate a noise power spectrum when there is a voice state, a voiceless state, a voice start state, and a voice end state;

The second noise power spectrum calculation unit is configured to perform a compromise process on the noise power spectrum in the voice state and the noise power spectrum in the voiceless state to obtain a noise power spectrum estimation value.
The apparatus of claim 18, wherein the first noise power spectrum calculation unit comprises:

The no-speech state calculation sub-unit is set to: when in the non-speech state, estimate the power spectrum of the noise signal array using the following formula:

The voice start and voice state calculation subunit is set to: when in the voice start state and the voice state, estimate the power spectrum of the noise signal array by using the following formula:

The no-speech state calculation sub-unit is set to: when in the speech end state, the noise spectrum array power spectrum is subjected to two-pole regression smooth estimation using the following formula:

In the above formula,

Where a 1 is the noise spectrum update parameter; a a and a d are the smoothing coefficients respectively.
The apparatus of claim 11 wherein the power spectrum estimate of the optimal beam output signal is calculated using the following formula:

among them,
A power spectrum estimate for the optimal beam output signal;
Outputting a signal for the optimal beam; a 0 is a noise spectrum update parameter.
A computer program comprising program instructions which, when executed by a computer, cause the computer to perform the method of any of claims 1-11.
A computer readable storage medium carrying the computer program of claim 21.