CN114882898A

CN114882898A - Multi-channel speech signal enhancement method and apparatus, computer device and storage medium

Info

Publication number: CN114882898A
Application number: CN202210384863.8A
Authority: CN
Inventors: 王劲夫; 杨飞然; 孙国华; 杨军
Original assignee: Institute of Acoustics CAS
Current assignee: Institute of Acoustics CAS
Priority date: 2022-04-13
Filing date: 2022-04-13
Publication date: 2022-08-09

Abstract

The invention provides a method and a system for enhancing a multi-channel voice signal, wherein the method comprises the following steps: carrying out short-time Fourier transform on the time domain signals of a plurality of channels collected by the microphone array to obtain corresponding time-frequency domain signals; estimating the existence probability of prior voice and calculating a noise covariance matrix; constructing an adaptive beam former by using the noise covariance matrix obtained by calculation, and carrying out spatial filtering on the collected time-frequency domain multi-channel signals to obtain estimated time-frequency domain voice signals; and carrying out short-time Fourier inverse transformation on the estimated time-frequency domain voice signal to obtain an estimated time-domain voice signal. The invention can effectively avoid the trailing effect in the prior probability estimation, can more quickly and accurately estimate the noise covariance matrix and improve the noise reduction performance.

Description

Multichannel speech signal enhancement method and device, computer equipment and storage medium

Technical Field

The present invention relates to the field of speech enhancement, and in particular, to a method and apparatus for enhancing a multi-channel speech signal, a computer device, and a storage medium.

Background

The multi-channel voice enhancement means that a multi-channel noisy signal acquired by a microphone array is utilized to realize the extraction of a desired voice signal. Compared with single-channel speech enhancement, the multi-channel speech enhancement can simultaneously utilize information of a time-frequency space domain to realize extraction of expected speech, and theoretically can ensure no distortion of the expected speech. Multi-channel speech enhancement plays an important role in conference systems, hearing aids and man-machine interaction systems.

A commonly used implementation of multi-channel speech enhancement methods is beamforming. The beamformer can be divided into fixed beamforming and adaptive beamforming according to whether the coefficients of the beamformer are adaptively adjusted according to the acquired data. Fixed beamformers generally assume that the noise field follows some particular spatially distributed pattern and then design an optimal beamformer for the noise field. The fixed beamformer works well when the actual noise field satisfies the assumed spatial distribution pattern. But when the actual noise field does not meet the assumed distribution form, which is often the case in practice, the effect of fixed beamforming on the noise effect becomes worse. Compared with a fixed beam former, the adaptive protection former automatically adjusts the coefficient of the adaptive protection former according to the change of a noise field in the environment, and theoretically, a better noise reduction effect can be realized. Many adaptive beamformer designs require a more accurate estimation of the noise covariance matrix, and the effect of the noise covariance matrix estimation directly determines the amount of residual noise in the output signal and the degree of distortion of the desired speech.

At present, the estimation of the noise covariance matrix mainly adopts an iterative smoothing method based on probability weighting, namely, a smoothing factor of the noise covariance matrix estimation is adjusted in real time through the existence probability of voice, and then the real-time update of the noise covariance matrix is realized. There are many ways to calculate the speech existence probability, for example, directly estimating the speech existence probability by performing threshold mapping on inter-channel amplitude difference (ILD) or inter-channel phase difference (IPD), and converting the estimation problem of the noise covariance matrix into the estimation problem of the noise covariance of a single channel by using the characteristics of the noise field (for example, the spatial characteristics of the noise field obey the distribution form of the noise of the diffusion field). There are also many studies focusing on the computation of speech presence probabilities under the binary hypothesis model. This type of method assumes that there are only two possibilities for forming a mixed signal at a certain time, the first possibility is that the mixed signal contains only a noise signal, and the second possibility is that the mixed signal contains both a noise signal and a speech signal. By assuming that the collected noise signal and speech signal obey some specific probability distribution, the analytic expression form of the corresponding posterior speech existence probability can be obtained. But the probabilistic model requires an estimation of the prior speech presence probability. In the existing method, the smoothed estimator is directly adopted to calculate the prior speech existence probability, and the estimated result has an estimated 'tailing effect': that is, the calculated value of the prior speech existence probability cannot be rapidly attenuated to a smaller value within a period of time after the speech is ended. The estimation result of such a method reduces the update rate of the noise covariance matrix, thereby affecting the noise reduction performance of the beamformer.

Disclosure of Invention

The invention aims to solve the problems that the calculation value of the prior speech existence probability cannot be quickly attenuated to a smaller value within a period of time after speech is finished due to a calculation mode of the prior speech existence probability by using a smoothed estimator adopted by the existing speech signal enhancement method, so that the estimated trailing effect cannot be achieved, and the optimal noise reduction effect cannot be achieved, thereby providing a multi-channel speech signal enhancement method and device, computer equipment and a computer readable storage medium.

To achieve the above object, the present invention provides a method for enhancing a multi-channel speech signal, comprising:

step 1) carrying out short-time Fourier transform on time domain signals of a plurality of channels collected by a microphone array to obtain corresponding time-frequency domain signals;

step 2) estimating the existing probability of the prior voice and calculating a noise covariance matrix;

step 3) constructing an adaptive beam former by using the noise covariance matrix obtained by calculation, and carrying out spatial filtering on the collected time-frequency domain multi-channel signals to obtain estimated time-frequency domain voice signals;

and 4) carrying out short-time Fourier inverse transformation on the estimated time-frequency domain voice signal to obtain an estimated time-domain voice signal.

Further, the step 2) specifically comprises: estimating prior speech existence probability by adopting instantaneous estimator and frequency domain smoothing value thereof, calculating noise covariance matrix, and realizing noise covariance matrix of each time-frequency point by utilizing probability weighting method

Is estimated.

The step 2) specifically comprises the following steps:

step 201) calculating an estimated value of the instantaneous signal-to-noise ratio gamma (l, k);

step 202) estimating the prior speech existence probability by utilizing the estimated instantaneous signal-to-noise ratio gamma (l, k);

step 203) estimating the existence probability of the prior voice;

step 204) calculating the posterior voice existence probability by utilizing the estimated prior voice existence probability, and estimating a noise covariance matrix;

step 205) iterative estimation obtains a better noise covariance matrix estimation value.

Further, the method for calculating the estimated value of the instantaneous signal-to-noise ratio γ (l, k) in the step 201) comprises:

wherein

And

respectively representing the instantaneous energy and noise power spectral density estimates of the speech by calculating

Where l is a frame index of the time-frequency domain, k is a frequency index of the time-frequency domain, and h (l, k) [ h ] ₁ (l,k),...,h _M (l,k)] ^T For the beam former used at time/,

is the noise smoothing factor at time i,

is an estimate of the noise covariance matrix at time l.

The step 202) estimates the prior speech existence probability by using the estimated instantaneous signal-to-noise ratio γ (l, k), and the specific method is as follows:

and (3) carrying out smoothing operation on three groups of frequency axis ranges on gamma (l, k) to respectively obtain smoothing based on fewer adjacent frequency points, smoothing based on more adjacent frequency points and smoothing results based on all frequencies:

wherein W (-) is a smooth window, K _loc And K _glo The window length of the window function representing the local and wide smoothing corresponds to half.

The step 203) of estimating the prior speech existence probability includes the following specific steps:

three groups of prior speech existence probabilities can be obtained by carrying out threshold mapping on three groups of signal-to-noise ratio smoothing results, wherein gamma is _loc (l, k) and γ _glo (l, k) the same mapping method is selected as:

wherein, the value of a is 316, and the value of b is 2.5;

γ _fra the threshold mapping mode corresponding to (l, k) is as follows:

wherein,&representing a logical AND operation, K ₁ ,K ₂ And K ₃ ,K ₄ Cut-off ranges of low-frequency and medium-high frequency which are set artificially respectively;

calculating the prior speech existence probability by the following formula:

the step 204) calculates the posterior speech existence probability by using the estimated prior speech existence probability, and estimates the noise covariance matrix, wherein the specific calculation method comprises the following steps:

the posterior voice existence probability calculation formula:

wherein Y (l, k) ═ Y ₁ (l,k),...,Y _M (l,k)] ^T ,

Noise covariance matrix

Obtained from the iterative smoothing estimation described below:

wherein,

being a time-varying smoothing factor, alpha _v For a fixed smoothing factor, the speech covariance matrix

Obtained by the following calculation:

wherein,

is a covariance matrix of the noisy signal, alpha _y For which a corresponding fixed smoothing factor is estimated.

The step 205) of iterative estimation obtains a better noise covariance matrix estimation value, and the specific method is as follows:

repeating steps 201) to 204), in each formula

Uniformly replacing the noise covariance matrix obtained by last iteration estimation

The beamformer h (0, k) at the initial time may be set according to the directional information of the desired speech signal, such as may be set as a classical delay-sum beamformer. Noise power spectral density at initial time

The estimation can be done directly from the beginning silent segment (i.e. the part without speech signal) in the collected data.

The step 3) specifically comprises the following steps:

constructing an adaptive beam former by using the estimated noise covariance matrix; the adaptive beamformer is represented as:

wherein, I _M Is an identity matrix with dimension of M × M, u is I _M The first column of (1); alpha is a parameter for adjusting the noise reduction amount of the beam former, and the value range is 0-1;

the time-frequency domain estimation value of the voice signal is as follows:

the present invention also provides a multi-channel speech signal enhancement apparatus, comprising:

the short-time Fourier transform module is used for carrying out short-time Fourier transform on the time domain signals of the channels collected by the microphone array to obtain corresponding time-frequency domain signals;

the noise covariance matrix estimation module is used for estimating the prior speech existence probability and calculating a noise covariance matrix;

the adaptive beam forming module is used for constructing an adaptive beam former by utilizing the noise covariance matrix obtained by calculation, and carrying out spatial filtering on the collected time-frequency domain multi-channel signals to obtain estimated time-frequency domain voice signals;

and the short-time Fourier inverse transformation module is used for carrying out short-time Fourier inverse transformation on the estimated time-frequency domain voice signal to obtain an estimated time-domain voice signal.

The invention also provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any one of claims 1 to 9 when executing the computer program.

The invention also provides a computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to carry out the method according to any one of claims 1 to 9.

The multichannel voice signal enhancement method and device, the computer equipment and the computer readable storage medium provided by the invention have the following advantages:

1. the method of the invention adopts a method based on the improved prior speech existence probability calculation to estimate the noise covariance matrix, and the estimation simultaneously utilizes the instantaneous estimator and the smooth estimator, thereby effectively avoiding the tailing effect in the prior probability estimation.

2. The noise covariance matrix estimation method based on prior speech existence probability calculation can estimate the noise covariance matrix more quickly and accurately and improve the noise reduction performance.

Drawings

FIG. 1 is a schematic diagram illustrating an audio signal collection using a microphone array in an actual environment;

FIG. 2 is a flow chart of a method for multi-channel speech signal enhancement;

FIG. 3(a) is a diagram showing the calculation of the prior speech existence probability by using the prior art method;

FIG. 3(b) is a diagram showing the calculation result of the existence probability of the posterior speech estimated by the prior art method;

FIG. 4(a) is a graph showing the calculation of the prior speech presence probability using the method of the present invention;

FIG. 4(b) is a graph showing the calculation of the probability of existence of a posteriori speech estimated by the method of the present invention;

FIG. 5 is a block diagram of a multi-channel speech signal enhancement system.

Detailed Description

The technical scheme provided by the invention is further illustrated by combining the following embodiments.

The invention provides a method and a system for enhancing a multichannel voice signal, wherein the method comprises the following steps:

carrying out short-time Fourier transform on the time domain signals of a plurality of channels collected by the microphone array to obtain corresponding time-frequency domain signals; estimating the existence probability of prior voice and calculating a noise covariance matrix; constructing an adaptive beam former by using the noise covariance matrix obtained by calculation, and carrying out spatial filtering on the collected time-frequency domain multi-channel signals to obtain estimated time-frequency domain voice signals; and carrying out short-time Fourier inverse transformation on the estimated time-frequency domain voice signal to obtain an estimated time-domain voice signal.

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, when an audio signal is collected by a microphone array in an actual environment, a reverberation signal and a noise signal of a speaker voice are inevitably collected in addition to a signal of a desired speaker. The adaptive beamformer system achieves extraction of the desired speech signal by linearly filtering the acquired multiple channel signals. In designing an adaptive beamformer, it is desirable to provide a more accurate estimate of the noise covariance matrix. The existing estimation method can generate a tailing effect when estimating the prior voice probability, so that the residual noise in the output of a beam former is larger, and the quality of the enhanced voice is influenced. The main reason for this phenomenon is that the existing methods rely directly on the smoothed estimators to calculate the a priori speech presence probabilities.

The multi-channel speech signal enhancement method provided by the invention, as shown in fig. 2, comprises:

101: step 1) short-time Fourier transform: and carrying out short-time Fourier transform on the time domain signals of the channels collected by the microphone array to obtain corresponding time-frequency domain signals.

102: step 2) noise covariance matrix estimation: estimating the existing probability of the prior voice, calculating a noise covariance matrix, and realizing the estimation of the noise covariance matrix of each time-frequency point by using a probability weighting method.

103: step 3) adaptive beam forming: and constructing a self-adaptive beam former by utilizing the estimated noise covariance matrix, and carrying out spatial filtering on the acquired time-frequency domain multi-channel signals to obtain estimated time-frequency domain voice signals.

104: step 4), short-time inverse Fourier transform: and carrying out short-time Fourier inverse transformation on the estimated time-frequency domain voice signal to obtain an estimated time-domain voice signal.

101, the specific method for short-time fourier transform in step 1) comprises the following steps:

and carrying out short-time Fourier transform on the time domain signals of the M channels collected by the microphone to obtain corresponding time domain and frequency domain signals of the M channels. Let the signal collected by the mth channel at the time n be y _m (n) time-frequency domain corresponding signalsIs Y _m (L, K), wherein L is the frame index of the time-frequency domain, K is the frequency index of the time-frequency domain, K is more than or equal to 1 and less than or equal to K, and L is more than or equal to 1 and less than or equal to L. K corresponds to the number of points of the short-time Fourier transform, and L corresponds to the number of frames after the short-time Fourier transform. Assuming that the sampling rate is 16000Hz, the number of fourier transforms is 512 points, the length of the acquired signal is 1s, and the inter-frame overlap rate of the short-time fourier transform is 75%, K is 512, L is (16000Hz 1s-512)/(512 x (1-0.75)) +1 is 122.

Due to y _m (n) is a real number signal, the time-frequency domain information obtained by short-time Fourier transform has redundancy in the frequency axis range, and only half of the frequency index value is taken during processing, that is

Wherein

The operator means rounding down. When performing short-time fourier transform, the signal length of each frame needs to be selected. As a rule of thumb, the signal frame length is typically between 32ms and 64 ms.

102, step 2) the specific method for estimating the noise covariance matrix comprises the following steps:

step 2) estimating the existing probability of the prior voice, calculating a noise covariance matrix, and realizing the estimation of the noise covariance matrix of each time-frequency point by using a probability weighting method

Step 201) an estimate of the instantaneous signal-to-noise ratio γ (l, k) is calculated, i.e.

Wherein

And

Wherein h (l, k) ═ h ₁ (l,k),...,h _M (l,k)] ^T For the beam former used at time/,

is the noise smoothing factor at time i,

for the estimated value of the noise covariance matrix at time l, the following steps 301), 204), and 205) are described for the specific calculation manner of the three.

Step 202) calculates the prior speech presence probability using the estimated instantaneous signal-to-noise ratio γ (l, k). Specifically, firstly, the instantaneous snr is smoothed in three frequency axis ranges to obtain the smoothing results based on less adjacent frequency points, the smoothing results based on more adjacent frequency points and the smoothing results based on all frequencies. The purpose of smoothing is to exploit the correlation of the time-frequency domain signal in the unused frequency range to achieve a more accurate estimation of the prior speech presence probability. The three sets of smoothing are calculated by:

wherein W (-) isAnd a sliding window, wherein a normalized Hamming (Hamming) window or a Kaiser (Kaiser) window can be selected. K _loc And K _glo The window length of the window function representing the local and wide smoothing corresponds to half. K _loc Generally 1, K _glo Typically taking a constant greater than 3.

Step 203) estimates the prior speech existence probability. Three groups of prior speech existence probabilities can be obtained by carrying out threshold mapping on three groups of signal-to-noise ratio smoothing results, wherein gamma is _loc (l, k) and γ _glo (l, k) using the same mapping scheme, i.e.

Wherein a may practically be 316 and b may be 2.5.

γ _fra The threshold mapping mode corresponding to (l, k) is as follows:

wherein,&representing a logical AND operation, K ₁ ,K ₂ And K ₃ ,K ₄ Cut-off ranges for low and medium-high frequencies are set artificially. When the sampling rate of the signal is 16000Hz, the frequency range of the low frequency cut-off can be set to be 500 Hz-2000 Hz, the frequency range of the high frequency cut-off can be set to be 4000 Hz-8000 Hz, Th ₁ And Th ₂ The thresholds corresponding to these two frequency ranges are set to 2 and 4, respectively. The prior speech presence probability may be calculated as:

step 204) calculating the posterior voice existence probability by utilizing the estimated prior voice existence probability, and estimating a noise covariance matrix. In general, it is assumed that the collected multi-channel speech signal and noise signal obey independent multivariate Gaussian distribution, and the posterior speech existence probability is calculated with good effect. The corresponding posterior speech presence probability may be expressed as:

wherein Y (l, k) ═ Y ₁ (l,k),...,Y _M (l,k)] ^T ,

Obtaining the posterior voice existence probability p corresponding to a certain time frequency point (l, k) _x After (l, k), the noise covariance matrix can be estimated by iterative smoothing as described below

Wherein,

is a time-varying smoothing factor, and _v the update rate of the noise covariance matrix in the absence of expected speech is determined for a fixed smoothing factor. Alpha is alpha _v Typically ranging from 0.9 to 1.

Covariance matrix of noisy signal

Calculated using the formula:

wherein alpha is _y Is a fixed smoothing factor, and generally ranges from 0.9 to 1. Covariance matrix of corresponding speech signal

Can be expressed as

Step 205) iterative estimation obtains a better noise covariance matrix estimation value. Repeating the steps 201) to 204), but in this moment, in each formula

Uniformly replacing the noise covariance matrix obtained by last iterative estimation

The estimation can be done directly from the beginning silent segment (i.e. the part without speech signal) in the collected data. Theoretically, the iterative calculation process can be repeated for a plurality of times to improve the accuracy of the noise covariance matrix estimation. However, actual calculation shows that the noise covariance matrix can be estimated more accurately after 1 iteration.

103, step 3) the specific method for adaptive beam forming comprises the following steps:

an adaptive beamformer is first constructed using an estimated noise covariance matrix. Common adaptive beamformers include a Multichannel Wiener Filtering (MWF) and Minimum variance distortion free (MVDR) beamformer, among others. They can be represented uniformly as:

wherein I _M Is an identity matrix with dimension of M × M, u is I _M And a is a weighting factor that determines the noise reduction performance of the beamformer. When α is 0, it corresponds to MVDR beamformer, when α is 1, it corresponds to standard MWF, and when α > 1, it corresponds to MWF with stronger noise reduction effect. The value of alpha can be selected according to the actual requirements of noise reduction and voice distortion: if it is more desirable that the speech distortion is small, let α be 0, and if it is more desirable that the amount of noise reduction is larger, let α take a value larger than 1.

After the beam former is obtained by solving the above formula, the time-frequency domain estimation value of the speech signal can be expressed as:

104, the specific method of short-time inverse Fourier transform in the step 4) comprises the following steps:

for the time-frequency domain voice signal obtained in the step 3)

And performing short-time Fourier inverse transformation to obtain a time domain signal of the expected voice.

Considering the characteristic of conjugate symmetry of short-time Fourier transform of real signals, firstly, the method utilizes

Restoring the time-frequency domain voice signal in the whole frequency range, and then carrying out the operations of inverse Fourier transform and windowing synthesis on the voice signal to obtain the estimation of the corresponding time-domain voice signal

Fig. 3(a) and 3(b) show a prior speech existence probability and a posterior speech existence probability estimated by using the existing method, respectively, and the existence of the "tailing effect" of such a method can be clearly seen from fig. 3(a) and 3 (b). Compared with the prior speech probability calculation method, the calculation method provided by the invention uses the estimation of the instantaneous signal-to-noise ratio after smoothing in the frequency domain, and can avoid the influence of only using the smoothing estimator on the updating speed of the noise covariance matrix. Fig. 4(a) and fig. 4(b) show the calculation results of the prior speech existence probability and the posterior speech existence probability proposed by the present invention, and it can be found that the method disclosed by the present invention significantly improves the "tailing effect".

Finally, we further explain the reason why the multi-channel speech enhancement method based on the improved prior speech existence probability calculation can achieve better enhancement effect. The existing noise covariance matrix estimation method only depends on the smoothed statistic when calculating the prior probability, which causes that the estimated prior speech existence probability cannot be attenuated to a smaller value quickly after the speech is finished, thereby affecting the update rate of the noise covariance matrix. Aiming at the problem, the invention adopts the instantaneous estimator and the frequency domain smoothing value thereof to realize the estimation of the prior speech existence probability. After the speech is finished, the estimated instantaneous signal-to-noise ratio is generally lower, so that the estimation method provided by the invention can effectively eliminate the tailing effect of frequency estimation in the traditional method, and further ensure the updating rate of the noise covariance matrix.

As shown in fig. 5, the present invention also provides a multi-channel speech signal enhancement system, comprising:

a short-time fourier transform module 301, configured to transform the acquired multi-channel time domain signal to a time-frequency domain, including framing, windowing, and fourier transform;

a noise covariance matrix estimation module 302, which estimates a noise covariance matrix by using the improved prior speech existence probability;

the adaptive beam forming module 303 constructs an adaptive beam former by using the estimated noise covariance matrix, and filters the acquired signals of the time-frequency domain to obtain estimated time-frequency domain voice signals;

and an inverse short-time fourier transform module 304 for transforming the estimated time-frequency domain speech signal to the time domain, including inverse fourier transform, windowing, and synthesis.

The present invention also provides a computer device, comprising: at least one processor, memory, at least one network interface, and a user interface. The various components in the device are coupled together by a bus system. It will be appreciated that a bus system is used to enable communications among the components. The bus system includes a power bus, a control bus, and a status signal bus in addition to a data bus.

The user interface may include, among other things, a display, a keyboard, or a pointing device (e.g., a mouse, track ball, touch pad, or touch screen, etc.).

It will be appreciated that the memory in the embodiments disclosed herein can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of example, and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), and Direct Rambus RAM (DRRAM). The memory described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

In some embodiments, the memory stores elements, executable modules or data structures, or a subset thereof, or an expanded set thereof as follows: an operating system and an application program.

The operating system includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, and is used for implementing various basic services and processing hardware-based tasks. The application programs, including various application programs such as a Media Player (Media Player), a Browser (Browser), etc., are used to implement various application services. The program for implementing the method of the embodiment of the present disclosure may be included in an application program.

In the above embodiments, the processor may further be configured to, by calling a program or an instruction stored in the memory, specifically, a program or an instruction stored in the application program:

the steps of the multi-channel speech signal enhancement method are performed.

The multi-channel speech signal enhancement method may be applied in or implemented by a processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The Processor may be a general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components. The various methods, steps and logic blocks disclosed in this disclosure may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the present invention may be embodied directly in a hardware decoding processor, or in a combination of hardware and software modules within the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor.

It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or any combination thereof. For a hardware implementation, the Processing units may be implemented within one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), general purpose processors, controllers, micro-controllers, microprocessors, other electronic units configured to perform the functions described herein, or a combination thereof.

For a software implementation, the techniques of the present invention may be implemented by executing the functional blocks (e.g., procedures, functions, and so on) of the present invention. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.

The present invention also provides a non-volatile storage medium for storing a computer program. The computer program may realize the respective steps of the above method when executed by a processor.

Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A method of multi-channel speech signal enhancement, comprising:

2. The multi-channel speech signal enhancement method of claim 1, characterized by:

the step 2) adopts the instantaneous estimators and the frequency domain smooth values thereof to estimate the prior speech existence probability and calculate the noise covariance matrix, and the probability weighting method is utilized to realize the noise covariance matrix of each time-frequency point

Is estimated.

3. The multi-channel speech signal enhancement method of claim 1 or 2, characterized by:

the step 2) specifically comprises the following steps:

step 203) estimating the existence probability of the prior voice;

step 205) repeating the steps 201) to 204), and iteratively estimating to obtain a better noise covariance matrix estimated value.

4. A multi-channel speech signal enhancement method according to claim 3, characterized by:

the step 201) of calculating the estimation value of the instantaneous signal-to-noise ratio gamma (l, k) comprises the following steps:

wherein,

and

respectively representing the instantaneous energy and noise power spectral density estimation of the voice, and the corresponding calculation mode is as follows:

the superscript H represents the conjugate transpose operation of the vector, namely changing the imaginary part of each complex element of the vector into the original opposite number, and then converting the row vector after element conversion into a column vector; l is a frame index of the time-frequency domain; k is a frequency index of the time-frequency domain; h (l, k) ═ h ₁ (l,k),...,h _M (l,k)] ^T A beamformer used for time l; upper label ^T Representing a vector transpose operation, i.e., converting a row vector into a column vector;

a noise smoothing factor at time l;

is an estimate of the noise covariance matrix at time l.

5. A multi-channel speech signal enhancement method according to claim 3, characterized by:

and performing smoothing operation on gamma (l, k) in three frequency axis ranges to respectively obtain smoothing based on fewer adjacent frequency points, smoothing based on more adjacent frequency points and smoothing results based on all frequencies:

6. A multi-channel speech signal enhancement method according to claim 3, characterized by:

firstly, three groups of prior speech existence probabilities can be obtained by carrying out threshold mapping on three groups of signal-to-noise ratio smoothing results, wherein gamma is _loc (l, k) and γ _glo (l, k) the same mapping method is selected as:

wherein, the value of a is 316, and the value of b is 2.5;

γ _fra the threshold mapping mode corresponding to (l, k) is as follows:

then, the prior speech existence probability is calculated by the following formula:

7. a multi-channel speech signal enhancement method according to claim 3, characterized in that:

the posterior voice existence probability calculation formula:

wherein Y (l, k) ═ Y ₁ (l,k),...,Y _M (l,k)] ^T ,

Noise covariance matrix

Obtained from the iterative smoothing estimation described below:

wherein,

Obtained by the following calculation:

wherein,

8. A multi-channel speech signal enhancement method according to claim 3, characterized by:

the step 205) of iterative estimation to obtain a better noise covariance matrix estimation value specifically comprises the following steps:

repeating steps 201) to 204), in each formula

A beam former h (0, k) at the initial time is set according to the direction information of the desired voice signal; noise power spectral density at initial time

The estimation is directly carried out according to the mute section which is just started in the collected data.

9. The multi-channel speech signal enhancement method of claim 2, characterized by:

the step 3) specifically comprises the following steps:

wherein I _M Is an identity matrix with dimension of M × M, u is I _M The first column of (1); alpha is a parameter for adjusting the noise reduction amount of the beam former, and the value range is 0-1;

the time-frequency domain estimation value of the voice signal is as follows:

10. a multi-channel speech signal enhancement apparatus comprising:

11. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 9 when executing the computer program.

12. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to carry out the method according to any one of claims 1 to 9.