CN114882898A - Multi-channel speech signal enhancement method and apparatus, computer device and storage medium - Google Patents

Multi-channel speech signal enhancement method and apparatus, computer device and storage medium Download PDF

Info

Publication number
CN114882898A
CN114882898A CN202210384863.8A CN202210384863A CN114882898A CN 114882898 A CN114882898 A CN 114882898A CN 202210384863 A CN202210384863 A CN 202210384863A CN 114882898 A CN114882898 A CN 114882898A
Authority
CN
China
Prior art keywords
time
covariance matrix
estimated
noise
frequency domain
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210384863.8A
Other languages
Chinese (zh)
Inventor
王劲夫
杨飞然
孙国华
杨军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Acoustics CAS
Original Assignee
Institute of Acoustics CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Acoustics CAS filed Critical Institute of Acoustics CAS
Priority to CN202210384863.8A priority Critical patent/CN114882898A/en
Publication of CN114882898A publication Critical patent/CN114882898A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/0212Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders using orthogonal transformation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/26Pre-filtering or post-filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0224Processing in the time domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0264Noise filtering characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention provides a method and a system for enhancing a multi-channel voice signal, wherein the method comprises the following steps: carrying out short-time Fourier transform on the time domain signals of a plurality of channels collected by the microphone array to obtain corresponding time-frequency domain signals; estimating the existence probability of prior voice and calculating a noise covariance matrix; constructing an adaptive beam former by using the noise covariance matrix obtained by calculation, and carrying out spatial filtering on the collected time-frequency domain multi-channel signals to obtain estimated time-frequency domain voice signals; and carrying out short-time Fourier inverse transformation on the estimated time-frequency domain voice signal to obtain an estimated time-domain voice signal. The invention can effectively avoid the trailing effect in the prior probability estimation, can more quickly and accurately estimate the noise covariance matrix and improve the noise reduction performance.

Description

Multichannel speech signal enhancement method and device, computer equipment and storage medium
Technical Field
The present invention relates to the field of speech enhancement, and in particular, to a method and apparatus for enhancing a multi-channel speech signal, a computer device, and a storage medium.
Background
The multi-channel voice enhancement means that a multi-channel noisy signal acquired by a microphone array is utilized to realize the extraction of a desired voice signal. Compared with single-channel speech enhancement, the multi-channel speech enhancement can simultaneously utilize information of a time-frequency space domain to realize extraction of expected speech, and theoretically can ensure no distortion of the expected speech. Multi-channel speech enhancement plays an important role in conference systems, hearing aids and man-machine interaction systems.
A commonly used implementation of multi-channel speech enhancement methods is beamforming. The beamformer can be divided into fixed beamforming and adaptive beamforming according to whether the coefficients of the beamformer are adaptively adjusted according to the acquired data. Fixed beamformers generally assume that the noise field follows some particular spatially distributed pattern and then design an optimal beamformer for the noise field. The fixed beamformer works well when the actual noise field satisfies the assumed spatial distribution pattern. But when the actual noise field does not meet the assumed distribution form, which is often the case in practice, the effect of fixed beamforming on the noise effect becomes worse. Compared with a fixed beam former, the adaptive protection former automatically adjusts the coefficient of the adaptive protection former according to the change of a noise field in the environment, and theoretically, a better noise reduction effect can be realized. Many adaptive beamformer designs require a more accurate estimation of the noise covariance matrix, and the effect of the noise covariance matrix estimation directly determines the amount of residual noise in the output signal and the degree of distortion of the desired speech.
At present, the estimation of the noise covariance matrix mainly adopts an iterative smoothing method based on probability weighting, namely, a smoothing factor of the noise covariance matrix estimation is adjusted in real time through the existence probability of voice, and then the real-time update of the noise covariance matrix is realized. There are many ways to calculate the speech existence probability, for example, directly estimating the speech existence probability by performing threshold mapping on inter-channel amplitude difference (ILD) or inter-channel phase difference (IPD), and converting the estimation problem of the noise covariance matrix into the estimation problem of the noise covariance of a single channel by using the characteristics of the noise field (for example, the spatial characteristics of the noise field obey the distribution form of the noise of the diffusion field). There are also many studies focusing on the computation of speech presence probabilities under the binary hypothesis model. This type of method assumes that there are only two possibilities for forming a mixed signal at a certain time, the first possibility is that the mixed signal contains only a noise signal, and the second possibility is that the mixed signal contains both a noise signal and a speech signal. By assuming that the collected noise signal and speech signal obey some specific probability distribution, the analytic expression form of the corresponding posterior speech existence probability can be obtained. But the probabilistic model requires an estimation of the prior speech presence probability. In the existing method, the smoothed estimator is directly adopted to calculate the prior speech existence probability, and the estimated result has an estimated 'tailing effect': that is, the calculated value of the prior speech existence probability cannot be rapidly attenuated to a smaller value within a period of time after the speech is ended. The estimation result of such a method reduces the update rate of the noise covariance matrix, thereby affecting the noise reduction performance of the beamformer.
Disclosure of Invention
The invention aims to solve the problems that the calculation value of the prior speech existence probability cannot be quickly attenuated to a smaller value within a period of time after speech is finished due to a calculation mode of the prior speech existence probability by using a smoothed estimator adopted by the existing speech signal enhancement method, so that the estimated trailing effect cannot be achieved, and the optimal noise reduction effect cannot be achieved, thereby providing a multi-channel speech signal enhancement method and device, computer equipment and a computer readable storage medium.
To achieve the above object, the present invention provides a method for enhancing a multi-channel speech signal, comprising:
step 1) carrying out short-time Fourier transform on time domain signals of a plurality of channels collected by a microphone array to obtain corresponding time-frequency domain signals;
step 2) estimating the existing probability of the prior voice and calculating a noise covariance matrix;
step 3) constructing an adaptive beam former by using the noise covariance matrix obtained by calculation, and carrying out spatial filtering on the collected time-frequency domain multi-channel signals to obtain estimated time-frequency domain voice signals;
and 4) carrying out short-time Fourier inverse transformation on the estimated time-frequency domain voice signal to obtain an estimated time-domain voice signal.
Further, the step 2) specifically comprises: estimating prior speech existence probability by adopting instantaneous estimator and frequency domain smoothing value thereof, calculating noise covariance matrix, and realizing noise covariance matrix of each time-frequency point by utilizing probability weighting method
Figure BDA0003594520860000021
Is estimated.
The step 2) specifically comprises the following steps:
step 201) calculating an estimated value of the instantaneous signal-to-noise ratio gamma (l, k);
step 202) estimating the prior speech existence probability by utilizing the estimated instantaneous signal-to-noise ratio gamma (l, k);
step 203) estimating the existence probability of the prior voice;
step 204) calculating the posterior voice existence probability by utilizing the estimated prior voice existence probability, and estimating a noise covariance matrix;
step 205) iterative estimation obtains a better noise covariance matrix estimation value.
Further, the method for calculating the estimated value of the instantaneous signal-to-noise ratio γ (l, k) in the step 201) comprises:
Figure BDA0003594520860000031
wherein
Figure BDA0003594520860000032
And
Figure BDA0003594520860000033
respectively representing the instantaneous energy and noise power spectral density estimates of the speech by calculating
Figure BDA0003594520860000034
Figure BDA0003594520860000035
Where l is a frame index of the time-frequency domain, k is a frequency index of the time-frequency domain, and h (l, k) [ h ] 1 (l,k),...,h M (l,k)] T For the beam former used at time/,
Figure BDA0003594520860000036
is the noise smoothing factor at time i,
Figure BDA0003594520860000037
is an estimate of the noise covariance matrix at time l.
The step 202) estimates the prior speech existence probability by using the estimated instantaneous signal-to-noise ratio γ (l, k), and the specific method is as follows:
and (3) carrying out smoothing operation on three groups of frequency axis ranges on gamma (l, k) to respectively obtain smoothing based on fewer adjacent frequency points, smoothing based on more adjacent frequency points and smoothing results based on all frequencies:
Figure BDA0003594520860000038
Figure BDA0003594520860000039
Figure BDA00035945208600000310
wherein W (-) is a smooth window, K loc And K glo The window length of the window function representing the local and wide smoothing corresponds to half.
The step 203) of estimating the prior speech existence probability includes the following specific steps:
three groups of prior speech existence probabilities can be obtained by carrying out threshold mapping on three groups of signal-to-noise ratio smoothing results, wherein gamma is loc (l, k) and γ glo (l, k) the same mapping method is selected as:
Figure BDA00035945208600000311
Figure BDA00035945208600000312
wherein, the value of a is 316, and the value of b is 2.5;
γ fra the threshold mapping mode corresponding to (l, k) is as follows:
Figure BDA0003594520860000041
wherein,&representing a logical AND operation, K 1 ,K 2 And K 3 ,K 4 Cut-off ranges of low-frequency and medium-high frequency which are set artificially respectively;
calculating the prior speech existence probability by the following formula:
Figure BDA0003594520860000042
the step 204) calculates the posterior speech existence probability by using the estimated prior speech existence probability, and estimates the noise covariance matrix, wherein the specific calculation method comprises the following steps:
the posterior voice existence probability calculation formula:
Figure BDA0003594520860000043
wherein Y (l, k) ═ Y 1 (l,k),...,Y M (l,k)] T ,
Figure BDA0003594520860000044
Noise covariance matrix
Figure BDA0003594520860000045
Obtained from the iterative smoothing estimation described below:
Figure BDA0003594520860000046
Figure BDA0003594520860000047
wherein,
Figure BDA0003594520860000048
being a time-varying smoothing factor, alpha v For a fixed smoothing factor, the speech covariance matrix
Figure BDA0003594520860000049
Obtained by the following calculation:
Figure BDA00035945208600000410
Figure BDA00035945208600000411
wherein,
Figure BDA00035945208600000412
is a covariance matrix of the noisy signal, alpha y For which a corresponding fixed smoothing factor is estimated.
The step 205) of iterative estimation obtains a better noise covariance matrix estimation value, and the specific method is as follows:
repeating steps 201) to 204), in each formula
Figure BDA00035945208600000413
Uniformly replacing the noise covariance matrix obtained by last iteration estimation
Figure BDA00035945208600000414
The beamformer h (0, k) at the initial time may be set according to the directional information of the desired speech signal, such as may be set as a classical delay-sum beamformer. Noise power spectral density at initial time
Figure BDA0003594520860000051
The estimation can be done directly from the beginning silent segment (i.e. the part without speech signal) in the collected data.
The step 3) specifically comprises the following steps:
constructing an adaptive beam former by using the estimated noise covariance matrix; the adaptive beamformer is represented as:
Figure BDA0003594520860000052
wherein, I M Is an identity matrix with dimension of M × M, u is I M The first column of (1); alpha is a parameter for adjusting the noise reduction amount of the beam former, and the value range is 0-1;
the time-frequency domain estimation value of the voice signal is as follows:
Figure BDA0003594520860000053
the present invention also provides a multi-channel speech signal enhancement apparatus, comprising:
the short-time Fourier transform module is used for carrying out short-time Fourier transform on the time domain signals of the channels collected by the microphone array to obtain corresponding time-frequency domain signals;
the noise covariance matrix estimation module is used for estimating the prior speech existence probability and calculating a noise covariance matrix;
the adaptive beam forming module is used for constructing an adaptive beam former by utilizing the noise covariance matrix obtained by calculation, and carrying out spatial filtering on the collected time-frequency domain multi-channel signals to obtain estimated time-frequency domain voice signals;
and the short-time Fourier inverse transformation module is used for carrying out short-time Fourier inverse transformation on the estimated time-frequency domain voice signal to obtain an estimated time-domain voice signal.
The invention also provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any one of claims 1 to 9 when executing the computer program.
The invention also provides a computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to carry out the method according to any one of claims 1 to 9.
The multichannel voice signal enhancement method and device, the computer equipment and the computer readable storage medium provided by the invention have the following advantages:
1. the method of the invention adopts a method based on the improved prior speech existence probability calculation to estimate the noise covariance matrix, and the estimation simultaneously utilizes the instantaneous estimator and the smooth estimator, thereby effectively avoiding the tailing effect in the prior probability estimation.
2. The noise covariance matrix estimation method based on prior speech existence probability calculation can estimate the noise covariance matrix more quickly and accurately and improve the noise reduction performance.
Drawings
FIG. 1 is a schematic diagram illustrating an audio signal collection using a microphone array in an actual environment;
FIG. 2 is a flow chart of a method for multi-channel speech signal enhancement;
FIG. 3(a) is a diagram showing the calculation of the prior speech existence probability by using the prior art method;
FIG. 3(b) is a diagram showing the calculation result of the existence probability of the posterior speech estimated by the prior art method;
FIG. 4(a) is a graph showing the calculation of the prior speech presence probability using the method of the present invention;
FIG. 4(b) is a graph showing the calculation of the probability of existence of a posteriori speech estimated by the method of the present invention;
FIG. 5 is a block diagram of a multi-channel speech signal enhancement system.
Detailed Description
The technical scheme provided by the invention is further illustrated by combining the following embodiments.
The invention provides a method and a system for enhancing a multichannel voice signal, wherein the method comprises the following steps:
carrying out short-time Fourier transform on the time domain signals of a plurality of channels collected by the microphone array to obtain corresponding time-frequency domain signals; estimating the existence probability of prior voice and calculating a noise covariance matrix; constructing an adaptive beam former by using the noise covariance matrix obtained by calculation, and carrying out spatial filtering on the collected time-frequency domain multi-channel signals to obtain estimated time-frequency domain voice signals; and carrying out short-time Fourier inverse transformation on the estimated time-frequency domain voice signal to obtain an estimated time-domain voice signal.
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1, when an audio signal is collected by a microphone array in an actual environment, a reverberation signal and a noise signal of a speaker voice are inevitably collected in addition to a signal of a desired speaker. The adaptive beamformer system achieves extraction of the desired speech signal by linearly filtering the acquired multiple channel signals. In designing an adaptive beamformer, it is desirable to provide a more accurate estimate of the noise covariance matrix. The existing estimation method can generate a tailing effect when estimating the prior voice probability, so that the residual noise in the output of a beam former is larger, and the quality of the enhanced voice is influenced. The main reason for this phenomenon is that the existing methods rely directly on the smoothed estimators to calculate the a priori speech presence probabilities.
The multi-channel speech signal enhancement method provided by the invention, as shown in fig. 2, comprises:
101: step 1) short-time Fourier transform: and carrying out short-time Fourier transform on the time domain signals of the channels collected by the microphone array to obtain corresponding time-frequency domain signals.
102: step 2) noise covariance matrix estimation: estimating the existing probability of the prior voice, calculating a noise covariance matrix, and realizing the estimation of the noise covariance matrix of each time-frequency point by using a probability weighting method.
103: step 3) adaptive beam forming: and constructing a self-adaptive beam former by utilizing the estimated noise covariance matrix, and carrying out spatial filtering on the acquired time-frequency domain multi-channel signals to obtain estimated time-frequency domain voice signals.
104: step 4), short-time inverse Fourier transform: and carrying out short-time Fourier inverse transformation on the estimated time-frequency domain voice signal to obtain an estimated time-domain voice signal.
101, the specific method for short-time fourier transform in step 1) comprises the following steps:
and carrying out short-time Fourier transform on the time domain signals of the M channels collected by the microphone to obtain corresponding time domain and frequency domain signals of the M channels. Let the signal collected by the mth channel at the time n be y m (n) time-frequency domain corresponding signalsIs Y m (L, K), wherein L is the frame index of the time-frequency domain, K is the frequency index of the time-frequency domain, K is more than or equal to 1 and less than or equal to K, and L is more than or equal to 1 and less than or equal to L. K corresponds to the number of points of the short-time Fourier transform, and L corresponds to the number of frames after the short-time Fourier transform. Assuming that the sampling rate is 16000Hz, the number of fourier transforms is 512 points, the length of the acquired signal is 1s, and the inter-frame overlap rate of the short-time fourier transform is 75%, K is 512, L is (16000Hz 1s-512)/(512 x (1-0.75)) +1 is 122.
Due to y m (n) is a real number signal, the time-frequency domain information obtained by short-time Fourier transform has redundancy in the frequency axis range, and only half of the frequency index value is taken during processing, that is
Figure BDA0003594520860000071
Wherein
Figure BDA0003594520860000072
The operator means rounding down. When performing short-time fourier transform, the signal length of each frame needs to be selected. As a rule of thumb, the signal frame length is typically between 32ms and 64 ms.
102, step 2) the specific method for estimating the noise covariance matrix comprises the following steps:
step 2) estimating the existing probability of the prior voice, calculating a noise covariance matrix, and realizing the estimation of the noise covariance matrix of each time-frequency point by using a probability weighting method
Step 201) an estimate of the instantaneous signal-to-noise ratio γ (l, k) is calculated, i.e.
Figure BDA0003594520860000073
Wherein
Figure BDA0003594520860000074
And
Figure BDA0003594520860000075
respectively representing the instantaneous energy and noise power spectral density estimates of the speech by calculating
Figure BDA0003594520860000081
Figure BDA0003594520860000082
Wherein h (l, k) ═ h 1 (l,k),...,h M (l,k)] T For the beam former used at time/,
Figure BDA0003594520860000083
is the noise smoothing factor at time i,
Figure BDA0003594520860000084
for the estimated value of the noise covariance matrix at time l, the following steps 301), 204), and 205) are described for the specific calculation manner of the three.
Step 202) calculates the prior speech presence probability using the estimated instantaneous signal-to-noise ratio γ (l, k). Specifically, firstly, the instantaneous snr is smoothed in three frequency axis ranges to obtain the smoothing results based on less adjacent frequency points, the smoothing results based on more adjacent frequency points and the smoothing results based on all frequencies. The purpose of smoothing is to exploit the correlation of the time-frequency domain signal in the unused frequency range to achieve a more accurate estimation of the prior speech presence probability. The three sets of smoothing are calculated by:
Figure BDA0003594520860000085
Figure BDA0003594520860000086
Figure BDA0003594520860000087
wherein W (-) isAnd a sliding window, wherein a normalized Hamming (Hamming) window or a Kaiser (Kaiser) window can be selected. K loc And K glo The window length of the window function representing the local and wide smoothing corresponds to half. K loc Generally 1, K glo Typically taking a constant greater than 3.
Step 203) estimates the prior speech existence probability. Three groups of prior speech existence probabilities can be obtained by carrying out threshold mapping on three groups of signal-to-noise ratio smoothing results, wherein gamma is loc (l, k) and γ glo (l, k) using the same mapping scheme, i.e.
Figure BDA0003594520860000088
Figure BDA0003594520860000089
Wherein a may practically be 316 and b may be 2.5.
γ fra The threshold mapping mode corresponding to (l, k) is as follows:
Figure BDA00035945208600000810
wherein,&representing a logical AND operation, K 1 ,K 2 And K 3 ,K 4 Cut-off ranges for low and medium-high frequencies are set artificially. When the sampling rate of the signal is 16000Hz, the frequency range of the low frequency cut-off can be set to be 500 Hz-2000 Hz, the frequency range of the high frequency cut-off can be set to be 4000 Hz-8000 Hz, Th 1 And Th 2 The thresholds corresponding to these two frequency ranges are set to 2 and 4, respectively. The prior speech presence probability may be calculated as:
Figure BDA0003594520860000091
step 204) calculating the posterior voice existence probability by utilizing the estimated prior voice existence probability, and estimating a noise covariance matrix. In general, it is assumed that the collected multi-channel speech signal and noise signal obey independent multivariate Gaussian distribution, and the posterior speech existence probability is calculated with good effect. The corresponding posterior speech presence probability may be expressed as:
Figure BDA0003594520860000092
wherein Y (l, k) ═ Y 1 (l,k),...,Y M (l,k)] T ,
Figure BDA0003594520860000093
Obtaining the posterior voice existence probability p corresponding to a certain time frequency point (l, k) x After (l, k), the noise covariance matrix can be estimated by iterative smoothing as described below
Figure BDA0003594520860000094
Figure BDA0003594520860000095
Wherein,
Figure BDA0003594520860000096
is a time-varying smoothing factor, and v the update rate of the noise covariance matrix in the absence of expected speech is determined for a fixed smoothing factor. Alpha is alpha v Typically ranging from 0.9 to 1.
Covariance matrix of noisy signal
Figure BDA0003594520860000097
Calculated using the formula:
Figure BDA0003594520860000098
wherein alpha is y Is a fixed smoothing factor, and generally ranges from 0.9 to 1. Covariance matrix of corresponding speech signal
Figure BDA0003594520860000099
Can be expressed as
Figure BDA00035945208600000910
Step 205) iterative estimation obtains a better noise covariance matrix estimation value. Repeating the steps 201) to 204), but in this moment, in each formula
Figure BDA00035945208600000911
Uniformly replacing the noise covariance matrix obtained by last iterative estimation
Figure BDA00035945208600000912
The beamformer h (0, k) at the initial time may be set according to the directional information of the desired speech signal, such as may be set as a classical delay-sum beamformer. Noise power spectral density at initial time
Figure BDA00035945208600000913
The estimation can be done directly from the beginning silent segment (i.e. the part without speech signal) in the collected data. Theoretically, the iterative calculation process can be repeated for a plurality of times to improve the accuracy of the noise covariance matrix estimation. However, actual calculation shows that the noise covariance matrix can be estimated more accurately after 1 iteration.
103, step 3) the specific method for adaptive beam forming comprises the following steps:
an adaptive beamformer is first constructed using an estimated noise covariance matrix. Common adaptive beamformers include a Multichannel Wiener Filtering (MWF) and Minimum variance distortion free (MVDR) beamformer, among others. They can be represented uniformly as:
Figure BDA0003594520860000101
wherein I M Is an identity matrix with dimension of M × M, u is I M And a is a weighting factor that determines the noise reduction performance of the beamformer. When α is 0, it corresponds to MVDR beamformer, when α is 1, it corresponds to standard MWF, and when α > 1, it corresponds to MWF with stronger noise reduction effect. The value of alpha can be selected according to the actual requirements of noise reduction and voice distortion: if it is more desirable that the speech distortion is small, let α be 0, and if it is more desirable that the amount of noise reduction is larger, let α take a value larger than 1.
After the beam former is obtained by solving the above formula, the time-frequency domain estimation value of the speech signal can be expressed as:
Figure BDA0003594520860000102
104, the specific method of short-time inverse Fourier transform in the step 4) comprises the following steps:
for the time-frequency domain voice signal obtained in the step 3)
Figure BDA0003594520860000103
And performing short-time Fourier inverse transformation to obtain a time domain signal of the expected voice.
Considering the characteristic of conjugate symmetry of short-time Fourier transform of real signals, firstly, the method utilizes
Figure BDA0003594520860000104
Restoring the time-frequency domain voice signal in the whole frequency range, and then carrying out the operations of inverse Fourier transform and windowing synthesis on the voice signal to obtain the estimation of the corresponding time-domain voice signal
Figure BDA0003594520860000105
Fig. 3(a) and 3(b) show a prior speech existence probability and a posterior speech existence probability estimated by using the existing method, respectively, and the existence of the "tailing effect" of such a method can be clearly seen from fig. 3(a) and 3 (b). Compared with the prior speech probability calculation method, the calculation method provided by the invention uses the estimation of the instantaneous signal-to-noise ratio after smoothing in the frequency domain, and can avoid the influence of only using the smoothing estimator on the updating speed of the noise covariance matrix. Fig. 4(a) and fig. 4(b) show the calculation results of the prior speech existence probability and the posterior speech existence probability proposed by the present invention, and it can be found that the method disclosed by the present invention significantly improves the "tailing effect".
Finally, we further explain the reason why the multi-channel speech enhancement method based on the improved prior speech existence probability calculation can achieve better enhancement effect. The existing noise covariance matrix estimation method only depends on the smoothed statistic when calculating the prior probability, which causes that the estimated prior speech existence probability cannot be attenuated to a smaller value quickly after the speech is finished, thereby affecting the update rate of the noise covariance matrix. Aiming at the problem, the invention adopts the instantaneous estimator and the frequency domain smoothing value thereof to realize the estimation of the prior speech existence probability. After the speech is finished, the estimated instantaneous signal-to-noise ratio is generally lower, so that the estimation method provided by the invention can effectively eliminate the tailing effect of frequency estimation in the traditional method, and further ensure the updating rate of the noise covariance matrix.
As shown in fig. 5, the present invention also provides a multi-channel speech signal enhancement system, comprising:
a short-time fourier transform module 301, configured to transform the acquired multi-channel time domain signal to a time-frequency domain, including framing, windowing, and fourier transform;
a noise covariance matrix estimation module 302, which estimates a noise covariance matrix by using the improved prior speech existence probability;
the adaptive beam forming module 303 constructs an adaptive beam former by using the estimated noise covariance matrix, and filters the acquired signals of the time-frequency domain to obtain estimated time-frequency domain voice signals;
and an inverse short-time fourier transform module 304 for transforming the estimated time-frequency domain speech signal to the time domain, including inverse fourier transform, windowing, and synthesis.
The present invention also provides a computer device, comprising: at least one processor, memory, at least one network interface, and a user interface. The various components in the device are coupled together by a bus system. It will be appreciated that a bus system is used to enable communications among the components. The bus system includes a power bus, a control bus, and a status signal bus in addition to a data bus.
The user interface may include, among other things, a display, a keyboard, or a pointing device (e.g., a mouse, track ball, touch pad, or touch screen, etc.).
It will be appreciated that the memory in the embodiments disclosed herein can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of example, and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), and Direct Rambus RAM (DRRAM). The memory described herein is intended to comprise, without being limited to, these and any other suitable types of memory.
In some embodiments, the memory stores elements, executable modules or data structures, or a subset thereof, or an expanded set thereof as follows: an operating system and an application program.
The operating system includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, and is used for implementing various basic services and processing hardware-based tasks. The application programs, including various application programs such as a Media Player (Media Player), a Browser (Browser), etc., are used to implement various application services. The program for implementing the method of the embodiment of the present disclosure may be included in an application program.
In the above embodiments, the processor may further be configured to, by calling a program or an instruction stored in the memory, specifically, a program or an instruction stored in the application program:
the steps of the multi-channel speech signal enhancement method are performed.
The multi-channel speech signal enhancement method may be applied in or implemented by a processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The Processor may be a general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components. The various methods, steps and logic blocks disclosed in this disclosure may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the present invention may be embodied directly in a hardware decoding processor, or in a combination of hardware and software modules within the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor.
It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or any combination thereof. For a hardware implementation, the Processing units may be implemented within one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), general purpose processors, controllers, micro-controllers, microprocessors, other electronic units configured to perform the functions described herein, or a combination thereof.
For a software implementation, the techniques of the present invention may be implemented by executing the functional blocks (e.g., procedures, functions, and so on) of the present invention. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.
The present invention also provides a non-volatile storage medium for storing a computer program. The computer program may realize the respective steps of the above method when executed by a processor.
Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (12)

1. A method of multi-channel speech signal enhancement, comprising:
step 1) carrying out short-time Fourier transform on time domain signals of a plurality of channels collected by a microphone array to obtain corresponding time-frequency domain signals;
step 2) estimating the existing probability of the prior voice and calculating a noise covariance matrix;
step 3) constructing an adaptive beam former by using the noise covariance matrix obtained by calculation, and carrying out spatial filtering on the collected time-frequency domain multi-channel signals to obtain estimated time-frequency domain voice signals;
and 4) carrying out short-time Fourier inverse transformation on the estimated time-frequency domain voice signal to obtain an estimated time-domain voice signal.
2. The multi-channel speech signal enhancement method of claim 1, characterized by:
the step 2) adopts the instantaneous estimators and the frequency domain smooth values thereof to estimate the prior speech existence probability and calculate the noise covariance matrix, and the probability weighting method is utilized to realize the noise covariance matrix of each time-frequency point
Figure FDA0003594520850000016
Is estimated.
3. The multi-channel speech signal enhancement method of claim 1 or 2, characterized by:
the step 2) specifically comprises the following steps:
step 201) calculating an estimated value of the instantaneous signal-to-noise ratio gamma (l, k);
step 202) estimating the prior speech existence probability by utilizing the estimated instantaneous signal-to-noise ratio gamma (l, k);
step 203) estimating the existence probability of the prior voice;
step 204) calculating the posterior voice existence probability by utilizing the estimated prior voice existence probability, and estimating a noise covariance matrix;
step 205) repeating the steps 201) to 204), and iteratively estimating to obtain a better noise covariance matrix estimated value.
4. A multi-channel speech signal enhancement method according to claim 3, characterized by:
the step 201) of calculating the estimation value of the instantaneous signal-to-noise ratio gamma (l, k) comprises the following steps:
Figure FDA0003594520850000011
wherein,
Figure FDA0003594520850000012
and
Figure FDA0003594520850000013
respectively representing the instantaneous energy and noise power spectral density estimation of the voice, and the corresponding calculation mode is as follows:
Figure FDA0003594520850000014
Figure FDA0003594520850000015
the superscript H represents the conjugate transpose operation of the vector, namely changing the imaginary part of each complex element of the vector into the original opposite number, and then converting the row vector after element conversion into a column vector; l is a frame index of the time-frequency domain; k is a frequency index of the time-frequency domain; h (l, k) ═ h 1 (l,k),...,h M (l,k)] T A beamformer used for time l; upper label T Representing a vector transpose operation, i.e., converting a row vector into a column vector;
Figure FDA0003594520850000021
a noise smoothing factor at time l;
Figure FDA0003594520850000022
is an estimate of the noise covariance matrix at time l.
5. A multi-channel speech signal enhancement method according to claim 3, characterized by:
the step 202) estimates the prior speech existence probability by using the estimated instantaneous signal-to-noise ratio γ (l, k), and the specific method is as follows:
and performing smoothing operation on gamma (l, k) in three frequency axis ranges to respectively obtain smoothing based on fewer adjacent frequency points, smoothing based on more adjacent frequency points and smoothing results based on all frequencies:
Figure FDA0003594520850000023
Figure FDA0003594520850000024
Figure FDA0003594520850000025
wherein W (-) is a smooth window, K loc And K glo The window length of the window function representing the local and wide smoothing corresponds to half.
6. A multi-channel speech signal enhancement method according to claim 3, characterized by:
the step 203) of estimating the prior speech existence probability includes the following specific steps:
firstly, three groups of prior speech existence probabilities can be obtained by carrying out threshold mapping on three groups of signal-to-noise ratio smoothing results, wherein gamma is loc (l, k) and γ glo (l, k) the same mapping method is selected as:
Figure FDA0003594520850000026
Figure FDA0003594520850000027
wherein, the value of a is 316, and the value of b is 2.5;
γ fra the threshold mapping mode corresponding to (l, k) is as follows:
Figure FDA0003594520850000028
wherein,&representing a logical AND operation, K 1 ,K 2 And K 3 ,K 4 Cut-off ranges of low-frequency and medium-high frequency which are set artificially respectively;
then, the prior speech existence probability is calculated by the following formula:
Figure FDA0003594520850000031
7. a multi-channel speech signal enhancement method according to claim 3, characterized in that:
the step 204) calculates the posterior speech existence probability by using the estimated prior speech existence probability, and estimates the noise covariance matrix, wherein the specific calculation method comprises the following steps:
the posterior voice existence probability calculation formula:
Figure FDA0003594520850000032
wherein Y (l, k) ═ Y 1 (l,k),...,Y M (l,k)] T ,
Figure FDA0003594520850000033
Noise covariance matrix
Figure FDA0003594520850000034
Obtained from the iterative smoothing estimation described below:
Figure FDA0003594520850000035
Figure FDA0003594520850000036
wherein,
Figure FDA0003594520850000037
being a time-varying smoothing factor, alpha v For a fixed smoothing factor, the speech covariance matrix
Figure FDA0003594520850000038
Obtained by the following calculation:
Figure FDA0003594520850000039
Figure FDA00035945208500000310
wherein,
Figure FDA00035945208500000311
is a covariance matrix of the noisy signal, alpha y For which a corresponding fixed smoothing factor is estimated.
8. A multi-channel speech signal enhancement method according to claim 3, characterized by:
the step 205) of iterative estimation to obtain a better noise covariance matrix estimation value specifically comprises the following steps:
repeating steps 201) to 204), in each formula
Figure FDA00035945208500000312
Uniformly replacing the noise covariance matrix obtained by last iteration estimation
Figure FDA00035945208500000313
A beam former h (0, k) at the initial time is set according to the direction information of the desired voice signal; noise power spectral density at initial time
Figure FDA00035945208500000314
The estimation is directly carried out according to the mute section which is just started in the collected data.
9. The multi-channel speech signal enhancement method of claim 2, characterized by:
the step 3) specifically comprises the following steps:
constructing an adaptive beam former by using the estimated noise covariance matrix; the adaptive beamformer is represented as:
Figure FDA0003594520850000041
wherein I M Is an identity matrix with dimension of M × M, u is I M The first column of (1); alpha is a parameter for adjusting the noise reduction amount of the beam former, and the value range is 0-1;
the time-frequency domain estimation value of the voice signal is as follows:
Figure FDA0003594520850000042
10. a multi-channel speech signal enhancement apparatus comprising:
the short-time Fourier transform module is used for carrying out short-time Fourier transform on the time domain signals of the channels collected by the microphone array to obtain corresponding time-frequency domain signals;
the noise covariance matrix estimation module is used for estimating the prior speech existence probability and calculating a noise covariance matrix;
the adaptive beam forming module is used for constructing an adaptive beam former by utilizing the noise covariance matrix obtained by calculation, and carrying out spatial filtering on the collected time-frequency domain multi-channel signals to obtain estimated time-frequency domain voice signals;
and the short-time Fourier inverse transformation module is used for carrying out short-time Fourier inverse transformation on the estimated time-frequency domain voice signal to obtain an estimated time-domain voice signal.
11. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 9 when executing the computer program.
12. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to carry out the method according to any one of claims 1 to 9.
CN202210384863.8A 2022-04-13 2022-04-13 Multi-channel speech signal enhancement method and apparatus, computer device and storage medium Pending CN114882898A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210384863.8A CN114882898A (en) 2022-04-13 2022-04-13 Multi-channel speech signal enhancement method and apparatus, computer device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210384863.8A CN114882898A (en) 2022-04-13 2022-04-13 Multi-channel speech signal enhancement method and apparatus, computer device and storage medium

Publications (1)

Publication Number Publication Date
CN114882898A true CN114882898A (en) 2022-08-09

Family

ID=82669784

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210384863.8A Pending CN114882898A (en) 2022-04-13 2022-04-13 Multi-channel speech signal enhancement method and apparatus, computer device and storage medium

Country Status (1)

Country Link
CN (1) CN114882898A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115942194A (en) * 2022-12-08 2023-04-07 中国科学院声学研究所 Directional processing method and system for hearing rehabilitation treatment device processor

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115942194A (en) * 2022-12-08 2023-04-07 中国科学院声学研究所 Directional processing method and system for hearing rehabilitation treatment device processor

Similar Documents

Publication Publication Date Title
CN110085249B (en) Single-channel speech enhancement method of recurrent neural network based on attention gating
Yoshioka et al. Generalization of multi-channel linear prediction methods for blind MIMO impulse response shortening
KR100304666B1 (en) Speech enhancement method
US7313518B2 (en) Noise reduction method and device using two pass filtering
Mertins et al. Room impulse response shortening/reshaping with infinity-and $ p $-norm optimization
WO2020107269A1 (en) Self-adaptive speech enhancement method, and electronic device
US20120245927A1 (en) System and method for monaural audio processing based preserving speech information
US8737641B2 (en) Noise suppressor
CN111081267B (en) Multi-channel far-field speech enhancement method
CN111445919B (en) Speech enhancement method, system, electronic device, and medium incorporating AI model
CN102938254A (en) Voice signal enhancement system and method
RU2768514C2 (en) Signal processor and method for providing processed noise-suppressed audio signal with suppressed reverberation
JP5834088B2 (en) Dynamic microphone signal mixer
CN103871421A (en) Self-adaptive denoising method and system based on sub-band noise analysis
US10839820B2 (en) Voice processing method, apparatus, device and storage medium
US11373667B2 (en) Real-time single-channel speech enhancement in noisy and time-varying environments
CN109961799A (en) A kind of hearing aid multicenter voice enhancing algorithm based on Iterative Wiener Filtering
Cord-Landwehr et al. Monaural source separation: From anechoic to reverberant environments
US20200286501A1 (en) Apparatus and a method for signal enhancement
JP2011203414A (en) Noise and reverberation suppressing device and method therefor
CN114882898A (en) Multi-channel speech signal enhancement method and apparatus, computer device and storage medium
CN117219102A (en) Low-complexity voice enhancement method based on auditory perception
CN113689870A (en) Multi-channel voice enhancement method and device, terminal and readable storage medium
CN114242104A (en) Method, device and equipment for voice noise reduction and storage medium
Gui et al. Adaptive subband Wiener filtering for speech enhancement using critical-band gammatone filterbank

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination