CN117636894A

CN117636894A - Voice dereverberation method based on multi-channel blind identification and multi-channel equalization

Info

Publication number: CN117636894A
Application number: CN202311670403.2A
Authority: CN
Inventors: 何宏森; 邱志民; 陈景东; 喻翌; 李小霞
Original assignee: Northwestern Polytechnical University; Southwest University of Science and Technology
Current assignee: Northwestern Polytechnical University; Southwest University of Science and Technology
Priority date: 2023-12-05
Filing date: 2023-12-05
Publication date: 2024-03-01

Abstract

The invention discloses a voice dereverberation method based on multi-channel blind identification and multi-channel equalization, which is used for solving the problem of multi-channel blind identification, designing a variable regularization function based on a normalized multi-channel frequency domain minimum mean square algorithm (NMCFLMS), and integrating signal to noise ratio, output signal energy and filter length information into the variable regularization function so that the algorithm has robustness to additive noise and non-stationarity of voice. In addition, in order to make the proposed method have better tracking performance under time-varying conditions, a mechanism for refreshing regularized parameters according to mean square error is proposed. In this way, a faster convergence speed and tracking speed can be obtained in a noise environment, a better dereverberation effect can be obtained by the dereverberation of the voice based on channel equalization, and particularly, the dereverberation performance is remarkably improved during the transient state of the adaptive filter in a low signal-to-noise ratio environment.

Description

Voice dereverberation method based on multi-channel blind identification and multi-channel equalization

Technical Field

The invention belongs to the technical field of voice dereverberation, and particularly relates to a voice dereverberation method based on multi-channel blind identification and multi-channel equalization.

Background

Blind recognition is a method of estimating impulse response of a system by using only system output signals, and plays an important role in speech processing technologies such as speech noise reduction, beam forming, speech dereverberation, sound source localization and the like. In recent years, scholars have conducted extensive research on batch processing algorithms and adaptive algorithms for this problem. Among these algorithms, the normalized multi-channel frequency domain least mean square algorithm (NMCFLMS) is implemented in the frequency domain using a Fast Fourier Transform (FFT), and has a characteristic of high computational efficiency, so that the algorithm has special attraction in a real-time processing system. Meanwhile, in order to accelerate the convergence speed of the adaptive filter and reduce the gradient noise amplification problem caused by large channel output amplitude, the algorithm uses a Newton iteration method. Newton's iteration is known as an optimization method, where regularization of the Hessian matrix is crucial. However, the NMCFLMS algorithm constructs the regularization factor only by using the first block signal output by the system, so that the regularization factors of the algorithm under different speech segments and different signal-to-noise environments are greatly different, and it is difficult to obtain a proper regularization parameter.

A number of solutions have been proposed by the scholars to the regularization problem of the Hessian matrix, the most classical approach being to bias the search direction of newton's method towards the steepest descent direction. This strategy can be implemented by adding a suitable number matrix to the Hessian matrix. In the self-adaptive filtering algorithm, the regularization parameter is introduced, so that the numerical stability can be ensured, and the convergence performance of the filter can be enhanced. In recent years, a large number of regularization methods have been proposed, and among these methods, a constant regularization method, that is, regularization parameters are not updated in the iterative process of the filter, and the optimal regularization parameters are determined by using the related information of the excitation signal and the information of the signal-to-noise ratio, so that the adaptive filter has better robustness in a noise environment, but in practice, the constant regularization method makes a compromise between the convergence speed and steady-state error of the filter. The other type is a variable regularization method, namely regularization parameters utilize relevant data information in real time in the filter iteration process, wherein the information comprises error signals, estimated noise, system input and the like, and the method has higher convergence speed than a constant regularization method due to the instantaneity of the variable regularization method. However, for blind system identification, the existing methods cannot be directly applied to the blind identification problem of the time-varying system due to the unavailability of the input signal and the time-varying nature of the system.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, provide a voice dereverberation method based on multi-channel blind identification and multi-channel equalization, design a variable regularization function, integrate signal to noise ratio, output signal energy and filter length information into the voice dereverberation method, enable the voice dereverberation method to have robustness to additive noise and non-stationarity of voice, have better tracking performance under time-varying conditions, enable the voice dereverberation method based on channel equalization to obtain faster convergence speed and tracking speed under noise environment, and enable the voice dereverberation method based on channel equalization to obtain better dereverberation effect, and particularly enable the dereverberation performance to be remarkably improved during transient state of a self-adaptive filter under low signal to noise ratio environment.

To achieve the above object, the present invention provides a speech dereverberation method based on multi-channel blind recognition and multi-channel equalization, which is characterized by comprising the steps of:

(1) Initializing

Initializing a length-L filter vector at time 0 of the kth channelThe method comprises the following steps:

wherein T represents transposition, M is the number of channels, and L is the length of the impulse response vector of the acoustic channel;

initializing a power spectrum matrix P at time 0 of the kth channel _k (0) The method comprises the following steps:

P _k (0)＝0 _L×L

initializing reverberant signal varianceThe method comprises the following steps:

(2) Collecting M microphone sound signals, wherein the sampling value of the ith channel is expressed as x _i (n), i=1, 2, …, M, n=0, 1,2, …, n is the sampling point time index, and the acoustic signal vector x corresponding to the block time index M is constructed _i (M), i=1, 2,., M is the output signal vector of the i-th microphone:

x _i (m)＝[x _i (mL-L)x _i (mL-L+1)…x _i (mL+L-1)] ^T

(3) Starting from the block time index m=1, the dereverberated speech vector is obtained

3.1 Obtaining a variable regularization function delta (m)

3.1.1 Calculating microphone output signal variance)

Wherein I ₂ Representing the 2-norm of the vector,is a smoothing factor;

3.1.2 Calculating reverberant signal variance)

Wherein lambda is ₁ As forgetting factor, SNR is output signal-to-noise ratio;

3.1.3 Calculating the bias factor b (m) of the regularization function:

wherein v represents a coefficient factor;

3.1.4 Calculating a variable regularization function δ (m):

wherein alpha is a parameter of the whole range of the control function curve, K is a parameter of the steepness of the control curve, the steeper the curve abrupt change is, xi is a parameter of the lifting point position of the control curve, and the initial value is xi ₀ ；

3.2 Obtaining a filter vector)

3.2.1 Calculating a frequency domain expansion filter vector)

Wherein F is _2L×2L Is a fourier matrix of size 2l×2l, 0 is a zero column vector of length L;

3.2.2 A spectral matrix for calculating an input signal

Wherein diag [. Cndot. ] represents expanding the vector into a diagonal array;

3.2.3 A) calculating a power spectrum matrix P _k (m)：

Wherein lambda is ₂ As forgetting factor, superscript indicates conjugate matrix;

3.2.4 Calculating inverse matrix of power spectrum

Wherein I is _2L×2L Is an identity matrix with the size of 2L multiplied by 2L;

3.2.5 Calculating frequency domain error vectors for the ith and j th channels

Wherein,F _L×L is a Fourier matrix with the size of L multiplied by L, 0 _L×L Is a zero matrix of size L×L, I _L×L Is an identity matrix of size L x L, ">Is an inverse fourier matrix of size 2l×2l;

3.2.6 Calculating a frequency domain expansion filter vector)

Wherein ρ is the step factor of the adaptive filter, and the error vector is expanded in the frequency domain Is a Fourier matrix of size 2L×2L,/A>Is an inverse fourier matrix of size l×l;

3.2.7 Calculating a filter vector)

Wherein, subscript 1:L represents taking the first L values;

3.3 Obtaining a dereverberated speech vector)

3.3.1 Constructing an impulse response matrix)

Wherein:impulse responseMatrix->Is of the size L _c ×L _g Matrix of L _c ＝L+L _g -1，L _g Is the equalization filter vector length;

3.3.2 Constructing a multi-channel impulse response matrix)

3.3.3 Calculating an equalization filter matrix g (m)

Wherein the method comprises the steps ofAn equalization filter vector for the kth channel, d being a desired equalization impulse response vector;

3.3.4 Calculating a dereverberated speech vector)

Wherein conv (·) represents a convolution function;

3.3.5 Detecting a mean square error MSE (m)

First, the mean square error MSE (m) is calculated:

wherein, the superscript H indicates a conjugate transpose, and then updating judgment:

if MSE (m) is greater than threshold upper limit gamma, then variable is initiatedRefreshing regularized function delta (m), let parameter ζ=m+ζ ₀ Wherein, xi ₀ Is an initial value, if MSE (m) is not greater than threshold upper limit gamma, parameter xi is maintained unchanged;

m=m+1, returning to step 3.1.1 for reverberant speech vector for next block timeIs calculated by the computer.

The invention aims at realizing the following steps:

the invention provides a voice dereverberation method based on multi-channel blind identification and multi-channel equalization, which aims at solving the problem of multi-channel blind identification, and designs a variable regularization function based on a normalized multi-channel frequency domain minimum mean square algorithm (NMCFLMS), wherein signal to noise ratio, output signal energy and filter length information are integrated, so that the algorithm has robustness to additive noise and non-stationarity of voice. In addition, in order to make the proposed method have better tracking performance under time-varying conditions, a mechanism for refreshing regularized parameters according to mean square error is proposed. In this way, a faster convergence speed and tracking speed can be obtained in a noise environment, a better dereverberation effect can be obtained by the dereverberation of the voice based on channel equalization, and particularly, the dereverberation performance is remarkably improved during the transient state of the adaptive filter in a low signal-to-noise ratio environment.

Drawings

FIG. 1 is a schematic diagram of speech dereverberation based on multi-channel equalization;

FIG. 2 is a schematic diagram of a variable regularization function;

FIG. 3 is a flow chart of a speech dereverberation method based on multi-channel blind recognition and multi-channel equalization in accordance with the present invention;

FIG. 4 is a graph of two sets of acoustic impulse responses measured in a real room, where (a) is the sound source location corresponding to the first set of impulse responses and (b) is the sound source location corresponding to the second set of impulse responses;

FIG. 5 is a plot of convergence performance of six acoustic channels with a white Gaussian sequence as excitation in a white Gaussian noise environment, unchanged at blind recognition by the proposed VR-NMCFLMS algorithm;

FIG. 6 is a graph comparing convergence performance of six acoustic channels for a blind recognition of NMCFLMS and the proposed VR-NMCFLMS algorithm when speech is used as an excitation signal in a white Gaussian noise environment;

FIG. 7 is a graph comparing NPM values after 10000 iterations of NMCFLMS and VR-NMCFLMS algorithm provided by the present invention under different SNR blind recognition of six acoustic channels with white Gaussian additive noise and speech as excitation condition

FIG. 8 is a graph comparing NPM values after 2000 iterations of NMCFLMS and VR-NMCFLMS algorithm of the present invention for blind identification of six acoustic channels at different SNR with white Gaussian additive noise and speech as excitation conditions;

FIG. 9 is a plot of MSE comparisons for six acoustic channels that vary with time for NMCFLMS and the VR-NMCFLMS algorithm of the present invention blindly identified with white Gaussian additive noise and speech as excitation conditions;

FIG. 10 is a graph comparing convergence performance for six acoustic channels blindly identified by NMCFLMS and the VR-NMCFLMS algorithm of the present invention under white Gaussian additive noise and speech as excitation conditions;

FIG. 11 is another comparison of convergence performance for a six acoustic channel blind identified by NMCFLMS and the VR-NMCFLMS algorithm of the present invention with white Gaussian additive noise and speech as excitation;

fig. 12 is dynamic performance metrics Δcd and Δstoi for use of the NMCFLMS and the proposed VR-NMCFLMS algorithm for speech dereverberation under time-invariant and snr=25 dB conditions, where (a) is dynamic performance metric Δcd and (b) is dynamic performance metric Δstoi;

fig. 13 is dynamic performance metrics Δfwsnr and Δllr for use of the NMCFLMS and the proposed VR-NMCFLMS algorithm in speech dereverberation, with time-invariant and snr=25 dB, where (a) is dynamic performance metric Δfwsnr and (b) is dynamic performance metric Δllr;

fig. 14 is dynamic performance metrics Δcd and Δstoi for use of the NMCFLMS and the proposed VR-NMCFLMS algorithm for speech dereverberation under time-invariant and snr=15 dB conditions, where (a) is dynamic performance metric Δcd and (b) is dynamic performance metric Δstoi;

fig. 15 is dynamic performance metrics Δfwsnr and Δllr for use of the NMCFLMS and the proposed VR-NMCFLMS algorithm in speech dereverberation under time-invariant and snr=15 dB conditions, where (a) is dynamic performance metric Δfwsnr and (b) is dynamic performance metric Δllr;

fig. 16 is dynamic performance metrics Δcd and Δstoi for use of the NMCFLMS and the proposed VR-NMCFLMS algorithm for speech dereverberation under time varying and snr=25 dB, where (a) is dynamic performance metric Δcd and (b) is dynamic performance metric Δstoi;

fig. 17 is dynamic performance metrics Δfwsnr and Δllr for use of the NMCFLMS and the proposed VR-NMCFLMS algorithm in speech dereverberation under time varying and snr=25 dB, where (a) is dynamic performance metric Δfwsnr and (b) is dynamic performance metric Δllr;

fig. 18 is dynamic performance metrics Δcd and Δstoi when NMCFLMS and the proposed VR-NMCFLMS algorithm are used for speech dereverberation under time varying and snr=15 dB, where (a) is dynamic performance metric Δcd and (b) is dynamic performance metric Δstoi;

fig. 19 is dynamic performance metrics Δfwsnr and Δllr for use of the NMCFLMS and the proposed VR-NMCFLMS algorithm in speech dereverberation under time varying and snr=15 dB, where (a) is dynamic performance metric Δfwsnr and (b) is dynamic performance metric Δllr;

Detailed Description

The following description of the embodiments of the invention is presented in conjunction with the accompanying drawings to provide a better understanding of the invention to those skilled in the art. It is to be expressly noted that in the description below, detailed descriptions of known functions and designs are omitted here as perhaps obscuring the present invention.

1. Speech dereverberation based on multichannel equalization

Assuming that a Single Input Multiple Output (SIMO) acoustic system is composed of one sound source and M microphones, as shown in fig. 1, the i (i=1, 2., M) microphone-only signals can be expressed as:

x _i ＝s(n)*h _i +υ _i (n)＝y _i (n)+υ _i (n) (1)

where s (n) is the excitation speech signal, representing linearityConvolving, h _i Is the impulse response between the sound source and the i-th microphone, which is typically modeled by a Finite Impulse Response (FIR) filter, y _i (n) is reverberant voice, v _i (n) is additive noise picked up by the ith microphone.

Assuming that there is no additive noise per acoustic channel, the output signal of the multi-channel equalization system can be expressed as:

in the method, in the process of the invention,is the speech after dereverberation g _i Is an equalization filter, c is the equalization impulse response between the sound source and the system output.

To write (2) in vector/matrix form, a length L of channel i is defined _g Is (are) equalization filter vectorsL and _c ×L _g dimensional impulse response matrix H _i :

Wherein: l is the length of the acoustic channel impulse response vector, the output signal of the multi-channel equalizer can be expressed as:

wherein:

s(n)＝[s(n) s(n-1) … s(n-Lc+1)]T， (5)

H＝[H ₁ H ₂ … H _M ]， (7)

L _c ＝L+L _g -1， (9)

and:

c＝Hg. (10)

as can be seen from (4), if the equalization impulse response vector c is a unit impulse response, the speech after the reverberation is estimated. To obtain the dereverberated speech, the equalization filter vector g needs to be estimated. According to the inverse multi-channel input-output theorem (MINT), the equalization filter g satisfies the equation:

in the middle ofIs an estimate of the multi-channel impulse response matrix H, usually obtained by estimation, d is the desired equalized impulse response vector, and is usually defined as:

τ is a delay parameter. If the additive noise is not equal to zero or the multi-channel impulse response matrix is not estimated accurately, the following least squares solution can be obtained according to (11):

in the method, in the process of the invention,is->If->Rank of full line/>。

As can be seen from fig. 1, once the equalization filter g is estimated, the following dereverberated speech is obtained based on the MINT principle:

where conv (·) represents the convolution function, x _i (n)＝[x _i (n) x _i (n-1) … x _i (n-L+1)] ^T Is the output signal vector of the i-th microphone.

2. NMCFLMS algorithm-based multi-channel acoustic system blind identification

According to the multi-channel equalizer principle, the impulse response of each acoustic channel needs to be estimated in order to achieve the purpose of speech dereverberation.

According to the signal model (1), there is the following relationship without taking noise into account:

x _i (n)*h _j ＝s(n)*h _i *h _j ＝x _j (n)*h _i ，i，j＝1，2，...，M，i≠j. (15)

the method can be written in vector form:

h in _i ＝[h _i，0 h _i，1 … h _i，L-1 ] ^T I=1, 2,..m, is the i-th channel length L impulse response vector.

When noise is present or the estimated impulse response deviates from the true impulse response, (16) will not be equal to zero, whereby the a priori error signal between the i and j channels can be defined:

in the middle ofIs h _i An estimate at time n.

According to the NMCFLMS algorithm, its cost function is defined as the sum of squares of instantaneous errors between different channels, namely:

where m is the block index.

e _ij (m)＝[e _ij (mL) e _ij (mL+1) … e _ij (mL+L-1)] ^T ， (20)

x _i (n)＝[x _i (mL-L) x _i (mL-L+1) … x _i (mL+L-1)] ^T ， (23)

0 _L×L Is a zero matrix of size L×L, I _L×L Is a unit array with the size of L multiplied by L, diag [. Cndot.]Representing the vector being spread into a diagonal matrix. According to the newton iterative method, a filter update equation of the NMCFLMS algorithm can be deduced as:

where ρ is a step size factor.

0 _1×L Is a zero line vector of length L.

In order to estimate a more stable power spectrum matrix in practical applications, it can be expressed in a recursive form as follows:

where λ is a forgetting factor, and is typically λ= [ 1-1/(3L)] ^L . To prevent numerical instability caused by irreversible power spectrum matrix, a regularization factor is usually added when the power spectrum matrix is inverted, so that the filter update equation of the NMCFLMS algorithm can be modified as follows:

where δ is a regularization factor, which is typically set to one fifth of the sum of the power of the first block signal of all channels, i.e.:

as can be seen from (33), the regularization method of the NMCFLMS algorithm constructs the regularization factor only by using the power information of the first block signal output by the system, and this way will cause the regularization factors of the algorithm to be very different in different speech segments and different signal-to-noise environments, so it is difficult to set a suitable regularization parameter. In order to solve the problem, the invention designs a variable regularization function, which integrates the information such as signal-to-noise ratio, output signal energy, filter length and the like, so that the algorithm has robustness to additive noise and non-stationarity of voice. In order to make the proposed method have better tracking performance under time-varying conditions, we propose a mechanism to refresh regularized parameters according to mean square error.

3. Proposed variable regularization method

As can be seen from section 2, the optimization method used by the original NMCFLMS algorithm is newton's iteration method. In general, the filter vector of newton's iterative method satisfies a system of linear equations:

G(m)p _Newton (m)＝-θ(m)， (34)

where G (m) is a Hessian matrix, θ (m) is a gradient vector, p _Newton (m) =w (m+1) -w (m) is a filter search direction vector corresponding to newton's method, which can be obtained by solving the linear equation set. In order to solve the problem that the non-positive nature of the Hessian matrix G (m) or the pathological characteristic of the Hessian matrix G (m) (the characteristic value diffusion degree is very large) reduces the convergence performance and the numerical instability of the filter, an appropriate number matrix can be added to the Hessian matrix G (m) to realize the implementation, and the corresponding linear equation set is as follows:

[G(m)+δI]p(m)＝-θ(m)， (35)

where δ is a regularization factor and p (m) is a regularized filter search direction vector. It can be seen that by introducing the regularization factor delta, the filter searches for the direction vector p _Newton And (n) is corrected to p (m), and the vector direction is between the filter searching direction corresponding to the Newton method and the negative gradient direction-theta (m) of the steepest descent method. We further found that:

1) When delta is smaller, the regularization filter searches the direction vector p (m) to the Newton method _Newton (m) offset, where the convergence speed of the filter is fast, but the risk of filter divergence is greater under low signal-to-noise conditions.

2) When δ is large, the regularized filter search direction vector p (m) is shifted to the negative gradient direction- θ (m) of the steepest descent method, and the convergence speed of the filter is slow but the filter is more stable.

According to the principle, for NMCFLMS algorithm, only in the condition of filter convergence, when instantaneous estimation error is large or filter vector is largeWhen the difference from the real impulse response vector h is large, the pulse response vector h will be positiveThe transformation factor delta is set to be a smaller number, so that the rapid convergence of the Newton method can be effectively utilized; conversely, when the instantaneous estimation error is small or the filter vector +.>And the true impulse response vector->In the approach, δ is set to a large number, and the stability of the gradient method can be effectively utilized. For this, we devised a variable regularization function as follows:

where w is a weight coefficient, the overall range of the function curve can be controlled, the steepness of the kappa control curve is steeper at a curve abrupt change when the value is larger, the position of the lifting point of the zeta control curve is steeper, b is a bias factor of the function, and the initial convergence speed of the filter is related to the factor. The function curves when w=90, κ=0.01, ζ=1500, and b=10 are shown in fig. 2.

As can be seen from fig. 2, if the rising and falling points of the function curve can be reasonably controlled, a smaller δ (m) can be obtained at the initial stage of the filter update, thereby resulting in a faster convergence speed, and the δ (m) becomes larger when the filter is in steady state, and the filter can stably operate. However, the regularization function does not contain information such as signal-to-noise ratio, output signal energy, and filter length, i.e., the function cannot accommodate variations in these parameters. In order to solve the problem, a robust regularization method is provided, so that the adaptive filter can effectively and blindly identify the impulse response of the multichannel acoustic system under different signal-to-noise ratios, and the aim of suppressing the voice reverberation is fulfilled.

Aiming at the self-adaptive system identification problem, benesty designs an effective regularization parameter, which is defined as follows:

in the method, in the process of the invention,representing the variance of the input signal>Is the output signal to noise ratio, +.>Is->Variance of->Is the variance of the additive noise. In practical application, the signal-to-noise ratio can be obtained through estimation, however, as the problem studied by the invention is multi-channel blind identification, the input signal is not available, and thus +.>For this purpose, the invention makes use of->For->And performing approximation. Variance->Can be obtained by the following means:

to obtain a more stableThe estimation can be performed using the following recurrence relation:

wherein eta is one ofThe forgetting factor is used to determine, I.I. | ₂ Representing the 2-norm of the vector,is a smoothing factor. The bias factor of the regularization function may then be defined as follows:

the bias factor mainly affects the initial convergence speed of the algorithm, the smaller the value, the faster the convergence. Introducing coefficient factorsvIn order to match the frequency domain adaptation algorithm. We further set the weight coefficient w of the regularization function to:

w(m)＝αb(m)， (42)

where α is a constant. Therefore, the regularization function is designed as follows:

according to the regularization function designed by the invention, when the adaptive filter tends to be stable, the regularization parameter of the adaptive filter is large, and the filter searching direction is biased to the negative gradient direction. If the acoustic system transfer function changes at this time, the larger regularization parameter slows down the convergence speed of the adaptive filter, and the time-varying system cannot be effectively tracked. Therefore, correction of regularization parameters is needed when the system is suddenly changed, so that the tracking capability of the algorithm on the time-varying system is improved. To this end, it may be determined whether to adjust the regularization factor by detecting an instantaneous value of a Mean Square Error (MSE). If the impulse response of the acoustic channel changes, the MSE is detected to be suddenly changed, the regularization function is refreshed, otherwise, the regularization function is unchanged, and the specific operation is as follows:

where γ represents the upper threshold of MSE, which may be set as a parameter related to microphone signal powerA number. Let only ζ=m+ζ when refresh is initiated ₀ ，ξ ₀ Is an initial value set.

4. Voice dereverberation method based on multi-channel blind identification and multi-channel equalization

In this embodiment, as shown in fig. 3, the speech dereverberation method based on multi-channel blind recognition and multi-channel equalization of the present invention comprises the following steps:

step S1: initialization of

where T represents the transpose, M is the number of channels, and L is the length of the acoustic channel impulse response vector.

P _k (0)＝0 _L×L

step S2: constructing an output signal vector x of a microphone _i (m)

Collecting M microphone sound signals, wherein the sampling value of the ith channel is expressed as x _i (n), i=1, 2, …, M, n=0, 1,2, …, n is the sampling point time index, and the acoustic signal vector x corresponding to the block time index M is constructed _i (M), i=1, 2,., M is the output signal vector of the i-th microphone:

x _i (m)＝[x _i (mL-L) x _i (mL*L+1) … x _i (mL+L*1)] ^T 。

step S3: starting from the block time index m=1, a dereverberated speech vector is obtained

Step S3.1: obtaining a variable regularization function delta (m)

Step S3.1.1: calculating microphone output signal variance

Wherein I ₂ Representing the 2-norm of the vector,is a smoothing factor.

Step S3.1.2: calculating reverberant signal variance/>

Wherein lambda is ₁ As a forgetting factor, SNR is the output signal-to-noise ratio.

Step S3.1.3: calculating a bias factor b (m) of the regularization function:

where v denotes a coefficient factor.

Step S3.1.4: calculating a variable regularization function delta (m):

wherein alpha is a parameter of the overall range of the control function curveK is a parameter for controlling the steepness of a curve, the steeper the curve abrupt change is, and xi is a parameter for controlling the position of the lifting point of the curve, and the initial value is xi ₀ 。

Step S3.2: obtaining a filter vector h _k (m)

Step S3.2.1: calculating a frequency domain expansion filter vector

Wherein F is _2L×2L Is a fourier matrix of size 2L x 2L, 0 is a zero column vector of length L.

Step S3.2.2: calculating a spectral matrix of an input signal

Wherein diag [ ] represents the expansion of vectors into a diagonal array.

Step S3.2.3: calculating a power spectrum matrix P _k (m)：

Wherein lambda is ₂ As forgetting factor, superscript denotes conjugate matrix.

Step S3.2.4: calculating the inverse matrix of the power spectrum

Wherein I is _2L×2L Is an identity matrix of size 2L by 2L.

Step S3.2.5: calculating frequency domain error vector of ith and j channels

Wherein,F _L×L is a Fourier matrix with the size of L multiplied by L, 0 _L×L Is a zero matrix of size L×L, I _L×L Is an identity matrix of size L x L, ">Is an inverse fourier matrix of size 2l×2l.

Step S3.2.6: calculating a frequency domain expansion filter vector/>

Wherein ρ is the step factor of the adaptive filter, and the error vector is expanded in the frequency domain F _2L×2L Is a Fourier matrix of size 2L×2L,/A>Is an inverse fourier matrix of size l×l.

Step S3.2.7: calculating a filter vector

Where the subscript 1:L indicates that the first L values are taken.

Step S3.3: obtaining dereverberated speech vectors

Step S3.3.1: construction of impulse response matrix

Wherein:impulse response matrix->Is of the size L _c ×L _g Matrix of L _c ＝L+L _g -1，L _g Is the equalization filter vector length.

Step S3.3.2: constructing a multi-channel impulse response matrix

Step S3.3.3: computing an equalization filter matrix g (m)

Wherein the method comprises the steps ofEqualization filter vector for the kth channel, d is the desired equalization impulse responseVector.

Step S3.3.4: computing a dereverberated speech vector

Where conv (·) represents the convolution function.

Step S3.3.5: detection of mean square error MSE (m)

First, the mean square error MSE (m) is calculated:

if MSE (m) is greater than threshold upper limit gamma, then starting to refresh variable regularization function delta (m) to make parameter xi=m+xi ₀ Wherein, xi ₀ Is an initial value, if MSE (m) is not greater than threshold upper limit gamma, parameter xi is maintained unchanged;

m=m+1, returning to step S3.1.1: reverberant speech vector for next block timeIs calculated by the computer.

4. Experiment

4.1, experimental Environment

To verify the effectiveness of the proposed algorithm, we use two sets of impulse responses acquired in a real room acoustic environment, as shown in fig. 4. The room size was 6.7mX6.1mX2.9 m and the reverberation time was 0.28ms. The acoustic signals were picked up using a linear array of 6 omni-directional microphones located at (2.537,0.5,1.4), (2.737,0.5,1.4), (2.937,0.5,1.4), (3.137,0.5,1.4), (3.337,0.5,1.4) and (3.537,0.5,1.4), respectively. The first set of impulse responses corresponds to a sound source position (0.337,3.938,1.6) and the second set of impulse responses corresponds to a sound source position (1.337,3.938,1.6). The sampling rate of the original impulse response 48kHz is reduced to 8kHz and truncated to 1024 sampling points. A first set of impulse response simulated time-invariant multi-channel acoustic systems is utilized, and another set of impulse response simulated time-variant multi-channel acoustic systems is added. A piece of female voice with a sampling rate of 8kHz in the librispech dataset was used as excitation signal.

4.2 experimental results

4.2.1 Blind identification experiment results of acoustic channel

In order to evaluate the performance of the acoustic channel blind recognition algorithm, a Normalized Projection Misalignment (NPM) was used as an evaluation index for the algorithm performance, which was defined as follows:

the smaller the NPM value, the closer the modeling filter is to the real impulse response.

FIG. 5 compares the convergence performance of NMCFLMS and the proposed VR-NMCFLMS algorithm for blind identification of six acoustic channels, where L=1024, μ, with a white Gaussian sequence as the excitation signal in a white Gaussian noise environment _f ＝0.05，SNR＝25dB。

In the proposed VR-NMCFLMS algorithm, the regularization parameters are set to:

wherein:

whereas the regularization parameters of the original NMCFLMS algorithm are:

/>

the rest of the parameters are set identically, i.e. mu _f ＝0.05，λ＝[1-1/(3L)] ^L 。

It can be seen that the VR-NMCFLMS algorithm has a faster convergence speed, because the regularization parameters are smaller when the filter just begins to work, and the regularized filter searching direction is closer to the filter searching direction corresponding to the newton method, so that the faster convergence speed is generated; when regularization parameters become large, the search direction of the regularized filter gradually approaches to the negative gradient direction, the convergence speed of the regularized filter becomes slow, and a stable state is easy to achieve.

If the regularization parameters of the original NMCFLMS algorithm are set to smaller values, the NMCFLMS algorithm will have a faster convergence speed, however, the NMCFLMS algorithm is prone to diverge due to the presence of additive noise and sensitivity of the NMCFLMS to the noise. For example, we set the regularization parameters of the VR-NMCFLMS algorithm to:

in the middle of

Whereas the regularization parameters of the NMCFLMS algorithm are set to:

the remaining parameters remain unchanged. As can be seen from fig. 5, the initial convergence speed of the NMCFLMS algorithm is increased, similar to the convergence speed of the VR-NMCFLMS algorithm. When the NPM value is reduced to about-9 dB, the convergence speed of the two algorithms is reduced, when the filter is iterated to about 5800 times, the NPM value of the two algorithms is reduced to about 22dB, the NMCFLMS algorithm starts to diverge, the VR-NMCFLMS algorithm is increased due to regularization parameters, the filter searching direction of the algorithm is deviated to the negative gradient direction, and the convergence speed of the filter is reduced, so that the filter stably works.

FIG. 6 compares the NMCFLMS and proposed VR-NMCFLMS algorithm blindly recognizing six acoustic passes when speech is used as an excitation signal in a white Gaussian noise environmentConvergence performance of the track, where l=1024, μ _f =0.05, snr=25 dB. In order for the NMCFLMS algorithm to converge in a noisy environment, we set its regularization parameters to:

the regularization parameter of the VR-NMCFLMS algorithm is still set to δ ₁ (m). It can be seen that the proposed VR-NMCFLMS algorithm has a faster convergence speed. Therefore, the provided regularization function can enable the VR-NMCFLMS algorithm to show better convergence performance under both white sequence excitation and voice excitation.

To verify the robustness of the proposed algorithm to noise, we evaluated the convergence performance of the NMCFLMS and VR-NMCFLMS algorithms under different signal-to-noise ratios. The excitation signal is a speech signal and the additive noise is white gaussian noise. The regularization parameters of the VR-NMCFLMS algorithm are set to:

wherein:

whereas the regularization parameters of the NMCFLMS algorithm are set to:

the remaining parameters remain unchanged.

Fig. 7 is a comparison plot of NPM values after 10000 iterations of NMCFLMS and the proposed VR-NMCFLMS algorithm under white gaussian additive noise and speech as excitation conditions for blind recognition of six acoustic channels at different SNRs, where l=1024, μ _f =0.05. As can be seen from fig. 7, the VR-NMCFLMS algorithm provided by the present invention has better robustness to noise. Although the NPM values of the two algorithms are close when snr=15 dB, the NPM values of the two algorithms are closeThe NMCFLMS algorithm achieves lower NPM values at the expense of convergence speed. Fig. 8 comparison of NPM values after 2000 iterations of NMCFLMS and the proposed VR-NMCFLMS algorithm under white gaussian additive noise and speech as excitation conditions for blind recognition of six acoustic channels at different SNRs, where l=1024, μ _f The result shows that the VR-NMCFLMS algorithm has higher convergence speed under different signal-to-noise ratios.

Fig. 9 compares the mean square error of NMCFLMS and the proposed VR-NMCFLMS algorithm blindly recognizing six time-varying acoustic channels with white gaussian additive noise and speech as excitation, where SNR = 25dB, the regularization parameter of the proposed VR-NMCFLMS algorithm is still set to δ ₃ (m) the regularization parameters of the NMCFLMS algorithm are:

other parameters remain unchanged. It can be seen that when the acoustic channel transfer function changes, the MSE of the adaptive filtering algorithm will mutate (black box in the figure). Since the proposed VR-NMCFLMS algorithm refreshes the regularization parameters, the corresponding MSE can be reduced rapidly. The corresponding adaptive filter convergence performance is shown in fig. 10, where l=1024, μ _f =0.05, snr=25 dB. As can be seen from fig. 10, the VR-NMCFLMS algorithm of the present invention exhibits better performance in both the convergence speed and the tracking speed.

Fig. 11 shows a performance comparison of the NMCFLMS and the proposed VR-NMCFLMS algorithm with snr=15 dB and experimental environment and algorithm parameters remaining unchanged. It can be seen that the regularization scheme provided by the invention enables the VR-NMCFLMS algorithm to be better adapted to different signal-to-noise ratio environments.

4.2.2 Speech dereverberation experiment results based on Acoustic channel Blind identification

Once the impulse responses of a plurality of sound channels are blindly identified by applying the VR-NMCFLMS method, an equalization filter can be estimated by using the MINT method, and the aim of removing reverberation of the voice can be achieved by deconvolution. The section experimentally evaluates the effectiveness of the proposed method in adaptive speech dereverberation. In experimental verification, we employ four widely used speech dereverberation performance metrics: cepstrum Distance (CD), short-time objective intelligibility (STOI), frequency-weighted segment signal-to-noise ratio (FWSNR), and log-likelihood ratio (LLR). For visual unification, we define CD, STOI, FWSNR and LLR as relative performance indicators compared to the original reverberant speech boost:

ΔCD＝CD _original -CD， (52)

ΔSTOI＝STOI-STOI _original ， (53)

ΔFWSNR＝FWSNR-FWSNR _original ， (54)

ΔLLR＝LLR _original -LLR， (55)

where the subscript "original" represents the corresponding indicator of the dereverberated front reverberant speech. It can be seen that the greater the values of these four relative performance indicators, the better the speech dereverberation performance.

Fig. 12, 13 show dynamic performance metrics for NMCFLMS and the proposed VR-NMCFLMS algorithm for speech dereverberation under time-invariant and snr=25 dB conditions, the convergence performance of both algorithms corresponding to fig. 6. The results show that the proposed VR-NMCFLMS algorithm has a better dereverberation effect during filter transients due to its fast convergence.

Fig. 14, 15 show dynamic performance metrics for the use of the NMCFLMS and the proposed VR-NMCFLMS algorithm for speech dereverberation under time-invariant and snr=15 dB conditions. Under the condition of low signal to noise ratio, in order to ensure that the NMCFLMS algorithm converges, the regularization parameters are larger, so that steady-state errors are smaller, but at the cost of sacrificing the convergence speed of the filter, the corresponding NMCFLMS algorithm does not converge in a shorter time, and the proposed VR-NMCFLMS algorithm can converge rapidly, so that better dereverberation performance is obtained.

Fig. 16 and 17 show dynamic performance metrics for NMCFLMS and the proposed VR-NMCFLMS algorithm for speech dereverberation under time-varying and snr=25 dB conditions, the convergence performance of both algorithms corresponding to fig. 10. It can be seen that at 90 seconds, both algorithms degrade in performance index when used for speech dereverberation due to the variation in the multi-channel system transfer function. At this time, compared with the NMCFLMS algorithm, the proposed VR-NMCFLMS algorithm does not achieve a larger performance improvement (black box in the figure) of the speech dereverberation algorithm, because after the multi-channel system is changed, the filter vector still has a certain similarity with the impulse response vector after the system is changed, which is also the reason why the NPM value is not reduced to 0dB after the system is changed in fig. 10, so that the speech dereverberation effect caused by the two algorithms is similar. Fig. 18 and 19 show dynamic performance indicators for NMCFLMS and VR-NMCFLMS algorithms proposed by the present invention for speech dereverberation under time-varying conditions with snr=15 dB, the convergence performance of these two algorithms corresponding to fig. 11. It can be seen that under the condition of low signal-to-noise ratio, the corresponding dereverberation algorithm obtains significant performance improvement due to the faster convergence speed and tracking speed of the VR-NMCFLMS algorithm provided by the invention.

5. Summary

The invention provides a voice dereverberation method based on multi-channel blind identification and multi-channel equalization. Based on the prior art, a variable regularization parameter is designed, and information such as signal-to-noise ratio, output signal energy, filter length and the like is integrated into the variable regularization parameter, so that the algorithm has robustness on additive noise, voice non-stationarity and the like. In order to enable the method to have better tracking performance under time-varying conditions, the invention provides a mechanism for refreshing regularized parameters according to mean square error. Experimental results show that the VR-NMCFLMS algorithm provided by the invention can obtain faster convergence speed and tracking speed in a noise environment. The method can enable the voice dereverberation algorithm based on channel equalization to obtain better dereverberation effect, and particularly the dereverberation performance is remarkably improved when the adaptive filter is in a transient state under a low signal-to-noise ratio environment.

While the foregoing describes illustrative embodiments of the present invention to facilitate an understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, but is to be construed as protected by the accompanying claims insofar as various changes are within the spirit and scope of the present invention as defined and defined by the appended claims.

Claims

1. A speech dereverberation method based on multi-channel blind recognition and multi-channel equalization, comprising the steps of:

(1) Initializing

P _k (0)＝0 _L×L

x _i (m)＝[x _i (mL-L)x _i (mL-L+1)…x _i (mL+L-1)] ^T

3.1 Obtaining a variable regularization function delta (m)

3.1.1 Calculating microphone output signal variance)

Wherein I ₂ Representing the 2-norm of the vector,is a smoothing factor;

3.1.2 Calculating reverberant signal variance)

3.1.3 Calculating the bias factor b (m) of the regularization function:

wherein v represents a coefficient factor;

3.1.4 Calculating a variable regularization function δ (m):

wherein alpha is controlThe whole range of the function curve is a parameter, K is a parameter for controlling the steepness of the curve, the steeper the curve abrupt change is, xi is a parameter for controlling the position of the lifting point of the curve, and the initial value is xi ₀ ；

3.2 Obtaining a filter vector)

3.2.1 Calculating a frequency domain expansion filter vector)

3.2.2 A spectral matrix for calculating an input signal

Wherein diag [. Cndot. ] represents expanding the vector into a diagonal array;

3.2.3 A) calculating a power spectrum matrix P _k (m)：

3.2.4 Calculating inverse matrix of power spectrum

3.2.5 Calculating frequency domain error vectors for the ith and j th channelse _ij (m)：

3.2.6 Calculating a frequency domain expansion filter vector)

Wherein ρ is the step factor of the adaptive filter, and the error vector is expanded in the frequency domain F _2L×2L Is a Fourier of size 2L×2LMatrix (S)>Is an inverse fourier matrix of size l×l;

3.2.7 Calculating a filter vector)

Wherein, subscript 1:L represents taking the first L values;

3.3 Obtaining a dereverberated speech vector)

3.3.1 Constructing an impulse response matrix)

Wherein:impulse response matrix->Is of the size L _c ×L _g Matrix of L _c ＝L+L _g -1，L _g Is the equalization filter vector length;

3.3.2 Constructing a multi-channel impulse response matrix)

3.3.3 Calculating an equalization filter matrix g (m)

Wherein,an equalization filter vector for the kth channel, d being a desired equalization impulse response vector;

3.3.4 Calculating a dereverberated speech vector)

Wherein conv (·) represents a convolution function;

3.3.5 Detecting a mean square error MSE (m)

First, the mean square error MSE (m) is calculated: