CN116758928A

CN116758928A - MCLP language dereverberation method and system based on time-varying forgetting factor

Info

Publication number: CN116758928A
Application number: CN202310271405.8A
Authority: CN
Inventors: 冯涛; 吴礼福
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2023-03-20
Filing date: 2023-03-20
Publication date: 2023-09-15

Abstract

The application discloses an MCLP language dereverberation method and system based on a time-varying forgetting factor in the technical field of voice acquisition, and the method comprises the following steps: collecting signals, and converting the signals into a time-frequency domain; introducing the acquired time-frequency domain signals into an MCLP model to obtain a prediction coefficient matrix of a linear prediction filter; calculating the power spectral density of the acquired signals to obtain a weighting coefficient of a reverberation prediction filter, introducing matrix QR decomposition and adding a time-varying forgetting factor on the basis of a least square method when solving the prediction coefficient, improving the dereverberation capacity and stability, and calculating the late reverberation according to the prediction coefficient matrix; and (3) repeating the step (3) for each frame of signal, substituting the late reverberation into the expression in the step (2) and performing inverse short-time Fourier transform on the result to obtain a desired signal, namely a dereverberated speech signal. The method has the advantages of rapid convergence speed, improved system stability, better voice reverberation elimination performance and better audio quality.

Description

MCLP language dereverberation method and system based on time-varying forgetting factor

Technical Field

The application belongs to the technical field of voice acquisition, and particularly relates to an MCLP language dereverberation method and system based on a time-varying forgetting factor.

Background

The voice recording is carried out in the closed space (such as a recording studio), the communication can be reflected by walls, floors and ceilings, the collected voice signals inevitably contain reverberation, the reverberation can reduce the recognition degree and quality of the voice signals, the hearing aid system is affected, the performance of the voice automatic recognition system is reduced, and inconvenience is brought to listeners. Therefore, research into the dereverberation algorithm has become very active in recent years.

The algorithm for speech dereverberation broadly comprises three classes: inverse filtering, spectral enhancement, and probabilistic model-based methods. Inverse filtering is a frequently studied dereverberation scheme that relies on room acoustic properties with certain limitations in applicability. In addition, microphone array techniques are also well known in multichannel speech dereverberation, which can suppress reverberation to some extent by spatially distinguishing sounds in different directions. The principle of multi-channel linear prediction is to design a linear predictor to estimate the reverberant part of speech, and subtracting the estimated part from the reverberant speech can estimate the desired speech signal. Among other things, jukic et al propose a MCLP and WPE based speech dereverberation algorithm with sparse priors that includes two multi-channel schemes: WPE with Complex Generalized Gaussian (CGG) a priori and WPE using iterative re-weighted least squares (IRLS) methods. Both schemes have significant performance in reverberant environments.

However, the RLS algorithm has the problems of potential instability caused by an increased condition number in the matrix inverse transformation process, slow convergence when the system suddenly changes due to the use of a constant forgetting factor, and the like. The former can be solved with QR decomposition, while the latter is usually solved with adaptive forgetting factors. However, the VFF control design of current algorithms relies primarily on estimation errors.

Disclosure of Invention

Aiming at the defects of the prior art, the application aims to provide an MCLP language dereverberation method and system based on a time-varying forgetting factor so as to solve the problems in the prior art.

The aim of the application can be achieved by the following technical scheme:

an MCLP language dereverberation method based on a time-varying forgetting factor, comprising the steps of:

step 1, firstly, carrying out signal acquisition by using a microphone array in a single sound source environment, and converting the signals into a time-frequency domain through short-time Fourier transform;

step 2, introducing the time-frequency domain signals acquired in the step 1 into an MCLP model to obtain expressions of a prediction coefficient matrix of a linear prediction filter related to the expected signals, the late reverberation and the original acquired signals;

step 3, calculating the power spectral density of the acquired signal to obtain a weighting coefficient of a reverberation prediction filter, introducing matrix QR decomposition and adding a time-varying forgetting factor on the basis of a least square method when solving the prediction coefficient, improving the dereverberation capacity and stability, and calculating the late reverberation according to the prediction coefficient matrix;

and 4, repeating the step 3 for each frame of signal, substituting the late reverberation into the expression of the step 2, and performing the inverse short-time Fourier transform on the result to obtain a desired signal, namely a dereverberated voice signal.

Preferably, the signals captured by the microphone capturing array in the step 1 are as follows:

y(n)＝x(n)+v(n)

where y (n) represents the signal captured by the microphone, x (n) represents the speech signal, and v (n) is additive noise;

the signals captured by the mth microphone are expressed as follows, using a short-time fourier transform of the time signals:

x _m (k，n)＝d _m (k，n)+u _m (k，n)

wherein x is _m (k, n) represents the signal of the mth microphone at the time frame of n and the frequency of k, d _m (k, n) signals representing early reflections of speech and direct sound, which need to be preserved, as desired signals, u _m (k, n) represents late reverberation.

Preferably, the required estimated signal obtained by the MCLP model in step 2 is as follows:

wherein:

where "∈" represents an estimated value, H represents a complex conjugate,is a matrix of prediction coefficients, τ is the prediction delay, the direct speech part and the earliest reflected part remain as the desired speech components, and the direct speech signal is obtained by subtracting u (n) from the mix.

Preferably, in the step 3, the prediction filter is obtained by maximizing the sparsity of the desired speech signal in the time-frequency domain, as follows:

wherein w (n) is used to represent a weighting coefficient, and gamma is between (0, 1) and is expressed as a forgetting factor;

the weighting coefficients are expressed as follows:

where ε is an infinitesimal number, w (n) is guaranteed to be a non-negative number, p represents the shape parameter, and d (n) has the following power spectral density:

wherein:

in the method, in the process of the application,representing the power spectral density of the signal, alpha representing the attenuation coefficient, td representing the duration of the earliest reflected speech portion, T ₆₀ Represents the reverberation time, n _τ Representing the corresponding delay per frame, β is a smoothing factor.

Preferably, w (n) represents as follows:

the estimate of late reverberation is as follows:

recursive solution using least squaresExpressed as:

preferably, the time-varying forgetting factor control is based on an approximate derivative of the filter coefficients as follows:

wherein w is _i (n) denotes the tap of the i-th filter,is its approximate inverse in time, η is the calculated smoothed tap weight +.>Forgetting factor of|·|| ₁ L representing a vector ₁ Norms, by->Mapping the convergence state of the adaptive filter to the expected variance of the time-varying forgetting factor gamma (n), calculating +.>The absolute value of the approximate derivative of (2) is G _c (n)：

And calculates the average thereof over a time window of time length TGet->Average value reuse->To indicate, will->And->Normalizing to obtain->With gamma _L And gamma _H To represent the upper and lower bounds, and γ (n) at each iteration update is as follows:

an MCLP language dereverberation system based on a time-varying forgetting factor, comprising:

the voice acquisition module is used for simulating human ears through a microphone and carrying out framing processing on the received signals;

the late reverberation estimation module is used for calculating to obtain a prediction coefficient matrix;

and the expected signal calculation module is used for calculating an expected signal.

Preferably, the late reverberation estimation module includes a power spectral density calculation module and a filter coefficient prediction module.

The application has the beneficial effects that:

the MCLP language dereverberation method based on the time-varying forgetting factor uses the least square method to calculate the predictive linear matrix, has high convergence rate, adds matrix QR decomposition on the basis of the least square method in the voice signal processing process, improves the stability of the system, uses the time-varying forgetting factor, ensures that the time-varying forgetting factor has better voice dereverberation performance, and the processed voice signal has higher PESQ score and better audio quality.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described, and it will be obvious to those skilled in the art that other drawings can be obtained according to these drawings without inventive effort.

FIG. 1 is a schematic diagram of an MCLP dereverberation system based on a time-varying forgetting factor in an embodiment of the present application;

FIG. 2 is a graph of the effect of algorithmic dereverberation in an embodiment of the present application;

FIG. 3 is a ΔMFCC distance improvement graph for three algorithms in an embodiment of the present application;

FIG. 4 is a graph of the average peSQ score of speech signals before and after dereverberation by different methods in an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The embodiment provides an MCLP language dereverberation method based on a time-varying forgetting factor, which comprises the following steps:

step 1, carrying out signal acquisition by using a microphone array in a single sound source environment, and converting the signals into a time-frequency domain through short-time Fourier transform (STFT);

the voice signal generated by this sound source is captured by M microphones, and the signals captured by the microphones are expressed as follows:

y(n)＝x(n)+v(n)(1)

where y (n) represents the signal captured by the microphone, x (n) represents the speech signal, and v (n) is additive noise. Let v (n) =0.

x _m (k，n)＝d _m (k，n)+u _m (k，n) (2)

wherein x is _m (k, n) represents the signal of the mth microphone at the time frame of n and the frequency of k, d _m (k, n) signals representing the early reflections of speech and the direct sound are signals that need to be preserved, called desired signals. u (u) _m (k, n) represents late reverberation.

Step 2, introducing the signals acquired in the step 1 into an MCLP model to obtain expressions of a prediction coefficient matrix of a linear prediction filter related to the expected signals, the early reflection signals and the original acquired signals;

in the MCLP model, u (k, n) represents the following formula:

wherein L is _g Representing the length of the MCLP filter, τ is taken as the prediction delay in the time domain, gm is the prediction coefficient of the linear prediction filter, k is omitted after each frequency point is calculated, and the formula is as follows:

x(n)＝d(n)+u(n) (4)

the estimation signal required by the formula (3) and the formula (4) is:

wherein:

where "∈" represents an estimated value and H represents a complex conjugate.Is a matrix of prediction coefficients. i is the prediction delay and the direct speech part and the earliest reflected part remain as the desired speech components. The direct speech signal may be obtained by subtracting u (n) from the mix.

using the dereverberation model represented by equations (5) (6), a prediction coefficient matrix needs to be solvedObtaining a prediction filter by maximizing sparsity of a desired speech signal in a time-frequency domain, as follows

Where w (n) is used to represent the weighting factor; the value of gamma is between (0, 1), which is expressed as forgetting factor; the weighting coefficients can in turn be expressed as follows:

where ε is an infinitesimal number, for ensuring that w (n) is a non-negative number and p represents the shape parameter, and assuming that the late reverberation obeys the exponential distribution, the power spectral density of d (n) can be expressed as follows:

wherein:

in the method, in the process of the application,represents the power spectral density of the signal, alpha represents the attenuation coefficient, T _d For representing the duration of the earliest reflected speech part, T ₆₀ Represents the reverberation time, n _τ Representing the corresponding delay per frame, β is a smoothing factor.

Due toSo w (n) can be represented as follows:

from the above resultsCan be measured by taking in (8)Calculate->The estimate of late reverberation can be expressed as follows:

recursive solution using least squaresCan be expressed as:

the matrix QR decomposition combined with least squares (QR-RLS) algorithm matrix is shown in table 1:

TABLE 1

The proposed time-varying forgetting factor control scheme is based on the approximate derivative of the filter coefficients as follows:

wherein w is _i (n) denotes the tap of the i-th filter,is its approximate inverse in time. Eta is the calculated smoothed tap weight +.>Forgetting factor of (c). I.I ₁ L representing a vector ₁ Norms. When the algorithm converges to its steady state,gradually decreasing from its initial value to a very small value, but being quite unstable in tracking the impulse response of the time-varying channel. Therefore, go through +.>The convergence state of the adaptive filter is mapped to the desired variance of the time-varying forgetting factor y (n). Calculate->The absolute value of the approximate derivative of (2) is G _c (n)：

And calculates the average thereof over a time window of time length TGet->Average value reuse->To represent. Will->And->Normalized, we obtained +.>This is a more stable measure of convergence of the adaptive filter, using gamma _L And gamma _H To represent the upper and lower bounds, and γ (n) at each iteration update is as follows:

substituting formula (32) for beta in formula (8) results in a MCLP dereverberation algorithm based on a time-varying forgetting factor.

Wherein, the system parameters are shown in table 2:

TABLE 2

Parameter name	(symbol)	Value of
			Sampling rate	f _s	16kHz
Window length	wlen	512(32ms)
			Frame shifting	wlen/4	128(8ms)
Filter order	Lg	30

An MCLP language dereverberation system based on a time-varying forgetting factor, the structure of which is shown in fig. 1, comprises:

1. the voice acquisition module: the human ear is simulated by two microphones with a spacing of 15cm, and the received signals are subjected to framing processing, wherein the signals can be expressed as x (n) =d (n) +u (n), x (n) represents signals captured by the microphones, d (n) represents voice signals, and u (n) represents post reverberation. Introducing MCLP model to obtain reverberation signal asThe desired signal thus required can be expressed asWherein->Is a prediction coefficient matrix;

2. and a late reverberation estimation module: the module comprises a power spectral density calculation module and a filterAnd a coefficient prediction module. The power spectral density calculation module calculates the power spectral densities of d (n), x (n) and u (n) The filter coefficient prediction module obtains->Substituted into-> Obtaining a prediction coefficient matrix->

3. The expected signal calculation module: calculating reverberation signalSubtracting the reverberation signal u (n) from the signal x (n) captured by the microphone to obtain a desired signal +.>

And (3) performing dereverberation effect comparison:

the reverberations signal was dereverberated using RLS algorithm and VFF-QR-RLS algorithm, with a forgetting factor γ of 0.99, and the result is shown in fig. 2.

It can be found that the signals processed by RLS and VFF-QR-RLS remove partial reverberation below 4kHz compared with the original signals, and the signals processed by VFF-QR-RLS have more obvious dereverberation effect compared with the signals processed by RLS.

The performance of the algorithm was evaluated with Mel frequency cepstral coefficient distance improvement (Δmfcc), which is a good spectral envelope parameter. Because the MFCC is at a certain levelThe auditory characteristics of human ears are simulated to a certain extent, so that the distortion measure based on the Mel cepstrum coefficient can accurately represent the distortion of the reverberation voice. Using pure voice as reference signal, calculating MFCC distortion distance between reference signal and reverberant signal and dereverberated signal, and recording as MFCC _in And MFCC _out . The difference then yields a Mel-frequency cepstrum distance improvement (Δmfcc), which, when larger, indicates a better dereverberation effect.

The forgetting factor gamma of the RLS and QR-RLS algorithm is 0.96, and the forgetting factor gamma of the VFF-QR-RLS is _L ＝0.96，γ _H =0.99. The simulation results are shown in fig. 3.

As can be seen from fig. 3 (upper) and fig. 3 (middle), QR-RLS has the same effect as RLS, and the condition number is reduced by the QR decomposition method, and also can exhibit better numerical properties. Comparing the circled portions of the graph, the VFF-QR-RLS algorithm can be found to be faster and more stable, and has better numerical stability.

To further evaluate the performance and dereverberation effect of the algorithm, the dereverberation speech in the experiment was also evaluated using a speech quality perception evaluation, the final score being the average of the experimental results of 10 sets of different simulated reverberation samples, the dereverberation signal scores of the different algorithms being shown in fig. 4 (reverberation time T ₆₀ =300 ms,600ms,900 ms), it can be seen from the data in the figure that the score of the VFF-QR-RLS algorithm is highest among the different degrees of reverberation, which also verifies the effectiveness of the algorithm.

In the description of the present specification, the descriptions of the terms "one embodiment," "example," "specific example," and the like, mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it should be noted that: the above embodiments are only for illustrating the technical aspects of the present application and not for limiting the same, and although the present application has been described in detail with reference to the above embodiments, it should be understood by those of ordinary skill in the art that: modifications and equivalents may be made to the specific embodiments of the application without departing from the spirit and scope of the application, which is intended to be covered by the claims.

Claims

1. An MCLP language dereverberation method based on a time-varying forgetting factor, comprising the steps of:

2. The MCLP language dereverberation method based on a time-varying forgetting factor of claim 1, wherein the signals captured by the microphone array in step 1 are as follows:

y(n)＝x(n)+v(n)

x _m (k,n)＝d _m (k,n)+u _m (k,n)

wherein x is _m (k, m) represents the signal of the mth microphone at the time frame of n and the frequency of k, d _m (k, n) signals representing early reflections of speech and direct sound, which need to be preserved, as desired signals, u _m (k, n) represents late reverberation.

3. The MCLP language dereverberation method based on the time-varying forgetting factor of claim 2, wherein the required estimated signal obtained by the MCLP model in step 2 is as follows:

wherein:

4. A MCLP language dereverberation method based on a time-varying forgetting factor as claimed in claim 3, wherein the prediction filter is obtained in step 3 by maximizing the sparsity of the desired speech signal in the time-frequency domain by:

the weighting coefficients are expressed as follows:

wherein:

5. A time-varying forgetting factor based MCLP language dereverberation method as in claim 4, wherein w (n) is expressed as follows:

the estimate of late reverberation is as follows:

recursive solution using least squaresExpressed as:

6. a MCLP language dereverberation method based on a time-varying forgetting factor as claimed in claim 5, wherein the time-varying forgetting factor control is based on an approximate derivative of a filter coefficient as follows:

And calculates the average thereof over a time window of time length TGet->Average reuse of (2)To indicate, will->And->Normalizing to obtain->With gamma _L And gamma _H To represent the upper and lower bounds, and γ (n) at each iteration update is as follows:

7. an MCLP language dereverberation system based on a time-varying forgetting factor, comprising:

8. An MCLP speech dereverberation system based on a time-varying forgetting factor as in claim 7, wherein the late reverberation estimation module comprises a power spectral density calculation module and a filter coefficient prediction module.

9. A time-varying forgetting factor based MCLP language dereverberation system as in claim 8The system is characterized in that the signal is expressed as x (n) =d (n) +u (n), and the reverberant signal is obtained by introducing an MCLP modelThe desired signal is represented asIs a prediction coefficient matrix;

the power spectral density calculation module calculates the power spectral densities of d (n), x (n) and u (n) The filter coefficient prediction module obtains->Substituted into-> Obtaining a prediction coefficient matrix->

The expected signal calculation module calculates a reverberation signalSubtracting the reverberation signal u (n) from the signal x (n) captured by the microphone to obtain a desired signal +.>

10. A time-varying forgetting factor-based MCLP language dereverberation controller storing a time-varying forgetting factor-based MCLP language dereverberation system program to run the time-varying forgetting factor-based MCLP language dereverberation system of claims 7-9.