CN113160842A

CN113160842A - Voice dereverberation method and system based on MCLP

Info

Publication number: CN113160842A
Application number: CN202110247855.4A
Authority: CN
Inventors: 冯子成; 马鸿飞
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2021-03-06
Filing date: 2021-03-06
Publication date: 2021-07-23
Anticipated expiration: 2041-03-06
Also published as: CN113160842B

Abstract

The invention relates to the technical field of voice signal processing, in particular to a voice dereverberation method and system based on MCLP. The method comprises the following steps: the method comprises the steps of performing frame data processing on collected reverberation voice of a reverberation environment to obtain an expected signal of a current frame; acquiring a voice reverberation energy ratio and a signal-to-noise estimation value of an expected signal, substituting a geometric spectrum subtraction formula to perform spectrum subtraction on reverberation voice to obtain a first power spectral density of the expected signal; the voice reverberation energy ratio and a first energy ratio of the reverberation voice and the reverberation component are in positive correlation, and the signal-to-noise estimation value and a second energy ratio of the expected voice and the reverberation component are in positive correlation; acquiring a voice signal after dereverberation according to the first power spectral density; and storing the first power spectral density of the current frame as the historical first power spectral density of the next frame, and updating the first power spectral density of the next frame until all the dereverberated voice signals are obtained. The embodiment of the invention can obtain better dereverberation voice.

Description

Voice dereverberation method and system based on MCLP

Technical Field

The invention relates to the technical field of voice signal processing, in particular to a voice dereverberation method and system based on MCLP.

Background

In daily life, the scene demand of indoor recording is more and more extensive, and is common in indoor meeting, auditorium speech, live webcast, intelligent voice assistant etc. and in these scenes, the speech signal that the microphone was gathered often can be mingled with serious reverberation component. Reverberation is an acoustic phenomenon generated in a closed space, and due to the multipath propagation effect of sound, reflection is generated on the surfaces of walls and objects, so that collected voice signals are blurred due to time delay difference, and the definition of a voice frequency spectrum is seriously polluted. Studies have shown that early reverberant sounds within 50 milliseconds help to improve speech intelligibility, and fullness, but excessive late reverberation severely affects speech signal quality.

In practice, the inventors found that the above prior art has the following disadvantages:

for a Multi-Channel Linear Prediction (MCLP) algorithm in the field of speech dereverberation, because a clean speech signal is modeled as a time-varying gaussian model, the performance of the algorithm depends heavily on the accuracy of estimation of Power Spectral Density (PSD) of the clean speech signal, and an original online MCLP algorithm directly uses an observed reverberation signal instead of the clean speech to estimate PSD, so that the accuracy is poor and the dereverberation effect is influenced. In part of the improved research results of the algorithm, a late reverberation component PSD estimation algorithm is used, and then the reverberation PSD is subtracted by spectral subtraction to obtain an estimated pure voice PSD. However, because the estimation of the reverberation PSD is inaccurate, when the amplitude of the estimated value is large, the direct spectral subtraction may cause an over-subtraction problem, so that the frequency spectrum may have too many zeros, resulting in problems of frequency spectrum distortion and music noise.

Disclosure of Invention

In order to solve the above technical problems, an object of the present invention is to provide a method and a system for dereverberating a speech based on MCLP, wherein the adopted technical solution is as follows:

in a first aspect, an embodiment of the present invention provides a method for voice dereverberation based on MCLP, including the following steps:

the method comprises the steps of performing frame data processing on collected reverberation voice of a reverberation environment to obtain an expected signal of a current frame;

acquiring a voice reverberation energy ratio and a signal noise estimation value of the expected signal, substituting a geometric spectrum subtraction formula to perform spectrum subtraction on the reverberation voice to obtain a first power spectral density of the expected signal; the voice reverberation energy ratio and the first energy ratio are in positive correlation, and the signal-to-noise estimation value and the second energy ratio are in positive correlation; the first energy ratio is the energy ratio of the reverberation voice and the reverberation component; the second energy ratio is the energy ratio of the desired speech and the reverberation component;

obtaining a dereverberated speech signal according to the first power spectral density;

and storing the first power spectral density of the current frame as the historical first power spectral density of the next frame, and updating the first power spectral density of the next frame until all the dereverberated voice signals are obtained.

Preferably, the step of acquiring the desired signal includes:

calculating a prediction coefficient through the mathematical representation of the reverberation signal in a time-frequency domain;

and obtaining a first prediction coefficient matrix according to the prediction coefficient, and calculating the expected signal by using the first prediction coefficient matrix and the reverberation voice subjected to framing processing.

Preferably, the method for calculating the speech reverberation energy ratio includes:

and obtaining the voice reverberation energy ratio of the current frame by performing smooth calculation on the first energy ratio and the historical voice reverberation energy ratio.

Preferably, the method for calculating the signal-to-noise estimation value comprises:

wherein R is_d/rRepresenting a signal-to-noise estimate;

representing the second energy ratio value, d'_t,lRepresenting the estimated expected signal bin amplitude, | d'_t,l|²Presentation periodThe energy of the signal to be observed,

a second power spectral density representative of the reverberation component; beta is a₂Representing a second smoothing factor; r_x/rRepresenting the speech to reverberation energy ratio.

Preferably, the step of obtaining the dereverberated speech signal includes:

obtaining an expected signal frequency point at each channel of the current frame by using a weighted recursive least square formula according to the first power spectral density;

and carrying out short-time Fourier inverse transformation on the expected signal frequency point to obtain the voice signal after the reverberation is removed.

In a second aspect, another embodiment of the present invention provides an MCLP-based speech dereverberation system, which includes the following modules:

the reverberation voice preprocessing module is used for performing framing data processing on the collected reverberation voice of the reverberation environment to obtain an expected signal of the current frame;

the first power spectral density acquisition module is used for acquiring the voice reverberation energy ratio and the signal-to-noise estimation value of the expected signal, substituting a geometric spectrum subtraction formula to perform spectrum subtraction on the reverberation voice to obtain the first power spectral density of the expected signal; the voice reverberation energy ratio and the first energy ratio are in positive correlation, and the signal-to-noise estimation value and the second energy ratio are in positive correlation; the first energy ratio is the energy ratio of the reverberation voice and the reverberation component; the second energy ratio is the energy ratio of the desired speech and the reverberation component;

the voice dereverberation module is used for acquiring a dereverberated voice signal according to the first power spectral density;

and the first power spectral density updating module is used for storing the first power spectral density of the current frame, taking the first power spectral density as the historical first power spectral density of the next frame, and updating the first power spectral density of the next frame until all the dereverberated voice signals are obtained.

Preferably, the reverberation voice preprocessing module comprises:

the prediction coefficient calculation module is used for calculating a prediction coefficient through the mathematical representation of the reverberation signal in a time-frequency domain;

and the expected signal calculation module is used for obtaining a first prediction coefficient matrix according to the prediction coefficient and calculating the expected signal by using the first prediction coefficient matrix and the reverberation voice subjected to framing processing.

Preferably, the first power spectral density acquisition module includes:

and the voice reverberation energy ratio acquisition module is used for performing smooth calculation on the first energy ratio and the historical voice reverberation energy ratio to obtain the voice reverberation energy ratio of the current frame.

Preferably, the first power spectral density acquisition module includes:

a signal-to-noise estimate calculation module configured to calculate the signal-to-noise estimate:

wherein R is_d/rRepresenting a signal-to-noise estimate;

representing the second energy ratio value, d'_t,lRepresenting the estimated expected signal bin amplitude, | d'_t,l|²Which is indicative of the energy of the desired signal,

Preferably, the voice dereverberation module includes:

the expected signal frequency point acquisition module is used for acquiring an expected signal frequency point at each channel of the current frame by using a weighted recursive least square formula according to the first power spectral density;

and the dereverberation voice signal calculation module is used for carrying out short-time Fourier inverse transformation on the expected signal frequency point to obtain the voice signal after dereverberation.

The embodiment of the invention has the following beneficial effects:

by combining geometric spectral subtraction and MCLP algorithm, the problem of spectral over-subtraction caused by spectral subtraction is solved, the dereverberation performance of the MCLP algorithm is improved, and high-quality dereverberation voice can be obtained.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions and advantages of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a flowchart of an MCLP-based speech dereverberation method according to an embodiment of the present invention;

FIG. 2 is a diagram of a speech time domain waveform of an original speech when the reverberation time is 0.8s and the number of channels is 4 according to an embodiment of the present invention;

FIG. 3 is a diagram of a speech time domain waveform of speech processed by the MCLP algorithm according to an embodiment of the present invention when the reverberation time is 0.8s and the number of channels is 4;

FIG. 4 is a time domain waveform diagram of speech processed by the MCLP-based speech dereverberation method according to an embodiment of the present invention when the reverberation time is 0.8s and the number of channels is 4;

FIG. 5 is a diagram of a speech spectrum of an original speech with reverberation time of 0.8s and channel number of 4 according to an embodiment of the present invention;

fig. 6 is a diagram of a speech spectrum of a speech processed by the MCLP algorithm according to an embodiment of the present invention when the reverberation time is 0.8s and the number of channels is 4;

fig. 7 is a diagram of a voice spectrum of a voice processed by the MCLP-based voice dereverberation method according to an embodiment of the present invention when a reverberation time is 0.8s and a channel number is 4;

FIG. 8 is a line graph illustrating quality assessment of an original reverberated speech, a speech processed by an MCLP algorithm, and a speech processed by an MCLP-based speech dereverberation method using subjective speech quality assessment at different reverberation times according to an embodiment of the present invention;

FIG. 9 is a line graph illustrating the quality evaluation of the original reverberated speech and the speech processed by the MCLP algorithm and the MCLP-based speech dereverberation method using the energy ratio of the speech reverberation model at different reverberation times according to an embodiment of the present invention;

FIG. 10 is a line graph illustrating the comparison of the weighted segmented direct reverberation energy versus the original reverberated speech, the speech processed by the MCLP algorithm, and the speech processed by the MCLP-based speech dereverberation method at different reverberation times according to an embodiment of the present invention;

FIG. 11 is a line graph illustrating the evaluation of the quality of an original reverberated speech, the speech processed by the MCLP algorithm, and the speech processed by the MCLP-based speech dereverberation method using cepstral distances at different reverberation times according to an embodiment of the present invention;

FIG. 12 is a line graph illustrating quality assessment of original reverberated speech, speech processed by the MCLP algorithm, and speech processed by the MCLP-based speech dereverberation method using subjective speech quality assessment under different numbers of speech channels according to an embodiment of the present invention;

FIG. 13 is a line graph illustrating quality of speech processed by the MCLP algorithm and the MCLP-based speech dereverberation method using the energy ratio of the speech reverberation model to the original reverberated speech under different number of speech channels according to an embodiment of the present invention;

FIG. 14 is a line graph illustrating the quality of the original reverberated speech compared to the direct reverberation energy of the weighted segmented direct reverberation under different numbers of speech channels, the speech processed by the MCLP algorithm, and the speech processed by the MCLP-based speech dereverberation method according to an embodiment of the present invention;

FIG. 15 is a line graph illustrating the quality evaluation of the original reverberated speech, the speech processed by the MCLP algorithm, and the speech processed by the MCLP-based speech dereverberation method using cepstral distances according to an embodiment of the present invention with different numbers of speech channels;

fig. 16 is a block diagram illustrating a structure of an MCLP-based speech dereverberation system according to an embodiment of the present invention.

Detailed Description

To further illustrate the technical means and effects of the present invention for achieving the predetermined objects, the following detailed description of the embodiments, structures, features and effects of the method and system for dereverberating MCLP-based speech according to the present invention will be made with reference to the accompanying drawings and preferred embodiments. In the following description, different "one embodiment" or "another embodiment" refers to not necessarily the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

The following describes a specific scheme of a voice dereverberation method and system based on MCLP in detail with reference to the accompanying drawings.

Referring to fig. 1, a flowchart of an MCLP-based speech dereverberation method according to an embodiment of the present invention is shown, where the method includes the following steps:

and S001, performing framing data processing on the collected reverberation voice of the reverberation environment to obtain an expected signal of the current frame.

The method comprises the following specific steps:

1) calculating prediction coefficients from a mathematical representation of a reverberant signal in the time-frequency domain

In a closed acoustic space, a single voice source and a microphone array composed of M omnidirectional microphones are established, the shape of the array is not required, multi-channel voice signals received by the microphone array are windowed frame by frame, and are subjected to Short Time Fourier Transform (STFT) with the frame length of L subframes and L points, and as reverberant voice is the result of reverberant room impulse response and voice convolution in the Time domain and is the result after multiplication in the frequency domain, the reverberant signal received by the mth channel microphone can be represented as follows in the Time-frequency domain:

wherein t represents the time domain sequence number of the voice frame; l represents the frequency domain frequency point sequence number at each frame, and is in the range of {1,2, …, L }; τ represents the linear prediction delay;

frequency point components representing the reverberant voice at the ith frequency point of the tth frame; s_t,lFrequency point components representing clean speech at the ith frequency point of the tth frame;

the prediction coefficient of the mth microphone to the nth microphone receiving signal is represented, and can also be called as reverberation room impact response from a signal source to the mth microphone, and the length of each channel prediction coefficient is set as a constant K; k denotes the prediction coefficient number, K ∈ {1,2, …, K }.

It should be noted that the prediction delay τ is usually a non-negative integer from 0 to 3, and the prediction coefficient length K is usually a positive integer between 5 and 20; x, s and μ are complex.

2) And obtaining a first prediction coefficient matrix according to the prediction coefficient, and calculating the expected signal by using the first prediction coefficient matrix and the reverberation voice subjected to framing processing.

The above equation (1) is abbreviated again in a matrix form as follows:

among them are:

a prediction coefficient matrix, x, representing the m-th microphone_t-τ,lRepresenting the sequence of signal observations needed to predict late reverberation under the current frame, assume in an embodiment of the invention that the desired signal s is_t,lA zero-mean time-varying Gaussian model, and a late reverberation component part

Independent of each other, and the prediction coefficient is estimated by using MCLP algorithm

Then, the expected signal of the current frame is obtained:

it should be noted that, in the embodiment of the present invention, the method of the present invention is subjected to an on-machine experiment simulation, specifically:

the simulation environment is that a uniform linear array consisting of eight omnidirectional microphones is placed in a closed room with the size of 7.0 multiplied by 3.5 multiplied by 2.4(M), namely M is 8, the microphone intervals are all 10cm, and the microphone coordinates are [6.0, 1.35-2.05, 1.0%]The source coordinate is [1.0,1.7,1.0 ]]. Generating multi-channel reverberation voice under different reverberation times by using a mirror image source model method, wherein the time length is 8s, and the sampling frequency f_s16000 Hz. When windowing and framing, the frame length is set to 512 samples, the window function is a hamming window with the length of 512, the prediction coefficient length K is 10, and the prediction delay τ is 3.

S002, acquiring a voice reverberation energy ratio and a signal noise estimation value of the expected signal, substituting a geometric spectrum subtraction formula to perform spectrum subtraction on the reverberation voice to obtain a first power spectral density of the expected signal; the voice reverberation energy ratio and the first energy ratio are in positive correlation, and the signal-to-noise estimation value and the second energy ratio are in positive correlation; the first energy ratio is the energy ratio of the reverberation voice and the reverberation component; the second energy ratio is the energy ratio of the desired speech and reverberation components.

The method comprises the following specific steps:

1) a second power spectral density of the late reverberation component is estimated.

Modeling as an exponential decay model based on reverberation time, estimating frame by adopting a smooth calculation mode, and using symbols

The second power spectral density representing late reverberation is:

wherein, R represents a discrete frame shift length of the speech frame in the time domain, and is usually set to be one half or one quarter of the frame length L, in the embodiment of the present invention, the frame shift R is 128 samples; e is a constant, representing the minimum of the estimated second power spectral density, typically taken to be 0.0001;

representing a third power spectral density of the reverberant speech signal at the t- τ frame, the embodiment of the present invention is obtained by averaging signals of former δ frames of all channels of the microphone receiving signals:

wherein, τ represents the number of the predicted delay frames, τ frames before the t-th frame do not participate in the prediction, δ represents the number of the frames involved in the calculation covered before and after the t- τ frame, δ is a constant of 6 to 10, and δ is generally required to be greater than or equal to 2 τ.

As an example, in an embodiment of the present invention, δ is taken to be 10.

α (t, l) is defined as a variable related to reverberation time:

wherein f is_sRepresenting the speech sampling rate in Hz; RT (reverse transcription)₆₀And (t, l) represents the reverberation time estimated at the current voice frame frequency point, and the unit is second, and the reverberation time is obtained by various reverberation time estimation algorithms.

As an example, in the embodiment of the present invention, the reverberation time RT is calculated by the maximum likelihood estimation method₆₀：

Where the constant ρ represents the rate of attenuation of the acoustic wave, a likelihood function may be used

And solving by a maximum likelihood rule. Likelihood function

Where L represents the frame length, a and d (i) are respectively:

wherein, represents A_rOriginal amplitude of the current speech signal, v (i) at the ith sample point of a discrete normal distribution with mean 0 and variance 1The value is i e {0, …, N-1}, r_t(i) Indicating a set reverberation time search sequence, r_t＝[0.1,0.2,…,1.2]。

2) A first power spectral density of the desired signal is estimated using geometric spectral subtraction.

The method comprises the following specific steps:

a) and calculating the voice reverberation energy ratio.

The specific calculation formula is as follows:

wherein R is_x/rRepresenting a speech to reverberation energy ratio; beta is a₁Denotes a first smoothing factor, 0<β₁<1；

The first energy ratio is expressed in the form of a constant.

As an example, in the embodiment of the present invention, β₁0.9 is taken.

b) A signal-to-noise estimate is calculated.

The specific calculation formula is as follows:

wherein R is_d/rRepresenting a signal-to-noise estimate;

denotes a second energy ratio, d'_t,lRepresenting the estimated expected signal bin amplitude, | d'_t,l|²Representing the energy of the desired signal; beta is a₂Denotes a second smoothing factor, 0<β₂<1。

D 'is obtained'_t,lThen, it is substituted into equation (2) to calculate R for the next frame_d/r(ii) a In calculating the first frame, | x is adopted_t,lL instead of d'_t,lR is to be_x/rThe initialization was 1.0.

As an example, β in the embodiment of the present invention₂0.9 is taken.

c) And obtaining a first power spectral density of the expected signal according to the frequency point amplitude of the expected signal.

Wherein, d'_t,lFor the estimated amplitude, beta, of the desired signal frequency point₃Is a third smoothing factor, 0<β₃<1, when processing the first frame, using

Instead of the former

And (6) performing calculation.

As an example, in the embodiment of the present invention, β₃0.9 is taken.

And step S003, acquiring the voice signal after dereverberation according to the first power spectral density.

The method comprises the following specific steps:

1) and obtaining the expected signal frequency points at each channel of the current frame by using a weighted recursive least square formula according to the first power spectral density.

d_t,l＝x_t,l-G_l(t-1)^Hx_t-τ,l

Among them are:

wherein d is_t,lRepresenting the frequency point of the desired signal, G, at each channel of the current frame_l(t) denotes a second prediction coefficient matrix, k_l(t) represents a gain vector for updating the prediction coefficients, the matrix size is (MK × 1), Φ_l(t) an inverse matrix for storing a spatial correlation matrix, the matrix size being (mkxmk); α is a constant, representing a fourth smoothing factor.

As an example, in the embodiment of the present invention, α is 0.9999.

It should be noted that before calculating the first frame, G is used_l(t) initialization to an all-zero matrix, Φ_l(t) is initialized to the unity diagonal matrix.

2) And carrying out short-time Fourier inverse transformation on the frequency points of the expected signals to obtain the voice signals after reverberation is removed.

To d_t,lAfter the short-time Fourier inverse transformation is carried out, the algorithm outputs a dereverberation voice signal frame.

And step S004, storing the first power spectral density of the current frame, taking the first power spectral density as the historical first power spectral density of the next frame, and updating the first power spectral density of the next frame until all the dereverberated voice signals are obtained.

The method comprises the following specific steps:

the expected signal is modeled into a time-varying Gaussian model with zero mean value, so the first power spectral density is used as the variance, the first power spectral density of the currently obtained speech frame is stored and used as the variance

Substituting into the calculation formula (3) of the next frame, the estimation process of the first power spectral density is modified:

and judging whether all the voice frames are processed or not, and if the voice frames remain, continuing to perform dereverberation calculation on the next frame of data until all the voice frames are processed.

In summary, in the embodiments of the present invention, frame data processing is performed on the collected reverberation voice of the reverberation environment to obtain an expected signal of the current frame; acquiring a voice reverberation energy ratio and a signal-to-noise estimation value of an expected signal, substituting a geometric spectrum subtraction formula to perform spectrum subtraction on reverberation voice to obtain a first power spectral density of the expected signal; the voice reverberation energy ratio and a first energy ratio of the reverberation voice and the reverberation component are in positive correlation, and the signal-to-noise estimation value and a second energy ratio of the expected voice and the reverberation component are in positive correlation; acquiring a voice signal after dereverberation according to the first power spectral density; and storing the first power spectral density of the current frame as the historical first power spectral density of the next frame, and updating the first power spectral density of the next frame until all the dereverberated voice signals are obtained.

Through computer-aided experimental simulation, the performance of the voice dereverberation method based on MCLP is evaluated in the embodiment of the invention, as shown in fig. 2-15, an improved MCLP algorithm in the graph is the voice dereverberation method based on MCLP provided by the embodiment of the invention, and it can be found by observing the time domain waveforms in fig. 2-4 and the frequency spectrum waveforms in fig. 5-7, compared with the voice processed by the MCLP algorithm, the embodiment of the invention is clearer and cleaner on the envelope of the time domain waveform and the spectrogram ripple, and reduces the effect of trailing blurring, especially in the beginning section of the voice, the clearness of the time domain waveform and the frequency domain waveform is obviously improved compared with the MCLP algorithm, and is not bulked and blurred any more, which indicates that the removal of reverberation components is more thorough, and the overall stability of the algorithm is higher.

Among the four Speech Quality Evaluation criteria, the higher the score of the subjective Speech Quality assessment method (PESQ), Speech-to-Reverberation model Energy Ratio (SRMR), and Weighted segment direct Reverberation Energy Ratio (FWsegSNR), the lower the score of the Cepstrum Distance (CD), the better the Speech Quality. By observing the line graphs of fig. 8-11, it can be found that the scores of the four evaluation indexes are obviously superior to those of the MCLP algorithm under different reverberation times of 0.2s to 1.2s, and the performance improvement amount is stable, which proves the superiority of the embodiment of the invention. As can be seen from observing the line diagrams in fig. 12 to fig. 15, in the embodiment of the present invention, under the condition of different numbers of

voice channels

2, 4, 6, and 8, the four evaluation indexes are also significantly improved compared with the MCLP algorithm, and the higher the number of voice channels is, the larger the performance improvement range is.

The comparison shows that the voice quality processed by the MCLP-based voice dereverberation method is obviously superior to that of the original MCLP algorithm, and the dereverberation performance can be further improved to a certain extent by the embodiment of the invention.

Based on the same inventive concept as the above method, another embodiment of the present invention provides an MCLP-based speech dereverberation system, referring to fig. 16, which includes the following modules:

a reverberant speech pre-processing module 1001, a first power spectral density acquisition module 1002, a speech dereverberation module 1003 and a first power spectral density update module 1004.

The reverberation voice preprocessing module 1001 is configured to perform framing data processing on the collected reverberation voice of the reverberation environment to obtain an expected signal of the current frame; the first power spectral density obtaining module 1002 is configured to obtain a speech-to-reverberation energy ratio and a signal-to-noise estimation value of the desired signal, and substitute a geometric spectral subtraction formula to perform spectral subtraction on the reverberated speech to obtain a first power spectral density of the desired signal; the voice reverberation energy ratio and the first energy ratio are in positive correlation, and the signal-to-noise estimation value and the second energy ratio are in positive correlation; the first energy ratio is the energy ratio of the reverberation voice and the reverberation component; the second energy ratio is the energy ratio of the desired speech and reverberation components; the voice dereverberation module 1003 is configured to obtain a dereverberated voice signal according to the first power spectral density; the first power spectral density updating module 1004 is configured to store the first power spectral density of the current frame as a historical first power spectral density of the next frame, and update the first power spectral density of the next frame until all dereverberated speech signals are obtained.

Preferably, the reverberation voice preprocessing module comprises:

and the expected signal calculation module is used for obtaining a first prediction coefficient matrix according to the prediction coefficient and calculating an expected signal by using the first prediction coefficient matrix and the reverberation voice subjected to framing processing.

Preferably, the first power spectral density acquisition module comprises:

a signal-to-noise estimation value calculation module, configured to calculate a signal-to-noise estimation value:

wherein R is_d/rRepresenting a signal-to-noise estimate;

represents the secondEnergy ratio, d'_t,lRepresenting the estimated expected signal bin amplitude, | d'_t,l|²Which is indicative of the energy of the desired signal,

a second power spectral density representing the reverberation component; beta is a₂Representing a second smoothing factor; r_x/rRepresenting the speech to reverberation energy ratio.

Preferably, the voice dereverberation module includes:

and the dereverberation voice signal calculation module is used for carrying out short-time Fourier inverse transformation on the frequency point of the expected signal to obtain the voice signal after dereverberation.

In summary, in the embodiment of the present invention, the reverberation voice preprocessing module 1001 performs framing data processing on the collected reverberation voice of the reverberation environment to obtain an expected signal of the current frame; acquiring a voice reverberation energy ratio and a signal-to-noise estimation value of a desired signal through a first power spectral density acquisition module 1002, substituting a geometric spectrum subtraction formula to perform spectral subtraction on reverberant voice to obtain a first power spectral density of the desired signal; obtaining a dereverberated voice signal according to the first power spectral density through a voice dereverberation module 1003; the first power spectral density of the current frame is stored by the first power spectral density update module 1004 and is used as the historical first power spectral density of the next frame, and the first power spectral density of the next frame is updated until all dereverberated speech signals are obtained. The embodiment of the invention can further improve the dereverberation performance of the MCLP algorithm to a certain extent, and obtain the dereverberation voice with higher quality.

It should be noted that: the precedence order of the above embodiments of the present invention is only for description, and does not represent the merits of the embodiments. And specific embodiments thereof have been described above. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. An MCLP-based speech dereverberation method, comprising the steps of:

acquiring a voice reverberation energy ratio and a signal-to-noise estimation value of the expected signal, substituting a geometric spectrum subtraction formula to perform spectrum subtraction on the reverberation voice to obtain a first power spectral density of the expected signal; the voice reverberation energy ratio and the first energy ratio are in positive correlation, and the signal-to-noise estimation value and the second energy ratio are in positive correlation; the first energy ratio is the energy ratio of the reverberant speech and the reverberant component; the second energy ratio is the energy ratio of the desired speech and the reverberation component;

2. The method of claim 1, wherein the step of obtaining the desired signal comprises:

3. The method of claim 1, wherein the speech to reverberation energy ratio is calculated by:

4. The method of claim 1, wherein the signal-to-noise estimate is calculated by:

wherein R is_d/rRepresenting a signal-to-noise estimate;

5. The method of claim 1, wherein the step of obtaining the dereverberated speech signal comprises:

6. An MCLP-based speech dereverberation system, comprising the following modules:

a first power spectral density obtaining module, configured to obtain a speech-to-reverberation energy ratio and a signal-to-noise estimation value of the desired signal, and substitute a geometric spectrum subtraction formula to perform spectrum subtraction on the reverberant speech to obtain a first power spectral density of the desired signal; the voice reverberation energy ratio and the first energy ratio are in positive correlation, and the signal-to-noise estimation value and the second energy ratio are in positive correlation; the first energy ratio is the energy ratio of the reverberant speech and the reverberant component; the second energy ratio is the energy ratio of the desired speech and the reverberation component;

7. The system of claim 6, wherein the reverberation speech pre-processing module comprises:

8. The system of claim 6, wherein the first power spectral density acquisition module comprises:

9. The system of claim 6, wherein the first power spectral density acquisition module comprises:

wherein R is_d/rRepresenting a signal-to-noise estimate;

10. The system of claim 6, wherein the speech dereverberation module comprises: