CN113160842A - Voice dereverberation method and system based on MCLP - Google Patents
Voice dereverberation method and system based on MCLP Download PDFInfo
- Publication number
- CN113160842A CN113160842A CN202110247855.4A CN202110247855A CN113160842A CN 113160842 A CN113160842 A CN 113160842A CN 202110247855 A CN202110247855 A CN 202110247855A CN 113160842 A CN113160842 A CN 113160842A
- Authority
- CN
- China
- Prior art keywords
- reverberation
- voice
- signal
- energy ratio
- spectral density
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 46
- 230000003595 spectral effect Effects 0.000 claims abstract description 94
- 238000001228 spectrum Methods 0.000 claims abstract description 22
- 238000012545 processing Methods 0.000 claims abstract description 19
- 238000004364 calculation method Methods 0.000 claims description 25
- 239000011159 matrix material Substances 0.000 claims description 21
- 238000009432 framing Methods 0.000 claims description 12
- 238000009499 grossing Methods 0.000 claims description 9
- 238000007781 pre-processing Methods 0.000 claims description 8
- 230000009466 transformation Effects 0.000 claims description 7
- 238000004422 calculation algorithm Methods 0.000 description 28
- 238000010586 diagram Methods 0.000 description 8
- 230000000694 effects Effects 0.000 description 5
- 238000001303 quality assessment method Methods 0.000 description 5
- 238000011156 evaluation Methods 0.000 description 3
- 230000006872 improvement Effects 0.000 description 3
- 238000013441 quality evaluation Methods 0.000 description 3
- 238000010839 reverse transcription Methods 0.000 description 3
- 238000004088 simulation Methods 0.000 description 3
- 238000007476 Maximum Likelihood Methods 0.000 description 2
- 230000037433 frameshift Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0232—Processing in the frequency domain
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0264—Noise filtering characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L2021/02082—Noise filtering the noise being echo, reverberation of the speech
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
The invention relates to the technical field of voice signal processing, in particular to a voice dereverberation method and system based on MCLP. The method comprises the following steps: the method comprises the steps of performing frame data processing on collected reverberation voice of a reverberation environment to obtain an expected signal of a current frame; acquiring a voice reverberation energy ratio and a signal-to-noise estimation value of an expected signal, substituting a geometric spectrum subtraction formula to perform spectrum subtraction on reverberation voice to obtain a first power spectral density of the expected signal; the voice reverberation energy ratio and a first energy ratio of the reverberation voice and the reverberation component are in positive correlation, and the signal-to-noise estimation value and a second energy ratio of the expected voice and the reverberation component are in positive correlation; acquiring a voice signal after dereverberation according to the first power spectral density; and storing the first power spectral density of the current frame as the historical first power spectral density of the next frame, and updating the first power spectral density of the next frame until all the dereverberated voice signals are obtained. The embodiment of the invention can obtain better dereverberation voice.
Description
Technical Field
The invention relates to the technical field of voice signal processing, in particular to a voice dereverberation method and system based on MCLP.
Background
In daily life, the scene demand of indoor recording is more and more extensive, and is common in indoor meeting, auditorium speech, live webcast, intelligent voice assistant etc. and in these scenes, the speech signal that the microphone was gathered often can be mingled with serious reverberation component. Reverberation is an acoustic phenomenon generated in a closed space, and due to the multipath propagation effect of sound, reflection is generated on the surfaces of walls and objects, so that collected voice signals are blurred due to time delay difference, and the definition of a voice frequency spectrum is seriously polluted. Studies have shown that early reverberant sounds within 50 milliseconds help to improve speech intelligibility, and fullness, but excessive late reverberation severely affects speech signal quality.
In practice, the inventors found that the above prior art has the following disadvantages:
for a Multi-Channel Linear Prediction (MCLP) algorithm in the field of speech dereverberation, because a clean speech signal is modeled as a time-varying gaussian model, the performance of the algorithm depends heavily on the accuracy of estimation of Power Spectral Density (PSD) of the clean speech signal, and an original online MCLP algorithm directly uses an observed reverberation signal instead of the clean speech to estimate PSD, so that the accuracy is poor and the dereverberation effect is influenced. In part of the improved research results of the algorithm, a late reverberation component PSD estimation algorithm is used, and then the reverberation PSD is subtracted by spectral subtraction to obtain an estimated pure voice PSD. However, because the estimation of the reverberation PSD is inaccurate, when the amplitude of the estimated value is large, the direct spectral subtraction may cause an over-subtraction problem, so that the frequency spectrum may have too many zeros, resulting in problems of frequency spectrum distortion and music noise.
Disclosure of Invention
In order to solve the above technical problems, an object of the present invention is to provide a method and a system for dereverberating a speech based on MCLP, wherein the adopted technical solution is as follows:
in a first aspect, an embodiment of the present invention provides a method for voice dereverberation based on MCLP, including the following steps:
the method comprises the steps of performing frame data processing on collected reverberation voice of a reverberation environment to obtain an expected signal of a current frame;
acquiring a voice reverberation energy ratio and a signal noise estimation value of the expected signal, substituting a geometric spectrum subtraction formula to perform spectrum subtraction on the reverberation voice to obtain a first power spectral density of the expected signal; the voice reverberation energy ratio and the first energy ratio are in positive correlation, and the signal-to-noise estimation value and the second energy ratio are in positive correlation; the first energy ratio is the energy ratio of the reverberation voice and the reverberation component; the second energy ratio is the energy ratio of the desired speech and the reverberation component;
obtaining a dereverberated speech signal according to the first power spectral density;
and storing the first power spectral density of the current frame as the historical first power spectral density of the next frame, and updating the first power spectral density of the next frame until all the dereverberated voice signals are obtained.
Preferably, the step of acquiring the desired signal includes:
calculating a prediction coefficient through the mathematical representation of the reverberation signal in a time-frequency domain;
and obtaining a first prediction coefficient matrix according to the prediction coefficient, and calculating the expected signal by using the first prediction coefficient matrix and the reverberation voice subjected to framing processing.
Preferably, the method for calculating the speech reverberation energy ratio includes:
and obtaining the voice reverberation energy ratio of the current frame by performing smooth calculation on the first energy ratio and the historical voice reverberation energy ratio.
Preferably, the method for calculating the signal-to-noise estimation value comprises:
wherein R isd/rRepresenting a signal-to-noise estimate;representing the second energy ratio value, d't,lRepresenting the estimated expected signal bin amplitude, | d't,l|2Presentation periodThe energy of the signal to be observed,a second power spectral density representative of the reverberation component; beta is a2Representing a second smoothing factor; rx/rRepresenting the speech to reverberation energy ratio.
Preferably, the step of obtaining the dereverberated speech signal includes:
obtaining an expected signal frequency point at each channel of the current frame by using a weighted recursive least square formula according to the first power spectral density;
and carrying out short-time Fourier inverse transformation on the expected signal frequency point to obtain the voice signal after the reverberation is removed.
In a second aspect, another embodiment of the present invention provides an MCLP-based speech dereverberation system, which includes the following modules:
the reverberation voice preprocessing module is used for performing framing data processing on the collected reverberation voice of the reverberation environment to obtain an expected signal of the current frame;
the first power spectral density acquisition module is used for acquiring the voice reverberation energy ratio and the signal-to-noise estimation value of the expected signal, substituting a geometric spectrum subtraction formula to perform spectrum subtraction on the reverberation voice to obtain the first power spectral density of the expected signal; the voice reverberation energy ratio and the first energy ratio are in positive correlation, and the signal-to-noise estimation value and the second energy ratio are in positive correlation; the first energy ratio is the energy ratio of the reverberation voice and the reverberation component; the second energy ratio is the energy ratio of the desired speech and the reverberation component;
the voice dereverberation module is used for acquiring a dereverberated voice signal according to the first power spectral density;
and the first power spectral density updating module is used for storing the first power spectral density of the current frame, taking the first power spectral density as the historical first power spectral density of the next frame, and updating the first power spectral density of the next frame until all the dereverberated voice signals are obtained.
Preferably, the reverberation voice preprocessing module comprises:
the prediction coefficient calculation module is used for calculating a prediction coefficient through the mathematical representation of the reverberation signal in a time-frequency domain;
and the expected signal calculation module is used for obtaining a first prediction coefficient matrix according to the prediction coefficient and calculating the expected signal by using the first prediction coefficient matrix and the reverberation voice subjected to framing processing.
Preferably, the first power spectral density acquisition module includes:
and the voice reverberation energy ratio acquisition module is used for performing smooth calculation on the first energy ratio and the historical voice reverberation energy ratio to obtain the voice reverberation energy ratio of the current frame.
Preferably, the first power spectral density acquisition module includes:
a signal-to-noise estimate calculation module configured to calculate the signal-to-noise estimate:
wherein R isd/rRepresenting a signal-to-noise estimate;representing the second energy ratio value, d't,lRepresenting the estimated expected signal bin amplitude, | d't,l|2Which is indicative of the energy of the desired signal,a second power spectral density representative of the reverberation component; beta is a2Representing a second smoothing factor; rx/rRepresenting the speech to reverberation energy ratio.
Preferably, the voice dereverberation module includes:
the expected signal frequency point acquisition module is used for acquiring an expected signal frequency point at each channel of the current frame by using a weighted recursive least square formula according to the first power spectral density;
and the dereverberation voice signal calculation module is used for carrying out short-time Fourier inverse transformation on the expected signal frequency point to obtain the voice signal after dereverberation.
The embodiment of the invention has the following beneficial effects:
by combining geometric spectral subtraction and MCLP algorithm, the problem of spectral over-subtraction caused by spectral subtraction is solved, the dereverberation performance of the MCLP algorithm is improved, and high-quality dereverberation voice can be obtained.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions and advantages of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a flowchart of an MCLP-based speech dereverberation method according to an embodiment of the present invention;
FIG. 2 is a diagram of a speech time domain waveform of an original speech when the reverberation time is 0.8s and the number of channels is 4 according to an embodiment of the present invention;
FIG. 3 is a diagram of a speech time domain waveform of speech processed by the MCLP algorithm according to an embodiment of the present invention when the reverberation time is 0.8s and the number of channels is 4;
FIG. 4 is a time domain waveform diagram of speech processed by the MCLP-based speech dereverberation method according to an embodiment of the present invention when the reverberation time is 0.8s and the number of channels is 4;
FIG. 5 is a diagram of a speech spectrum of an original speech with reverberation time of 0.8s and channel number of 4 according to an embodiment of the present invention;
fig. 6 is a diagram of a speech spectrum of a speech processed by the MCLP algorithm according to an embodiment of the present invention when the reverberation time is 0.8s and the number of channels is 4;
fig. 7 is a diagram of a voice spectrum of a voice processed by the MCLP-based voice dereverberation method according to an embodiment of the present invention when a reverberation time is 0.8s and a channel number is 4;
FIG. 8 is a line graph illustrating quality assessment of an original reverberated speech, a speech processed by an MCLP algorithm, and a speech processed by an MCLP-based speech dereverberation method using subjective speech quality assessment at different reverberation times according to an embodiment of the present invention;
FIG. 9 is a line graph illustrating the quality evaluation of the original reverberated speech and the speech processed by the MCLP algorithm and the MCLP-based speech dereverberation method using the energy ratio of the speech reverberation model at different reverberation times according to an embodiment of the present invention;
FIG. 10 is a line graph illustrating the comparison of the weighted segmented direct reverberation energy versus the original reverberated speech, the speech processed by the MCLP algorithm, and the speech processed by the MCLP-based speech dereverberation method at different reverberation times according to an embodiment of the present invention;
FIG. 11 is a line graph illustrating the evaluation of the quality of an original reverberated speech, the speech processed by the MCLP algorithm, and the speech processed by the MCLP-based speech dereverberation method using cepstral distances at different reverberation times according to an embodiment of the present invention;
FIG. 12 is a line graph illustrating quality assessment of original reverberated speech, speech processed by the MCLP algorithm, and speech processed by the MCLP-based speech dereverberation method using subjective speech quality assessment under different numbers of speech channels according to an embodiment of the present invention;
FIG. 13 is a line graph illustrating quality of speech processed by the MCLP algorithm and the MCLP-based speech dereverberation method using the energy ratio of the speech reverberation model to the original reverberated speech under different number of speech channels according to an embodiment of the present invention;
FIG. 14 is a line graph illustrating the quality of the original reverberated speech compared to the direct reverberation energy of the weighted segmented direct reverberation under different numbers of speech channels, the speech processed by the MCLP algorithm, and the speech processed by the MCLP-based speech dereverberation method according to an embodiment of the present invention;
FIG. 15 is a line graph illustrating the quality evaluation of the original reverberated speech, the speech processed by the MCLP algorithm, and the speech processed by the MCLP-based speech dereverberation method using cepstral distances according to an embodiment of the present invention with different numbers of speech channels;
fig. 16 is a block diagram illustrating a structure of an MCLP-based speech dereverberation system according to an embodiment of the present invention.
Detailed Description
To further illustrate the technical means and effects of the present invention for achieving the predetermined objects, the following detailed description of the embodiments, structures, features and effects of the method and system for dereverberating MCLP-based speech according to the present invention will be made with reference to the accompanying drawings and preferred embodiments. In the following description, different "one embodiment" or "another embodiment" refers to not necessarily the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
The following describes a specific scheme of a voice dereverberation method and system based on MCLP in detail with reference to the accompanying drawings.
Referring to fig. 1, a flowchart of an MCLP-based speech dereverberation method according to an embodiment of the present invention is shown, where the method includes the following steps:
and S001, performing framing data processing on the collected reverberation voice of the reverberation environment to obtain an expected signal of the current frame.
The method comprises the following specific steps:
1) calculating prediction coefficients from a mathematical representation of a reverberant signal in the time-frequency domain
In a closed acoustic space, a single voice source and a microphone array composed of M omnidirectional microphones are established, the shape of the array is not required, multi-channel voice signals received by the microphone array are windowed frame by frame, and are subjected to Short Time Fourier Transform (STFT) with the frame length of L subframes and L points, and as reverberant voice is the result of reverberant room impulse response and voice convolution in the Time domain and is the result after multiplication in the frequency domain, the reverberant signal received by the mth channel microphone can be represented as follows in the Time-frequency domain:
wherein t represents the time domain sequence number of the voice frame; l represents the frequency domain frequency point sequence number at each frame, and is in the range of {1,2, …, L }; τ represents the linear prediction delay;frequency point components representing the reverberant voice at the ith frequency point of the tth frame; st,lFrequency point components representing clean speech at the ith frequency point of the tth frame;the prediction coefficient of the mth microphone to the nth microphone receiving signal is represented, and can also be called as reverberation room impact response from a signal source to the mth microphone, and the length of each channel prediction coefficient is set as a constant K; k denotes the prediction coefficient number, K ∈ {1,2, …, K }.
It should be noted that the prediction delay τ is usually a non-negative integer from 0 to 3, and the prediction coefficient length K is usually a positive integer between 5 and 20; x, s and μ are complex.
2) And obtaining a first prediction coefficient matrix according to the prediction coefficient, and calculating the expected signal by using the first prediction coefficient matrix and the reverberation voice subjected to framing processing.
The above equation (1) is abbreviated again in a matrix form as follows:
among them are:
a prediction coefficient matrix, x, representing the m-th microphonet-τ,lRepresenting the sequence of signal observations needed to predict late reverberation under the current frame, assume in an embodiment of the invention that the desired signal s ist,lA zero-mean time-varying Gaussian model, and a late reverberation component partIndependent of each other, and the prediction coefficient is estimated by using MCLP algorithmThen, the expected signal of the current frame is obtained:
it should be noted that, in the embodiment of the present invention, the method of the present invention is subjected to an on-machine experiment simulation, specifically:
the simulation environment is that a uniform linear array consisting of eight omnidirectional microphones is placed in a closed room with the size of 7.0 multiplied by 3.5 multiplied by 2.4(M), namely M is 8, the microphone intervals are all 10cm, and the microphone coordinates are [6.0, 1.35-2.05, 1.0%]The source coordinate is [1.0,1.7,1.0 ]]. Generating multi-channel reverberation voice under different reverberation times by using a mirror image source model method, wherein the time length is 8s, and the sampling frequency fs16000 Hz. When windowing and framing, the frame length is set to 512 samples, the window function is a hamming window with the length of 512, the prediction coefficient length K is 10, and the prediction delay τ is 3.
S002, acquiring a voice reverberation energy ratio and a signal noise estimation value of the expected signal, substituting a geometric spectrum subtraction formula to perform spectrum subtraction on the reverberation voice to obtain a first power spectral density of the expected signal; the voice reverberation energy ratio and the first energy ratio are in positive correlation, and the signal-to-noise estimation value and the second energy ratio are in positive correlation; the first energy ratio is the energy ratio of the reverberation voice and the reverberation component; the second energy ratio is the energy ratio of the desired speech and reverberation components.
The method comprises the following specific steps:
1) a second power spectral density of the late reverberation component is estimated.
Modeling as an exponential decay model based on reverberation time, estimating frame by adopting a smooth calculation mode, and using symbolsThe second power spectral density representing late reverberation is:
wherein, R represents a discrete frame shift length of the speech frame in the time domain, and is usually set to be one half or one quarter of the frame length L, in the embodiment of the present invention, the frame shift R is 128 samples; e is a constant, representing the minimum of the estimated second power spectral density, typically taken to be 0.0001;representing a third power spectral density of the reverberant speech signal at the t- τ frame, the embodiment of the present invention is obtained by averaging signals of former δ frames of all channels of the microphone receiving signals:
wherein, τ represents the number of the predicted delay frames, τ frames before the t-th frame do not participate in the prediction, δ represents the number of the frames involved in the calculation covered before and after the t- τ frame, δ is a constant of 6 to 10, and δ is generally required to be greater than or equal to 2 τ.
As an example, in an embodiment of the present invention, δ is taken to be 10.
α (t, l) is defined as a variable related to reverberation time:
wherein f issRepresenting the speech sampling rate in Hz; RT (reverse transcription)60And (t, l) represents the reverberation time estimated at the current voice frame frequency point, and the unit is second, and the reverberation time is obtained by various reverberation time estimation algorithms.
As an example, in the embodiment of the present invention, the reverberation time RT is calculated by the maximum likelihood estimation method60:
Where the constant ρ represents the rate of attenuation of the acoustic wave, a likelihood function may be usedAnd solving by a maximum likelihood rule. Likelihood functionWhere L represents the frame length, a and d (i) are respectively:
wherein, represents ArOriginal amplitude of the current speech signal, v (i) at the ith sample point of a discrete normal distribution with mean 0 and variance 1The value is i e {0, …, N-1}, rt(i) Indicating a set reverberation time search sequence, rt=[0.1,0.2,…,1.2]。
2) A first power spectral density of the desired signal is estimated using geometric spectral subtraction.
The method comprises the following specific steps:
a) and calculating the voice reverberation energy ratio.
And obtaining the voice reverberation energy ratio of the current frame by performing smooth calculation on the first energy ratio and the historical voice reverberation energy ratio.
The specific calculation formula is as follows:
wherein R isx/rRepresenting a speech to reverberation energy ratio; beta is a1Denotes a first smoothing factor, 0<β1<1;The first energy ratio is expressed in the form of a constant.
As an example, in the embodiment of the present invention, β10.9 is taken.
b) A signal-to-noise estimate is calculated.
The specific calculation formula is as follows:
wherein R isd/rRepresenting a signal-to-noise estimate;denotes a second energy ratio, d't,lRepresenting the estimated expected signal bin amplitude, | d't,l|2Representing the energy of the desired signal; beta is a2Denotes a second smoothing factor, 0<β2<1。
D 'is obtained't,lThen, it is substituted into equation (2) to calculate R for the next framed/r(ii) a In calculating the first frame, | x is adoptedt,lL instead of d't,lR is to bex/rThe initialization was 1.0.
As an example, β in the embodiment of the present invention20.9 is taken.
c) And obtaining a first power spectral density of the expected signal according to the frequency point amplitude of the expected signal.
Wherein, d't,lFor the estimated amplitude, beta, of the desired signal frequency point3Is a third smoothing factor, 0<β3<1, when processing the first frame, usingInstead of the formerAnd (6) performing calculation.
As an example, in the embodiment of the present invention, β30.9 is taken.
And step S003, acquiring the voice signal after dereverberation according to the first power spectral density.
The method comprises the following specific steps:
1) and obtaining the expected signal frequency points at each channel of the current frame by using a weighted recursive least square formula according to the first power spectral density.
dt,l=xt,l-Gl(t-1)Hxt-τ,l
Among them are:
wherein d ist,lRepresenting the frequency point of the desired signal, G, at each channel of the current framel(t) denotes a second prediction coefficient matrix, kl(t) represents a gain vector for updating the prediction coefficients, the matrix size is (MK × 1), Φl(t) an inverse matrix for storing a spatial correlation matrix, the matrix size being (mkxmk); α is a constant, representing a fourth smoothing factor.
As an example, in the embodiment of the present invention, α is 0.9999.
It should be noted that before calculating the first frame, G is usedl(t) initialization to an all-zero matrix, Φl(t) is initialized to the unity diagonal matrix.
2) And carrying out short-time Fourier inverse transformation on the frequency points of the expected signals to obtain the voice signals after reverberation is removed.
To dt,lAfter the short-time Fourier inverse transformation is carried out, the algorithm outputs a dereverberation voice signal frame.
And step S004, storing the first power spectral density of the current frame, taking the first power spectral density as the historical first power spectral density of the next frame, and updating the first power spectral density of the next frame until all the dereverberated voice signals are obtained.
The method comprises the following specific steps:
the expected signal is modeled into a time-varying Gaussian model with zero mean value, so the first power spectral density is used as the variance, the first power spectral density of the currently obtained speech frame is stored and used as the varianceSubstituting into the calculation formula (3) of the next frame, the estimation process of the first power spectral density is modified:
and judging whether all the voice frames are processed or not, and if the voice frames remain, continuing to perform dereverberation calculation on the next frame of data until all the voice frames are processed.
In summary, in the embodiments of the present invention, frame data processing is performed on the collected reverberation voice of the reverberation environment to obtain an expected signal of the current frame; acquiring a voice reverberation energy ratio and a signal-to-noise estimation value of an expected signal, substituting a geometric spectrum subtraction formula to perform spectrum subtraction on reverberation voice to obtain a first power spectral density of the expected signal; the voice reverberation energy ratio and a first energy ratio of the reverberation voice and the reverberation component are in positive correlation, and the signal-to-noise estimation value and a second energy ratio of the expected voice and the reverberation component are in positive correlation; acquiring a voice signal after dereverberation according to the first power spectral density; and storing the first power spectral density of the current frame as the historical first power spectral density of the next frame, and updating the first power spectral density of the next frame until all the dereverberated voice signals are obtained.
Through computer-aided experimental simulation, the performance of the voice dereverberation method based on MCLP is evaluated in the embodiment of the invention, as shown in fig. 2-15, an improved MCLP algorithm in the graph is the voice dereverberation method based on MCLP provided by the embodiment of the invention, and it can be found by observing the time domain waveforms in fig. 2-4 and the frequency spectrum waveforms in fig. 5-7, compared with the voice processed by the MCLP algorithm, the embodiment of the invention is clearer and cleaner on the envelope of the time domain waveform and the spectrogram ripple, and reduces the effect of trailing blurring, especially in the beginning section of the voice, the clearness of the time domain waveform and the frequency domain waveform is obviously improved compared with the MCLP algorithm, and is not bulked and blurred any more, which indicates that the removal of reverberation components is more thorough, and the overall stability of the algorithm is higher.
Among the four Speech Quality Evaluation criteria, the higher the score of the subjective Speech Quality assessment method (PESQ), Speech-to-Reverberation model Energy Ratio (SRMR), and Weighted segment direct Reverberation Energy Ratio (FWsegSNR), the lower the score of the Cepstrum Distance (CD), the better the Speech Quality. By observing the line graphs of fig. 8-11, it can be found that the scores of the four evaluation indexes are obviously superior to those of the MCLP algorithm under different reverberation times of 0.2s to 1.2s, and the performance improvement amount is stable, which proves the superiority of the embodiment of the invention. As can be seen from observing the line diagrams in fig. 12 to fig. 15, in the embodiment of the present invention, under the condition of different numbers of voice channels 2, 4, 6, and 8, the four evaluation indexes are also significantly improved compared with the MCLP algorithm, and the higher the number of voice channels is, the larger the performance improvement range is.
The comparison shows that the voice quality processed by the MCLP-based voice dereverberation method is obviously superior to that of the original MCLP algorithm, and the dereverberation performance can be further improved to a certain extent by the embodiment of the invention.
Based on the same inventive concept as the above method, another embodiment of the present invention provides an MCLP-based speech dereverberation system, referring to fig. 16, which includes the following modules:
a reverberant speech pre-processing module 1001, a first power spectral density acquisition module 1002, a speech dereverberation module 1003 and a first power spectral density update module 1004.
The reverberation voice preprocessing module 1001 is configured to perform framing data processing on the collected reverberation voice of the reverberation environment to obtain an expected signal of the current frame; the first power spectral density obtaining module 1002 is configured to obtain a speech-to-reverberation energy ratio and a signal-to-noise estimation value of the desired signal, and substitute a geometric spectral subtraction formula to perform spectral subtraction on the reverberated speech to obtain a first power spectral density of the desired signal; the voice reverberation energy ratio and the first energy ratio are in positive correlation, and the signal-to-noise estimation value and the second energy ratio are in positive correlation; the first energy ratio is the energy ratio of the reverberation voice and the reverberation component; the second energy ratio is the energy ratio of the desired speech and reverberation components; the voice dereverberation module 1003 is configured to obtain a dereverberated voice signal according to the first power spectral density; the first power spectral density updating module 1004 is configured to store the first power spectral density of the current frame as a historical first power spectral density of the next frame, and update the first power spectral density of the next frame until all dereverberated speech signals are obtained.
Preferably, the reverberation voice preprocessing module comprises:
the prediction coefficient calculation module is used for calculating a prediction coefficient through the mathematical representation of the reverberation signal in a time-frequency domain;
and the expected signal calculation module is used for obtaining a first prediction coefficient matrix according to the prediction coefficient and calculating an expected signal by using the first prediction coefficient matrix and the reverberation voice subjected to framing processing.
Preferably, the first power spectral density acquisition module comprises:
and the voice reverberation energy ratio acquisition module is used for performing smooth calculation on the first energy ratio and the historical voice reverberation energy ratio to obtain the voice reverberation energy ratio of the current frame.
Preferably, the first power spectral density acquisition module comprises:
a signal-to-noise estimation value calculation module, configured to calculate a signal-to-noise estimation value:
wherein R isd/rRepresenting a signal-to-noise estimate;represents the secondEnergy ratio, d't,lRepresenting the estimated expected signal bin amplitude, | d't,l|2Which is indicative of the energy of the desired signal,a second power spectral density representing the reverberation component; beta is a2Representing a second smoothing factor; rx/rRepresenting the speech to reverberation energy ratio.
Preferably, the voice dereverberation module includes:
the expected signal frequency point acquisition module is used for acquiring an expected signal frequency point at each channel of the current frame by using a weighted recursive least square formula according to the first power spectral density;
and the dereverberation voice signal calculation module is used for carrying out short-time Fourier inverse transformation on the frequency point of the expected signal to obtain the voice signal after dereverberation.
In summary, in the embodiment of the present invention, the reverberation voice preprocessing module 1001 performs framing data processing on the collected reverberation voice of the reverberation environment to obtain an expected signal of the current frame; acquiring a voice reverberation energy ratio and a signal-to-noise estimation value of a desired signal through a first power spectral density acquisition module 1002, substituting a geometric spectrum subtraction formula to perform spectral subtraction on reverberant voice to obtain a first power spectral density of the desired signal; obtaining a dereverberated voice signal according to the first power spectral density through a voice dereverberation module 1003; the first power spectral density of the current frame is stored by the first power spectral density update module 1004 and is used as the historical first power spectral density of the next frame, and the first power spectral density of the next frame is updated until all dereverberated speech signals are obtained. The embodiment of the invention can further improve the dereverberation performance of the MCLP algorithm to a certain extent, and obtain the dereverberation voice with higher quality.
It should be noted that: the precedence order of the above embodiments of the present invention is only for description, and does not represent the merits of the embodiments. And specific embodiments thereof have been described above. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.
Claims (10)
1. An MCLP-based speech dereverberation method, comprising the steps of:
the method comprises the steps of performing frame data processing on collected reverberation voice of a reverberation environment to obtain an expected signal of a current frame;
acquiring a voice reverberation energy ratio and a signal-to-noise estimation value of the expected signal, substituting a geometric spectrum subtraction formula to perform spectrum subtraction on the reverberation voice to obtain a first power spectral density of the expected signal; the voice reverberation energy ratio and the first energy ratio are in positive correlation, and the signal-to-noise estimation value and the second energy ratio are in positive correlation; the first energy ratio is the energy ratio of the reverberant speech and the reverberant component; the second energy ratio is the energy ratio of the desired speech and the reverberation component;
obtaining a dereverberated speech signal according to the first power spectral density;
and storing the first power spectral density of the current frame as the historical first power spectral density of the next frame, and updating the first power spectral density of the next frame until all the dereverberated voice signals are obtained.
2. The method of claim 1, wherein the step of obtaining the desired signal comprises:
calculating a prediction coefficient through the mathematical representation of the reverberation signal in a time-frequency domain;
and obtaining a first prediction coefficient matrix according to the prediction coefficient, and calculating the expected signal by using the first prediction coefficient matrix and the reverberation voice subjected to framing processing.
3. The method of claim 1, wherein the speech to reverberation energy ratio is calculated by:
and obtaining the voice reverberation energy ratio of the current frame by performing smooth calculation on the first energy ratio and the historical voice reverberation energy ratio.
4. The method of claim 1, wherein the signal-to-noise estimate is calculated by:
wherein R isd/rRepresenting a signal-to-noise estimate;representing the second energy ratio value, d't,lRepresenting the estimated expected signal bin amplitude, | d't,l|2Which is indicative of the energy of the desired signal,a second power spectral density representative of the reverberation component; beta is a2Representing a second smoothing factor; rx/rRepresenting the speech to reverberation energy ratio.
5. The method of claim 1, wherein the step of obtaining the dereverberated speech signal comprises:
obtaining an expected signal frequency point at each channel of the current frame by using a weighted recursive least square formula according to the first power spectral density;
and carrying out short-time Fourier inverse transformation on the expected signal frequency point to obtain the voice signal after the reverberation is removed.
6. An MCLP-based speech dereverberation system, comprising the following modules:
the reverberation voice preprocessing module is used for performing framing data processing on the collected reverberation voice of the reverberation environment to obtain an expected signal of the current frame;
a first power spectral density obtaining module, configured to obtain a speech-to-reverberation energy ratio and a signal-to-noise estimation value of the desired signal, and substitute a geometric spectrum subtraction formula to perform spectrum subtraction on the reverberant speech to obtain a first power spectral density of the desired signal; the voice reverberation energy ratio and the first energy ratio are in positive correlation, and the signal-to-noise estimation value and the second energy ratio are in positive correlation; the first energy ratio is the energy ratio of the reverberant speech and the reverberant component; the second energy ratio is the energy ratio of the desired speech and the reverberation component;
the voice dereverberation module is used for acquiring a dereverberated voice signal according to the first power spectral density;
and the first power spectral density updating module is used for storing the first power spectral density of the current frame, taking the first power spectral density as the historical first power spectral density of the next frame, and updating the first power spectral density of the next frame until all the dereverberated voice signals are obtained.
7. The system of claim 6, wherein the reverberation speech pre-processing module comprises:
the prediction coefficient calculation module is used for calculating a prediction coefficient through the mathematical representation of the reverberation signal in a time-frequency domain;
and the expected signal calculation module is used for obtaining a first prediction coefficient matrix according to the prediction coefficient and calculating the expected signal by using the first prediction coefficient matrix and the reverberation voice subjected to framing processing.
8. The system of claim 6, wherein the first power spectral density acquisition module comprises:
and the voice reverberation energy ratio acquisition module is used for performing smooth calculation on the first energy ratio and the historical voice reverberation energy ratio to obtain the voice reverberation energy ratio of the current frame.
9. The system of claim 6, wherein the first power spectral density acquisition module comprises:
a signal-to-noise estimate calculation module configured to calculate the signal-to-noise estimate:
wherein R isd/rRepresenting a signal-to-noise estimate;representing the second energy ratio value, d't,lRepresenting the estimated expected signal bin amplitude, | d't,l|2Which is indicative of the energy of the desired signal,a second power spectral density representative of the reverberation component; beta is a2Representing a second smoothing factor; rx/rRepresenting the speech to reverberation energy ratio.
10. The system of claim 6, wherein the speech dereverberation module comprises:
the expected signal frequency point acquisition module is used for acquiring an expected signal frequency point at each channel of the current frame by using a weighted recursive least square formula according to the first power spectral density;
and the dereverberation voice signal calculation module is used for carrying out short-time Fourier inverse transformation on the expected signal frequency point to obtain the voice signal after dereverberation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110247855.4A CN113160842B (en) | 2021-03-06 | 2021-03-06 | MCLP-based voice dereverberation method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110247855.4A CN113160842B (en) | 2021-03-06 | 2021-03-06 | MCLP-based voice dereverberation method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113160842A true CN113160842A (en) | 2021-07-23 |
CN113160842B CN113160842B (en) | 2024-04-09 |
Family
ID=76884366
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110247855.4A Active CN113160842B (en) | 2021-03-06 | 2021-03-06 | MCLP-based voice dereverberation method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113160842B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023005409A1 (en) * | 2021-07-26 | 2023-02-02 | 青岛海尔科技有限公司 | Device determination method and device determination system |
Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101436407A (en) * | 2008-12-22 | 2009-05-20 | 西安电子科技大学 | Method for encoding and decoding audio |
US20130151244A1 (en) * | 2011-12-09 | 2013-06-13 | Microsoft Corporation | Harmonicity-based single-channel speech quality estimation |
CN103413547A (en) * | 2013-07-23 | 2013-11-27 | 大连理工大学 | Method for eliminating indoor reverberations |
WO2013189199A1 (en) * | 2012-06-18 | 2013-12-27 | 歌尔声学股份有限公司 | Method and device for dereverberation of single-channel speech |
CN106340302A (en) * | 2015-07-10 | 2017-01-18 | 深圳市潮流网络技术有限公司 | De-reverberation method and device for speech data |
CN108154885A (en) * | 2017-12-15 | 2018-06-12 | 重庆邮电大学 | It is a kind of to use QR-RLS algorithms to multicenter voice signal dereverberation method |
US20180182410A1 (en) * | 2016-12-23 | 2018-06-28 | Synaptics Incorporated | Online dereverberation algorithm based on weighted prediction error for noisy time-varying environments |
CN109637554A (en) * | 2019-01-16 | 2019-04-16 | 辽宁工业大学 | MCLP speech dereverberation method based on CDR |
CN110111804A (en) * | 2018-02-01 | 2019-08-09 | 南京大学 | Adaptive dereverberation method based on RLS algorithm |
US20190267018A1 (en) * | 2018-02-23 | 2019-08-29 | Cirrus Logic International Semiconductor Ltd. | Signal processing for speech dereverberation |
CN111128220A (en) * | 2019-12-31 | 2020-05-08 | 深圳市友杰智新科技有限公司 | Dereverberation method, apparatus, device and storage medium |
CN111161751A (en) * | 2019-12-25 | 2020-05-15 | 声耕智能科技(西安)研究院有限公司 | Distributed microphone pickup system and method under complex scene |
US20200219524A1 (en) * | 2017-09-21 | 2020-07-09 | Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. | Signal processor and method for providing a processed audio signal reducing noise and reverberation |
-
2021
- 2021-03-06 CN CN202110247855.4A patent/CN113160842B/en active Active
Patent Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101436407A (en) * | 2008-12-22 | 2009-05-20 | 西安电子科技大学 | Method for encoding and decoding audio |
US20130151244A1 (en) * | 2011-12-09 | 2013-06-13 | Microsoft Corporation | Harmonicity-based single-channel speech quality estimation |
WO2013189199A1 (en) * | 2012-06-18 | 2013-12-27 | 歌尔声学股份有限公司 | Method and device for dereverberation of single-channel speech |
CN103413547A (en) * | 2013-07-23 | 2013-11-27 | 大连理工大学 | Method for eliminating indoor reverberations |
CN106340302A (en) * | 2015-07-10 | 2017-01-18 | 深圳市潮流网络技术有限公司 | De-reverberation method and device for speech data |
US20180182410A1 (en) * | 2016-12-23 | 2018-06-28 | Synaptics Incorporated | Online dereverberation algorithm based on weighted prediction error for noisy time-varying environments |
US20200219524A1 (en) * | 2017-09-21 | 2020-07-09 | Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. | Signal processor and method for providing a processed audio signal reducing noise and reverberation |
CN108154885A (en) * | 2017-12-15 | 2018-06-12 | 重庆邮电大学 | It is a kind of to use QR-RLS algorithms to multicenter voice signal dereverberation method |
CN110111804A (en) * | 2018-02-01 | 2019-08-09 | 南京大学 | Adaptive dereverberation method based on RLS algorithm |
US20190267018A1 (en) * | 2018-02-23 | 2019-08-29 | Cirrus Logic International Semiconductor Ltd. | Signal processing for speech dereverberation |
CN109637554A (en) * | 2019-01-16 | 2019-04-16 | 辽宁工业大学 | MCLP speech dereverberation method based on CDR |
CN111161751A (en) * | 2019-12-25 | 2020-05-15 | 声耕智能科技(西安)研究院有限公司 | Distributed microphone pickup system and method under complex scene |
CN111128220A (en) * | 2019-12-31 | 2020-05-08 | 深圳市友杰智新科技有限公司 | Dereverberation method, apparatus, device and storage medium |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023005409A1 (en) * | 2021-07-26 | 2023-02-02 | 青岛海尔科技有限公司 | Device determination method and device determination system |
Also Published As
Publication number | Publication date |
---|---|
CN113160842B (en) | 2024-04-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP5124014B2 (en) | Signal enhancement apparatus, method, program and recording medium | |
EP1993320B1 (en) | Reverberation removal device, reverberation removal method, reverberation removal program, and recording medium | |
CN108172231B (en) | Dereverberation method and system based on Kalman filtering | |
JP5550456B2 (en) | Reverberation suppression apparatus and reverberation suppression method | |
CN109979476B (en) | Method and device for removing reverberation of voice | |
Xiao et al. | The NTU-ADSC systems for reverberation challenge 2014 | |
EP3685378B1 (en) | Signal processor and method for providing a processed audio signal reducing noise and reverberation | |
CN111312269B (en) | Rapid echo cancellation method in intelligent loudspeaker box | |
CN110111802B (en) | Kalman filtering-based adaptive dereverberation method | |
Mack et al. | Single-Channel Dereverberation Using Direct MMSE Optimization and Bidirectional LSTM Networks. | |
Wisdom et al. | Enhancement and recognition of reverberant and noisy speech by extending its coherence | |
Huang et al. | Multi-microphone adaptive noise cancellation for robust hotword detection | |
CN113160842B (en) | MCLP-based voice dereverberation method and system | |
JP4348393B2 (en) | Signal distortion removing apparatus, method, program, and recording medium recording the program | |
CN109243476B (en) | Self-adaptive estimation method and device for post-reverberation power spectrum in reverberation voice signal | |
Lefkimmiatis et al. | An optimum microphone array post-filter for speech applications. | |
CN116052702A (en) | Kalman filtering-based low-complexity multichannel dereverberation noise reduction method | |
Huang et al. | Dereverberation | |
Sehr et al. | Towards robust distant-talking automatic speech recognition in reverberant environments | |
US20230306980A1 (en) | Method and System for Audio Signal Enhancement with Reduced Latency | |
CN116758928A (en) | MCLP language dereverberation method and system based on time-varying forgetting factor | |
JP5172797B2 (en) | Reverberation suppression apparatus and method, program, and recording medium | |
Schwartz et al. | LPC-based speech dereverberation using Kalman-EM algorithm | |
Meng et al. | Frame-wise speech extraction with recursive expectation maximization for partially deformable microphone arrays | |
Bartolewska et al. | Frame-based Maximum a Posteriori Estimation of Second-Order Statistics for Multichannel Speech Enhancement in Presence of Noise |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |