CN113160842A - Voice dereverberation method and system based on MCLP - Google Patents

Voice dereverberation method and system based on MCLP Download PDF

Info

Publication number
CN113160842A
CN113160842A CN202110247855.4A CN202110247855A CN113160842A CN 113160842 A CN113160842 A CN 113160842A CN 202110247855 A CN202110247855 A CN 202110247855A CN 113160842 A CN113160842 A CN 113160842A
Authority
CN
China
Prior art keywords
reverberation
voice
signal
energy ratio
spectral density
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110247855.4A
Other languages
Chinese (zh)
Other versions
CN113160842B (en
Inventor
冯子成
马鸿飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202110247855.4A priority Critical patent/CN113160842B/en
Publication of CN113160842A publication Critical patent/CN113160842A/en
Application granted granted Critical
Publication of CN113160842B publication Critical patent/CN113160842B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0264Noise filtering characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention relates to the technical field of voice signal processing, in particular to a voice dereverberation method and system based on MCLP. The method comprises the following steps: the method comprises the steps of performing frame data processing on collected reverberation voice of a reverberation environment to obtain an expected signal of a current frame; acquiring a voice reverberation energy ratio and a signal-to-noise estimation value of an expected signal, substituting a geometric spectrum subtraction formula to perform spectrum subtraction on reverberation voice to obtain a first power spectral density of the expected signal; the voice reverberation energy ratio and a first energy ratio of the reverberation voice and the reverberation component are in positive correlation, and the signal-to-noise estimation value and a second energy ratio of the expected voice and the reverberation component are in positive correlation; acquiring a voice signal after dereverberation according to the first power spectral density; and storing the first power spectral density of the current frame as the historical first power spectral density of the next frame, and updating the first power spectral density of the next frame until all the dereverberated voice signals are obtained. The embodiment of the invention can obtain better dereverberation voice.

Description

Voice dereverberation method and system based on MCLP
Technical Field
The invention relates to the technical field of voice signal processing, in particular to a voice dereverberation method and system based on MCLP.
Background
In daily life, the scene demand of indoor recording is more and more extensive, and is common in indoor meeting, auditorium speech, live webcast, intelligent voice assistant etc. and in these scenes, the speech signal that the microphone was gathered often can be mingled with serious reverberation component. Reverberation is an acoustic phenomenon generated in a closed space, and due to the multipath propagation effect of sound, reflection is generated on the surfaces of walls and objects, so that collected voice signals are blurred due to time delay difference, and the definition of a voice frequency spectrum is seriously polluted. Studies have shown that early reverberant sounds within 50 milliseconds help to improve speech intelligibility, and fullness, but excessive late reverberation severely affects speech signal quality.
In practice, the inventors found that the above prior art has the following disadvantages:
for a Multi-Channel Linear Prediction (MCLP) algorithm in the field of speech dereverberation, because a clean speech signal is modeled as a time-varying gaussian model, the performance of the algorithm depends heavily on the accuracy of estimation of Power Spectral Density (PSD) of the clean speech signal, and an original online MCLP algorithm directly uses an observed reverberation signal instead of the clean speech to estimate PSD, so that the accuracy is poor and the dereverberation effect is influenced. In part of the improved research results of the algorithm, a late reverberation component PSD estimation algorithm is used, and then the reverberation PSD is subtracted by spectral subtraction to obtain an estimated pure voice PSD. However, because the estimation of the reverberation PSD is inaccurate, when the amplitude of the estimated value is large, the direct spectral subtraction may cause an over-subtraction problem, so that the frequency spectrum may have too many zeros, resulting in problems of frequency spectrum distortion and music noise.
Disclosure of Invention
In order to solve the above technical problems, an object of the present invention is to provide a method and a system for dereverberating a speech based on MCLP, wherein the adopted technical solution is as follows:
in a first aspect, an embodiment of the present invention provides a method for voice dereverberation based on MCLP, including the following steps:
the method comprises the steps of performing frame data processing on collected reverberation voice of a reverberation environment to obtain an expected signal of a current frame;
acquiring a voice reverberation energy ratio and a signal noise estimation value of the expected signal, substituting a geometric spectrum subtraction formula to perform spectrum subtraction on the reverberation voice to obtain a first power spectral density of the expected signal; the voice reverberation energy ratio and the first energy ratio are in positive correlation, and the signal-to-noise estimation value and the second energy ratio are in positive correlation; the first energy ratio is the energy ratio of the reverberation voice and the reverberation component; the second energy ratio is the energy ratio of the desired speech and the reverberation component;
obtaining a dereverberated speech signal according to the first power spectral density;
and storing the first power spectral density of the current frame as the historical first power spectral density of the next frame, and updating the first power spectral density of the next frame until all the dereverberated voice signals are obtained.
Preferably, the step of acquiring the desired signal includes:
calculating a prediction coefficient through the mathematical representation of the reverberation signal in a time-frequency domain;
and obtaining a first prediction coefficient matrix according to the prediction coefficient, and calculating the expected signal by using the first prediction coefficient matrix and the reverberation voice subjected to framing processing.
Preferably, the method for calculating the speech reverberation energy ratio includes:
and obtaining the voice reverberation energy ratio of the current frame by performing smooth calculation on the first energy ratio and the historical voice reverberation energy ratio.
Preferably, the method for calculating the signal-to-noise estimation value comprises:
Figure BDA0002964797500000021
wherein R isd/rRepresenting a signal-to-noise estimate;
Figure BDA0002964797500000022
representing the second energy ratio value, d't,lRepresenting the estimated expected signal bin amplitude, | d't,l|2Presentation periodThe energy of the signal to be observed,
Figure BDA0002964797500000023
a second power spectral density representative of the reverberation component; beta is a2Representing a second smoothing factor; rx/rRepresenting the speech to reverberation energy ratio.
Preferably, the step of obtaining the dereverberated speech signal includes:
obtaining an expected signal frequency point at each channel of the current frame by using a weighted recursive least square formula according to the first power spectral density;
and carrying out short-time Fourier inverse transformation on the expected signal frequency point to obtain the voice signal after the reverberation is removed.
In a second aspect, another embodiment of the present invention provides an MCLP-based speech dereverberation system, which includes the following modules:
the reverberation voice preprocessing module is used for performing framing data processing on the collected reverberation voice of the reverberation environment to obtain an expected signal of the current frame;
the first power spectral density acquisition module is used for acquiring the voice reverberation energy ratio and the signal-to-noise estimation value of the expected signal, substituting a geometric spectrum subtraction formula to perform spectrum subtraction on the reverberation voice to obtain the first power spectral density of the expected signal; the voice reverberation energy ratio and the first energy ratio are in positive correlation, and the signal-to-noise estimation value and the second energy ratio are in positive correlation; the first energy ratio is the energy ratio of the reverberation voice and the reverberation component; the second energy ratio is the energy ratio of the desired speech and the reverberation component;
the voice dereverberation module is used for acquiring a dereverberated voice signal according to the first power spectral density;
and the first power spectral density updating module is used for storing the first power spectral density of the current frame, taking the first power spectral density as the historical first power spectral density of the next frame, and updating the first power spectral density of the next frame until all the dereverberated voice signals are obtained.
Preferably, the reverberation voice preprocessing module comprises:
the prediction coefficient calculation module is used for calculating a prediction coefficient through the mathematical representation of the reverberation signal in a time-frequency domain;
and the expected signal calculation module is used for obtaining a first prediction coefficient matrix according to the prediction coefficient and calculating the expected signal by using the first prediction coefficient matrix and the reverberation voice subjected to framing processing.
Preferably, the first power spectral density acquisition module includes:
and the voice reverberation energy ratio acquisition module is used for performing smooth calculation on the first energy ratio and the historical voice reverberation energy ratio to obtain the voice reverberation energy ratio of the current frame.
Preferably, the first power spectral density acquisition module includes:
a signal-to-noise estimate calculation module configured to calculate the signal-to-noise estimate:
Figure BDA0002964797500000031
wherein R isd/rRepresenting a signal-to-noise estimate;
Figure BDA0002964797500000032
representing the second energy ratio value, d't,lRepresenting the estimated expected signal bin amplitude, | d't,l|2Which is indicative of the energy of the desired signal,
Figure BDA0002964797500000033
a second power spectral density representative of the reverberation component; beta is a2Representing a second smoothing factor; rx/rRepresenting the speech to reverberation energy ratio.
Preferably, the voice dereverberation module includes:
the expected signal frequency point acquisition module is used for acquiring an expected signal frequency point at each channel of the current frame by using a weighted recursive least square formula according to the first power spectral density;
and the dereverberation voice signal calculation module is used for carrying out short-time Fourier inverse transformation on the expected signal frequency point to obtain the voice signal after dereverberation.
The embodiment of the invention has the following beneficial effects:
by combining geometric spectral subtraction and MCLP algorithm, the problem of spectral over-subtraction caused by spectral subtraction is solved, the dereverberation performance of the MCLP algorithm is improved, and high-quality dereverberation voice can be obtained.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions and advantages of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a flowchart of an MCLP-based speech dereverberation method according to an embodiment of the present invention;
FIG. 2 is a diagram of a speech time domain waveform of an original speech when the reverberation time is 0.8s and the number of channels is 4 according to an embodiment of the present invention;
FIG. 3 is a diagram of a speech time domain waveform of speech processed by the MCLP algorithm according to an embodiment of the present invention when the reverberation time is 0.8s and the number of channels is 4;
FIG. 4 is a time domain waveform diagram of speech processed by the MCLP-based speech dereverberation method according to an embodiment of the present invention when the reverberation time is 0.8s and the number of channels is 4;
FIG. 5 is a diagram of a speech spectrum of an original speech with reverberation time of 0.8s and channel number of 4 according to an embodiment of the present invention;
fig. 6 is a diagram of a speech spectrum of a speech processed by the MCLP algorithm according to an embodiment of the present invention when the reverberation time is 0.8s and the number of channels is 4;
fig. 7 is a diagram of a voice spectrum of a voice processed by the MCLP-based voice dereverberation method according to an embodiment of the present invention when a reverberation time is 0.8s and a channel number is 4;
FIG. 8 is a line graph illustrating quality assessment of an original reverberated speech, a speech processed by an MCLP algorithm, and a speech processed by an MCLP-based speech dereverberation method using subjective speech quality assessment at different reverberation times according to an embodiment of the present invention;
FIG. 9 is a line graph illustrating the quality evaluation of the original reverberated speech and the speech processed by the MCLP algorithm and the MCLP-based speech dereverberation method using the energy ratio of the speech reverberation model at different reverberation times according to an embodiment of the present invention;
FIG. 10 is a line graph illustrating the comparison of the weighted segmented direct reverberation energy versus the original reverberated speech, the speech processed by the MCLP algorithm, and the speech processed by the MCLP-based speech dereverberation method at different reverberation times according to an embodiment of the present invention;
FIG. 11 is a line graph illustrating the evaluation of the quality of an original reverberated speech, the speech processed by the MCLP algorithm, and the speech processed by the MCLP-based speech dereverberation method using cepstral distances at different reverberation times according to an embodiment of the present invention;
FIG. 12 is a line graph illustrating quality assessment of original reverberated speech, speech processed by the MCLP algorithm, and speech processed by the MCLP-based speech dereverberation method using subjective speech quality assessment under different numbers of speech channels according to an embodiment of the present invention;
FIG. 13 is a line graph illustrating quality of speech processed by the MCLP algorithm and the MCLP-based speech dereverberation method using the energy ratio of the speech reverberation model to the original reverberated speech under different number of speech channels according to an embodiment of the present invention;
FIG. 14 is a line graph illustrating the quality of the original reverberated speech compared to the direct reverberation energy of the weighted segmented direct reverberation under different numbers of speech channels, the speech processed by the MCLP algorithm, and the speech processed by the MCLP-based speech dereverberation method according to an embodiment of the present invention;
FIG. 15 is a line graph illustrating the quality evaluation of the original reverberated speech, the speech processed by the MCLP algorithm, and the speech processed by the MCLP-based speech dereverberation method using cepstral distances according to an embodiment of the present invention with different numbers of speech channels;
fig. 16 is a block diagram illustrating a structure of an MCLP-based speech dereverberation system according to an embodiment of the present invention.
Detailed Description
To further illustrate the technical means and effects of the present invention for achieving the predetermined objects, the following detailed description of the embodiments, structures, features and effects of the method and system for dereverberating MCLP-based speech according to the present invention will be made with reference to the accompanying drawings and preferred embodiments. In the following description, different "one embodiment" or "another embodiment" refers to not necessarily the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
The following describes a specific scheme of a voice dereverberation method and system based on MCLP in detail with reference to the accompanying drawings.
Referring to fig. 1, a flowchart of an MCLP-based speech dereverberation method according to an embodiment of the present invention is shown, where the method includes the following steps:
and S001, performing framing data processing on the collected reverberation voice of the reverberation environment to obtain an expected signal of the current frame.
The method comprises the following specific steps:
1) calculating prediction coefficients from a mathematical representation of a reverberant signal in the time-frequency domain
In a closed acoustic space, a single voice source and a microphone array composed of M omnidirectional microphones are established, the shape of the array is not required, multi-channel voice signals received by the microphone array are windowed frame by frame, and are subjected to Short Time Fourier Transform (STFT) with the frame length of L subframes and L points, and as reverberant voice is the result of reverberant room impulse response and voice convolution in the Time domain and is the result after multiplication in the frequency domain, the reverberant signal received by the mth channel microphone can be represented as follows in the Time-frequency domain:
Figure BDA0002964797500000061
wherein t represents the time domain sequence number of the voice frame; l represents the frequency domain frequency point sequence number at each frame, and is in the range of {1,2, …, L }; τ represents the linear prediction delay;
Figure BDA0002964797500000062
frequency point components representing the reverberant voice at the ith frequency point of the tth frame; st,lFrequency point components representing clean speech at the ith frequency point of the tth frame;
Figure BDA0002964797500000063
the prediction coefficient of the mth microphone to the nth microphone receiving signal is represented, and can also be called as reverberation room impact response from a signal source to the mth microphone, and the length of each channel prediction coefficient is set as a constant K; k denotes the prediction coefficient number, K ∈ {1,2, …, K }.
It should be noted that the prediction delay τ is usually a non-negative integer from 0 to 3, and the prediction coefficient length K is usually a positive integer between 5 and 20; x, s and μ are complex.
2) And obtaining a first prediction coefficient matrix according to the prediction coefficient, and calculating the expected signal by using the first prediction coefficient matrix and the reverberation voice subjected to framing processing.
The above equation (1) is abbreviated again in a matrix form as follows:
Figure BDA0002964797500000064
among them are:
Figure BDA0002964797500000065
Figure BDA0002964797500000066
Figure BDA0002964797500000067
a prediction coefficient matrix, x, representing the m-th microphonet-τ,lRepresenting the sequence of signal observations needed to predict late reverberation under the current frame, assume in an embodiment of the invention that the desired signal s ist,lA zero-mean time-varying Gaussian model, and a late reverberation component part
Figure BDA0002964797500000068
Independent of each other, and the prediction coefficient is estimated by using MCLP algorithm
Figure BDA0002964797500000069
Then, the expected signal of the current frame is obtained:
Figure BDA00029647975000000610
it should be noted that, in the embodiment of the present invention, the method of the present invention is subjected to an on-machine experiment simulation, specifically:
the simulation environment is that a uniform linear array consisting of eight omnidirectional microphones is placed in a closed room with the size of 7.0 multiplied by 3.5 multiplied by 2.4(M), namely M is 8, the microphone intervals are all 10cm, and the microphone coordinates are [6.0, 1.35-2.05, 1.0%]The source coordinate is [1.0,1.7,1.0 ]]. Generating multi-channel reverberation voice under different reverberation times by using a mirror image source model method, wherein the time length is 8s, and the sampling frequency fs16000 Hz. When windowing and framing, the frame length is set to 512 samples, the window function is a hamming window with the length of 512, the prediction coefficient length K is 10, and the prediction delay τ is 3.
S002, acquiring a voice reverberation energy ratio and a signal noise estimation value of the expected signal, substituting a geometric spectrum subtraction formula to perform spectrum subtraction on the reverberation voice to obtain a first power spectral density of the expected signal; the voice reverberation energy ratio and the first energy ratio are in positive correlation, and the signal-to-noise estimation value and the second energy ratio are in positive correlation; the first energy ratio is the energy ratio of the reverberation voice and the reverberation component; the second energy ratio is the energy ratio of the desired speech and reverberation components.
The method comprises the following specific steps:
1) a second power spectral density of the late reverberation component is estimated.
Modeling as an exponential decay model based on reverberation time, estimating frame by adopting a smooth calculation mode, and using symbols
Figure BDA0002964797500000071
The second power spectral density representing late reverberation is:
Figure BDA0002964797500000072
wherein, R represents a discrete frame shift length of the speech frame in the time domain, and is usually set to be one half or one quarter of the frame length L, in the embodiment of the present invention, the frame shift R is 128 samples; e is a constant, representing the minimum of the estimated second power spectral density, typically taken to be 0.0001;
Figure BDA0002964797500000073
representing a third power spectral density of the reverberant speech signal at the t- τ frame, the embodiment of the present invention is obtained by averaging signals of former δ frames of all channels of the microphone receiving signals:
Figure BDA0002964797500000074
wherein, τ represents the number of the predicted delay frames, τ frames before the t-th frame do not participate in the prediction, δ represents the number of the frames involved in the calculation covered before and after the t- τ frame, δ is a constant of 6 to 10, and δ is generally required to be greater than or equal to 2 τ.
As an example, in an embodiment of the present invention, δ is taken to be 10.
α (t, l) is defined as a variable related to reverberation time:
Figure BDA0002964797500000075
wherein f issRepresenting the speech sampling rate in Hz; RT (reverse transcription)60And (t, l) represents the reverberation time estimated at the current voice frame frequency point, and the unit is second, and the reverberation time is obtained by various reverberation time estimation algorithms.
As an example, in the embodiment of the present invention, the reverberation time RT is calculated by the maximum likelihood estimation method60
Figure BDA0002964797500000076
Figure BDA0002964797500000077
Where the constant ρ represents the rate of attenuation of the acoustic wave, a likelihood function may be used
Figure BDA0002964797500000081
And solving by a maximum likelihood rule. Likelihood function
Figure BDA0002964797500000082
Where L represents the frame length, a and d (i) are respectively:
Figure BDA0002964797500000083
Figure BDA0002964797500000084
wherein, represents ArOriginal amplitude of the current speech signal, v (i) at the ith sample point of a discrete normal distribution with mean 0 and variance 1The value is i e {0, …, N-1}, rt(i) Indicating a set reverberation time search sequence, rt=[0.1,0.2,…,1.2]。
2) A first power spectral density of the desired signal is estimated using geometric spectral subtraction.
The method comprises the following specific steps:
a) and calculating the voice reverberation energy ratio.
And obtaining the voice reverberation energy ratio of the current frame by performing smooth calculation on the first energy ratio and the historical voice reverberation energy ratio.
The specific calculation formula is as follows:
Figure BDA0002964797500000085
wherein R isx/rRepresenting a speech to reverberation energy ratio; beta is a1Denotes a first smoothing factor, 0<β1<1;
Figure BDA0002964797500000086
The first energy ratio is expressed in the form of a constant.
As an example, in the embodiment of the present invention, β10.9 is taken.
b) A signal-to-noise estimate is calculated.
The specific calculation formula is as follows:
Figure BDA0002964797500000087
wherein R isd/rRepresenting a signal-to-noise estimate;
Figure BDA0002964797500000088
denotes a second energy ratio, d't,lRepresenting the estimated expected signal bin amplitude, | d't,l|2Representing the energy of the desired signal; beta is a2Denotes a second smoothing factor, 0<β2<1。
Figure BDA0002964797500000089
D 'is obtained't,lThen, it is substituted into equation (2) to calculate R for the next framed/r(ii) a In calculating the first frame, | x is adoptedt,lL instead of d't,lR is to bex/rThe initialization was 1.0.
As an example, β in the embodiment of the present invention20.9 is taken.
c) And obtaining a first power spectral density of the expected signal according to the frequency point amplitude of the expected signal.
Figure BDA0002964797500000091
Wherein, d't,lFor the estimated amplitude, beta, of the desired signal frequency point3Is a third smoothing factor, 0<β3<1, when processing the first frame, using
Figure BDA0002964797500000092
Instead of the former
Figure BDA0002964797500000093
And (6) performing calculation.
As an example, in the embodiment of the present invention, β30.9 is taken.
And step S003, acquiring the voice signal after dereverberation according to the first power spectral density.
The method comprises the following specific steps:
1) and obtaining the expected signal frequency points at each channel of the current frame by using a weighted recursive least square formula according to the first power spectral density.
dt,l=xt,l-Gl(t-1)Hxt-τ,l
Figure BDA0002964797500000094
Figure BDA0002964797500000095
Figure BDA0002964797500000096
Among them are:
Figure BDA0002964797500000097
Figure BDA0002964797500000098
Figure BDA0002964797500000099
wherein d ist,lRepresenting the frequency point of the desired signal, G, at each channel of the current framel(t) denotes a second prediction coefficient matrix, kl(t) represents a gain vector for updating the prediction coefficients, the matrix size is (MK × 1), Φl(t) an inverse matrix for storing a spatial correlation matrix, the matrix size being (mkxmk); α is a constant, representing a fourth smoothing factor.
As an example, in the embodiment of the present invention, α is 0.9999.
It should be noted that before calculating the first frame, G is usedl(t) initialization to an all-zero matrix, Φl(t) is initialized to the unity diagonal matrix.
2) And carrying out short-time Fourier inverse transformation on the frequency points of the expected signals to obtain the voice signals after reverberation is removed.
To dt,lAfter the short-time Fourier inverse transformation is carried out, the algorithm outputs a dereverberation voice signal frame.
And step S004, storing the first power spectral density of the current frame, taking the first power spectral density as the historical first power spectral density of the next frame, and updating the first power spectral density of the next frame until all the dereverberated voice signals are obtained.
The method comprises the following specific steps:
the expected signal is modeled into a time-varying Gaussian model with zero mean value, so the first power spectral density is used as the variance, the first power spectral density of the currently obtained speech frame is stored and used as the variance
Figure BDA0002964797500000101
Substituting into the calculation formula (3) of the next frame, the estimation process of the first power spectral density is modified:
Figure BDA0002964797500000102
and judging whether all the voice frames are processed or not, and if the voice frames remain, continuing to perform dereverberation calculation on the next frame of data until all the voice frames are processed.
In summary, in the embodiments of the present invention, frame data processing is performed on the collected reverberation voice of the reverberation environment to obtain an expected signal of the current frame; acquiring a voice reverberation energy ratio and a signal-to-noise estimation value of an expected signal, substituting a geometric spectrum subtraction formula to perform spectrum subtraction on reverberation voice to obtain a first power spectral density of the expected signal; the voice reverberation energy ratio and a first energy ratio of the reverberation voice and the reverberation component are in positive correlation, and the signal-to-noise estimation value and a second energy ratio of the expected voice and the reverberation component are in positive correlation; acquiring a voice signal after dereverberation according to the first power spectral density; and storing the first power spectral density of the current frame as the historical first power spectral density of the next frame, and updating the first power spectral density of the next frame until all the dereverberated voice signals are obtained.
Through computer-aided experimental simulation, the performance of the voice dereverberation method based on MCLP is evaluated in the embodiment of the invention, as shown in fig. 2-15, an improved MCLP algorithm in the graph is the voice dereverberation method based on MCLP provided by the embodiment of the invention, and it can be found by observing the time domain waveforms in fig. 2-4 and the frequency spectrum waveforms in fig. 5-7, compared with the voice processed by the MCLP algorithm, the embodiment of the invention is clearer and cleaner on the envelope of the time domain waveform and the spectrogram ripple, and reduces the effect of trailing blurring, especially in the beginning section of the voice, the clearness of the time domain waveform and the frequency domain waveform is obviously improved compared with the MCLP algorithm, and is not bulked and blurred any more, which indicates that the removal of reverberation components is more thorough, and the overall stability of the algorithm is higher.
Among the four Speech Quality Evaluation criteria, the higher the score of the subjective Speech Quality assessment method (PESQ), Speech-to-Reverberation model Energy Ratio (SRMR), and Weighted segment direct Reverberation Energy Ratio (FWsegSNR), the lower the score of the Cepstrum Distance (CD), the better the Speech Quality. By observing the line graphs of fig. 8-11, it can be found that the scores of the four evaluation indexes are obviously superior to those of the MCLP algorithm under different reverberation times of 0.2s to 1.2s, and the performance improvement amount is stable, which proves the superiority of the embodiment of the invention. As can be seen from observing the line diagrams in fig. 12 to fig. 15, in the embodiment of the present invention, under the condition of different numbers of voice channels 2, 4, 6, and 8, the four evaluation indexes are also significantly improved compared with the MCLP algorithm, and the higher the number of voice channels is, the larger the performance improvement range is.
The comparison shows that the voice quality processed by the MCLP-based voice dereverberation method is obviously superior to that of the original MCLP algorithm, and the dereverberation performance can be further improved to a certain extent by the embodiment of the invention.
Based on the same inventive concept as the above method, another embodiment of the present invention provides an MCLP-based speech dereverberation system, referring to fig. 16, which includes the following modules:
a reverberant speech pre-processing module 1001, a first power spectral density acquisition module 1002, a speech dereverberation module 1003 and a first power spectral density update module 1004.
The reverberation voice preprocessing module 1001 is configured to perform framing data processing on the collected reverberation voice of the reverberation environment to obtain an expected signal of the current frame; the first power spectral density obtaining module 1002 is configured to obtain a speech-to-reverberation energy ratio and a signal-to-noise estimation value of the desired signal, and substitute a geometric spectral subtraction formula to perform spectral subtraction on the reverberated speech to obtain a first power spectral density of the desired signal; the voice reverberation energy ratio and the first energy ratio are in positive correlation, and the signal-to-noise estimation value and the second energy ratio are in positive correlation; the first energy ratio is the energy ratio of the reverberation voice and the reverberation component; the second energy ratio is the energy ratio of the desired speech and reverberation components; the voice dereverberation module 1003 is configured to obtain a dereverberated voice signal according to the first power spectral density; the first power spectral density updating module 1004 is configured to store the first power spectral density of the current frame as a historical first power spectral density of the next frame, and update the first power spectral density of the next frame until all dereverberated speech signals are obtained.
Preferably, the reverberation voice preprocessing module comprises:
the prediction coefficient calculation module is used for calculating a prediction coefficient through the mathematical representation of the reverberation signal in a time-frequency domain;
and the expected signal calculation module is used for obtaining a first prediction coefficient matrix according to the prediction coefficient and calculating an expected signal by using the first prediction coefficient matrix and the reverberation voice subjected to framing processing.
Preferably, the first power spectral density acquisition module comprises:
and the voice reverberation energy ratio acquisition module is used for performing smooth calculation on the first energy ratio and the historical voice reverberation energy ratio to obtain the voice reverberation energy ratio of the current frame.
Preferably, the first power spectral density acquisition module comprises:
a signal-to-noise estimation value calculation module, configured to calculate a signal-to-noise estimation value:
Figure BDA0002964797500000121
wherein R isd/rRepresenting a signal-to-noise estimate;
Figure BDA0002964797500000122
represents the secondEnergy ratio, d't,lRepresenting the estimated expected signal bin amplitude, | d't,l|2Which is indicative of the energy of the desired signal,
Figure BDA0002964797500000123
a second power spectral density representing the reverberation component; beta is a2Representing a second smoothing factor; rx/rRepresenting the speech to reverberation energy ratio.
Preferably, the voice dereverberation module includes:
the expected signal frequency point acquisition module is used for acquiring an expected signal frequency point at each channel of the current frame by using a weighted recursive least square formula according to the first power spectral density;
and the dereverberation voice signal calculation module is used for carrying out short-time Fourier inverse transformation on the frequency point of the expected signal to obtain the voice signal after dereverberation.
In summary, in the embodiment of the present invention, the reverberation voice preprocessing module 1001 performs framing data processing on the collected reverberation voice of the reverberation environment to obtain an expected signal of the current frame; acquiring a voice reverberation energy ratio and a signal-to-noise estimation value of a desired signal through a first power spectral density acquisition module 1002, substituting a geometric spectrum subtraction formula to perform spectral subtraction on reverberant voice to obtain a first power spectral density of the desired signal; obtaining a dereverberated voice signal according to the first power spectral density through a voice dereverberation module 1003; the first power spectral density of the current frame is stored by the first power spectral density update module 1004 and is used as the historical first power spectral density of the next frame, and the first power spectral density of the next frame is updated until all dereverberated speech signals are obtained. The embodiment of the invention can further improve the dereverberation performance of the MCLP algorithm to a certain extent, and obtain the dereverberation voice with higher quality.
It should be noted that: the precedence order of the above embodiments of the present invention is only for description, and does not represent the merits of the embodiments. And specific embodiments thereof have been described above. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (10)

1. An MCLP-based speech dereverberation method, comprising the steps of:
the method comprises the steps of performing frame data processing on collected reverberation voice of a reverberation environment to obtain an expected signal of a current frame;
acquiring a voice reverberation energy ratio and a signal-to-noise estimation value of the expected signal, substituting a geometric spectrum subtraction formula to perform spectrum subtraction on the reverberation voice to obtain a first power spectral density of the expected signal; the voice reverberation energy ratio and the first energy ratio are in positive correlation, and the signal-to-noise estimation value and the second energy ratio are in positive correlation; the first energy ratio is the energy ratio of the reverberant speech and the reverberant component; the second energy ratio is the energy ratio of the desired speech and the reverberation component;
obtaining a dereverberated speech signal according to the first power spectral density;
and storing the first power spectral density of the current frame as the historical first power spectral density of the next frame, and updating the first power spectral density of the next frame until all the dereverberated voice signals are obtained.
2. The method of claim 1, wherein the step of obtaining the desired signal comprises:
calculating a prediction coefficient through the mathematical representation of the reverberation signal in a time-frequency domain;
and obtaining a first prediction coefficient matrix according to the prediction coefficient, and calculating the expected signal by using the first prediction coefficient matrix and the reverberation voice subjected to framing processing.
3. The method of claim 1, wherein the speech to reverberation energy ratio is calculated by:
and obtaining the voice reverberation energy ratio of the current frame by performing smooth calculation on the first energy ratio and the historical voice reverberation energy ratio.
4. The method of claim 1, wherein the signal-to-noise estimate is calculated by:
Figure FDA0002964797490000011
wherein R isd/rRepresenting a signal-to-noise estimate;
Figure FDA0002964797490000012
representing the second energy ratio value, d't,lRepresenting the estimated expected signal bin amplitude, | d't,l|2Which is indicative of the energy of the desired signal,
Figure FDA0002964797490000013
a second power spectral density representative of the reverberation component; beta is a2Representing a second smoothing factor; rx/rRepresenting the speech to reverberation energy ratio.
5. The method of claim 1, wherein the step of obtaining the dereverberated speech signal comprises:
obtaining an expected signal frequency point at each channel of the current frame by using a weighted recursive least square formula according to the first power spectral density;
and carrying out short-time Fourier inverse transformation on the expected signal frequency point to obtain the voice signal after the reverberation is removed.
6. An MCLP-based speech dereverberation system, comprising the following modules:
the reverberation voice preprocessing module is used for performing framing data processing on the collected reverberation voice of the reverberation environment to obtain an expected signal of the current frame;
a first power spectral density obtaining module, configured to obtain a speech-to-reverberation energy ratio and a signal-to-noise estimation value of the desired signal, and substitute a geometric spectrum subtraction formula to perform spectrum subtraction on the reverberant speech to obtain a first power spectral density of the desired signal; the voice reverberation energy ratio and the first energy ratio are in positive correlation, and the signal-to-noise estimation value and the second energy ratio are in positive correlation; the first energy ratio is the energy ratio of the reverberant speech and the reverberant component; the second energy ratio is the energy ratio of the desired speech and the reverberation component;
the voice dereverberation module is used for acquiring a dereverberated voice signal according to the first power spectral density;
and the first power spectral density updating module is used for storing the first power spectral density of the current frame, taking the first power spectral density as the historical first power spectral density of the next frame, and updating the first power spectral density of the next frame until all the dereverberated voice signals are obtained.
7. The system of claim 6, wherein the reverberation speech pre-processing module comprises:
the prediction coefficient calculation module is used for calculating a prediction coefficient through the mathematical representation of the reverberation signal in a time-frequency domain;
and the expected signal calculation module is used for obtaining a first prediction coefficient matrix according to the prediction coefficient and calculating the expected signal by using the first prediction coefficient matrix and the reverberation voice subjected to framing processing.
8. The system of claim 6, wherein the first power spectral density acquisition module comprises:
and the voice reverberation energy ratio acquisition module is used for performing smooth calculation on the first energy ratio and the historical voice reverberation energy ratio to obtain the voice reverberation energy ratio of the current frame.
9. The system of claim 6, wherein the first power spectral density acquisition module comprises:
a signal-to-noise estimate calculation module configured to calculate the signal-to-noise estimate:
Figure FDA0002964797490000021
wherein R isd/rRepresenting a signal-to-noise estimate;
Figure FDA0002964797490000022
representing the second energy ratio value, d't,lRepresenting the estimated expected signal bin amplitude, | d't,l|2Which is indicative of the energy of the desired signal,
Figure FDA0002964797490000023
a second power spectral density representative of the reverberation component; beta is a2Representing a second smoothing factor; rx/rRepresenting the speech to reverberation energy ratio.
10. The system of claim 6, wherein the speech dereverberation module comprises:
the expected signal frequency point acquisition module is used for acquiring an expected signal frequency point at each channel of the current frame by using a weighted recursive least square formula according to the first power spectral density;
and the dereverberation voice signal calculation module is used for carrying out short-time Fourier inverse transformation on the expected signal frequency point to obtain the voice signal after dereverberation.
CN202110247855.4A 2021-03-06 2021-03-06 MCLP-based voice dereverberation method and system Active CN113160842B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110247855.4A CN113160842B (en) 2021-03-06 2021-03-06 MCLP-based voice dereverberation method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110247855.4A CN113160842B (en) 2021-03-06 2021-03-06 MCLP-based voice dereverberation method and system

Publications (2)

Publication Number Publication Date
CN113160842A true CN113160842A (en) 2021-07-23
CN113160842B CN113160842B (en) 2024-04-09

Family

ID=76884366

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110247855.4A Active CN113160842B (en) 2021-03-06 2021-03-06 MCLP-based voice dereverberation method and system

Country Status (1)

Country Link
CN (1) CN113160842B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023005409A1 (en) * 2021-07-26 2023-02-02 青岛海尔科技有限公司 Device determination method and device determination system

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101436407A (en) * 2008-12-22 2009-05-20 西安电子科技大学 Method for encoding and decoding audio
US20130151244A1 (en) * 2011-12-09 2013-06-13 Microsoft Corporation Harmonicity-based single-channel speech quality estimation
CN103413547A (en) * 2013-07-23 2013-11-27 大连理工大学 Method for eliminating indoor reverberations
WO2013189199A1 (en) * 2012-06-18 2013-12-27 歌尔声学股份有限公司 Method and device for dereverberation of single-channel speech
CN106340302A (en) * 2015-07-10 2017-01-18 深圳市潮流网络技术有限公司 De-reverberation method and device for speech data
CN108154885A (en) * 2017-12-15 2018-06-12 重庆邮电大学 It is a kind of to use QR-RLS algorithms to multicenter voice signal dereverberation method
US20180182410A1 (en) * 2016-12-23 2018-06-28 Synaptics Incorporated Online dereverberation algorithm based on weighted prediction error for noisy time-varying environments
CN109637554A (en) * 2019-01-16 2019-04-16 辽宁工业大学 MCLP speech dereverberation method based on CDR
CN110111804A (en) * 2018-02-01 2019-08-09 南京大学 Adaptive dereverberation method based on RLS algorithm
US20190267018A1 (en) * 2018-02-23 2019-08-29 Cirrus Logic International Semiconductor Ltd. Signal processing for speech dereverberation
CN111128220A (en) * 2019-12-31 2020-05-08 深圳市友杰智新科技有限公司 Dereverberation method, apparatus, device and storage medium
CN111161751A (en) * 2019-12-25 2020-05-15 声耕智能科技(西安)研究院有限公司 Distributed microphone pickup system and method under complex scene
US20200219524A1 (en) * 2017-09-21 2020-07-09 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Signal processor and method for providing a processed audio signal reducing noise and reverberation

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101436407A (en) * 2008-12-22 2009-05-20 西安电子科技大学 Method for encoding and decoding audio
US20130151244A1 (en) * 2011-12-09 2013-06-13 Microsoft Corporation Harmonicity-based single-channel speech quality estimation
WO2013189199A1 (en) * 2012-06-18 2013-12-27 歌尔声学股份有限公司 Method and device for dereverberation of single-channel speech
CN103413547A (en) * 2013-07-23 2013-11-27 大连理工大学 Method for eliminating indoor reverberations
CN106340302A (en) * 2015-07-10 2017-01-18 深圳市潮流网络技术有限公司 De-reverberation method and device for speech data
US20180182410A1 (en) * 2016-12-23 2018-06-28 Synaptics Incorporated Online dereverberation algorithm based on weighted prediction error for noisy time-varying environments
US20200219524A1 (en) * 2017-09-21 2020-07-09 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Signal processor and method for providing a processed audio signal reducing noise and reverberation
CN108154885A (en) * 2017-12-15 2018-06-12 重庆邮电大学 It is a kind of to use QR-RLS algorithms to multicenter voice signal dereverberation method
CN110111804A (en) * 2018-02-01 2019-08-09 南京大学 Adaptive dereverberation method based on RLS algorithm
US20190267018A1 (en) * 2018-02-23 2019-08-29 Cirrus Logic International Semiconductor Ltd. Signal processing for speech dereverberation
CN109637554A (en) * 2019-01-16 2019-04-16 辽宁工业大学 MCLP speech dereverberation method based on CDR
CN111161751A (en) * 2019-12-25 2020-05-15 声耕智能科技(西安)研究院有限公司 Distributed microphone pickup system and method under complex scene
CN111128220A (en) * 2019-12-31 2020-05-08 深圳市友杰智新科技有限公司 Dereverberation method, apparatus, device and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023005409A1 (en) * 2021-07-26 2023-02-02 青岛海尔科技有限公司 Device determination method and device determination system

Also Published As

Publication number Publication date
CN113160842B (en) 2024-04-09

Similar Documents

Publication Publication Date Title
JP5124014B2 (en) Signal enhancement apparatus, method, program and recording medium
EP1993320B1 (en) Reverberation removal device, reverberation removal method, reverberation removal program, and recording medium
CN108172231B (en) Dereverberation method and system based on Kalman filtering
JP5550456B2 (en) Reverberation suppression apparatus and reverberation suppression method
CN109979476B (en) Method and device for removing reverberation of voice
Xiao et al. The NTU-ADSC systems for reverberation challenge 2014
EP3685378B1 (en) Signal processor and method for providing a processed audio signal reducing noise and reverberation
CN111312269B (en) Rapid echo cancellation method in intelligent loudspeaker box
CN110111802B (en) Kalman filtering-based adaptive dereverberation method
Mack et al. Single-Channel Dereverberation Using Direct MMSE Optimization and Bidirectional LSTM Networks.
Wisdom et al. Enhancement and recognition of reverberant and noisy speech by extending its coherence
Huang et al. Multi-microphone adaptive noise cancellation for robust hotword detection
CN113160842B (en) MCLP-based voice dereverberation method and system
JP4348393B2 (en) Signal distortion removing apparatus, method, program, and recording medium recording the program
CN109243476B (en) Self-adaptive estimation method and device for post-reverberation power spectrum in reverberation voice signal
Lefkimmiatis et al. An optimum microphone array post-filter for speech applications.
CN116052702A (en) Kalman filtering-based low-complexity multichannel dereverberation noise reduction method
Huang et al. Dereverberation
Sehr et al. Towards robust distant-talking automatic speech recognition in reverberant environments
US20230306980A1 (en) Method and System for Audio Signal Enhancement with Reduced Latency
CN116758928A (en) MCLP language dereverberation method and system based on time-varying forgetting factor
JP5172797B2 (en) Reverberation suppression apparatus and method, program, and recording medium
Schwartz et al. LPC-based speech dereverberation using Kalman-EM algorithm
Meng et al. Frame-wise speech extraction with recursive expectation maximization for partially deformable microphone arrays
Bartolewska et al. Frame-based Maximum a Posteriori Estimation of Second-Order Statistics for Multichannel Speech Enhancement in Presence of Noise

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant