CN108986832B - Binaural voice dereverberation method and device based on voice occurrence probability and consistency - Google Patents
Binaural voice dereverberation method and device based on voice occurrence probability and consistency Download PDFInfo
- Publication number
- CN108986832B CN108986832B CN201810765266.3A CN201810765266A CN108986832B CN 108986832 B CN108986832 B CN 108986832B CN 201810765266 A CN201810765266 A CN 201810765266A CN 108986832 B CN108986832 B CN 108986832B
- Authority
- CN
- China
- Prior art keywords
- power spectrum
- reverberation
- speech
- signal
- voice
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 47
- 238000001228 spectrum Methods 0.000 claims abstract description 129
- 230000006870 function Effects 0.000 claims abstract description 26
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 23
- 238000009499 grossing Methods 0.000 claims abstract description 13
- 238000012545 processing Methods 0.000 claims abstract description 12
- 238000009432 framing Methods 0.000 claims abstract description 9
- 230000009466 transformation Effects 0.000 claims abstract description 6
- 230000001131 transforming effect Effects 0.000 claims abstract description 6
- 230000000694 effects Effects 0.000 claims description 11
- 230000005236 sound signal Effects 0.000 claims description 8
- 238000005314 correlation function Methods 0.000 claims description 6
- 210000005069 ears Anatomy 0.000 claims description 5
- 238000007781 pre-processing Methods 0.000 claims description 5
- 239000000126 substance Substances 0.000 claims description 3
- 238000012546 transfer Methods 0.000 claims description 2
- 230000008447 perception Effects 0.000 abstract description 4
- 238000004364 calculation method Methods 0.000 description 7
- 230000006872 improvement Effects 0.000 description 7
- 238000001914 filtration Methods 0.000 description 5
- 238000012986 modification Methods 0.000 description 5
- 230000004048 modification Effects 0.000 description 5
- 230000004044 response Effects 0.000 description 4
- 230000003595 spectral effect Effects 0.000 description 4
- 238000004891 communication Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 206010063385 Intellectualisation Diseases 0.000 description 1
- 101100071232 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) HMS2 gene Proteins 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000001427 coherent effect Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000037433 frameshift Effects 0.000 description 1
- 208000016354 hearing loss disease Diseases 0.000 description 1
- 239000007943 implant Substances 0.000 description 1
- 230000004807 localization Effects 0.000 description 1
- 230000005855 radiation Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0232—Processing in the frequency domain
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L2021/02082—Noise filtering the noise being echo, reverberation of the speech
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L2021/02161—Number of inputs available containing the signal or the noise to be suppressed
- G10L2021/02165—Two microphones, one receiving mainly the noise signal and the other one mainly the speech signal
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
The invention discloses a binaural voice dereverberation method and device based on voice occurrence probability and consistency. The method comprises the following steps: 1) carrying out time delay compensation on voice signals received by the two microphones to obtain voice signals aligned in time; 2) performing windowing and framing processing, and transforming the voice signal from a time domain to a frequency domain through Fourier transform; 3) estimating a reverberation power spectrum of the low frequency band part based on the voice occurrence probability; 4) calculating the consistency of different signal components of the speech signal; 5) estimating a reverberation power spectrum of the high frequency band part based on the consistency; 6) estimating a reverberation power spectrum combined with high and low frequencies according to the division threshold values of the high and low frequency bands; 7) calculating by using a recursive smoothing algorithm to obtain a final reverberation power spectrum; 8) obtaining a frequency domain signal after reverberation is removed through a gain function; 9) and obtaining the time domain signal after dereverberation by using short-time Fourier inverse transformation. The invention can effectively remove reverberation on the whole frequency band and improve the voice perception quality.
Description
Technical Field
The invention belongs to the technical field of audio signal processing and computer hearing, and particularly relates to a method and a device for removing reverberation of double-microphone voice in a reverberation environment.
Background
Binaural audio naturally has many advantages for communication and multimedia experiences. In the daily human-human interaction, the auditory perception is one of the most effective and direct human-human interaction modes. However, in an actual environment, speech is used as an important information carrier for human-to-human and human-to-machine communication, and is inevitably interfered by reverberation, environmental noise and the like, so that the definition, intelligibility and comfort of speech are greatly reduced, and the auditory perception of human ears and the performance of a subsequent speech processing system are seriously affected. In general, a microphone receives not only a direct part of a sound source but also a reflected signal (e.g., a signal generated by reflection on the floor, wall, ceiling, home furnishings, etc. in a room) that a sound source signal reaches due to multipath propagation when passing through a channel, a reflected wave with an acoustic delay time of about 50ms or more is called an echo, and an effect of the reflected wave other than the direct sound is called a reverberation phenomenon, which affects a receiving effect of a desired speech signal. To counteract the degradation in sound quality caused by reverberation, researchers have proposed dereverberation (or reverberation cancellation) techniques that aim to improve the quality and intelligibility of segmented speech.
Speech dereverberation techniques have wide application. With the development of modern signal processing technology and intelligent discipline, the degree of intellectualization of the robot is continuously improved, the robot is often in a complex acoustic environment in practical application, various types of noise and the like can cause the robot to be interfered when acquiring voice, the recognition rate of the voice can be rapidly reduced in a reverberation environment, the realization of subsequent operation and functions is influenced, and even the practical application cannot be met. Therefore, the reduction of reverberation by using the binaural speech dereverberation technology has important significance on the influence of the robot in practical application. As another example, binaural speech dereverberation techniques may provide pre-processing for many speech signal processing techniques, such as: binaural sound source localization, speech recognition, etc. In addition, for example, for a person with hearing impairment, it is often necessary to communicate with the hearing aid or cochlear implant. However, in a reverberant environment, the hearing aid hearing effect is greatly affected. At this time, the non-clean speech signal needs to be preprocessed by using a speech dereverberation algorithm before being amplified, and the reverberation signal can be removed to a certain extent to help auditory handicapped people to better communicate.
Speech dereverberation techniques can be generally divided in terms of single-channel and multi-channel speech enhancement. The single-channel dereverberation algorithm utilizes a single microphone for speech enhancement, and such an approach has achieved widespread application and mature development in its simple model and inexpensive cost. But since the single-channel speech dereverberation algorithm can only utilize the statistical properties of the single-channel speech signal to suppress reverberation. The multi-channel speech dereverberation system uses a plurality of microphones, namely a microphone array to collect sound signals, so as to obtain multi-channel signals. Due to the increase of the number of input channels, the signal processing algorithm can utilize the correlation between the channel signals to perform voice enhancement. Compared with the limitation that a single channel can only be enhanced by using the difference of the voice and the reverberation in the time-frequency domain, the introduction of the microphone array can make up the defect of a single-channel voice dereverberation algorithm. Generally, increasing the number of microphones can improve the effect of speech dereverberation. Compared with a single microphone, the microphone-based array can not only utilize time-frequency information of signals, but also utilize spatial information of the signals, and is widely concerned. But the disadvantages are that the structure size is huge, the system is complex in calculation, the calculation amount is too large, and the like. The cost of equipment, the real-time performance of the voice adding method algorithm and the effect of the algorithm are comprehensively considered, and the dual-channel voice dereverberation, namely the voice dereverberation by using two microphones, is a better compromise scheme.
The algorithm for dereverberating the double-microphone voice mainly comprises a consistency model-based method, a two-channel wiener filtering-based method and the like. Among them, the algorithm based on the consistent dereverberation is to design the filter mainly according to the consistent difference between the pure speech and the reverberated speech. The method assumes that a pure voice part and a reverberation part are irrelevant, utilizes the consistency of the pure voice, the reverberation voice and the voice received by a microphone to estimate the reverberation power in the received voice, and calculates the gain of a filter through the estimated reverberation power so as to obtain the voice after reverberation removal. The two-channel voice dereverberation method based on consistency mainly comprises the following steps:
1. voice input, pre-filtering and analog-to-digital conversion. Firstly, pre-filtering an input analog sound signal, and carrying out high-pass filtering to inhibit a 50Hz power supply noise signal; the low-pass filtering filters the part of the sound signal with the frequency component exceeding half of the sampling frequency, prevents aliasing interference, and samples and quantizes the analog sound signal to obtain a digital signal.
2. Pre-emphasis is performed. The signal is passed through a high frequency emphasis filter impulse response to compensate for the high frequency attenuation of the lip radiation.
3. And (4) framing and windowing. Due to the slow time-varying property of the voice signal, the voice signal is not stable as a whole and is stable locally, the voice signal is generally considered to be stable within 10-30ms, and the voice signal can be framed according to the length of 20 ms. The framing function is:
xk(n)=w(n)s(Nk+n)n=0,1...N-1;k=0,1...L-1 (1)
where N is the frame length, L is the frame number, and s represents the speech signal. w (n) is a window function whose choice (shape and length) has a large influence on the behavior of the analysis parameters in short time, and commonly used window functions include rectangular windows, hanning windows, hamming windows, and the like. The Hamming window is generally selected, so that the characteristic change of the voice signal can be well reflected, and the Hamming window expression is as follows:
4. and (4) estimating a reverberation power spectrum. The consistency of the pure voice and the reverberation voice is obtained by using a form researched by the prior person when the pure voice and the reverberation voice are estimated, and the consistency of the voice received by the microphone is calculated by using a defined formula of the consistency.
5. The filter gain is calculated and the dual channel signal is filtered.
6. The filtered speech is converted to a time domain output using an inverse fourier transform.
Disclosure of Invention
The invention provides a new method and a device for removing reverberation of binaural speech, which are used for improving the dereverberation effect of a two-microphone dereverberation algorithm based on consistency in a low-frequency section part.
The traditional consistency-based dual-microphone dereverberation algorithm assumes that reverberation is a scattered sound field and has low consistency, and pure voice has high consistency, so that reverberation can be removed according to the consistency, but in a low-frequency section, the consistency of the reverberated voice is also high, so that the reverberation in the low-frequency section is removed less. In addition, the conventional method uses free field calculation when calculating the consistency of each sound part, and in the case of a binaural microphone, the consistency of each sound part is affected by head occlusion due to the presence of a "head shadow effect", and the form of the free field is not suitable. Aiming at the two problems, the invention provides a binaural speech dereverberation method based on speech occurrence probability and consistency.
The technical scheme adopted by the invention is as follows:
a binaural voice dereverberation method based on voice occurrence probability and consistency mainly comprises the following steps:
1) carrying out time delay compensation on voice signals received by the two microphones to obtain voice signals aligned in time;
2) performing windowing and framing processing on the voice signals aligned in time, and transforming the voice signals from a time domain to a frequency domain through Fourier transform;
3) estimating a reverberation power spectrum of a low frequency band part of the speech signal based on the speech occurrence probability;
4) calculating the consistency of different signal components of the speech signal;
5) estimating a reverberation power spectrum of a highband portion of a speech signal based on the coherence;
6) estimating the reverberation power spectrum combined with high and low frequencies according to the reverberation power spectrum of the low frequency band part and the reverberation power spectrum of the high frequency band part and the division threshold of the high and low frequency bands;
7) calculating by using a recursive smoothing algorithm according to the combined high-low frequency reverberation power spectrum to obtain a final reverberation power spectrum;
8) calculating a gain function according to the final reverberation power spectrum, and obtaining a frequency domain signal after reverberation is removed through the gain function;
9) and according to the frequency domain signal after dereverberation, obtaining a time domain signal after dereverberation by utilizing short-time Fourier inverse transformation.
The above steps are specifically described as follows:
1) and carrying out time delay compensation on the voice signals received by the two microphones to obtain the voice aligned in time. Since there is a time difference between the arrival of the speech signal at the two microphones, the signals need to be aligned and processed. The GCC-PHAT-rho gamma method based on generalized cross-correlation is adopted for time delay estimation, and the binaural time difference is determined mainly by searching the spectral peak position of the cross-correlation function. The method can overcome the influence of interference factors such as related noise, reverberation and the like in the environment on the position of the cross-correlation function spectrum peak, and is relatively robust.
In the time domain, the two-channel speech model can be described as:
xi(n)=si(n)+vi(n), (3)
wherein x isi(n) represents a speech signal received by a microphone, si(n) denotes a clean speech signal, vi(n) represents the noise signal, where the index i ∈ { l, r } represents the first microphone signal and the second microphone signal.
With a short-time fourier transform, the two-channel speech model can be represented in the frequency domain as:
Xi(λ,μ)=Si(λ,μ)+Vi(λ,μ), (4)
where λ and μ denote the frame number and frequency, respectively. The cross-correlation function of two received voices can then be expressed as:
where Δ τ is the time difference, denotes taking the complex conjugate, and ω denotes the angular frequency. W (ω) represents a frequency domain weighting function,for sharpening the spectral peaks of the cross-correlation function, the parameter ρ is the reverberation factor determined by the snr, γ (ω) is the coherence function of the speech received by the microphone (described in detail in step 4), and both are adaptively adjusted according to the environment, G (ω) represents the cross-power spectrum, and G (ω) ═ X represents the cross-power spectruml(ω)Xr *(ω). Thus, the time delay can be obtained by maximizing the generalized cross-correlation function:
2) and performing windowing and framing preprocessing on the two aligned voices, and performing Fourier transform to transform the signal from a time domain to a frequency domain.
3) And estimating the reverberation power spectrum of the low frequency band part based on the voice occurrence probability. The step separately estimates the reverberation power spectrum of the low frequency band to ensure that the reverberation of the low frequency band can be removed. For each channel of speech, the speech power and the reverberation power are respectively denoted as phiss(lambda, mu) and phivv(lambda, mu), because whether the voice appears is uncertain, the noise power spectrum E (| V |) is obtained by using the minimum mean square error method2| X), the formula is calculated as:
E(|V|2|X)=P(H0|X)E(|V|2|X,H0)+P(H1|X)E(|V|2|X,H1) (7)
where X and V represent the discrete Fourier transforms of the signal received by the microphone and the reverberant signal, respectively, H1Representing speech, H0Representing non-speech, P (H)0| X) represents the probability of no occurrence of speech, E (| V2|X,H0) Represents the reverberant power spectrum in the absence of speech, P (H)1| X) represents the probability of occurrence of speech, E (| V2|X,H1) Representing the reverberation power spectrum in the presence of speech.
The posterior signal-to-mixture ratio is defined as:
ξ=φss/φvv (8)
the probability of speech occurrence can be calculated by equation (9):
wherein ξoptIndicating the best a posteriori signal-to-mixture ratio. Research shows that when the true posterior signal-to-noise ratio is between-infinity and 20dB, 10 logs are taken10(ξopt) The speech occurrence probability calculation error is minimal at 15 dB. Calculating the probability of occurrence P (H)1| X), the probability P (H) that speech does not occur0| X) can be calculated using the following formula:
P(H0|X)=1-P(H1|X) (10)
when the speech does not appear, the speech received by the microphone can be considered as the reverberation noise, so that the reverberation power spectrum can be obtained by the following formula:
E(|V|2|X,H0)=E(|V|2|V)=|V|2=|X|2 (11)
when the voice appears, the reverberation power spectrum is calculated according to the reverberation estimation result of the previous frame:
whereinIs the self-power spectrum of the estimated reverberation. Thus, the reverberation power spectrum E (| V2| X) may be rewritten as:
interframe smoothing of the reverberant power spectrum:
where alpha is a smoothing factor.
The reverberation power spectrum is updated when the larger of the speech occurrence probabilities of the two channels (i.e. the two microphones) is below a certain threshold, otherwise not:
1) if max (P (H)1|Xl),P(H1|Xr))<p0And P (H)1|Xl)<P(H1|Xr),
2) If max (P (H)1|Xl),P(H1|Xr))<p0And P (H)1|Xl)>P(H1|Xr),
wherein, P (H)1|Xl) Representing the probability of occurrence of speech, P (H), of the first microphone signal1|Xr) Representing the probability of occurrence of speech, p, of the second microphone signal0Representing a threshold value.
The low-frequency part of the voice signal with reverberation carries out reverberation power spectrum estimation based on the method, and the result is recorded as
4) The coherence of the different signal components is calculated. The reverberation signal is clearly distinguished from the speech signal by the consistency in the high frequency part, so that the reverberation in the high frequency part is estimated by the consistency. It is first necessary to compute the correspondence between the different components of the speech. The consistency of the speech received by the microphone can be directly calculated by the definition of consistency, and the consistency between two signals can be defined as:
whereinAndrepresenting a signal x1And x2The self-power spectrum of (a) a,the cross-power spectrum of the signal is represented, and the calculation is carried out by adopting a recursive average method:
wherein alpha isPSDIs a smoothing factor, which represents the complex conjugate.
Reverberation voice is generally assumed to be a scattering sound field, wherein the scattering sound field is caused by that countless uncorrelated signals simultaneously propagate in all directions with the same energy, and the ideal consistency calculation method of the scattering sound field in the traditional method is as follows:
wherein f represents frequency, dmicRepresenting the distance between the two microphones and c representing the speed of sound. However, when the two microphones are located at the left and right ears of the human head, the consistency of the scattering sound field is more complicated due to the shielding of the human head. Jeub et al proposed a method with curve fitting was therefore used to approximate the model:
wherein, ap,bpAnd cpIs a constant, and takes the values of 2.38 and 10 respectively-31371, 151.5, P represents the model order, which is 3.
For clean speech, the coherence of the speech is high, and assuming that both microphones are reached at an angle θ, the coherence between clean speech components can be expressed as:
where f denotes the frequency and c denotes the speed of sound propagation in airDegree, dmicRepresenting the distance of the two microphones.
5) A reverberation power spectrum of the high band part is estimated based on the signal consistency. Since it is assumed that the reverberant sound field is a diffuse sound field, the noise signals received by the respective microphones have the same power spectrumConsidering the head shadow effect, the difference of the power spectrums of the pure voice signals received by the binaural microphone cannot be directly ignored, and the power spectrum of the pure voice signal can be expressed as:
wherein HlAnd HrRespectively representing the transfer functions of the left and right ears, S representing the sound source signal, SlRepresenting the sound signal received by the left microphone, SrRepresenting the sound signal received by the right microphone. Combining the binaural signal coherence function γ yields:
thus, the left and right ears are clean speech signals slAnd srReverberation signal vlAnd vrWith the speech signal x received by the microphonelAnd xrThe relationship of the self-power spectrum and cross-power spectrum of (a) can be expressed as:
since it is assumed that the reverberation is not correlated with the speech, the joint equations (23), (25), (26) can be derived:
combining the definition of binaural coherence with equation (28), one obtains:
solving the equation (29) to obtain the estimation result phi of the reverberation power spectrumvv. Rewrite equation (29) to:
obtaining by solution:
theoretically, the consistency of the voice signal is strong, the consistency of the reverberation signal is weak, and the consistency of the received signal is not more than that of the pure voice signal, so thatThus, the formula (31) can be considered to have a solution. In order to guarantee the reverberation power spectrum phivvTaking a positive number, calculating the reverberation power spectrum by adopting an equation (32):
wherein the self-power spectrumAnd cross power spectrumAlso calculated using a recursive average method.
The high-frequency part of the speech with reverberation carries out reverberation power spectrum estimation based on the method, and the estimation result is
Because a certain difference exists between the theoretical signal consistency and the actual signal consistency, the result of the power spectrum estimation of the reverberation is influenced. To further improve the effect of the estimation, the consistency of the signal is updated here.
When the larger value of the two voice occurrence probabilities is lower than a certain threshold value, the consistency of the reverberation signal is updated by the consistency of the voice signals received by the microphones:
if max (P (H)1|Xl),P(H1|Xr))<p0
When the smaller of the two speech occurrence probabilities is higher than a certain threshold, the consistency of the pure speech signal is updated by the consistency of the speech signal received by the microphone, which is obtained by equation (29):
if min (P (H)1|Xl),P(H1|Xr))>p1
Wherein p is0、p1A threshold value is indicated which is indicative of,indicating the coherence of the reverberant signal, alphaγRepresenting the smoothing coefficient, gammaxlxrIndicating the correspondence between the two voices received by the microphone,indicating the consistency of the clean speech signal,cross power spectrum, phi, representing speech received by two microphonesxlxlRepresenting the self-power spectrum of the speech received by the left microphone,self-power spectrum, phi, representing speech received by the right microphonevvRepresenting the self-power spectrum of the reverberant signal. Since the reverberation power spectrum estimation based on the consistency uses only the square of the pure speech signal consistency, only the formula (35) needs to be used for updating.
6) Combined with the estimation of the reverberation power spectrum of high and low frequencies, when the frequency mu is less than a certain set value mus(frequency values for discriminating high and low frequencies), the reverberation power spectrum isWhen the frequency is greater than the threshold value musWhile, the reverberation power spectrum isNamely:
7) and calculating to obtain a final reverberation power spectrum by utilizing the conventional recursive smoothing algorithm according to the reverberation power spectrum combined with the high frequency and the low frequency estimated in the step 6).
8) A gain function is calculated. After the power spectrum of the reverberation signal is obtained through calculation, a gain function can be designed by using the reverberation power spectrum, and the signal received by the microphone is multiplied by the gain function to obtain the signal after reverberation is removed. Speech dereverberation based on an estimate of the reverberation power spectrum is often filtered by spectral subtraction. It is based on a simple principle: assuming that the reverberation is echo noise, a clean speech signal spectrum can be obtained by subtracting an estimate of the reverberation spectrum from the reverberant speech spectrum received by the microphone. The gain function is as follows:
wherein the content of the first and second substances,representing the estimated power spectrum of the reverberations,indicating the calculated power spectrum, ξ, of the speech signal received by the microphone2(λ) represents the square of the posterior signal-to-noise ratio. To avoid over-reduction, a lower bound G is setmin. The dereverberated speech signal is represented in the frequency domain as:
9) finally, the time domain signal after dereverberation can be obtained by using short-time inverse Fourier transform
Correspondingly, the present invention also provides a binaural speech dereverberation apparatus based on speech occurrence probability and consistency, comprising:
the preprocessing unit is responsible for carrying out time delay compensation on the voice signals received by the two microphones to obtain voice signals aligned in time; performing windowing and framing processing on the voice signals aligned in time, and transforming the voice signals from a time domain to a frequency domain through Fourier transform;
the low-frequency-band reverberation power spectrum estimation unit is responsible for estimating the reverberation power spectrum of the low-frequency-band part of the voice signal based on the voice occurrence probability;
the high-frequency-band reverberation power spectrum estimation unit is responsible for calculating the consistency of different signal components of the voice signal; estimating a reverberation power spectrum of a highband portion of a speech signal based on the coherence;
the reverberation power spectrum estimation unit combined with high and low frequencies is responsible for estimating the reverberation power spectrum combined with high and low frequencies according to the reverberation power spectrum of the low frequency band part and the reverberation power spectrum of the high frequency band part and the division threshold value of high and low frequency bands;
the dereverberation unit is responsible for calculating to obtain a final reverberation power spectrum by using a recursive smoothing algorithm according to the reverberation power spectrum combined with the high frequency and the low frequency; calculating a gain function according to the final reverberation power spectrum, and obtaining a frequency domain signal after reverberation is removed through the gain function; and according to the frequency domain signal after dereverberation, obtaining a time domain signal after dereverberation by utilizing short-time Fourier inverse transformation.
The invention has the beneficial effects that:
the method adopts different reverberation power spectrum estimation for high and low frequencies by utilizing the difference of consistency between the reverberation and pure voice received by the two microphones, removes the reverberation of the low frequency part by utilizing a model for calculating the reverberation power spectrum based on the voice occurrence probability, and removes the reverberation of the high frequency part by utilizing the voice consistency model, so that the reverberation on the whole frequency band can be effectively removed, and the voice perception quality is improved.
Drawings
Fig. 1 is a flow diagram of a binaural speech dereverberation method based on speech occurrence probability and consistency according to the present invention.
Fig. 2 is a comparison graph of the real reverberation power and the improved pre-post to reverberation power spectrum estimation based on the method of coherent dereverberation in the embodiment of the present invention.
Fig. 3(a) -3 (c) are a speech signal contaminated by reverberation, a spectrogram of the speech after dereverberation based on consistency before modification, and a spectrogram of the speech after dereverberation using speech occurrence probability and consistency after modification, respectively.
Detailed Description
The invention is described more fully below with reference to the following examples and accompanying drawings.
The database used in this embodiment is more authoritative in the international speech enhancement field and is one of the most widely used databases. Pure speech was taken from the TSP database for a total of 80 utterances for testing. The signals received by the microphone are convolved with the clean speech signal by the room Impulse response provided by an air (aachen Impulse response) database. The Air impulse response database is recorded by a communication system research institute of the Gem industry university in Germany by utilizing an HMS2 simulation artificial head, comprises different types of scenes such as offices, conference rooms, report halls and the like, and is used for researching a signal processing algorithm in a reverberation environment. The two microphones are respectively positioned at the left ear and the right ear of the artificial head, and the distance is about 0.17 meter.
The present embodiment adopts a binaural speech dereverberation method based on speech occurrence probability and consistency as shown in fig. 1 to perform speech dereverberation algorithm evaluation under different reverberation scenes. The specific settings for the parameters in the algorithm are shown in table 1.
TABLE 1 Algorithm parameter set
Parameter(s) | Value taking |
Sampling rate fs | 16kHz |
Frame length L | 320 |
Frame shift M | 160 |
Spectral smoothing |
50% |
Subtraction factor beta | 0.85 |
Lower boundary of spectrum Gmin | -10dB |
Table 2 shows the improvement degree (Δ SRMR) of the perceptual quality of speech (PESQ) and the modulation ratio of signal reverberation obtained by the method of using only the consistency for reverberation estimation and reverberation removal before improvement and the method of using the probability and consistency of occurrence of speech for reverberation estimation and reverberation removal after improvement. From the comparison of Δ SRMR before and after the improvement, it can be seen that the dereverberation method based on the probability of occurrence and consistency of speech can obviously remove more reverberation, and thus can obtain higher PESQ value.
TABLE 2 noise power spectrum estimation algorithm before and after improvement of the noise power spectrum estimation logarithm error
Reverberant scenes | Office room | Speech room | Corridor | Auditoria |
Reverberation time | 0.45s | 0.85s | 0.83s | 5.16s |
Initial PESQ value | 1.89 | 1.62 | 1.74 | 1.44 |
PESQ-before improvement | 2.19 | 1.78 | 1.92 | 1.61 |
After PESQ-improvement | 2.42 | 2.00 | 2.07 | 1.78 |
Before Delta SRMR-improvement | 1.05 | 1.11 | 1.19 | 0.90 |
After Delta SRMR-improvement | 1.32 | 1.37 | 1.41 | 1.18 |
Fig. 2 is a power spectrum of a real reverberation signal under the condition that the reverberation scene is an office in the embodiment of the present invention and a reverberation power spectrum estimated by using a consistency improvement-based pre-and-post method. It is apparent from fig. 2 that the power spectrum estimated by the improved method is closer to the real reverberation power spectrum.
The voice dereverberation effect can be better observed by utilizing the spectrogram of the voice signal after dereverberation. Examples are given in fig. 3(a) -3 (c). Fig. 3(a) -3 (c) are spectrograms of a speech signal after being contaminated by reverberation, a spectrogram of a speech after being dereverberated using a consistency-based before modification, and a spectrogram of a speech after being dereverberated using a speech occurrence probability and consistency after modification, respectively. It can be seen from the spectrogram that the spectrogram of the voice signal obtained by using the method of the invention to perform voice dereverberation can remove more reverberation, especially in the low-frequency part.
Another embodiment of the present invention provides a binaural speech dereverberation apparatus based on speech occurrence probability and consistency, including:
the preprocessing unit is responsible for carrying out time delay compensation on the voice signals received by the two microphones to obtain voice signals aligned in time; performing windowing and framing processing on the voice signals aligned in time, and transforming the voice signals from a time domain to a frequency domain through Fourier transform;
the low-frequency-band reverberation power spectrum estimation unit is responsible for estimating the reverberation power spectrum of the low-frequency-band part of the voice signal based on the voice occurrence probability;
the high-frequency-band reverberation power spectrum estimation unit is responsible for calculating the consistency of different signal components of the voice signal; estimating a reverberation power spectrum of a highband portion of a speech signal based on the coherence;
the reverberation power spectrum estimation unit combined with high and low frequencies is responsible for estimating the reverberation power spectrum combined with high and low frequencies according to the reverberation power spectrum of the low frequency band part and the reverberation power spectrum of the high frequency band part and the division threshold value of high and low frequency bands;
the dereverberation unit is responsible for calculating to obtain a final reverberation power spectrum by using a recursive smoothing algorithm according to the reverberation power spectrum combined with the high frequency and the low frequency; calculating a gain function according to the final reverberation power spectrum, and obtaining a frequency domain signal after reverberation is removed through the gain function; and according to the frequency domain signal after dereverberation, obtaining a time domain signal after dereverberation by utilizing short-time Fourier inverse transformation.
The above examples are merely illustrative of the present invention, and although examples of the present invention are disclosed for illustrative purposes, those skilled in the art will appreciate that: various substitutions, changes and modifications are possible without departing from the spirit and scope of the present invention and the appended claims. Therefore, the present invention should not be limited to the content of the examples, and the scope of the present invention should be defined by the claims.
Claims (8)
1. A binaural speech dereverberation method based on speech occurrence probability and consistency comprises the following steps:
1) carrying out time delay compensation on voice signals received by the two microphones to obtain voice signals aligned in time;
2) performing windowing and framing processing on the voice signals aligned in time, and transforming the voice signals from a time domain to a frequency domain through Fourier transform;
3) estimating a reverberation power spectrum of a low frequency band part of the speech signal based on the speech occurrence probability; separately estimating the reverberation power spectrum of the low frequency band to ensure that the reverberation of the low frequency band can be removed; when the larger value of the voice occurrence probabilities of the two channels is lower than a certain threshold value, updating the reverberation power spectrum, otherwise, not updating; the method for updating the reverberation power spectrum comprises the following steps:
a) if max (P (H)1|Xl),P(H1|Xr))<p0And P (H)1|Xl)<P(H1|Xr),
b) If max (P (H)1|Xl),P(H1|Xr))<p0And P (H)1|Xl)>P(H1|Xr),
wherein, P (H)1|Xl) Representing the first microphone signal XlProbability of occurrence of speech, P (H)1|Xr) Representing the second microphone signal XrProbability of occurrence of speech, p0Representing the threshold, λ and μ representing the frame number and frequency, respectively, H1Representing speech, H0The representation of a non-speech sound is,is the self-power spectrum of the estimated reverberation;
4) calculating the consistency of different signal components of the speech signal;
5) estimating a reverberation power spectrum of a highband portion of a speech signal based on the coherence;
6) estimating the reverberation power spectrum combined with high and low frequencies according to the reverberation power spectrum of the low frequency band part and the reverberation power spectrum of the high frequency band part and the division threshold of the high and low frequency bands;
7) calculating by using a recursive smoothing algorithm according to the combined high-low frequency reverberation power spectrum to obtain a final reverberation power spectrum;
8) calculating a gain function according to the final reverberation power spectrum, and obtaining a frequency domain signal after reverberation is removed through the gain function;
9) and according to the frequency domain signal after dereverberation, obtaining a time domain signal after dereverberation by utilizing short-time Fourier inverse transformation.
2. The method as claimed in claim 1, wherein the two speech signals in step 1) are time delay compensated by using GCC-PHAT- ρ γ method to overcome the influence of interference factors in the environment on the peak position of the cross correlation function spectrum.
3. The method of claim 1, wherein step 4) assumes reverberation as a diffuse sound field and computes the consistency using a reverberation consistency model with head occlusion.
4. The method of claim 1, wherein step 5) comprises the sub-steps of:
5-1) updating the consistency of the signals according to the voice occurrence probability at all frequencies;
5-2) considering the influence of the head shielding effect, and estimating the reverberation power spectrum by combining a consistency function under the condition that the power spectrums of pure voice signals received by the two microphones are different.
5. The method of claim 4, wherein the self-power spectrum and cross-power spectrum of the clean speech received by the two microphones in step 5) are represented as:
6. The method of claim 5, wherein step 5-1) comprises:
a) updating the consistency of the reverberant voice, namely when the larger value of the two voice occurrence probabilities is lower than a certain threshold value, updating the consistency of the reverberant signal by utilizing the consistency of the voice signal received by the microphone as follows:
if max (P (H)1|Xl),P(H1|Xr))<p0
Wherein the content of the first and second substances,indicating the coherence of the reverberant signal, alphaγWhich represents the coefficient of the smoothing, is,indicating the correspondence between two voices received by the microphone, p0Represents a threshold value in the "when the larger of two speech occurrence probabilities is lower than a certain threshold value";
b) the consistency of the pure voice is updated, when the smaller value of the probability of occurrence of the two voices is higher than a certain threshold value, the consistency of the voice signals received by the microphone is updated to the consistency of the pure voice signals as follows:
if min (P (H)1|Xl),P(H1|Xr))>p1
Wherein the content of the first and second substances,indicating the consistency of the clean speech signal,representing the cross-power spectrum of the speech received by the two microphones,representing the self-power spectrum of the speech received by the left microphone,self-power spectrum, phi, representing speech received by the right microphonevvRepresenting the self-power spectrum, p, of the reverberant signal1Represents a threshold value in "when the smaller of the two speech occurrence probabilities is higher than a certain threshold value";
step 5-2) the estimation of the reverberation power spectrum is as follows:
7. the method of claim 6, wherein the reverberation power spectrum of the combined high and low frequencies estimated in step 6) is:
where μ denotes a certain frequency, μsWhich represents a frequency value that distinguishes between high and low frequencies,a reverberation power spectrum representing a low frequency band portion estimated based on a speech occurrence probability;representation based on consistencyAn estimated reverberation power spectrum of the high frequency band part.
8. A binaural speech dereverberation device based on speech occurrence probability and consistency using the method of any of claims 1-7, comprising:
the preprocessing unit is responsible for carrying out time delay compensation on the voice signals received by the two microphones to obtain voice signals aligned in time; performing windowing and framing processing on the voice signals aligned in time, and transforming the voice signals from a time domain to a frequency domain through Fourier transform;
the low-frequency-band reverberation power spectrum estimation unit is responsible for estimating the reverberation power spectrum of the low-frequency-band part of the voice signal based on the voice occurrence probability;
the high-frequency-band reverberation power spectrum estimation unit is responsible for calculating the consistency of different signal components of the voice signal; estimating a reverberation power spectrum of a highband portion of a speech signal based on the coherence;
the reverberation power spectrum estimation unit combined with high and low frequencies is responsible for estimating the reverberation power spectrum combined with high and low frequencies according to the reverberation power spectrum of the low frequency band part and the reverberation power spectrum of the high frequency band part and the division threshold value of high and low frequency bands;
the dereverberation unit is responsible for calculating to obtain a final reverberation power spectrum by using a recursive smoothing algorithm according to the reverberation power spectrum combined with the high frequency and the low frequency; calculating a gain function according to the final reverberation power spectrum, and obtaining a frequency domain signal after reverberation is removed through the gain function; and according to the frequency domain signal after dereverberation, obtaining a time domain signal after dereverberation by utilizing short-time Fourier inverse transformation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810765266.3A CN108986832B (en) | 2018-07-12 | 2018-07-12 | Binaural voice dereverberation method and device based on voice occurrence probability and consistency |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810765266.3A CN108986832B (en) | 2018-07-12 | 2018-07-12 | Binaural voice dereverberation method and device based on voice occurrence probability and consistency |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108986832A CN108986832A (en) | 2018-12-11 |
CN108986832B true CN108986832B (en) | 2020-12-15 |
Family
ID=64537944
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810765266.3A Active CN108986832B (en) | 2018-07-12 | 2018-07-12 | Binaural voice dereverberation method and device based on voice occurrence probability and consistency |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108986832B (en) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110095755B (en) * | 2019-04-01 | 2021-03-12 | 云知声智能科技股份有限公司 | Sound source positioning method |
CN110012331B (en) * | 2019-04-11 | 2021-05-25 | 杭州微纳科技股份有限公司 | Infrared-triggered far-field double-microphone far-field speech recognition method |
CN110718230B (en) * | 2019-08-29 | 2021-12-17 | 云知声智能科技股份有限公司 | Method and system for eliminating reverberation |
CN110691296B (en) * | 2019-11-27 | 2021-01-22 | 深圳市悦尔声学有限公司 | Channel mapping method for built-in earphone of microphone |
CN111128213B (en) * | 2019-12-10 | 2022-09-27 | 展讯通信(上海)有限公司 | Noise suppression method and system for processing in different frequency bands |
CN113613112B (en) * | 2021-09-23 | 2024-03-29 | 三星半导体(中国)研究开发有限公司 | Method for suppressing wind noise of microphone and electronic device |
CN115831145B (en) * | 2023-02-16 | 2023-06-27 | 之江实验室 | Dual-microphone voice enhancement method and system |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2006243290A (en) * | 2005-03-02 | 2006-09-14 | Advanced Telecommunication Research Institute International | Disturbance component suppressing device, computer program, and speech recognition system |
WO2009151062A1 (en) * | 2008-06-10 | 2009-12-17 | ヤマハ株式会社 | Acoustic echo canceller and acoustic echo cancel method |
JP2011065128A (en) * | 2009-08-20 | 2011-03-31 | Mitsubishi Electric Corp | Reverberation removing device |
CN102347028A (en) * | 2011-07-14 | 2012-02-08 | 瑞声声学科技(深圳)有限公司 | Double-microphone speech enhancer and speech enhancement method thereof |
CN102800322A (en) * | 2011-05-27 | 2012-11-28 | 中国科学院声学研究所 | Method for estimating noise power spectrum and voice activity |
JP2013044908A (en) * | 2011-08-24 | 2013-03-04 | Nippon Telegr & Teleph Corp <Ntt> | Background sound suppressor, background sound suppression method and program |
CN106297817A (en) * | 2015-06-09 | 2017-01-04 | 中国科学院声学研究所 | A kind of sound enhancement method based on binaural information |
CN106971740A (en) * | 2017-03-28 | 2017-07-21 | 吉林大学 | Probability and the sound enhancement method of phase estimation are had based on voice |
-
2018
- 2018-07-12 CN CN201810765266.3A patent/CN108986832B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2006243290A (en) * | 2005-03-02 | 2006-09-14 | Advanced Telecommunication Research Institute International | Disturbance component suppressing device, computer program, and speech recognition system |
WO2009151062A1 (en) * | 2008-06-10 | 2009-12-17 | ヤマハ株式会社 | Acoustic echo canceller and acoustic echo cancel method |
JP2011065128A (en) * | 2009-08-20 | 2011-03-31 | Mitsubishi Electric Corp | Reverberation removing device |
CN102800322A (en) * | 2011-05-27 | 2012-11-28 | 中国科学院声学研究所 | Method for estimating noise power spectrum and voice activity |
CN102347028A (en) * | 2011-07-14 | 2012-02-08 | 瑞声声学科技(深圳)有限公司 | Double-microphone speech enhancer and speech enhancement method thereof |
JP2013044908A (en) * | 2011-08-24 | 2013-03-04 | Nippon Telegr & Teleph Corp <Ntt> | Background sound suppressor, background sound suppression method and program |
CN106297817A (en) * | 2015-06-09 | 2017-01-04 | 中国科学院声学研究所 | A kind of sound enhancement method based on binaural information |
CN106971740A (en) * | 2017-03-28 | 2017-07-21 | 吉林大学 | Probability and the sound enhancement method of phase estimation are had based on voice |
Non-Patent Citations (2)
Title |
---|
Supervised single-channel speech dereverberation and denoising using a two-stage model based sparse representation;Zhang Long et al.;《Speech Communication》;20171227;第1-8页 * |
基于麦克风阵列的混响消减处理;陈建荣等;《电声技术》;20180315;第42卷(第3期);第49-51、54页 * |
Also Published As
Publication number | Publication date |
---|---|
CN108986832A (en) | 2018-12-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108986832B (en) | Binaural voice dereverberation method and device based on voice occurrence probability and consistency | |
Yousefian et al. | A dual-microphone speech enhancement algorithm based on the coherence function | |
CN105869651B (en) | Binary channels Wave beam forming sound enhancement method based on noise mixing coherence | |
Wu et al. | A two-stage algorithm for one-microphone reverberant speech enhancement | |
CN111161751A (en) | Distributed microphone pickup system and method under complex scene | |
JP2013527493A (en) | Robust noise suppression with multiple microphones | |
US9532149B2 (en) | Method of signal processing in a hearing aid system and a hearing aid system | |
CN105679330B (en) | Based on the digital deaf-aid noise-reduction method for improving subband signal-to-noise ratio (SNR) estimation | |
Aroudi et al. | Cognitive-driven binaural LCMV beamformer using EEG-based auditory attention decoding | |
Jangjit et al. | A new wavelet denoising method for noise threshold | |
Yousefian et al. | A coherence-based noise reduction algorithm for binaural hearing aids | |
Itoh et al. | Environmental noise reduction based on speech/non-speech identification for hearing aids | |
CN110827847A (en) | Microphone array voice denoising and enhancing method with low signal-to-noise ratio and remarkable growth | |
CN106328160B (en) | Noise reduction method based on double microphones | |
CN114023352B (en) | Voice enhancement method and device based on energy spectrum depth modulation | |
Miyazaki et al. | Theoretical analysis of parametric blind spatial subtraction array and its application to speech recognition performance prediction | |
CN114189781A (en) | Noise reduction method and system for double-microphone neural network noise reduction earphone | |
Gonzalez-Rodriguez et al. | Speech dereverberation and noise reduction with a combined microphone array approach | |
Wu et al. | A two-stage algorithm for enhancement of reverberant speech | |
Madhu et al. | Localisation-based, situation-adaptive mask generation for source separation | |
Hussain et al. | Speech enhancement using degenerate unmixing estimation technique and adaptive noise cancellation technique as a post signal processing | |
Akagi et al. | Noise reduction using a small-scale microphone array in multi noise source environment | |
Brutti et al. | A Phase-Based Time-Frequency Masking for Multi-Channel Speech Enhancement in Domestic Environments. | |
Unoki et al. | Unified denoising and dereverberation method used in restoration of MTF-based power envelope | |
Zhang et al. | Post-secondary filtering improvement of GSC beamforming algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |