CN108986832B - Binaural voice dereverberation method and device based on voice occurrence probability and consistency - Google Patents

Binaural voice dereverberation method and device based on voice occurrence probability and consistency Download PDF

Info

Publication number
CN108986832B
CN108986832B CN201810765266.3A CN201810765266A CN108986832B CN 108986832 B CN108986832 B CN 108986832B CN 201810765266 A CN201810765266 A CN 201810765266A CN 108986832 B CN108986832 B CN 108986832B
Authority
CN
China
Prior art keywords
power spectrum
reverberation
speech
signal
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810765266.3A
Other languages
Chinese (zh)
Other versions
CN108986832A (en
Inventor
刘宏
王秀玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University Shenzhen Graduate School
Original Assignee
Peking University Shenzhen Graduate School
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University Shenzhen Graduate School filed Critical Peking University Shenzhen Graduate School
Priority to CN201810765266.3A priority Critical patent/CN108986832B/en
Publication of CN108986832A publication Critical patent/CN108986832A/en
Application granted granted Critical
Publication of CN108986832B publication Critical patent/CN108986832B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02165Two microphones, one receiving mainly the noise signal and the other one mainly the speech signal

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention discloses a binaural voice dereverberation method and device based on voice occurrence probability and consistency. The method comprises the following steps: 1) carrying out time delay compensation on voice signals received by the two microphones to obtain voice signals aligned in time; 2) performing windowing and framing processing, and transforming the voice signal from a time domain to a frequency domain through Fourier transform; 3) estimating a reverberation power spectrum of the low frequency band part based on the voice occurrence probability; 4) calculating the consistency of different signal components of the speech signal; 5) estimating a reverberation power spectrum of the high frequency band part based on the consistency; 6) estimating a reverberation power spectrum combined with high and low frequencies according to the division threshold values of the high and low frequency bands; 7) calculating by using a recursive smoothing algorithm to obtain a final reverberation power spectrum; 8) obtaining a frequency domain signal after reverberation is removed through a gain function; 9) and obtaining the time domain signal after dereverberation by using short-time Fourier inverse transformation. The invention can effectively remove reverberation on the whole frequency band and improve the voice perception quality.

Description

Binaural voice dereverberation method and device based on voice occurrence probability and consistency
Technical Field
The invention belongs to the technical field of audio signal processing and computer hearing, and particularly relates to a method and a device for removing reverberation of double-microphone voice in a reverberation environment.
Background
Binaural audio naturally has many advantages for communication and multimedia experiences. In the daily human-human interaction, the auditory perception is one of the most effective and direct human-human interaction modes. However, in an actual environment, speech is used as an important information carrier for human-to-human and human-to-machine communication, and is inevitably interfered by reverberation, environmental noise and the like, so that the definition, intelligibility and comfort of speech are greatly reduced, and the auditory perception of human ears and the performance of a subsequent speech processing system are seriously affected. In general, a microphone receives not only a direct part of a sound source but also a reflected signal (e.g., a signal generated by reflection on the floor, wall, ceiling, home furnishings, etc. in a room) that a sound source signal reaches due to multipath propagation when passing through a channel, a reflected wave with an acoustic delay time of about 50ms or more is called an echo, and an effect of the reflected wave other than the direct sound is called a reverberation phenomenon, which affects a receiving effect of a desired speech signal. To counteract the degradation in sound quality caused by reverberation, researchers have proposed dereverberation (or reverberation cancellation) techniques that aim to improve the quality and intelligibility of segmented speech.
Speech dereverberation techniques have wide application. With the development of modern signal processing technology and intelligent discipline, the degree of intellectualization of the robot is continuously improved, the robot is often in a complex acoustic environment in practical application, various types of noise and the like can cause the robot to be interfered when acquiring voice, the recognition rate of the voice can be rapidly reduced in a reverberation environment, the realization of subsequent operation and functions is influenced, and even the practical application cannot be met. Therefore, the reduction of reverberation by using the binaural speech dereverberation technology has important significance on the influence of the robot in practical application. As another example, binaural speech dereverberation techniques may provide pre-processing for many speech signal processing techniques, such as: binaural sound source localization, speech recognition, etc. In addition, for example, for a person with hearing impairment, it is often necessary to communicate with the hearing aid or cochlear implant. However, in a reverberant environment, the hearing aid hearing effect is greatly affected. At this time, the non-clean speech signal needs to be preprocessed by using a speech dereverberation algorithm before being amplified, and the reverberation signal can be removed to a certain extent to help auditory handicapped people to better communicate.
Speech dereverberation techniques can be generally divided in terms of single-channel and multi-channel speech enhancement. The single-channel dereverberation algorithm utilizes a single microphone for speech enhancement, and such an approach has achieved widespread application and mature development in its simple model and inexpensive cost. But since the single-channel speech dereverberation algorithm can only utilize the statistical properties of the single-channel speech signal to suppress reverberation. The multi-channel speech dereverberation system uses a plurality of microphones, namely a microphone array to collect sound signals, so as to obtain multi-channel signals. Due to the increase of the number of input channels, the signal processing algorithm can utilize the correlation between the channel signals to perform voice enhancement. Compared with the limitation that a single channel can only be enhanced by using the difference of the voice and the reverberation in the time-frequency domain, the introduction of the microphone array can make up the defect of a single-channel voice dereverberation algorithm. Generally, increasing the number of microphones can improve the effect of speech dereverberation. Compared with a single microphone, the microphone-based array can not only utilize time-frequency information of signals, but also utilize spatial information of the signals, and is widely concerned. But the disadvantages are that the structure size is huge, the system is complex in calculation, the calculation amount is too large, and the like. The cost of equipment, the real-time performance of the voice adding method algorithm and the effect of the algorithm are comprehensively considered, and the dual-channel voice dereverberation, namely the voice dereverberation by using two microphones, is a better compromise scheme.
The algorithm for dereverberating the double-microphone voice mainly comprises a consistency model-based method, a two-channel wiener filtering-based method and the like. Among them, the algorithm based on the consistent dereverberation is to design the filter mainly according to the consistent difference between the pure speech and the reverberated speech. The method assumes that a pure voice part and a reverberation part are irrelevant, utilizes the consistency of the pure voice, the reverberation voice and the voice received by a microphone to estimate the reverberation power in the received voice, and calculates the gain of a filter through the estimated reverberation power so as to obtain the voice after reverberation removal. The two-channel voice dereverberation method based on consistency mainly comprises the following steps:
1. voice input, pre-filtering and analog-to-digital conversion. Firstly, pre-filtering an input analog sound signal, and carrying out high-pass filtering to inhibit a 50Hz power supply noise signal; the low-pass filtering filters the part of the sound signal with the frequency component exceeding half of the sampling frequency, prevents aliasing interference, and samples and quantizes the analog sound signal to obtain a digital signal.
2. Pre-emphasis is performed. The signal is passed through a high frequency emphasis filter impulse response to compensate for the high frequency attenuation of the lip radiation.
3. And (4) framing and windowing. Due to the slow time-varying property of the voice signal, the voice signal is not stable as a whole and is stable locally, the voice signal is generally considered to be stable within 10-30ms, and the voice signal can be framed according to the length of 20 ms. The framing function is:
xk(n)=w(n)s(Nk+n)n=0,1...N-1;k=0,1...L-1 (1)
where N is the frame length, L is the frame number, and s represents the speech signal. w (n) is a window function whose choice (shape and length) has a large influence on the behavior of the analysis parameters in short time, and commonly used window functions include rectangular windows, hanning windows, hamming windows, and the like. The Hamming window is generally selected, so that the characteristic change of the voice signal can be well reflected, and the Hamming window expression is as follows:
Figure BDA0001728883470000021
4. and (4) estimating a reverberation power spectrum. The consistency of the pure voice and the reverberation voice is obtained by using a form researched by the prior person when the pure voice and the reverberation voice are estimated, and the consistency of the voice received by the microphone is calculated by using a defined formula of the consistency.
5. The filter gain is calculated and the dual channel signal is filtered.
6. The filtered speech is converted to a time domain output using an inverse fourier transform.
Disclosure of Invention
The invention provides a new method and a device for removing reverberation of binaural speech, which are used for improving the dereverberation effect of a two-microphone dereverberation algorithm based on consistency in a low-frequency section part.
The traditional consistency-based dual-microphone dereverberation algorithm assumes that reverberation is a scattered sound field and has low consistency, and pure voice has high consistency, so that reverberation can be removed according to the consistency, but in a low-frequency section, the consistency of the reverberated voice is also high, so that the reverberation in the low-frequency section is removed less. In addition, the conventional method uses free field calculation when calculating the consistency of each sound part, and in the case of a binaural microphone, the consistency of each sound part is affected by head occlusion due to the presence of a "head shadow effect", and the form of the free field is not suitable. Aiming at the two problems, the invention provides a binaural speech dereverberation method based on speech occurrence probability and consistency.
The technical scheme adopted by the invention is as follows:
a binaural voice dereverberation method based on voice occurrence probability and consistency mainly comprises the following steps:
1) carrying out time delay compensation on voice signals received by the two microphones to obtain voice signals aligned in time;
2) performing windowing and framing processing on the voice signals aligned in time, and transforming the voice signals from a time domain to a frequency domain through Fourier transform;
3) estimating a reverberation power spectrum of a low frequency band part of the speech signal based on the speech occurrence probability;
4) calculating the consistency of different signal components of the speech signal;
5) estimating a reverberation power spectrum of a highband portion of a speech signal based on the coherence;
6) estimating the reverberation power spectrum combined with high and low frequencies according to the reverberation power spectrum of the low frequency band part and the reverberation power spectrum of the high frequency band part and the division threshold of the high and low frequency bands;
7) calculating by using a recursive smoothing algorithm according to the combined high-low frequency reverberation power spectrum to obtain a final reverberation power spectrum;
8) calculating a gain function according to the final reverberation power spectrum, and obtaining a frequency domain signal after reverberation is removed through the gain function;
9) and according to the frequency domain signal after dereverberation, obtaining a time domain signal after dereverberation by utilizing short-time Fourier inverse transformation.
The above steps are specifically described as follows:
1) and carrying out time delay compensation on the voice signals received by the two microphones to obtain the voice aligned in time. Since there is a time difference between the arrival of the speech signal at the two microphones, the signals need to be aligned and processed. The GCC-PHAT-rho gamma method based on generalized cross-correlation is adopted for time delay estimation, and the binaural time difference is determined mainly by searching the spectral peak position of the cross-correlation function. The method can overcome the influence of interference factors such as related noise, reverberation and the like in the environment on the position of the cross-correlation function spectrum peak, and is relatively robust.
In the time domain, the two-channel speech model can be described as:
xi(n)=si(n)+vi(n), (3)
wherein x isi(n) represents a speech signal received by a microphone, si(n) denotes a clean speech signal, vi(n) represents the noise signal, where the index i ∈ { l, r } represents the first microphone signal and the second microphone signal.
With a short-time fourier transform, the two-channel speech model can be represented in the frequency domain as:
Xi(λ,μ)=Si(λ,μ)+Vi(λ,μ), (4)
where λ and μ denote the frame number and frequency, respectively. The cross-correlation function of two received voices can then be expressed as:
Figure BDA0001728883470000041
where Δ τ is the time difference, denotes taking the complex conjugate, and ω denotes the angular frequency. W (ω) represents a frequency domain weighting function,
Figure BDA0001728883470000042
for sharpening the spectral peaks of the cross-correlation function, the parameter ρ is the reverberation factor determined by the snr, γ (ω) is the coherence function of the speech received by the microphone (described in detail in step 4), and both are adaptively adjusted according to the environment, G (ω) represents the cross-power spectrum, and G (ω) ═ X represents the cross-power spectruml(ω)Xr *(ω). Thus, the time delay can be obtained by maximizing the generalized cross-correlation function:
Figure BDA0001728883470000043
2) and performing windowing and framing preprocessing on the two aligned voices, and performing Fourier transform to transform the signal from a time domain to a frequency domain.
3) And estimating the reverberation power spectrum of the low frequency band part based on the voice occurrence probability. The step separately estimates the reverberation power spectrum of the low frequency band to ensure that the reverberation of the low frequency band can be removed. For each channel of speech, the speech power and the reverberation power are respectively denoted as phiss(lambda, mu) and phivv(lambda, mu), because whether the voice appears is uncertain, the noise power spectrum E (| V |) is obtained by using the minimum mean square error method2| X), the formula is calculated as:
E(|V|2|X)=P(H0|X)E(|V|2|X,H0)+P(H1|X)E(|V|2|X,H1) (7)
where X and V represent the discrete Fourier transforms of the signal received by the microphone and the reverberant signal, respectively, H1Representing speech, H0Representing non-speech, P (H)0| X) represents the probability of no occurrence of speech, E (| V2|X,H0) Represents the reverberant power spectrum in the absence of speech, P (H)1| X) represents the probability of occurrence of speech, E (| V2|X,H1) Representing the reverberation power spectrum in the presence of speech.
The posterior signal-to-mixture ratio is defined as:
ξ=φssvv (8)
the probability of speech occurrence can be calculated by equation (9):
Figure BDA0001728883470000051
wherein ξoptIndicating the best a posteriori signal-to-mixture ratio. Research shows that when the true posterior signal-to-noise ratio is between-infinity and 20dB, 10 logs are taken10(ξopt) The speech occurrence probability calculation error is minimal at 15 dB. Calculating the probability of occurrence P (H)1| X), the probability P (H) that speech does not occur0| X) can be calculated using the following formula:
P(H0|X)=1-P(H1|X) (10)
when the speech does not appear, the speech received by the microphone can be considered as the reverberation noise, so that the reverberation power spectrum can be obtained by the following formula:
E(|V|2|X,H0)=E(|V|2|V)=|V|2=|X|2 (11)
when the voice appears, the reverberation power spectrum is calculated according to the reverberation estimation result of the previous frame:
Figure BDA0001728883470000052
wherein
Figure BDA0001728883470000053
Is the self-power spectrum of the estimated reverberation. Thus, the reverberation power spectrum E (| V2| X) may be rewritten as:
Figure BDA0001728883470000054
interframe smoothing of the reverberant power spectrum:
Figure BDA0001728883470000055
where alpha is a smoothing factor.
The reverberation power spectrum is updated when the larger of the speech occurrence probabilities of the two channels (i.e. the two microphones) is below a certain threshold, otherwise not:
1) if max (P (H)1|Xl),P(H1|Xr))<p0And P (H)1|Xl)<P(H1|Xr),
Then
Figure BDA0001728883470000056
2) If max (P (H)1|Xl),P(H1|Xr))<p0And P (H)1|Xl)>P(H1|Xr),
Then
Figure BDA0001728883470000057
3) In addition to the above-mentioned others,
Figure BDA0001728883470000058
wherein, P (H)1|Xl) Representing the probability of occurrence of speech, P (H), of the first microphone signal1|Xr) Representing the probability of occurrence of speech, p, of the second microphone signal0Representing a threshold value.
The low-frequency part of the voice signal with reverberation carries out reverberation power spectrum estimation based on the method, and the result is recorded as
Figure BDA0001728883470000061
4) The coherence of the different signal components is calculated. The reverberation signal is clearly distinguished from the speech signal by the consistency in the high frequency part, so that the reverberation in the high frequency part is estimated by the consistency. It is first necessary to compute the correspondence between the different components of the speech. The consistency of the speech received by the microphone can be directly calculated by the definition of consistency, and the consistency between two signals can be defined as:
Figure BDA0001728883470000062
wherein
Figure BDA0001728883470000063
And
Figure BDA0001728883470000064
representing a signal x1And x2The self-power spectrum of (a) a,
Figure BDA0001728883470000065
the cross-power spectrum of the signal is represented, and the calculation is carried out by adopting a recursive average method:
Figure BDA0001728883470000066
wherein alpha isPSDIs a smoothing factor, which represents the complex conjugate.
Reverberation voice is generally assumed to be a scattering sound field, wherein the scattering sound field is caused by that countless uncorrelated signals simultaneously propagate in all directions with the same energy, and the ideal consistency calculation method of the scattering sound field in the traditional method is as follows:
Figure BDA0001728883470000067
wherein f represents frequency, dmicRepresenting the distance between the two microphones and c representing the speed of sound. However, when the two microphones are located at the left and right ears of the human head, the consistency of the scattering sound field is more complicated due to the shielding of the human head. Jeub et al proposed a method with curve fitting was therefore used to approximate the model:
Figure BDA0001728883470000068
wherein, ap,bpAnd cpIs a constant, and takes the values of 2.38 and 10 respectively-31371, 151.5, P represents the model order, which is 3.
For clean speech, the coherence of the speech is high, and assuming that both microphones are reached at an angle θ, the coherence between clean speech components can be expressed as:
Figure BDA0001728883470000069
where f denotes the frequency and c denotes the speed of sound propagation in airDegree, dmicRepresenting the distance of the two microphones.
5) A reverberation power spectrum of the high band part is estimated based on the signal consistency. Since it is assumed that the reverberant sound field is a diffuse sound field, the noise signals received by the respective microphones have the same power spectrum
Figure BDA0001728883470000071
Considering the head shadow effect, the difference of the power spectrums of the pure voice signals received by the binaural microphone cannot be directly ignored, and the power spectrum of the pure voice signal can be expressed as:
Figure BDA0001728883470000072
Figure BDA0001728883470000073
wherein HlAnd HrRespectively representing the transfer functions of the left and right ears, S representing the sound source signal, SlRepresenting the sound signal received by the left microphone, SrRepresenting the sound signal received by the right microphone. Combining the binaural signal coherence function γ yields:
Figure BDA0001728883470000074
Figure BDA0001728883470000075
thus, the left and right ears are clean speech signals slAnd srReverberation signal vlAnd vrWith the speech signal x received by the microphonelAnd xrThe relationship of the self-power spectrum and cross-power spectrum of (a) can be expressed as:
Figure BDA0001728883470000076
Figure BDA0001728883470000077
Figure BDA0001728883470000078
since it is assumed that the reverberation is not correlated with the speech, the joint equations (23), (25), (26) can be derived:
Figure BDA0001728883470000079
combining the definition of binaural coherence with equation (28), one obtains:
Figure BDA00017288834700000710
solving the equation (29) to obtain the estimation result phi of the reverberation power spectrumvv. Rewrite equation (29) to:
Figure BDA00017288834700000711
obtaining by solution:
Figure BDA00017288834700000712
theoretically, the consistency of the voice signal is strong, the consistency of the reverberation signal is weak, and the consistency of the received signal is not more than that of the pure voice signal, so that
Figure BDA0001728883470000081
Thus, the formula (31) can be considered to have a solution. In order to guarantee the reverberation power spectrum phivvTaking a positive number, calculating the reverberation power spectrum by adopting an equation (32):
Figure BDA0001728883470000082
wherein the self-power spectrum
Figure BDA0001728883470000083
And cross power spectrum
Figure BDA0001728883470000084
Also calculated using a recursive average method.
The high-frequency part of the speech with reverberation carries out reverberation power spectrum estimation based on the method, and the estimation result is
Figure BDA0001728883470000085
Because a certain difference exists between the theoretical signal consistency and the actual signal consistency, the result of the power spectrum estimation of the reverberation is influenced. To further improve the effect of the estimation, the consistency of the signal is updated here.
When the larger value of the two voice occurrence probabilities is lower than a certain threshold value, the consistency of the reverberation signal is updated by the consistency of the voice signals received by the microphones:
if max (P (H)1|Xl),P(H1|Xr))<p0
Then
Figure BDA0001728883470000086
When the smaller of the two speech occurrence probabilities is higher than a certain threshold, the consistency of the pure speech signal is updated by the consistency of the speech signal received by the microphone, which is obtained by equation (29):
if min (P (H)1|Xl),P(H1|Xr))>p1
Then
Figure BDA0001728883470000087
Wherein p is0、p1A threshold value is indicated which is indicative of,
Figure BDA0001728883470000088
indicating the coherence of the reverberant signal, alphaγRepresenting the smoothing coefficient, gammaxlxrIndicating the correspondence between the two voices received by the microphone,
Figure BDA0001728883470000089
indicating the consistency of the clean speech signal,
Figure BDA00017288834700000810
cross power spectrum, phi, representing speech received by two microphonesxlxlRepresenting the self-power spectrum of the speech received by the left microphone,
Figure BDA00017288834700000811
self-power spectrum, phi, representing speech received by the right microphonevvRepresenting the self-power spectrum of the reverberant signal. Since the reverberation power spectrum estimation based on the consistency uses only the square of the pure speech signal consistency, only the formula (35) needs to be used for updating.
6) Combined with the estimation of the reverberation power spectrum of high and low frequencies, when the frequency mu is less than a certain set value mus(frequency values for discriminating high and low frequencies), the reverberation power spectrum is
Figure BDA00017288834700000812
When the frequency is greater than the threshold value musWhile, the reverberation power spectrum is
Figure BDA00017288834700000813
Namely:
Figure BDA0001728883470000091
7) and calculating to obtain a final reverberation power spectrum by utilizing the conventional recursive smoothing algorithm according to the reverberation power spectrum combined with the high frequency and the low frequency estimated in the step 6).
8) A gain function is calculated. After the power spectrum of the reverberation signal is obtained through calculation, a gain function can be designed by using the reverberation power spectrum, and the signal received by the microphone is multiplied by the gain function to obtain the signal after reverberation is removed. Speech dereverberation based on an estimate of the reverberation power spectrum is often filtered by spectral subtraction. It is based on a simple principle: assuming that the reverberation is echo noise, a clean speech signal spectrum can be obtained by subtracting an estimate of the reverberation spectrum from the reverberant speech spectrum received by the microphone. The gain function is as follows:
Figure BDA0001728883470000092
wherein the content of the first and second substances,
Figure BDA0001728883470000093
representing the estimated power spectrum of the reverberations,
Figure BDA0001728883470000094
indicating the calculated power spectrum, ξ, of the speech signal received by the microphone2(λ) represents the square of the posterior signal-to-noise ratio. To avoid over-reduction, a lower bound G is setmin. The dereverberated speech signal is represented in the frequency domain as:
Figure BDA0001728883470000095
9) finally, the time domain signal after dereverberation can be obtained by using short-time inverse Fourier transform
Figure BDA0001728883470000096
Correspondingly, the present invention also provides a binaural speech dereverberation apparatus based on speech occurrence probability and consistency, comprising:
the preprocessing unit is responsible for carrying out time delay compensation on the voice signals received by the two microphones to obtain voice signals aligned in time; performing windowing and framing processing on the voice signals aligned in time, and transforming the voice signals from a time domain to a frequency domain through Fourier transform;
the low-frequency-band reverberation power spectrum estimation unit is responsible for estimating the reverberation power spectrum of the low-frequency-band part of the voice signal based on the voice occurrence probability;
the high-frequency-band reverberation power spectrum estimation unit is responsible for calculating the consistency of different signal components of the voice signal; estimating a reverberation power spectrum of a highband portion of a speech signal based on the coherence;
the reverberation power spectrum estimation unit combined with high and low frequencies is responsible for estimating the reverberation power spectrum combined with high and low frequencies according to the reverberation power spectrum of the low frequency band part and the reverberation power spectrum of the high frequency band part and the division threshold value of high and low frequency bands;
the dereverberation unit is responsible for calculating to obtain a final reverberation power spectrum by using a recursive smoothing algorithm according to the reverberation power spectrum combined with the high frequency and the low frequency; calculating a gain function according to the final reverberation power spectrum, and obtaining a frequency domain signal after reverberation is removed through the gain function; and according to the frequency domain signal after dereverberation, obtaining a time domain signal after dereverberation by utilizing short-time Fourier inverse transformation.
The invention has the beneficial effects that:
the method adopts different reverberation power spectrum estimation for high and low frequencies by utilizing the difference of consistency between the reverberation and pure voice received by the two microphones, removes the reverberation of the low frequency part by utilizing a model for calculating the reverberation power spectrum based on the voice occurrence probability, and removes the reverberation of the high frequency part by utilizing the voice consistency model, so that the reverberation on the whole frequency band can be effectively removed, and the voice perception quality is improved.
Drawings
Fig. 1 is a flow diagram of a binaural speech dereverberation method based on speech occurrence probability and consistency according to the present invention.
Fig. 2 is a comparison graph of the real reverberation power and the improved pre-post to reverberation power spectrum estimation based on the method of coherent dereverberation in the embodiment of the present invention.
Fig. 3(a) -3 (c) are a speech signal contaminated by reverberation, a spectrogram of the speech after dereverberation based on consistency before modification, and a spectrogram of the speech after dereverberation using speech occurrence probability and consistency after modification, respectively.
Detailed Description
The invention is described more fully below with reference to the following examples and accompanying drawings.
The database used in this embodiment is more authoritative in the international speech enhancement field and is one of the most widely used databases. Pure speech was taken from the TSP database for a total of 80 utterances for testing. The signals received by the microphone are convolved with the clean speech signal by the room Impulse response provided by an air (aachen Impulse response) database. The Air impulse response database is recorded by a communication system research institute of the Gem industry university in Germany by utilizing an HMS2 simulation artificial head, comprises different types of scenes such as offices, conference rooms, report halls and the like, and is used for researching a signal processing algorithm in a reverberation environment. The two microphones are respectively positioned at the left ear and the right ear of the artificial head, and the distance is about 0.17 meter.
The present embodiment adopts a binaural speech dereverberation method based on speech occurrence probability and consistency as shown in fig. 1 to perform speech dereverberation algorithm evaluation under different reverberation scenes. The specific settings for the parameters in the algorithm are shown in table 1.
TABLE 1 Algorithm parameter set
Parameter(s) Value taking
Sampling rate fs 16kHz
Frame length L 320
Frame shift M 160
Spectral smoothing parameter alpha 50%
Subtraction factor beta 0.85
Lower boundary of spectrum Gmin -10dB
Table 2 shows the improvement degree (Δ SRMR) of the perceptual quality of speech (PESQ) and the modulation ratio of signal reverberation obtained by the method of using only the consistency for reverberation estimation and reverberation removal before improvement and the method of using the probability and consistency of occurrence of speech for reverberation estimation and reverberation removal after improvement. From the comparison of Δ SRMR before and after the improvement, it can be seen that the dereverberation method based on the probability of occurrence and consistency of speech can obviously remove more reverberation, and thus can obtain higher PESQ value.
TABLE 2 noise power spectrum estimation algorithm before and after improvement of the noise power spectrum estimation logarithm error
Reverberant scenes Office room Speech room Corridor Auditoria
Reverberation time 0.45s 0.85s 0.83s 5.16s
Initial PESQ value 1.89 1.62 1.74 1.44
PESQ-before improvement 2.19 1.78 1.92 1.61
After PESQ-improvement 2.42 2.00 2.07 1.78
Before Delta SRMR-improvement 1.05 1.11 1.19 0.90
After Delta SRMR-improvement 1.32 1.37 1.41 1.18
Fig. 2 is a power spectrum of a real reverberation signal under the condition that the reverberation scene is an office in the embodiment of the present invention and a reverberation power spectrum estimated by using a consistency improvement-based pre-and-post method. It is apparent from fig. 2 that the power spectrum estimated by the improved method is closer to the real reverberation power spectrum.
The voice dereverberation effect can be better observed by utilizing the spectrogram of the voice signal after dereverberation. Examples are given in fig. 3(a) -3 (c). Fig. 3(a) -3 (c) are spectrograms of a speech signal after being contaminated by reverberation, a spectrogram of a speech after being dereverberated using a consistency-based before modification, and a spectrogram of a speech after being dereverberated using a speech occurrence probability and consistency after modification, respectively. It can be seen from the spectrogram that the spectrogram of the voice signal obtained by using the method of the invention to perform voice dereverberation can remove more reverberation, especially in the low-frequency part.
Another embodiment of the present invention provides a binaural speech dereverberation apparatus based on speech occurrence probability and consistency, including:
the preprocessing unit is responsible for carrying out time delay compensation on the voice signals received by the two microphones to obtain voice signals aligned in time; performing windowing and framing processing on the voice signals aligned in time, and transforming the voice signals from a time domain to a frequency domain through Fourier transform;
the low-frequency-band reverberation power spectrum estimation unit is responsible for estimating the reverberation power spectrum of the low-frequency-band part of the voice signal based on the voice occurrence probability;
the high-frequency-band reverberation power spectrum estimation unit is responsible for calculating the consistency of different signal components of the voice signal; estimating a reverberation power spectrum of a highband portion of a speech signal based on the coherence;
the reverberation power spectrum estimation unit combined with high and low frequencies is responsible for estimating the reverberation power spectrum combined with high and low frequencies according to the reverberation power spectrum of the low frequency band part and the reverberation power spectrum of the high frequency band part and the division threshold value of high and low frequency bands;
the dereverberation unit is responsible for calculating to obtain a final reverberation power spectrum by using a recursive smoothing algorithm according to the reverberation power spectrum combined with the high frequency and the low frequency; calculating a gain function according to the final reverberation power spectrum, and obtaining a frequency domain signal after reverberation is removed through the gain function; and according to the frequency domain signal after dereverberation, obtaining a time domain signal after dereverberation by utilizing short-time Fourier inverse transformation.
The above examples are merely illustrative of the present invention, and although examples of the present invention are disclosed for illustrative purposes, those skilled in the art will appreciate that: various substitutions, changes and modifications are possible without departing from the spirit and scope of the present invention and the appended claims. Therefore, the present invention should not be limited to the content of the examples, and the scope of the present invention should be defined by the claims.

Claims (8)

1. A binaural speech dereverberation method based on speech occurrence probability and consistency comprises the following steps:
1) carrying out time delay compensation on voice signals received by the two microphones to obtain voice signals aligned in time;
2) performing windowing and framing processing on the voice signals aligned in time, and transforming the voice signals from a time domain to a frequency domain through Fourier transform;
3) estimating a reverberation power spectrum of a low frequency band part of the speech signal based on the speech occurrence probability; separately estimating the reverberation power spectrum of the low frequency band to ensure that the reverberation of the low frequency band can be removed; when the larger value of the voice occurrence probabilities of the two channels is lower than a certain threshold value, updating the reverberation power spectrum, otherwise, not updating; the method for updating the reverberation power spectrum comprises the following steps:
a) if max (P (H)1|Xl),P(H1|Xr))<p0And P (H)1|Xl)<P(H1|Xr),
Then
Figure FDA0002706241080000011
b) If max (P (H)1|Xl),P(H1|Xr))<p0And P (H)1|Xl)>P(H1|Xr),
Then
Figure FDA0002706241080000012
c) In addition to the above-mentioned others,
Figure FDA0002706241080000013
wherein, P (H)1|Xl) Representing the first microphone signal XlProbability of occurrence of speech, P (H)1|Xr) Representing the second microphone signal XrProbability of occurrence of speech, p0Representing the threshold, λ and μ representing the frame number and frequency, respectively, H1Representing speech, H0The representation of a non-speech sound is,
Figure FDA0002706241080000014
is the self-power spectrum of the estimated reverberation;
4) calculating the consistency of different signal components of the speech signal;
5) estimating a reverberation power spectrum of a highband portion of a speech signal based on the coherence;
6) estimating the reverberation power spectrum combined with high and low frequencies according to the reverberation power spectrum of the low frequency band part and the reverberation power spectrum of the high frequency band part and the division threshold of the high and low frequency bands;
7) calculating by using a recursive smoothing algorithm according to the combined high-low frequency reverberation power spectrum to obtain a final reverberation power spectrum;
8) calculating a gain function according to the final reverberation power spectrum, and obtaining a frequency domain signal after reverberation is removed through the gain function;
9) and according to the frequency domain signal after dereverberation, obtaining a time domain signal after dereverberation by utilizing short-time Fourier inverse transformation.
2. The method as claimed in claim 1, wherein the two speech signals in step 1) are time delay compensated by using GCC-PHAT- ρ γ method to overcome the influence of interference factors in the environment on the peak position of the cross correlation function spectrum.
3. The method of claim 1, wherein step 4) assumes reverberation as a diffuse sound field and computes the consistency using a reverberation consistency model with head occlusion.
4. The method of claim 1, wherein step 5) comprises the sub-steps of:
5-1) updating the consistency of the signals according to the voice occurrence probability at all frequencies;
5-2) considering the influence of the head shielding effect, and estimating the reverberation power spectrum by combining a consistency function under the condition that the power spectrums of pure voice signals received by the two microphones are different.
5. The method of claim 4, wherein the self-power spectrum and cross-power spectrum of the clean speech received by the two microphones in step 5) are represented as:
Figure FDA0002706241080000021
Figure FDA0002706241080000022
Figure FDA0002706241080000023
wherein HlAnd HrRespectively, the transfer functions of the left and right ears, S the sound source signal,
Figure FDA0002706241080000024
representing the coherence function, S, of the binaural signallTo representSound signal received by the left microphone, SrRepresenting the sound signal received by the right microphone.
6. The method of claim 5, wherein step 5-1) comprises:
a) updating the consistency of the reverberant voice, namely when the larger value of the two voice occurrence probabilities is lower than a certain threshold value, updating the consistency of the reverberant signal by utilizing the consistency of the voice signal received by the microphone as follows:
if max (P (H)1|Xl),P(H1|Xr))<p0
Then
Figure FDA0002706241080000025
Wherein the content of the first and second substances,
Figure FDA0002706241080000026
indicating the coherence of the reverberant signal, alphaγWhich represents the coefficient of the smoothing, is,
Figure FDA0002706241080000027
indicating the correspondence between two voices received by the microphone, p0Represents a threshold value in the "when the larger of two speech occurrence probabilities is lower than a certain threshold value";
b) the consistency of the pure voice is updated, when the smaller value of the probability of occurrence of the two voices is higher than a certain threshold value, the consistency of the voice signals received by the microphone is updated to the consistency of the pure voice signals as follows:
if min (P (H)1|Xl),P(H1|Xr))>p1
Then
Figure FDA0002706241080000028
Wherein the content of the first and second substances,
Figure FDA0002706241080000029
indicating the consistency of the clean speech signal,
Figure FDA00027062410800000210
representing the cross-power spectrum of the speech received by the two microphones,
Figure FDA00027062410800000211
representing the self-power spectrum of the speech received by the left microphone,
Figure FDA00027062410800000212
self-power spectrum, phi, representing speech received by the right microphonevvRepresenting the self-power spectrum, p, of the reverberant signal1Represents a threshold value in "when the smaller of the two speech occurrence probabilities is higher than a certain threshold value";
step 5-2) the estimation of the reverberation power spectrum is as follows:
Figure FDA0002706241080000031
7. the method of claim 6, wherein the reverberation power spectrum of the combined high and low frequencies estimated in step 6) is:
Figure FDA0002706241080000032
where μ denotes a certain frequency, μsWhich represents a frequency value that distinguishes between high and low frequencies,
Figure FDA0002706241080000033
a reverberation power spectrum representing a low frequency band portion estimated based on a speech occurrence probability;
Figure FDA0002706241080000034
representation based on consistencyAn estimated reverberation power spectrum of the high frequency band part.
8. A binaural speech dereverberation device based on speech occurrence probability and consistency using the method of any of claims 1-7, comprising:
the preprocessing unit is responsible for carrying out time delay compensation on the voice signals received by the two microphones to obtain voice signals aligned in time; performing windowing and framing processing on the voice signals aligned in time, and transforming the voice signals from a time domain to a frequency domain through Fourier transform;
the low-frequency-band reverberation power spectrum estimation unit is responsible for estimating the reverberation power spectrum of the low-frequency-band part of the voice signal based on the voice occurrence probability;
the high-frequency-band reverberation power spectrum estimation unit is responsible for calculating the consistency of different signal components of the voice signal; estimating a reverberation power spectrum of a highband portion of a speech signal based on the coherence;
the reverberation power spectrum estimation unit combined with high and low frequencies is responsible for estimating the reverberation power spectrum combined with high and low frequencies according to the reverberation power spectrum of the low frequency band part and the reverberation power spectrum of the high frequency band part and the division threshold value of high and low frequency bands;
the dereverberation unit is responsible for calculating to obtain a final reverberation power spectrum by using a recursive smoothing algorithm according to the reverberation power spectrum combined with the high frequency and the low frequency; calculating a gain function according to the final reverberation power spectrum, and obtaining a frequency domain signal after reverberation is removed through the gain function; and according to the frequency domain signal after dereverberation, obtaining a time domain signal after dereverberation by utilizing short-time Fourier inverse transformation.
CN201810765266.3A 2018-07-12 2018-07-12 Binaural voice dereverberation method and device based on voice occurrence probability and consistency Active CN108986832B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810765266.3A CN108986832B (en) 2018-07-12 2018-07-12 Binaural voice dereverberation method and device based on voice occurrence probability and consistency

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810765266.3A CN108986832B (en) 2018-07-12 2018-07-12 Binaural voice dereverberation method and device based on voice occurrence probability and consistency

Publications (2)

Publication Number Publication Date
CN108986832A CN108986832A (en) 2018-12-11
CN108986832B true CN108986832B (en) 2020-12-15

Family

ID=64537944

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810765266.3A Active CN108986832B (en) 2018-07-12 2018-07-12 Binaural voice dereverberation method and device based on voice occurrence probability and consistency

Country Status (1)

Country Link
CN (1) CN108986832B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110095755B (en) * 2019-04-01 2021-03-12 云知声智能科技股份有限公司 Sound source positioning method
CN110012331B (en) * 2019-04-11 2021-05-25 杭州微纳科技股份有限公司 Infrared-triggered far-field double-microphone far-field speech recognition method
CN110718230B (en) * 2019-08-29 2021-12-17 云知声智能科技股份有限公司 Method and system for eliminating reverberation
CN110691296B (en) * 2019-11-27 2021-01-22 深圳市悦尔声学有限公司 Channel mapping method for built-in earphone of microphone
CN111128213B (en) * 2019-12-10 2022-09-27 展讯通信(上海)有限公司 Noise suppression method and system for processing in different frequency bands
CN113613112B (en) * 2021-09-23 2024-03-29 三星半导体(中国)研究开发有限公司 Method for suppressing wind noise of microphone and electronic device
CN115831145B (en) * 2023-02-16 2023-06-27 之江实验室 Dual-microphone voice enhancement method and system

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006243290A (en) * 2005-03-02 2006-09-14 Advanced Telecommunication Research Institute International Disturbance component suppressing device, computer program, and speech recognition system
WO2009151062A1 (en) * 2008-06-10 2009-12-17 ヤマハ株式会社 Acoustic echo canceller and acoustic echo cancel method
JP2011065128A (en) * 2009-08-20 2011-03-31 Mitsubishi Electric Corp Reverberation removing device
CN102347028A (en) * 2011-07-14 2012-02-08 瑞声声学科技(深圳)有限公司 Double-microphone speech enhancer and speech enhancement method thereof
CN102800322A (en) * 2011-05-27 2012-11-28 中国科学院声学研究所 Method for estimating noise power spectrum and voice activity
JP2013044908A (en) * 2011-08-24 2013-03-04 Nippon Telegr & Teleph Corp <Ntt> Background sound suppressor, background sound suppression method and program
CN106297817A (en) * 2015-06-09 2017-01-04 中国科学院声学研究所 A kind of sound enhancement method based on binaural information
CN106971740A (en) * 2017-03-28 2017-07-21 吉林大学 Probability and the sound enhancement method of phase estimation are had based on voice

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006243290A (en) * 2005-03-02 2006-09-14 Advanced Telecommunication Research Institute International Disturbance component suppressing device, computer program, and speech recognition system
WO2009151062A1 (en) * 2008-06-10 2009-12-17 ヤマハ株式会社 Acoustic echo canceller and acoustic echo cancel method
JP2011065128A (en) * 2009-08-20 2011-03-31 Mitsubishi Electric Corp Reverberation removing device
CN102800322A (en) * 2011-05-27 2012-11-28 中国科学院声学研究所 Method for estimating noise power spectrum and voice activity
CN102347028A (en) * 2011-07-14 2012-02-08 瑞声声学科技(深圳)有限公司 Double-microphone speech enhancer and speech enhancement method thereof
JP2013044908A (en) * 2011-08-24 2013-03-04 Nippon Telegr & Teleph Corp <Ntt> Background sound suppressor, background sound suppression method and program
CN106297817A (en) * 2015-06-09 2017-01-04 中国科学院声学研究所 A kind of sound enhancement method based on binaural information
CN106971740A (en) * 2017-03-28 2017-07-21 吉林大学 Probability and the sound enhancement method of phase estimation are had based on voice

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Supervised single-channel speech dereverberation and denoising using a two-stage model based sparse representation;Zhang Long et al.;《Speech Communication》;20171227;第1-8页 *
基于麦克风阵列的混响消减处理;陈建荣等;《电声技术》;20180315;第42卷(第3期);第49-51、54页 *

Also Published As

Publication number Publication date
CN108986832A (en) 2018-12-11

Similar Documents

Publication Publication Date Title
CN108986832B (en) Binaural voice dereverberation method and device based on voice occurrence probability and consistency
Yousefian et al. A dual-microphone speech enhancement algorithm based on the coherence function
CN105869651B (en) Binary channels Wave beam forming sound enhancement method based on noise mixing coherence
Wu et al. A two-stage algorithm for one-microphone reverberant speech enhancement
CN111161751A (en) Distributed microphone pickup system and method under complex scene
JP2013527493A (en) Robust noise suppression with multiple microphones
US9532149B2 (en) Method of signal processing in a hearing aid system and a hearing aid system
CN105679330B (en) Based on the digital deaf-aid noise-reduction method for improving subband signal-to-noise ratio (SNR) estimation
Aroudi et al. Cognitive-driven binaural LCMV beamformer using EEG-based auditory attention decoding
Jangjit et al. A new wavelet denoising method for noise threshold
Yousefian et al. A coherence-based noise reduction algorithm for binaural hearing aids
Itoh et al. Environmental noise reduction based on speech/non-speech identification for hearing aids
CN110827847A (en) Microphone array voice denoising and enhancing method with low signal-to-noise ratio and remarkable growth
CN106328160B (en) Noise reduction method based on double microphones
CN114023352B (en) Voice enhancement method and device based on energy spectrum depth modulation
Miyazaki et al. Theoretical analysis of parametric blind spatial subtraction array and its application to speech recognition performance prediction
CN114189781A (en) Noise reduction method and system for double-microphone neural network noise reduction earphone
Gonzalez-Rodriguez et al. Speech dereverberation and noise reduction with a combined microphone array approach
Wu et al. A two-stage algorithm for enhancement of reverberant speech
Madhu et al. Localisation-based, situation-adaptive mask generation for source separation
Hussain et al. Speech enhancement using degenerate unmixing estimation technique and adaptive noise cancellation technique as a post signal processing
Akagi et al. Noise reduction using a small-scale microphone array in multi noise source environment
Brutti et al. A Phase-Based Time-Frequency Masking for Multi-Channel Speech Enhancement in Domestic Environments.
Unoki et al. Unified denoising and dereverberation method used in restoration of MTF-based power envelope
Zhang et al. Post-secondary filtering improvement of GSC beamforming algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant