CN108986832B

CN108986832B - Binaural voice dereverberation method and device based on voice occurrence probability and consistency

Info

Publication number: CN108986832B
Application number: CN201810765266.3A
Authority: CN
Inventors: 刘宏; 王秀玲
Original assignee: Peking University Shenzhen Graduate School
Current assignee: Peking University Shenzhen Graduate School
Priority date: 2018-07-12
Filing date: 2018-07-12
Publication date: 2020-12-15
Anticipated expiration: 2038-07-12
Also published as: CN108986832A

Abstract

The invention discloses a binaural voice dereverberation method and device based on voice occurrence probability and consistency. The method comprises the following steps: 1) carrying out time delay compensation on voice signals received by the two microphones to obtain voice signals aligned in time; 2) performing windowing and framing processing, and transforming the voice signal from a time domain to a frequency domain through Fourier transform; 3) estimating a reverberation power spectrum of the low frequency band part based on the voice occurrence probability; 4) calculating the consistency of different signal components of the speech signal; 5) estimating a reverberation power spectrum of the high frequency band part based on the consistency; 6) estimating a reverberation power spectrum combined with high and low frequencies according to the division threshold values of the high and low frequency bands; 7) calculating by using a recursive smoothing algorithm to obtain a final reverberation power spectrum; 8) obtaining a frequency domain signal after reverberation is removed through a gain function; 9) and obtaining the time domain signal after dereverberation by using short-time Fourier inverse transformation. The invention can effectively remove reverberation on the whole frequency band and improve the voice perception quality.

Description

Binaural voice dereverberation method and device based on voice occurrence probability and consistency

Technical Field

The invention belongs to the technical field of audio signal processing and computer hearing, and particularly relates to a method and a device for removing reverberation of double-microphone voice in a reverberation environment.

Background

Binaural audio naturally has many advantages for communication and multimedia experiences. In the daily human-human interaction, the auditory perception is one of the most effective and direct human-human interaction modes. However, in an actual environment, speech is used as an important information carrier for human-to-human and human-to-machine communication, and is inevitably interfered by reverberation, environmental noise and the like, so that the definition, intelligibility and comfort of speech are greatly reduced, and the auditory perception of human ears and the performance of a subsequent speech processing system are seriously affected. In general, a microphone receives not only a direct part of a sound source but also a reflected signal (e.g., a signal generated by reflection on the floor, wall, ceiling, home furnishings, etc. in a room) that a sound source signal reaches due to multipath propagation when passing through a channel, a reflected wave with an acoustic delay time of about 50ms or more is called an echo, and an effect of the reflected wave other than the direct sound is called a reverberation phenomenon, which affects a receiving effect of a desired speech signal. To counteract the degradation in sound quality caused by reverberation, researchers have proposed dereverberation (or reverberation cancellation) techniques that aim to improve the quality and intelligibility of segmented speech.

Speech dereverberation techniques have wide application. With the development of modern signal processing technology and intelligent discipline, the degree of intellectualization of the robot is continuously improved, the robot is often in a complex acoustic environment in practical application, various types of noise and the like can cause the robot to be interfered when acquiring voice, the recognition rate of the voice can be rapidly reduced in a reverberation environment, the realization of subsequent operation and functions is influenced, and even the practical application cannot be met. Therefore, the reduction of reverberation by using the binaural speech dereverberation technology has important significance on the influence of the robot in practical application. As another example, binaural speech dereverberation techniques may provide pre-processing for many speech signal processing techniques, such as: binaural sound source localization, speech recognition, etc. In addition, for example, for a person with hearing impairment, it is often necessary to communicate with the hearing aid or cochlear implant. However, in a reverberant environment, the hearing aid hearing effect is greatly affected. At this time, the non-clean speech signal needs to be preprocessed by using a speech dereverberation algorithm before being amplified, and the reverberation signal can be removed to a certain extent to help auditory handicapped people to better communicate.

Speech dereverberation techniques can be generally divided in terms of single-channel and multi-channel speech enhancement. The single-channel dereverberation algorithm utilizes a single microphone for speech enhancement, and such an approach has achieved widespread application and mature development in its simple model and inexpensive cost. But since the single-channel speech dereverberation algorithm can only utilize the statistical properties of the single-channel speech signal to suppress reverberation. The multi-channel speech dereverberation system uses a plurality of microphones, namely a microphone array to collect sound signals, so as to obtain multi-channel signals. Due to the increase of the number of input channels, the signal processing algorithm can utilize the correlation between the channel signals to perform voice enhancement. Compared with the limitation that a single channel can only be enhanced by using the difference of the voice and the reverberation in the time-frequency domain, the introduction of the microphone array can make up the defect of a single-channel voice dereverberation algorithm. Generally, increasing the number of microphones can improve the effect of speech dereverberation. Compared with a single microphone, the microphone-based array can not only utilize time-frequency information of signals, but also utilize spatial information of the signals, and is widely concerned. But the disadvantages are that the structure size is huge, the system is complex in calculation, the calculation amount is too large, and the like. The cost of equipment, the real-time performance of the voice adding method algorithm and the effect of the algorithm are comprehensively considered, and the dual-channel voice dereverberation, namely the voice dereverberation by using two microphones, is a better compromise scheme.

The algorithm for dereverberating the double-microphone voice mainly comprises a consistency model-based method, a two-channel wiener filtering-based method and the like. Among them, the algorithm based on the consistent dereverberation is to design the filter mainly according to the consistent difference between the pure speech and the reverberated speech. The method assumes that a pure voice part and a reverberation part are irrelevant, utilizes the consistency of the pure voice, the reverberation voice and the voice received by a microphone to estimate the reverberation power in the received voice, and calculates the gain of a filter through the estimated reverberation power so as to obtain the voice after reverberation removal. The two-channel voice dereverberation method based on consistency mainly comprises the following steps:

1. voice input, pre-filtering and analog-to-digital conversion. Firstly, pre-filtering an input analog sound signal, and carrying out high-pass filtering to inhibit a 50Hz power supply noise signal; the low-pass filtering filters the part of the sound signal with the frequency component exceeding half of the sampling frequency, prevents aliasing interference, and samples and quantizes the analog sound signal to obtain a digital signal.

2. Pre-emphasis is performed. The signal is passed through a high frequency emphasis filter impulse response to compensate for the high frequency attenuation of the lip radiation.

3. And (4) framing and windowing. Due to the slow time-varying property of the voice signal, the voice signal is not stable as a whole and is stable locally, the voice signal is generally considered to be stable within 10-30ms, and the voice signal can be framed according to the length of 20 ms. The framing function is:

x_k(n)＝w(n)s(Nk+n)n＝0,1...N-1；k＝0,1...L-1 (1)

where N is the frame length, L is the frame number, and s represents the speech signal. w (n) is a window function whose choice (shape and length) has a large influence on the behavior of the analysis parameters in short time, and commonly used window functions include rectangular windows, hanning windows, hamming windows, and the like. The Hamming window is generally selected, so that the characteristic change of the voice signal can be well reflected, and the Hamming window expression is as follows:

4. and (4) estimating a reverberation power spectrum. The consistency of the pure voice and the reverberation voice is obtained by using a form researched by the prior person when the pure voice and the reverberation voice are estimated, and the consistency of the voice received by the microphone is calculated by using a defined formula of the consistency.

5. The filter gain is calculated and the dual channel signal is filtered.

6. The filtered speech is converted to a time domain output using an inverse fourier transform.

Disclosure of Invention

The invention provides a new method and a device for removing reverberation of binaural speech, which are used for improving the dereverberation effect of a two-microphone dereverberation algorithm based on consistency in a low-frequency section part.

The traditional consistency-based dual-microphone dereverberation algorithm assumes that reverberation is a scattered sound field and has low consistency, and pure voice has high consistency, so that reverberation can be removed according to the consistency, but in a low-frequency section, the consistency of the reverberated voice is also high, so that the reverberation in the low-frequency section is removed less. In addition, the conventional method uses free field calculation when calculating the consistency of each sound part, and in the case of a binaural microphone, the consistency of each sound part is affected by head occlusion due to the presence of a "head shadow effect", and the form of the free field is not suitable. Aiming at the two problems, the invention provides a binaural speech dereverberation method based on speech occurrence probability and consistency.

The technical scheme adopted by the invention is as follows:

a binaural voice dereverberation method based on voice occurrence probability and consistency mainly comprises the following steps:

1) carrying out time delay compensation on voice signals received by the two microphones to obtain voice signals aligned in time;

2) performing windowing and framing processing on the voice signals aligned in time, and transforming the voice signals from a time domain to a frequency domain through Fourier transform;

3) estimating a reverberation power spectrum of a low frequency band part of the speech signal based on the speech occurrence probability;

4) calculating the consistency of different signal components of the speech signal;

5) estimating a reverberation power spectrum of a highband portion of a speech signal based on the coherence;

6) estimating the reverberation power spectrum combined with high and low frequencies according to the reverberation power spectrum of the low frequency band part and the reverberation power spectrum of the high frequency band part and the division threshold of the high and low frequency bands;

7) calculating by using a recursive smoothing algorithm according to the combined high-low frequency reverberation power spectrum to obtain a final reverberation power spectrum;

8) calculating a gain function according to the final reverberation power spectrum, and obtaining a frequency domain signal after reverberation is removed through the gain function;

9) and according to the frequency domain signal after dereverberation, obtaining a time domain signal after dereverberation by utilizing short-time Fourier inverse transformation.

The above steps are specifically described as follows:

1) and carrying out time delay compensation on the voice signals received by the two microphones to obtain the voice aligned in time. Since there is a time difference between the arrival of the speech signal at the two microphones, the signals need to be aligned and processed. The GCC-PHAT-rho gamma method based on generalized cross-correlation is adopted for time delay estimation, and the binaural time difference is determined mainly by searching the spectral peak position of the cross-correlation function. The method can overcome the influence of interference factors such as related noise, reverberation and the like in the environment on the position of the cross-correlation function spectrum peak, and is relatively robust.

In the time domain, the two-channel speech model can be described as:

x_i(n)＝s_i(n)+v_i(n), (3)

wherein x is_i(n) represents a speech signal received by a microphone, s_i(n) denotes a clean speech signal, v_i(n) represents the noise signal, where the index i ∈ { l, r } represents the first microphone signal and the second microphone signal.

With a short-time fourier transform, the two-channel speech model can be represented in the frequency domain as:

X_i(λ,μ)＝S_i(λ,μ)+V_i(λ,μ), (4)

where λ and μ denote the frame number and frequency, respectively. The cross-correlation function of two received voices can then be expressed as:

where Δ τ is the time difference, denotes taking the complex conjugate, and ω denotes the angular frequency. W (ω) represents a frequency domain weighting function,

for sharpening the spectral peaks of the cross-correlation function, the parameter ρ is the reverberation factor determined by the snr, γ (ω) is the coherence function of the speech received by the microphone (described in detail in step 4), and both are adaptively adjusted according to the environment, G (ω) represents the cross-power spectrum, and G (ω) ═ X represents the cross-power spectrum_l(ω)X_r ^*(ω). Thus, the time delay can be obtained by maximizing the generalized cross-correlation function:

2) and performing windowing and framing preprocessing on the two aligned voices, and performing Fourier transform to transform the signal from a time domain to a frequency domain.

3) And estimating the reverberation power spectrum of the low frequency band part based on the voice occurrence probability. The step separately estimates the reverberation power spectrum of the low frequency band to ensure that the reverberation of the low frequency band can be removed. For each channel of speech, the speech power and the reverberation power are respectively denoted as phi_ss(lambda, mu) and phi_vv(lambda, mu), because whether the voice appears is uncertain, the noise power spectrum E (| V |) is obtained by using the minimum mean square error method²| X), the formula is calculated as:

E(|V|²|X)＝P(H₀|X)E(|V|²|X,H₀)+P(H₁|X)E(|V|²|X,H₁) (7)

where X and V represent the discrete Fourier transforms of the signal received by the microphone and the reverberant signal, respectively, H₁Representing speech, H₀Representing non-speech, P (H)₀| X) represents the probability of no occurrence of speech, E (| V²|X,H₀) Represents the reverberant power spectrum in the absence of speech, P (H)₁| X) represents the probability of occurrence of speech, E (| V²|X,H₁) Representing the reverberation power spectrum in the presence of speech.

The posterior signal-to-mixture ratio is defined as:

ξ＝φ_ss/φ_vv (8)

the probability of speech occurrence can be calculated by equation (9):

wherein ξ_optIndicating the best a posteriori signal-to-mixture ratio. Research shows that when the true posterior signal-to-noise ratio is between-infinity and 20dB, 10 logs are taken₁₀(ξo_pt) The speech occurrence probability calculation error is minimal at 15 dB. Calculating the probability of occurrence P (H)₁| X), the probability P (H) that speech does not occur₀| X) can be calculated using the following formula:

P(H₀|X)＝1-P(H₁|X) (10)

when the speech does not appear, the speech received by the microphone can be considered as the reverberation noise, so that the reverberation power spectrum can be obtained by the following formula:

E(|V|²|X,H₀)＝E(|V|²|V)＝|V|²＝|X|² (11)

when the voice appears, the reverberation power spectrum is calculated according to the reverberation estimation result of the previous frame:

wherein

Is the self-power spectrum of the estimated reverberation. Thus, the reverberation power spectrum E (| V²| X) may be rewritten as:

interframe smoothing of the reverberant power spectrum:

where alpha is a smoothing factor.

The reverberation power spectrum is updated when the larger of the speech occurrence probabilities of the two channels (i.e. the two microphones) is below a certain threshold, otherwise not:

1) if max (P (H)₁|X_l),P(H₁|Xr))<p₀And P (H)₁|X_l)<P(H₁|X_r),

Then

2) If max (P (H)₁|X_l),P(H₁|X_r))<p₀And P (H)₁|X_l)＞P(H₁|X_r),

Then

3) In addition to the above-mentioned others,

wherein, P (H)₁|X_l) Representing the probability of occurrence of speech, P (H), of the first microphone signal₁|X_r) Representing the probability of occurrence of speech, p, of the second microphone signal₀Representing a threshold value.

The low-frequency part of the voice signal with reverberation carries out reverberation power spectrum estimation based on the method, and the result is recorded as

4) The coherence of the different signal components is calculated. The reverberation signal is clearly distinguished from the speech signal by the consistency in the high frequency part, so that the reverberation in the high frequency part is estimated by the consistency. It is first necessary to compute the correspondence between the different components of the speech. The consistency of the speech received by the microphone can be directly calculated by the definition of consistency, and the consistency between two signals can be defined as:

wherein

And

representing a signal x₁And x₂The self-power spectrum of (a) a,

the cross-power spectrum of the signal is represented, and the calculation is carried out by adopting a recursive average method:

wherein alpha is_PSDIs a smoothing factor, which represents the complex conjugate.

Reverberation voice is generally assumed to be a scattering sound field, wherein the scattering sound field is caused by that countless uncorrelated signals simultaneously propagate in all directions with the same energy, and the ideal consistency calculation method of the scattering sound field in the traditional method is as follows:

wherein f represents frequency, d_micRepresenting the distance between the two microphones and c representing the speed of sound. However, when the two microphones are located at the left and right ears of the human head, the consistency of the scattering sound field is more complicated due to the shielding of the human head. Jeub et al proposed a method with curve fitting was therefore used to approximate the model:

wherein, a_p，b_pAnd c_pIs a constant, and takes the values of 2.38 and 10 respectively^-31371, 151.5, P represents the model order, which is 3.

For clean speech, the coherence of the speech is high, and assuming that both microphones are reached at an angle θ, the coherence between clean speech components can be expressed as:

where f denotes the frequency and c denotes the speed of sound propagation in airDegree, d_micRepresenting the distance of the two microphones.

5) A reverberation power spectrum of the high band part is estimated based on the signal consistency. Since it is assumed that the reverberant sound field is a diffuse sound field, the noise signals received by the respective microphones have the same power spectrum

Considering the head shadow effect, the difference of the power spectrums of the pure voice signals received by the binaural microphone cannot be directly ignored, and the power spectrum of the pure voice signal can be expressed as:

wherein H_lAnd H_rRespectively representing the transfer functions of the left and right ears, S representing the sound source signal, S_lRepresenting the sound signal received by the left microphone, S_rRepresenting the sound signal received by the right microphone. Combining the binaural signal coherence function γ yields:

thus, the left and right ears are clean speech signals s_lAnd s_rReverberation signal v_lAnd v_rWith the speech signal x received by the microphone_lAnd x_rThe relationship of the self-power spectrum and cross-power spectrum of (a) can be expressed as:

since it is assumed that the reverberation is not correlated with the speech, the joint equations (23), (25), (26) can be derived:

combining the definition of binaural coherence with equation (28), one obtains:

solving the equation (29) to obtain the estimation result phi of the reverberation power spectrum_vv. Rewrite equation (29) to:

obtaining by solution:

theoretically, the consistency of the voice signal is strong, the consistency of the reverberation signal is weak, and the consistency of the received signal is not more than that of the pure voice signal, so that

Thus, the formula (31) can be considered to have a solution. In order to guarantee the reverberation power spectrum phi_vvTaking a positive number, calculating the reverberation power spectrum by adopting an equation (32):

wherein the self-power spectrum

And cross power spectrum

Also calculated using a recursive average method.

The high-frequency part of the speech with reverberation carries out reverberation power spectrum estimation based on the method, and the estimation result is

Because a certain difference exists between the theoretical signal consistency and the actual signal consistency, the result of the power spectrum estimation of the reverberation is influenced. To further improve the effect of the estimation, the consistency of the signal is updated here.

When the larger value of the two voice occurrence probabilities is lower than a certain threshold value, the consistency of the reverberation signal is updated by the consistency of the voice signals received by the microphones:

if max (P (H)₁|X_l),P(H₁|X_r))<p₀

Then

When the smaller of the two speech occurrence probabilities is higher than a certain threshold, the consistency of the pure speech signal is updated by the consistency of the speech signal received by the microphone, which is obtained by equation (29):

if min (P (H)₁|X_l),P(H₁|X_r))＞p₁

Then

Wherein p is₀、p₁A threshold value is indicated which is indicative of,

indicating the coherence of the reverberant signal, alpha_γRepresenting the smoothing coefficient, gamma_xlxrIndicating the correspondence between the two voices received by the microphone,

indicating the consistency of the clean speech signal,

cross power spectrum, phi, representing speech received by two microphones_xlxlRepresenting the self-power spectrum of the speech received by the left microphone,

self-power spectrum, phi, representing speech received by the right microphone_vvRepresenting the self-power spectrum of the reverberant signal. Since the reverberation power spectrum estimation based on the consistency uses only the square of the pure speech signal consistency, only the formula (35) needs to be used for updating.

6) Combined with the estimation of the reverberation power spectrum of high and low frequencies, when the frequency mu is less than a certain set value mu_s(frequency values for discriminating high and low frequencies), the reverberation power spectrum is

When the frequency is greater than the threshold value mu_sWhile, the reverberation power spectrum is

Namely:

7) and calculating to obtain a final reverberation power spectrum by utilizing the conventional recursive smoothing algorithm according to the reverberation power spectrum combined with the high frequency and the low frequency estimated in the step 6).

8) A gain function is calculated. After the power spectrum of the reverberation signal is obtained through calculation, a gain function can be designed by using the reverberation power spectrum, and the signal received by the microphone is multiplied by the gain function to obtain the signal after reverberation is removed. Speech dereverberation based on an estimate of the reverberation power spectrum is often filtered by spectral subtraction. It is based on a simple principle: assuming that the reverberation is echo noise, a clean speech signal spectrum can be obtained by subtracting an estimate of the reverberation spectrum from the reverberant speech spectrum received by the microphone. The gain function is as follows:

wherein the content of the first and second substances,

representing the estimated power spectrum of the reverberations,

indicating the calculated power spectrum, ξ, of the speech signal received by the microphone²(λ) represents the square of the posterior signal-to-noise ratio. To avoid over-reduction, a lower bound G is set_min. The dereverberated speech signal is represented in the frequency domain as:

9) finally, the time domain signal after dereverberation can be obtained by using short-time inverse Fourier transform

Correspondingly, the present invention also provides a binaural speech dereverberation apparatus based on speech occurrence probability and consistency, comprising:

the preprocessing unit is responsible for carrying out time delay compensation on the voice signals received by the two microphones to obtain voice signals aligned in time; performing windowing and framing processing on the voice signals aligned in time, and transforming the voice signals from a time domain to a frequency domain through Fourier transform;

the low-frequency-band reverberation power spectrum estimation unit is responsible for estimating the reverberation power spectrum of the low-frequency-band part of the voice signal based on the voice occurrence probability;

the high-frequency-band reverberation power spectrum estimation unit is responsible for calculating the consistency of different signal components of the voice signal; estimating a reverberation power spectrum of a highband portion of a speech signal based on the coherence;

the reverberation power spectrum estimation unit combined with high and low frequencies is responsible for estimating the reverberation power spectrum combined with high and low frequencies according to the reverberation power spectrum of the low frequency band part and the reverberation power spectrum of the high frequency band part and the division threshold value of high and low frequency bands;

the dereverberation unit is responsible for calculating to obtain a final reverberation power spectrum by using a recursive smoothing algorithm according to the reverberation power spectrum combined with the high frequency and the low frequency; calculating a gain function according to the final reverberation power spectrum, and obtaining a frequency domain signal after reverberation is removed through the gain function; and according to the frequency domain signal after dereverberation, obtaining a time domain signal after dereverberation by utilizing short-time Fourier inverse transformation.

The invention has the beneficial effects that:

the method adopts different reverberation power spectrum estimation for high and low frequencies by utilizing the difference of consistency between the reverberation and pure voice received by the two microphones, removes the reverberation of the low frequency part by utilizing a model for calculating the reverberation power spectrum based on the voice occurrence probability, and removes the reverberation of the high frequency part by utilizing the voice consistency model, so that the reverberation on the whole frequency band can be effectively removed, and the voice perception quality is improved.

Drawings

Fig. 1 is a flow diagram of a binaural speech dereverberation method based on speech occurrence probability and consistency according to the present invention.

Fig. 2 is a comparison graph of the real reverberation power and the improved pre-post to reverberation power spectrum estimation based on the method of coherent dereverberation in the embodiment of the present invention.

Fig. 3(a) -3 (c) are a speech signal contaminated by reverberation, a spectrogram of the speech after dereverberation based on consistency before modification, and a spectrogram of the speech after dereverberation using speech occurrence probability and consistency after modification, respectively.

Detailed Description

The invention is described more fully below with reference to the following examples and accompanying drawings.

The database used in this embodiment is more authoritative in the international speech enhancement field and is one of the most widely used databases. Pure speech was taken from the TSP database for a total of 80 utterances for testing. The signals received by the microphone are convolved with the clean speech signal by the room Impulse response provided by an air (aachen Impulse response) database. The Air impulse response database is recorded by a communication system research institute of the Gem industry university in Germany by utilizing an HMS2 simulation artificial head, comprises different types of scenes such as offices, conference rooms, report halls and the like, and is used for researching a signal processing algorithm in a reverberation environment. The two microphones are respectively positioned at the left ear and the right ear of the artificial head, and the distance is about 0.17 meter.

The present embodiment adopts a binaural speech dereverberation method based on speech occurrence probability and consistency as shown in fig. 1 to perform speech dereverberation algorithm evaluation under different reverberation scenes. The specific settings for the parameters in the algorithm are shown in table 1.

TABLE 1 Algorithm parameter set

Parameter(s)	Value taking
		Sampling rate f_s	16kHz
Frame length L	320
		Frame shift M	160
Spectral smoothing parameter alpha	50％
		Subtraction factor beta	0.85
Lower boundary of spectrum G_min	-10dB

Table 2 shows the improvement degree (Δ SRMR) of the perceptual quality of speech (PESQ) and the modulation ratio of signal reverberation obtained by the method of using only the consistency for reverberation estimation and reverberation removal before improvement and the method of using the probability and consistency of occurrence of speech for reverberation estimation and reverberation removal after improvement. From the comparison of Δ SRMR before and after the improvement, it can be seen that the dereverberation method based on the probability of occurrence and consistency of speech can obviously remove more reverberation, and thus can obtain higher PESQ value.

TABLE 2 noise power spectrum estimation algorithm before and after improvement of the noise power spectrum estimation logarithm error

Reverberant scenes	Office room	Speech room	Corridor	Auditoria
					Reverberation time	0.45s	0.85s	0.83s	5.16s
Initial PESQ value	1.89	1.62	1.74	1.44
					PESQ-before improvement	2.19	1.78	1.92	1.61
After PESQ-improvement	2.42	2.00	2.07	1.78
					Before Delta SRMR-improvement	1.05	1.11	1.19	0.90
After Delta SRMR-improvement	1.32	1.37	1.41	1.18

Fig. 2 is a power spectrum of a real reverberation signal under the condition that the reverberation scene is an office in the embodiment of the present invention and a reverberation power spectrum estimated by using a consistency improvement-based pre-and-post method. It is apparent from fig. 2 that the power spectrum estimated by the improved method is closer to the real reverberation power spectrum.

The voice dereverberation effect can be better observed by utilizing the spectrogram of the voice signal after dereverberation. Examples are given in fig. 3(a) -3 (c). Fig. 3(a) -3 (c) are spectrograms of a speech signal after being contaminated by reverberation, a spectrogram of a speech after being dereverberated using a consistency-based before modification, and a spectrogram of a speech after being dereverberated using a speech occurrence probability and consistency after modification, respectively. It can be seen from the spectrogram that the spectrogram of the voice signal obtained by using the method of the invention to perform voice dereverberation can remove more reverberation, especially in the low-frequency part.

Another embodiment of the present invention provides a binaural speech dereverberation apparatus based on speech occurrence probability and consistency, including:

The above examples are merely illustrative of the present invention, and although examples of the present invention are disclosed for illustrative purposes, those skilled in the art will appreciate that: various substitutions, changes and modifications are possible without departing from the spirit and scope of the present invention and the appended claims. Therefore, the present invention should not be limited to the content of the examples, and the scope of the present invention should be defined by the claims.

Claims

1. A binaural speech dereverberation method based on speech occurrence probability and consistency comprises the following steps:

3) estimating a reverberation power spectrum of a low frequency band part of the speech signal based on the speech occurrence probability; separately estimating the reverberation power spectrum of the low frequency band to ensure that the reverberation of the low frequency band can be removed; when the larger value of the voice occurrence probabilities of the two channels is lower than a certain threshold value, updating the reverberation power spectrum, otherwise, not updating; the method for updating the reverberation power spectrum comprises the following steps:

a) if max (P (H)₁|X_l),P(H₁|Xr))<p₀And P (H)₁|X_l)<P(H₁|X_r),

Then

b) If max (P (H)₁|X_l),P(H₁|X_r))<p₀And P (H)₁|X_l)＞P(H₁|X_r),

Then

c) In addition to the above-mentioned others,

wherein, P (H)₁|X_l) Representing the first microphone signal X_lProbability of occurrence of speech, P (H)₁|X_r) Representing the second microphone signal X_rProbability of occurrence of speech, p₀Representing the threshold, λ and μ representing the frame number and frequency, respectively, H₁Representing speech, H₀The representation of a non-speech sound is,

is the self-power spectrum of the estimated reverberation;

2. The method as claimed in claim 1, wherein the two speech signals in step 1) are time delay compensated by using GCC-PHAT- ρ γ method to overcome the influence of interference factors in the environment on the peak position of the cross correlation function spectrum.

3. The method of claim 1, wherein step 4) assumes reverberation as a diffuse sound field and computes the consistency using a reverberation consistency model with head occlusion.

4. The method of claim 1, wherein step 5) comprises the sub-steps of:

5-1) updating the consistency of the signals according to the voice occurrence probability at all frequencies;

5-2) considering the influence of the head shielding effect, and estimating the reverberation power spectrum by combining a consistency function under the condition that the power spectrums of pure voice signals received by the two microphones are different.

5. The method of claim 4, wherein the self-power spectrum and cross-power spectrum of the clean speech received by the two microphones in step 5) are represented as:

wherein H_lAnd H_rRespectively, the transfer functions of the left and right ears, S the sound source signal,

representing the coherence function, S, of the binaural signal_lTo representSound signal received by the left microphone, S_rRepresenting the sound signal received by the right microphone.

6. The method of claim 5, wherein step 5-1) comprises:

a) updating the consistency of the reverberant voice, namely when the larger value of the two voice occurrence probabilities is lower than a certain threshold value, updating the consistency of the reverberant signal by utilizing the consistency of the voice signal received by the microphone as follows:

if max (P (H)₁|X_l),P(H₁|X_r))<p₀

Then

Wherein the content of the first and second substances,

indicating the coherence of the reverberant signal, alpha_γWhich represents the coefficient of the smoothing, is,

indicating the correspondence between two voices received by the microphone, p₀Represents a threshold value in the "when the larger of two speech occurrence probabilities is lower than a certain threshold value";

b) the consistency of the pure voice is updated, when the smaller value of the probability of occurrence of the two voices is higher than a certain threshold value, the consistency of the voice signals received by the microphone is updated to the consistency of the pure voice signals as follows:

if min (P (H)₁|X_l),P(H₁|X_r))＞p₁

Then

Wherein the content of the first and second substances,

indicating the consistency of the clean speech signal,

representing the cross-power spectrum of the speech received by the two microphones,

representing the self-power spectrum of the speech received by the left microphone,

self-power spectrum, phi, representing speech received by the right microphone_vvRepresenting the self-power spectrum, p, of the reverberant signal₁Represents a threshold value in "when the smaller of the two speech occurrence probabilities is higher than a certain threshold value";

step 5-2) the estimation of the reverberation power spectrum is as follows:

7. the method of claim 6, wherein the reverberation power spectrum of the combined high and low frequencies estimated in step 6) is:

where μ denotes a certain frequency, μ_sWhich represents a frequency value that distinguishes between high and low frequencies,

a reverberation power spectrum representing a low frequency band portion estimated based on a speech occurrence probability;

representation based on consistencyAn estimated reverberation power spectrum of the high frequency band part.

8. A binaural speech dereverberation device based on speech occurrence probability and consistency using the method of any of claims 1-7, comprising: