CN117238306A

CN117238306A - Voice activity detection and ambient noise elimination method based on double microphones

Info

Publication number: CN117238306A
Application number: CN202311282052.8A
Authority: CN
Inventors: 刘建兵; 冯波; 李鸿鹏; 高峰; 商易; 刘永辉; 朱海波; 姜瑞
Original assignee: Shenzhen Zhilian Technology Co ltd
Current assignee: Shenzhen Zhilian Technology Co ltd
Priority date: 2023-09-28
Filing date: 2023-09-28
Publication date: 2023-12-15

Abstract

The invention discloses a voice activity detection and environmental noise elimination method based on double microphones, belonging to the field of voice signal processing of VOIP terminals; the method specifically comprises the following steps: for a VoIP phone, arranging two omnidirectional microphones at the front and rear of the phone respectively, collecting two paths of signals when a user uses the phone, windowing for fast Fourier transformation, calculating respective power spectrums, then carrying out logarithm and subtraction on the power spectrums, and judging whether the result is larger than an experience threshold epsilon; if yes, judging that voice activity exists, taking the auxiliary signal as a reference signal, and performing noise elimination on the main signal by using an adaptive filter to obtain an enhanced signal; otherwise, no speaking activity exists, updating the coefficient of the adaptive filter and re-performing noise elimination; and finally, coding the enhanced signal, sending the signal through an RTSP protocol, and adjusting audio setting by a user through real-time feedback so as to achieve the optimal voice communication effect. The invention greatly reduces the hardware cost on the premise of meeting certain performance.

Description

Voice activity detection and ambient noise elimination method based on double microphones

Technical Field

The invention belongs to the field of voice signal processing of a VOIP (stands for voice over IP) or Internet protocol) terminal, and particularly relates to a voice activity detection and environmental noise elimination method based on double microphones.

Background

In an application scenario where a VOIP phone is actually used for hands-free or video conference call, the real-time voice communication quality is affected by noisy environmental noise. In order to improve voice quality, it is necessary to effectively detect and eliminate environmental noise.

The prior art uses a single microphone, although the arrangement is easier, when non-stationary noise occurs, both the detection accuracy of voice activity and the noise reduction performance are greatly reduced [1]. In theory, using multiple microphones to take advantage of the spatial characteristics of the sound field may improve the noise reduction capabilities of the system.

The beam forming [2] is the simplest and effective method for enhancing the voice by utilizing a plurality of microphones to form an array. The beamforming noise reduction algorithm assumes that the noise components picked up by each microphone are uncorrelated with each other, however in practical applications such assumptions are not sufficient; therefore, the suppression effect of the beamforming algorithm on noise is not obvious enough. Post-filtering algorithms are often used to further enhance speech, however, the drawbacks of post-filtering algorithms are also significant, namely the limited processing of non-stationary noise and reduced quality of speech communication when transient disturbances occur. The number of microphones also affects the performance of the beamforming noise reduction algorithm, and excessive numbers of microphones greatly increase the complexity of the system.

Another relatively common method of noise reduction using bimorph is based on the energy difference, PLD (Power Level Difference) algorithm [3]. Although the energy difference based approach has many advantages, such as less dependence on the accuracy of delay estimation between bimetals and better handling of non-stationary noise, in practice we find that noise reduction based on energy difference estimation wiener filters often introduces musical noise, and the impact on speech quality can be unacceptable.

In recent years, with the rise of deep learning, noise reduction algorithms based on neural networks are increasingly applied to practical systems. However, the neural network algorithm is data driven, and under the condition of low signal-to-noise ratio in a complex environment, the phenomenon of hurting human voice often occurs, moreover, the neural network training cost is high, the calculated amount is large, npu units are often required for deployment on terminal equipment, and the cost of hardware is greatly increased.

Reference to the literature

[1]Schnitta B.Speech Enhancement:Theory and Practice,Second Edition[J].Noise-News International,2015(23-1).

[2]Brandstein M S,Ward D B.Microphone Arrays:Signal Processing Techniques and Applications[M].2001.

[3]Yousefian N,Rahmani M,Akbari A.Power level difference as a criterion for speech enhancement[C]//IEEE International Conference on Acoustics.IEEE,2009:4653-4656.DOI:10.1109/ICASSP.2009.4960668.

Disclosure of Invention

In order to solve the problems, the invention provides a voice activity detection and environmental noise elimination method based on double microphones, which is characterized in that a main microphone and an environmental noise acquisition microphone are reasonably arranged, voice activity detection is performed by using an energy ratio, and then self-adaptive filtering is controlled to eliminate environmental noise.

The voice activity detection and environmental noise elimination method based on the double microphones comprises the following specific steps:

step one, respectively arranging two omnidirectional microphones at the front end and the rear of a VoIP phone, and collecting signals of the two microphones when a user uses the phone;

the main microphone is arranged at the front end of the telephone, the auxiliary microphone is arranged at the rear end of the telephone, and the distance between the two microphones is 5cm;

the acquired signals are represented as follows:

y _i (m)＝s _i (m)+n _i (m),i＝1,2

wherein y is ₁ (m) represents the signal acquired by the primary microphone; y is ₂ (m) represents the signal acquired by the auxiliary microphone;

s _i (m) represents the sound signal acquired by the ith microphone when the user uses the phone, n _i (m) represents ambient noise collected by the ith microphone;

step two, windowing two paths of microphone signals respectively, performing fast Fourier transform, and calculating respective power spectrums;

the power spectral density of the microphone signal is calculated as follows:

lambda is forgetting factor, Y _i (n, k) is the frequency domain value of the microphone signal, P represents the power spectral density,representing the current frame power spectral density,/->Representing the power spectral density of the previous frame.

Y _i (n, k) short-term Fourier transform of microphone signalsLeaf transformation to obtain a frequency domain value; expressed as:

Y _i (n,k)＝S _i (n,k)+N _i (n,k),i＝1,2

where n is a frame index, k is a frequency index, S _i (n,k),N _i (n, k) are respectively the pairs s _i (m)，n _i (m) fourier transforming the post-fourier transformed frequency domain values;

step three, respectively carrying out logarithm and subtraction on the power spectrums of the two paths of microphones, and judging whether the result is larger than an experience threshold epsilon; if yes, judging that voice activity exists, and entering a step four; otherwise, judging that no speaking activity exists, and entering a step five.

The expression is as follows:

step four, taking the acquired signals of the auxiliary microphones as reference signals, performing noise elimination on the signals acquired by the main microphones by using an adaptive filter to obtain enhanced signals, and entering a step six;

the formula is as follows:

s _E ＝y ₁ (m)-h(m)*y ₂ (m)

where h (m) represents an adaptive filter, x represents a convolution, s _E Representing the enhanced signal.

Updating the coefficient of the adaptive filter, and returning to the step four;

the update formula is as follows:

wherein μ represents the adaptive filter update step size; e (m) =y ₁ (m)-y ₂ (m)。

And step six, coding the enhanced signal, sending the signal through an RTSP protocol, and enabling a user to adjust audio setting through real-time feedback so as to achieve the optimal voice communication effect.

The invention has the advantages that:

1) A voice activity detection and environmental noise elimination method based on double microphones effectively improves the performance of voice activity detection and noise elimination by using double microphone configuration.

2) The method for detecting the voice activity and eliminating the environmental noise based on the double microphones is simple in implementation, low in complexity and good in real-time performance compared with a method based on deep learning by performing signal processing through self-adaptive filtering, and can be implemented on more cheaper chips, so that the application range is wider.

3) The voice activity detection and environmental noise elimination method based on the double microphones is very effective for eliminating annoying environmental noise when no one speaks in a voice call.

4) A voice activity detection and environmental noise elimination method based on double microphones can play a certain role in inhibiting environmental noise under the situation when a talker speaks in a stable noise environment.

Drawings

FIG. 1 is a schematic diagram of a dual microphone based voice activity detection and ambient noise cancellation method of the present invention;

FIG. 2 is a flow chart of a dual microphone based voice activity detection and ambient noise cancellation method of the present invention;

Detailed Description

The present invention will be described in further detail and in greater detail below with reference to the accompanying drawings for the purpose of facilitating understanding and practicing the present invention by those of ordinary skill in the art.

The invention relates to a voice activity detection and environmental noise elimination method based on double microphones, which comprises the steps of carrying out voice activity detection through double microphone energy ratios, and then controlling a self-adaptive filtering strategy according to a voice activity detection result to eliminate environmental noise; the principle is as shown in figure 1, two omnidirectional microphones are respectively arranged at the front and rear of a telephone, when a user uses the telephone, two paths of microphone signals are respectively windowed and subjected to fast Fourier transform, the logarithm of each power spectrum is calculated and subtracted, whether voice activity exists or not is judged, if yes, the updating of the adaptive filter coefficient is stopped, the current adaptive filter coefficient and the noise power spectrum are utilized for eliminating environmental noise, and an enhanced signal is obtained; when there is no voice activity, the adaptive filter coefficients are updated normally and ambient noise is eliminated.

The voice activity detection and environmental noise elimination method based on the double microphones is shown in fig. 2, and specifically comprises the following steps:

the acquired signals are represented as follows:

y _i (m)＝s _i (m)+n _i (m),i＝1,2

step two, windowing the two paths of microphone signals respectively, performing fast Fourier transform, and calculating respective power spectrums

I.e. a short time fourier transform of the time signal, the microphone signal is represented in the frequency domain as:

Y _i (n,k)＝S _i (n,k)+N _i (n,k),i＝1,2

assuming that the speech signal and the noise signal are uncorrelated with each other, the power spectral density of the microphone signal can be calculated as follows:

lambda is forgetting factor, Y _i (n, k) is the frequency domain value of the microphone signal, P represents the power spectral density,representing the current frame power spectral density,/->Representing the power spectral density of the previous frame

The formula is as follows:

typically, the secondary microphone has a mask due to the speaker being closer to the primary microphone, typically with an energy difference of 3 to 10 db when the speaker speaks.

adaptive filtering, which avoids divergence when there is speech activity, uses the previous coefficients to filter directly.

The formula is as follows:

s _E ＝y ₁ (m)-h(m)*y ₂ (m)

Updating the coefficient of the adaptive filter, and entering a step four;

the formula is as follows:

Examples:

first a two microphone configuration is performed: two omni-directional microphones with a distance of 5cm are adopted and are respectively arranged at the front end and the rear end of the sip phone, and if the environmental noise is additive noise and is irrelevant to the voice signals of a speaker using the phone, the signals collected by a main microphone arranged at the front end of the phone and an auxiliary microphone arranged at the rear end of the phone can be expressed as follows:

y _i (m)＝s _i (m)+n _i (m), i＝1,2 (1)

the purpose of ambient noise cancellation is to remove y ₁ Ambient noise component n in (a) ₁ The method comprises the steps of carrying out a first treatment on the surface of the Because the ambient noise is isotropic, i.e

n ₁ ≈n ₂ (2)

So when the speaker is not speaking,

y ₁ ≈y ₂ (3)

when the speaker sounds, the auxiliary microphone is not only far from the speaker, but also the phone body is shielded because the main microphone is closer to the speaker, so that:

P ₁ ＞P ₂ (4)

wherein P is ₁ Representing the power spectral density, P, of the primary wheat acquisition signal ₂ Representing the power spectral density of the auxiliary wheat acquisition signal. According to the above principle, the following voice activity detection method is designed:

firstly, windowing two paths of microphone signals respectively, performing fast Fourier transform, calculating various power spectrums, and carrying out logarithm subtraction on the power spectrums of two microphones, and if the result is larger than epsilon (an empirical threshold), judging that voice activity exists; otherwise, judging that the speaking activity is not generated.

Then, noise cancellation of the dual microphone environment is performed: the method comprises the steps of taking an environmental noise acquisition microphone as a reference signal, carrying out self-adaptive filtering noise elimination on a main microphone, particularly, controlling updating of the self-adaptive filter based on a voice activity detection result, and stopping updating of the self-adaptive filter when voice activity is detected so as to prevent the filter from diverging and injuring voice signals. The formula is described as follows:

s _E ＝y ₁ (n)-h(n)*y ₂ (n) (5)

finally, adaptive filter design: for the adaptive filter, the present embodiment implements an nlms filter and uses a block acceleration method. The adaptive filter update formula is as follows:

and outputting the signal subjected to noise elimination processing, wherein a user can adjust audio setting through real-time feedback so as to achieve the optimal voice communication effect.

According to the voice activity detection and environmental noise elimination method based on the double microphones, environmental noise elimination is performed on a VoIP phone through reasonably arranging the double microphones, and a user uses the VoIP phone supporting RTSP to start a double microphone environmental noise elimination function through key setting; after the user dials to make a call, a voice RTP stream is established after the DSP module removes noise;

the VoIP phone comprises a user input module, a call control module, an RTSP protocol control module, a DSP module and a UI module.

The user performs key operation through a user input module of the VoIP phone, starts a double-microphone environment noise suppression function, and constructs a VoIP call request according to the setting information; the audio acquisition module acquires data from the two microphones and then sends the data to the DSP module, and the DSP module performs signal processing operation on the data: the method comprises the steps of windowing, fast Fourier transformation, power spectrum calculation, logarithm subtraction and voice activity control adaptive filtering strategy judgment according to the result: if no voice activity is judged, updating the adaptive filter coefficient; if there is voice activity, the adaptive filter coefficients are stopped updating and the previous coefficients are used for filtering. The noise-reduced audio data is encoded according to the settings and then transmitted via the RTSP protocol. And starting a double-microphone environment denoising function, and transmitting the denoised voice through an RTSP protocol control module.

Claims

1. A voice activity detection and environmental noise elimination method based on double microphones is characterized by comprising the following specific steps:

the acquired signals are represented as follows:

y _i (m)＝s _i (m)+n _i (m),i＝1,2

the power spectral density of the microphone signal is calculated as follows:

P _Yi (n,k)＝λP _Yi (n-1,k)+(1-λ)|Y _i (n,k) ² |i＝1,2

lambda is forgetting factor, Y _i (n, k) is the frequency domain value of the microphone signal, P represents the power spectral density, P _Yi (n, k) represents the current frame power spectral density, P _Yi (n-1, k) represents the power spectral density of the previous frame;

step three, respectively carrying out logarithm and subtraction on the power spectrums of the two paths of microphones, and judging whether the result is larger than an experience threshold epsilon; if yes, judging that voice activity exists, and entering a step four; otherwise, judging that no speaking activity exists, and entering a step five;

the expression is as follows:

the formula is as follows:

s _E ＝y ₁ (m)-h(m)*y ₂ (m)

where h (m) represents an adaptive filter, x represents a convolution, s _E Representing the enhanced signal;

the update formula is as follows:

wherein μ represents the adaptive filter update step size; e (m) =y ₁ (m)-y ₂ (m)；

2. The method for detecting voice activity and eliminating environmental noise based on two microphones as defined in claim 1, wherein in the first step, the main microphone is disposed at the front end of the phone, the auxiliary microphone is disposed at the rear end of the phone, and the distance between the two microphones is 5cm.

3. A dual microphone based voice activity detection and ambient noise cancellation as recited in claim 1The dividing method is characterized in that in the second step, the frequency domain value Y _i The calculation formula of (n, k) is:

Y _i (n,k)＝S _i (n,k)+N _i (n,k),i＝1,2

where n is a frame index, k is a frequency index, S _i (n,k),N _i (n, k) are respectively the pairs s _i (m)，n _i (m) fourier transforming the frequency domain values.