CN117238306A - Voice activity detection and ambient noise elimination method based on double microphones - Google Patents
Voice activity detection and ambient noise elimination method based on double microphones Download PDFInfo
- Publication number
- CN117238306A CN117238306A CN202311282052.8A CN202311282052A CN117238306A CN 117238306 A CN117238306 A CN 117238306A CN 202311282052 A CN202311282052 A CN 202311282052A CN 117238306 A CN117238306 A CN 117238306A
- Authority
- CN
- China
- Prior art keywords
- microphone
- microphones
- signal
- voice activity
- phone
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000000694 effects Effects 0.000 title claims abstract description 47
- 238000000034 method Methods 0.000 title claims abstract description 27
- 238000001514 detection method Methods 0.000 title claims abstract description 23
- 230000008030 elimination Effects 0.000 title claims abstract description 23
- 238000003379 elimination reaction Methods 0.000 title claims abstract description 23
- 230000007613 environmental effect Effects 0.000 claims abstract description 25
- 230000003044 adaptive effect Effects 0.000 claims abstract description 24
- 238000001228 spectrum Methods 0.000 claims abstract description 13
- 238000004891 communication Methods 0.000 claims abstract description 7
- 230000003595 spectral effect Effects 0.000 claims description 14
- 230000009977 dual effect Effects 0.000 claims description 4
- 230000005236 sound signal Effects 0.000 claims description 3
- 230000001131 transforming effect Effects 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 2
- 238000012545 processing Methods 0.000 abstract description 7
- 230000009466 transformation Effects 0.000 abstract description 3
- 238000001914 filtration Methods 0.000 description 9
- 238000013528 artificial neural network Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 241000209140 Triticum Species 0.000 description 2
- 235000021307 Triticum Nutrition 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 230000001629 suppression Effects 0.000 description 2
- 230000001133 acceleration Effects 0.000 description 1
- 239000000654 additive Substances 0.000 description 1
- 230000000996 additive effect Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 230000002401 inhibitory effect Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 230000001052 transient effect Effects 0.000 description 1
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D30/00—Reducing energy consumption in communication networks
- Y02D30/70—Reducing energy consumption in communication networks in wireless communication networks
Landscapes
- Circuit For Audible Band Transducer (AREA)
Abstract
The invention discloses a voice activity detection and environmental noise elimination method based on double microphones, belonging to the field of voice signal processing of VOIP terminals; the method specifically comprises the following steps: for a VoIP phone, arranging two omnidirectional microphones at the front and rear of the phone respectively, collecting two paths of signals when a user uses the phone, windowing for fast Fourier transformation, calculating respective power spectrums, then carrying out logarithm and subtraction on the power spectrums, and judging whether the result is larger than an experience threshold epsilon; if yes, judging that voice activity exists, taking the auxiliary signal as a reference signal, and performing noise elimination on the main signal by using an adaptive filter to obtain an enhanced signal; otherwise, no speaking activity exists, updating the coefficient of the adaptive filter and re-performing noise elimination; and finally, coding the enhanced signal, sending the signal through an RTSP protocol, and adjusting audio setting by a user through real-time feedback so as to achieve the optimal voice communication effect. The invention greatly reduces the hardware cost on the premise of meeting certain performance.
Description
Technical Field
The invention belongs to the field of voice signal processing of a VOIP (stands for voice over IP) or Internet protocol) terminal, and particularly relates to a voice activity detection and environmental noise elimination method based on double microphones.
Background
In an application scenario where a VOIP phone is actually used for hands-free or video conference call, the real-time voice communication quality is affected by noisy environmental noise. In order to improve voice quality, it is necessary to effectively detect and eliminate environmental noise.
The prior art uses a single microphone, although the arrangement is easier, when non-stationary noise occurs, both the detection accuracy of voice activity and the noise reduction performance are greatly reduced [1]. In theory, using multiple microphones to take advantage of the spatial characteristics of the sound field may improve the noise reduction capabilities of the system.
The beam forming [2] is the simplest and effective method for enhancing the voice by utilizing a plurality of microphones to form an array. The beamforming noise reduction algorithm assumes that the noise components picked up by each microphone are uncorrelated with each other, however in practical applications such assumptions are not sufficient; therefore, the suppression effect of the beamforming algorithm on noise is not obvious enough. Post-filtering algorithms are often used to further enhance speech, however, the drawbacks of post-filtering algorithms are also significant, namely the limited processing of non-stationary noise and reduced quality of speech communication when transient disturbances occur. The number of microphones also affects the performance of the beamforming noise reduction algorithm, and excessive numbers of microphones greatly increase the complexity of the system.
Another relatively common method of noise reduction using bimorph is based on the energy difference, PLD (Power Level Difference) algorithm [3]. Although the energy difference based approach has many advantages, such as less dependence on the accuracy of delay estimation between bimetals and better handling of non-stationary noise, in practice we find that noise reduction based on energy difference estimation wiener filters often introduces musical noise, and the impact on speech quality can be unacceptable.
In recent years, with the rise of deep learning, noise reduction algorithms based on neural networks are increasingly applied to practical systems. However, the neural network algorithm is data driven, and under the condition of low signal-to-noise ratio in a complex environment, the phenomenon of hurting human voice often occurs, moreover, the neural network training cost is high, the calculated amount is large, npu units are often required for deployment on terminal equipment, and the cost of hardware is greatly increased.
Reference to the literature
[1]Schnitta B.Speech Enhancement:Theory and Practice,Second Edition[J].Noise-News International,2015(23-1).
[2]Brandstein M S,Ward D B.Microphone Arrays:Signal Processing Techniques and Applications[M].2001.
[3]Yousefian N,Rahmani M,Akbari A.Power level difference as a criterion for speech enhancement[C]//IEEE International Conference on Acoustics.IEEE,2009:4653-4656.DOI:10.1109/ICASSP.2009.4960668.
Disclosure of Invention
In order to solve the problems, the invention provides a voice activity detection and environmental noise elimination method based on double microphones, which is characterized in that a main microphone and an environmental noise acquisition microphone are reasonably arranged, voice activity detection is performed by using an energy ratio, and then self-adaptive filtering is controlled to eliminate environmental noise.
The voice activity detection and environmental noise elimination method based on the double microphones comprises the following specific steps:
step one, respectively arranging two omnidirectional microphones at the front end and the rear of a VoIP phone, and collecting signals of the two microphones when a user uses the phone;
the main microphone is arranged at the front end of the telephone, the auxiliary microphone is arranged at the rear end of the telephone, and the distance between the two microphones is 5cm;
the acquired signals are represented as follows:
y i (m)=s i (m)+n i (m),i=1,2
wherein y is 1 (m) represents the signal acquired by the primary microphone; y is 2 (m) represents the signal acquired by the auxiliary microphone;
s i (m) represents the sound signal acquired by the ith microphone when the user uses the phone, n i (m) represents ambient noise collected by the ith microphone;
step two, windowing two paths of microphone signals respectively, performing fast Fourier transform, and calculating respective power spectrums;
the power spectral density of the microphone signal is calculated as follows:
lambda is forgetting factor, Y i (n, k) is the frequency domain value of the microphone signal, P represents the power spectral density,representing the current frame power spectral density,/->Representing the power spectral density of the previous frame.
Y i (n, k) short-term Fourier transform of microphone signalsLeaf transformation to obtain a frequency domain value; expressed as:
Y i (n,k)=S i (n,k)+N i (n,k),i=1,2
where n is a frame index, k is a frequency index, S i (n,k),N i (n, k) are respectively the pairs s i (m),n i (m) fourier transforming the post-fourier transformed frequency domain values;
step three, respectively carrying out logarithm and subtraction on the power spectrums of the two paths of microphones, and judging whether the result is larger than an experience threshold epsilon; if yes, judging that voice activity exists, and entering a step four; otherwise, judging that no speaking activity exists, and entering a step five.
The expression is as follows:
step four, taking the acquired signals of the auxiliary microphones as reference signals, performing noise elimination on the signals acquired by the main microphones by using an adaptive filter to obtain enhanced signals, and entering a step six;
the formula is as follows:
s E =y 1 (m)-h(m)*y 2 (m)
where h (m) represents an adaptive filter, x represents a convolution, s E Representing the enhanced signal.
Updating the coefficient of the adaptive filter, and returning to the step four;
the update formula is as follows:
wherein μ represents the adaptive filter update step size; e (m) =y 1 (m)-y 2 (m)。
And step six, coding the enhanced signal, sending the signal through an RTSP protocol, and enabling a user to adjust audio setting through real-time feedback so as to achieve the optimal voice communication effect.
The invention has the advantages that:
1) A voice activity detection and environmental noise elimination method based on double microphones effectively improves the performance of voice activity detection and noise elimination by using double microphone configuration.
2) The method for detecting the voice activity and eliminating the environmental noise based on the double microphones is simple in implementation, low in complexity and good in real-time performance compared with a method based on deep learning by performing signal processing through self-adaptive filtering, and can be implemented on more cheaper chips, so that the application range is wider.
3) The voice activity detection and environmental noise elimination method based on the double microphones is very effective for eliminating annoying environmental noise when no one speaks in a voice call.
4) A voice activity detection and environmental noise elimination method based on double microphones can play a certain role in inhibiting environmental noise under the situation when a talker speaks in a stable noise environment.
Drawings
FIG. 1 is a schematic diagram of a dual microphone based voice activity detection and ambient noise cancellation method of the present invention;
FIG. 2 is a flow chart of a dual microphone based voice activity detection and ambient noise cancellation method of the present invention;
Detailed Description
The present invention will be described in further detail and in greater detail below with reference to the accompanying drawings for the purpose of facilitating understanding and practicing the present invention by those of ordinary skill in the art.
The invention relates to a voice activity detection and environmental noise elimination method based on double microphones, which comprises the steps of carrying out voice activity detection through double microphone energy ratios, and then controlling a self-adaptive filtering strategy according to a voice activity detection result to eliminate environmental noise; the principle is as shown in figure 1, two omnidirectional microphones are respectively arranged at the front and rear of a telephone, when a user uses the telephone, two paths of microphone signals are respectively windowed and subjected to fast Fourier transform, the logarithm of each power spectrum is calculated and subtracted, whether voice activity exists or not is judged, if yes, the updating of the adaptive filter coefficient is stopped, the current adaptive filter coefficient and the noise power spectrum are utilized for eliminating environmental noise, and an enhanced signal is obtained; when there is no voice activity, the adaptive filter coefficients are updated normally and ambient noise is eliminated.
The voice activity detection and environmental noise elimination method based on the double microphones is shown in fig. 2, and specifically comprises the following steps:
step one, respectively arranging two omnidirectional microphones at the front end and the rear of a VoIP phone, and collecting signals of the two microphones when a user uses the phone;
the main microphone is arranged at the front end of the telephone, the auxiliary microphone is arranged at the rear end of the telephone, and the distance between the two microphones is 5cm;
the acquired signals are represented as follows:
y i (m)=s i (m)+n i (m),i=1,2
wherein y is 1 (m) represents the signal acquired by the primary microphone; y is 2 (m) represents the signal acquired by the auxiliary microphone;
s i (m) represents the sound signal acquired by the ith microphone when the user uses the phone, n i (m) represents ambient noise collected by the ith microphone;
step two, windowing the two paths of microphone signals respectively, performing fast Fourier transform, and calculating respective power spectrums
I.e. a short time fourier transform of the time signal, the microphone signal is represented in the frequency domain as:
Y i (n,k)=S i (n,k)+N i (n,k),i=1,2
where n is a frame index, k is a frequency index, S i (n,k),N i (n, k) are respectively the pairs s i (m),n i (m) fourier transforming the post-fourier transformed frequency domain values;
assuming that the speech signal and the noise signal are uncorrelated with each other, the power spectral density of the microphone signal can be calculated as follows:
lambda is forgetting factor, Y i (n, k) is the frequency domain value of the microphone signal, P represents the power spectral density,representing the current frame power spectral density,/->Representing the power spectral density of the previous frame
Step three, respectively carrying out logarithm and subtraction on the power spectrums of the two paths of microphones, and judging whether the result is larger than an experience threshold epsilon; if yes, judging that voice activity exists, and entering a step four; otherwise, judging that no speaking activity exists, and entering a step five.
The formula is as follows:
typically, the secondary microphone has a mask due to the speaker being closer to the primary microphone, typically with an energy difference of 3 to 10 db when the speaker speaks.
Step four, taking the acquired signals of the auxiliary microphones as reference signals, performing noise elimination on the signals acquired by the main microphones by using an adaptive filter to obtain enhanced signals, and entering a step six;
adaptive filtering, which avoids divergence when there is speech activity, uses the previous coefficients to filter directly.
The formula is as follows:
s E =y 1 (m)-h(m)*y 2 (m)
where h (m) represents an adaptive filter, x represents a convolution, s E Representing the enhanced signal.
Updating the coefficient of the adaptive filter, and entering a step four;
the formula is as follows:
wherein μ represents the adaptive filter update step size; e (m) =y 1 (m)-y 2 (m)。
And step six, coding the enhanced signal, sending the signal through an RTSP protocol, and enabling a user to adjust audio setting through real-time feedback so as to achieve the optimal voice communication effect.
Examples:
first a two microphone configuration is performed: two omni-directional microphones with a distance of 5cm are adopted and are respectively arranged at the front end and the rear end of the sip phone, and if the environmental noise is additive noise and is irrelevant to the voice signals of a speaker using the phone, the signals collected by a main microphone arranged at the front end of the phone and an auxiliary microphone arranged at the rear end of the phone can be expressed as follows:
y i (m)=s i (m)+n i (m), i=1,2 (1)
the purpose of ambient noise cancellation is to remove y 1 Ambient noise component n in (a) 1 The method comprises the steps of carrying out a first treatment on the surface of the Because the ambient noise is isotropic, i.e
n 1 ≈n 2 (2)
So when the speaker is not speaking,
y 1 ≈y 2 (3)
when the speaker sounds, the auxiliary microphone is not only far from the speaker, but also the phone body is shielded because the main microphone is closer to the speaker, so that:
P 1 >P 2 (4)
wherein P is 1 Representing the power spectral density, P, of the primary wheat acquisition signal 2 Representing the power spectral density of the auxiliary wheat acquisition signal. According to the above principle, the following voice activity detection method is designed:
firstly, windowing two paths of microphone signals respectively, performing fast Fourier transform, calculating various power spectrums, and carrying out logarithm subtraction on the power spectrums of two microphones, and if the result is larger than epsilon (an empirical threshold), judging that voice activity exists; otherwise, judging that the speaking activity is not generated.
Then, noise cancellation of the dual microphone environment is performed: the method comprises the steps of taking an environmental noise acquisition microphone as a reference signal, carrying out self-adaptive filtering noise elimination on a main microphone, particularly, controlling updating of the self-adaptive filter based on a voice activity detection result, and stopping updating of the self-adaptive filter when voice activity is detected so as to prevent the filter from diverging and injuring voice signals. The formula is described as follows:
s E =y 1 (n)-h(n)*y 2 (n) (5)
finally, adaptive filter design: for the adaptive filter, the present embodiment implements an nlms filter and uses a block acceleration method. The adaptive filter update formula is as follows:
and outputting the signal subjected to noise elimination processing, wherein a user can adjust audio setting through real-time feedback so as to achieve the optimal voice communication effect.
According to the voice activity detection and environmental noise elimination method based on the double microphones, environmental noise elimination is performed on a VoIP phone through reasonably arranging the double microphones, and a user uses the VoIP phone supporting RTSP to start a double microphone environmental noise elimination function through key setting; after the user dials to make a call, a voice RTP stream is established after the DSP module removes noise;
the VoIP phone comprises a user input module, a call control module, an RTSP protocol control module, a DSP module and a UI module.
The user performs key operation through a user input module of the VoIP phone, starts a double-microphone environment noise suppression function, and constructs a VoIP call request according to the setting information; the audio acquisition module acquires data from the two microphones and then sends the data to the DSP module, and the DSP module performs signal processing operation on the data: the method comprises the steps of windowing, fast Fourier transformation, power spectrum calculation, logarithm subtraction and voice activity control adaptive filtering strategy judgment according to the result: if no voice activity is judged, updating the adaptive filter coefficient; if there is voice activity, the adaptive filter coefficients are stopped updating and the previous coefficients are used for filtering. The noise-reduced audio data is encoded according to the settings and then transmitted via the RTSP protocol. And starting a double-microphone environment denoising function, and transmitting the denoised voice through an RTSP protocol control module.
Claims (3)
1. A voice activity detection and environmental noise elimination method based on double microphones is characterized by comprising the following specific steps:
step one, respectively arranging two omnidirectional microphones at the front end and the rear of a VoIP phone, and collecting signals of the two microphones when a user uses the phone;
the acquired signals are represented as follows:
y i (m)=s i (m)+n i (m),i=1,2
wherein y is 1 (m) represents the signal acquired by the primary microphone; y is 2 (m) represents the signal acquired by the auxiliary microphone;
s i (m) represents the sound signal acquired by the ith microphone when the user uses the phone, n i (m) represents ambient noise collected by the ith microphone;
step two, windowing two paths of microphone signals respectively, performing fast Fourier transform, and calculating respective power spectrums;
the power spectral density of the microphone signal is calculated as follows:
P Yi (n,k)=λP Yi (n-1,k)+(1-λ)|Y i (n,k) 2 |i=1,2
lambda is forgetting factor, Y i (n, k) is the frequency domain value of the microphone signal, P represents the power spectral density, P Yi (n, k) represents the current frame power spectral density, P Yi (n-1, k) represents the power spectral density of the previous frame;
step three, respectively carrying out logarithm and subtraction on the power spectrums of the two paths of microphones, and judging whether the result is larger than an experience threshold epsilon; if yes, judging that voice activity exists, and entering a step four; otherwise, judging that no speaking activity exists, and entering a step five;
the expression is as follows:
step four, taking the acquired signals of the auxiliary microphones as reference signals, performing noise elimination on the signals acquired by the main microphones by using an adaptive filter to obtain enhanced signals, and entering a step six;
the formula is as follows:
s E =y 1 (m)-h(m)*y 2 (m)
where h (m) represents an adaptive filter, x represents a convolution, s E Representing the enhanced signal;
updating the coefficient of the adaptive filter, and returning to the step four;
the update formula is as follows:
wherein μ represents the adaptive filter update step size; e (m) =y 1 (m)-y 2 (m);
And step six, coding the enhanced signal, sending the signal through an RTSP protocol, and enabling a user to adjust audio setting through real-time feedback so as to achieve the optimal voice communication effect.
2. The method for detecting voice activity and eliminating environmental noise based on two microphones as defined in claim 1, wherein in the first step, the main microphone is disposed at the front end of the phone, the auxiliary microphone is disposed at the rear end of the phone, and the distance between the two microphones is 5cm.
3. A dual microphone based voice activity detection and ambient noise cancellation as recited in claim 1The dividing method is characterized in that in the second step, the frequency domain value Y i The calculation formula of (n, k) is:
Y i (n,k)=S i (n,k)+N i (n,k),i=1,2
where n is a frame index, k is a frequency index, S i (n,k),N i (n, k) are respectively the pairs s i (m),n i (m) fourier transforming the frequency domain values.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311282052.8A CN117238306A (en) | 2023-09-28 | 2023-09-28 | Voice activity detection and ambient noise elimination method based on double microphones |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311282052.8A CN117238306A (en) | 2023-09-28 | 2023-09-28 | Voice activity detection and ambient noise elimination method based on double microphones |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117238306A true CN117238306A (en) | 2023-12-15 |
Family
ID=89098170
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311282052.8A Pending CN117238306A (en) | 2023-09-28 | 2023-09-28 | Voice activity detection and ambient noise elimination method based on double microphones |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117238306A (en) |
-
2023
- 2023-09-28 CN CN202311282052.8A patent/CN117238306A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110838300B (en) | Echo cancellation processing method and processing system | |
US10403299B2 (en) | Multi-channel speech signal enhancement for robust voice trigger detection and automatic speech recognition | |
US9443532B2 (en) | Noise reduction using direction-of-arrival information | |
US9589556B2 (en) | Energy adjustment of acoustic echo replica signal for speech enhancement | |
US8175871B2 (en) | Apparatus and method of noise and echo reduction in multiple microphone audio systems | |
CN108376548B (en) | Echo cancellation method and system based on microphone array | |
CN105825864B (en) | Both-end based on zero-crossing rate index is spoken detection and echo cancel method | |
KR101469739B1 (en) | A device for and a method of processing audio signals | |
US9699554B1 (en) | Adaptive signal equalization | |
CN106713570B (en) | Echo cancellation method and device | |
WO2009130513A1 (en) | Two microphone noise reduction system | |
CN108447496B (en) | Speech enhancement method and device based on microphone array | |
CN109273019B (en) | Method for double-talk detection for echo suppression and echo suppression | |
CN104883462B (en) | A kind of sef-adapting filter and filtering method for eliminating acoustic echo | |
US9313573B2 (en) | Method and device for microphone selection | |
CN106448691B (en) | Voice enhancement method for public address communication system | |
KR20130108063A (en) | Multi-microphone robust noise suppression | |
WO2011129725A1 (en) | Method and arrangement for noise cancellation in a speech encoder | |
US9508359B2 (en) | Acoustic echo preprocessing for speech enhancement | |
US20180308503A1 (en) | Real-time single-channel speech enhancement in noisy and time-varying environments | |
US9020144B1 (en) | Cross-domain processing for noise and echo suppression | |
CN110956975A (en) | Echo cancellation method and device | |
US10129410B2 (en) | Echo canceller device and echo cancel method | |
TWI465121B (en) | System and method for utilizing omni-directional microphones for speech enhancement | |
CN112929506B (en) | Audio signal processing method and device, computer storage medium and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |