CN117238306A - Voice activity detection and ambient noise elimination method based on double microphones - Google Patents

Voice activity detection and ambient noise elimination method based on double microphones Download PDF

Info

Publication number
CN117238306A
CN117238306A CN202311282052.8A CN202311282052A CN117238306A CN 117238306 A CN117238306 A CN 117238306A CN 202311282052 A CN202311282052 A CN 202311282052A CN 117238306 A CN117238306 A CN 117238306A
Authority
CN
China
Prior art keywords
microphone
microphones
signal
voice activity
phone
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311282052.8A
Other languages
Chinese (zh)
Inventor
刘建兵
冯波
李鸿鹏
高峰
商易
刘永辉
朱海波
姜瑞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Zhilian Technology Co ltd
Original Assignee
Shenzhen Zhilian Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Zhilian Technology Co ltd filed Critical Shenzhen Zhilian Technology Co ltd
Priority to CN202311282052.8A priority Critical patent/CN117238306A/en
Publication of CN117238306A publication Critical patent/CN117238306A/en
Pending legal-status Critical Current

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention discloses a voice activity detection and environmental noise elimination method based on double microphones, belonging to the field of voice signal processing of VOIP terminals; the method specifically comprises the following steps: for a VoIP phone, arranging two omnidirectional microphones at the front and rear of the phone respectively, collecting two paths of signals when a user uses the phone, windowing for fast Fourier transformation, calculating respective power spectrums, then carrying out logarithm and subtraction on the power spectrums, and judging whether the result is larger than an experience threshold epsilon; if yes, judging that voice activity exists, taking the auxiliary signal as a reference signal, and performing noise elimination on the main signal by using an adaptive filter to obtain an enhanced signal; otherwise, no speaking activity exists, updating the coefficient of the adaptive filter and re-performing noise elimination; and finally, coding the enhanced signal, sending the signal through an RTSP protocol, and adjusting audio setting by a user through real-time feedback so as to achieve the optimal voice communication effect. The invention greatly reduces the hardware cost on the premise of meeting certain performance.

Description

Voice activity detection and ambient noise elimination method based on double microphones
Technical Field
The invention belongs to the field of voice signal processing of a VOIP (stands for voice over IP) or Internet protocol) terminal, and particularly relates to a voice activity detection and environmental noise elimination method based on double microphones.
Background
In an application scenario where a VOIP phone is actually used for hands-free or video conference call, the real-time voice communication quality is affected by noisy environmental noise. In order to improve voice quality, it is necessary to effectively detect and eliminate environmental noise.
The prior art uses a single microphone, although the arrangement is easier, when non-stationary noise occurs, both the detection accuracy of voice activity and the noise reduction performance are greatly reduced [1]. In theory, using multiple microphones to take advantage of the spatial characteristics of the sound field may improve the noise reduction capabilities of the system.
The beam forming [2] is the simplest and effective method for enhancing the voice by utilizing a plurality of microphones to form an array. The beamforming noise reduction algorithm assumes that the noise components picked up by each microphone are uncorrelated with each other, however in practical applications such assumptions are not sufficient; therefore, the suppression effect of the beamforming algorithm on noise is not obvious enough. Post-filtering algorithms are often used to further enhance speech, however, the drawbacks of post-filtering algorithms are also significant, namely the limited processing of non-stationary noise and reduced quality of speech communication when transient disturbances occur. The number of microphones also affects the performance of the beamforming noise reduction algorithm, and excessive numbers of microphones greatly increase the complexity of the system.
Another relatively common method of noise reduction using bimorph is based on the energy difference, PLD (Power Level Difference) algorithm [3]. Although the energy difference based approach has many advantages, such as less dependence on the accuracy of delay estimation between bimetals and better handling of non-stationary noise, in practice we find that noise reduction based on energy difference estimation wiener filters often introduces musical noise, and the impact on speech quality can be unacceptable.
In recent years, with the rise of deep learning, noise reduction algorithms based on neural networks are increasingly applied to practical systems. However, the neural network algorithm is data driven, and under the condition of low signal-to-noise ratio in a complex environment, the phenomenon of hurting human voice often occurs, moreover, the neural network training cost is high, the calculated amount is large, npu units are often required for deployment on terminal equipment, and the cost of hardware is greatly increased.
Reference to the literature
[1]Schnitta B.Speech Enhancement:Theory and Practice,Second Edition[J].Noise-News International,2015(23-1).
[2]Brandstein M S,Ward D B.Microphone Arrays:Signal Processing Techniques and Applications[M].2001.
[3]Yousefian N,Rahmani M,Akbari A.Power level difference as a criterion for speech enhancement[C]//IEEE International Conference on Acoustics.IEEE,2009:4653-4656.DOI:10.1109/ICASSP.2009.4960668.
Disclosure of Invention
In order to solve the problems, the invention provides a voice activity detection and environmental noise elimination method based on double microphones, which is characterized in that a main microphone and an environmental noise acquisition microphone are reasonably arranged, voice activity detection is performed by using an energy ratio, and then self-adaptive filtering is controlled to eliminate environmental noise.
The voice activity detection and environmental noise elimination method based on the double microphones comprises the following specific steps:
step one, respectively arranging two omnidirectional microphones at the front end and the rear of a VoIP phone, and collecting signals of the two microphones when a user uses the phone;
the main microphone is arranged at the front end of the telephone, the auxiliary microphone is arranged at the rear end of the telephone, and the distance between the two microphones is 5cm;
the acquired signals are represented as follows:
y i (m)=s i (m)+n i (m),i=1,2
wherein y is 1 (m) represents the signal acquired by the primary microphone; y is 2 (m) represents the signal acquired by the auxiliary microphone;
s i (m) represents the sound signal acquired by the ith microphone when the user uses the phone, n i (m) represents ambient noise collected by the ith microphone;
step two, windowing two paths of microphone signals respectively, performing fast Fourier transform, and calculating respective power spectrums;
the power spectral density of the microphone signal is calculated as follows:
lambda is forgetting factor, Y i (n, k) is the frequency domain value of the microphone signal, P represents the power spectral density,representing the current frame power spectral density,/->Representing the power spectral density of the previous frame.
Y i (n, k) short-term Fourier transform of microphone signalsLeaf transformation to obtain a frequency domain value; expressed as:
Y i (n,k)=S i (n,k)+N i (n,k),i=1,2
where n is a frame index, k is a frequency index, S i (n,k),N i (n, k) are respectively the pairs s i (m),n i (m) fourier transforming the post-fourier transformed frequency domain values;
step three, respectively carrying out logarithm and subtraction on the power spectrums of the two paths of microphones, and judging whether the result is larger than an experience threshold epsilon; if yes, judging that voice activity exists, and entering a step four; otherwise, judging that no speaking activity exists, and entering a step five.
The expression is as follows:
step four, taking the acquired signals of the auxiliary microphones as reference signals, performing noise elimination on the signals acquired by the main microphones by using an adaptive filter to obtain enhanced signals, and entering a step six;
the formula is as follows:
s E =y 1 (m)-h(m)*y 2 (m)
where h (m) represents an adaptive filter, x represents a convolution, s E Representing the enhanced signal.
Updating the coefficient of the adaptive filter, and returning to the step four;
the update formula is as follows:
wherein μ represents the adaptive filter update step size; e (m) =y 1 (m)-y 2 (m)。
And step six, coding the enhanced signal, sending the signal through an RTSP protocol, and enabling a user to adjust audio setting through real-time feedback so as to achieve the optimal voice communication effect.
The invention has the advantages that:
1) A voice activity detection and environmental noise elimination method based on double microphones effectively improves the performance of voice activity detection and noise elimination by using double microphone configuration.
2) The method for detecting the voice activity and eliminating the environmental noise based on the double microphones is simple in implementation, low in complexity and good in real-time performance compared with a method based on deep learning by performing signal processing through self-adaptive filtering, and can be implemented on more cheaper chips, so that the application range is wider.
3) The voice activity detection and environmental noise elimination method based on the double microphones is very effective for eliminating annoying environmental noise when no one speaks in a voice call.
4) A voice activity detection and environmental noise elimination method based on double microphones can play a certain role in inhibiting environmental noise under the situation when a talker speaks in a stable noise environment.
Drawings
FIG. 1 is a schematic diagram of a dual microphone based voice activity detection and ambient noise cancellation method of the present invention;
FIG. 2 is a flow chart of a dual microphone based voice activity detection and ambient noise cancellation method of the present invention;
Detailed Description
The present invention will be described in further detail and in greater detail below with reference to the accompanying drawings for the purpose of facilitating understanding and practicing the present invention by those of ordinary skill in the art.
The invention relates to a voice activity detection and environmental noise elimination method based on double microphones, which comprises the steps of carrying out voice activity detection through double microphone energy ratios, and then controlling a self-adaptive filtering strategy according to a voice activity detection result to eliminate environmental noise; the principle is as shown in figure 1, two omnidirectional microphones are respectively arranged at the front and rear of a telephone, when a user uses the telephone, two paths of microphone signals are respectively windowed and subjected to fast Fourier transform, the logarithm of each power spectrum is calculated and subtracted, whether voice activity exists or not is judged, if yes, the updating of the adaptive filter coefficient is stopped, the current adaptive filter coefficient and the noise power spectrum are utilized for eliminating environmental noise, and an enhanced signal is obtained; when there is no voice activity, the adaptive filter coefficients are updated normally and ambient noise is eliminated.
The voice activity detection and environmental noise elimination method based on the double microphones is shown in fig. 2, and specifically comprises the following steps:
step one, respectively arranging two omnidirectional microphones at the front end and the rear of a VoIP phone, and collecting signals of the two microphones when a user uses the phone;
the main microphone is arranged at the front end of the telephone, the auxiliary microphone is arranged at the rear end of the telephone, and the distance between the two microphones is 5cm;
the acquired signals are represented as follows:
y i (m)=s i (m)+n i (m),i=1,2
wherein y is 1 (m) represents the signal acquired by the primary microphone; y is 2 (m) represents the signal acquired by the auxiliary microphone;
s i (m) represents the sound signal acquired by the ith microphone when the user uses the phone, n i (m) represents ambient noise collected by the ith microphone;
step two, windowing the two paths of microphone signals respectively, performing fast Fourier transform, and calculating respective power spectrums
I.e. a short time fourier transform of the time signal, the microphone signal is represented in the frequency domain as:
Y i (n,k)=S i (n,k)+N i (n,k),i=1,2
where n is a frame index, k is a frequency index, S i (n,k),N i (n, k) are respectively the pairs s i (m),n i (m) fourier transforming the post-fourier transformed frequency domain values;
assuming that the speech signal and the noise signal are uncorrelated with each other, the power spectral density of the microphone signal can be calculated as follows:
lambda is forgetting factor, Y i (n, k) is the frequency domain value of the microphone signal, P represents the power spectral density,representing the current frame power spectral density,/->Representing the power spectral density of the previous frame
Step three, respectively carrying out logarithm and subtraction on the power spectrums of the two paths of microphones, and judging whether the result is larger than an experience threshold epsilon; if yes, judging that voice activity exists, and entering a step four; otherwise, judging that no speaking activity exists, and entering a step five.
The formula is as follows:
typically, the secondary microphone has a mask due to the speaker being closer to the primary microphone, typically with an energy difference of 3 to 10 db when the speaker speaks.
Step four, taking the acquired signals of the auxiliary microphones as reference signals, performing noise elimination on the signals acquired by the main microphones by using an adaptive filter to obtain enhanced signals, and entering a step six;
adaptive filtering, which avoids divergence when there is speech activity, uses the previous coefficients to filter directly.
The formula is as follows:
s E =y 1 (m)-h(m)*y 2 (m)
where h (m) represents an adaptive filter, x represents a convolution, s E Representing the enhanced signal.
Updating the coefficient of the adaptive filter, and entering a step four;
the formula is as follows:
wherein μ represents the adaptive filter update step size; e (m) =y 1 (m)-y 2 (m)。
And step six, coding the enhanced signal, sending the signal through an RTSP protocol, and enabling a user to adjust audio setting through real-time feedback so as to achieve the optimal voice communication effect.
Examples:
first a two microphone configuration is performed: two omni-directional microphones with a distance of 5cm are adopted and are respectively arranged at the front end and the rear end of the sip phone, and if the environmental noise is additive noise and is irrelevant to the voice signals of a speaker using the phone, the signals collected by a main microphone arranged at the front end of the phone and an auxiliary microphone arranged at the rear end of the phone can be expressed as follows:
y i (m)=s i (m)+n i (m), i=1,2 (1)
the purpose of ambient noise cancellation is to remove y 1 Ambient noise component n in (a) 1 The method comprises the steps of carrying out a first treatment on the surface of the Because the ambient noise is isotropic, i.e
n 1 ≈n 2 (2)
So when the speaker is not speaking,
y 1 ≈y 2 (3)
when the speaker sounds, the auxiliary microphone is not only far from the speaker, but also the phone body is shielded because the main microphone is closer to the speaker, so that:
P 1 >P 2 (4)
wherein P is 1 Representing the power spectral density, P, of the primary wheat acquisition signal 2 Representing the power spectral density of the auxiliary wheat acquisition signal. According to the above principle, the following voice activity detection method is designed:
firstly, windowing two paths of microphone signals respectively, performing fast Fourier transform, calculating various power spectrums, and carrying out logarithm subtraction on the power spectrums of two microphones, and if the result is larger than epsilon (an empirical threshold), judging that voice activity exists; otherwise, judging that the speaking activity is not generated.
Then, noise cancellation of the dual microphone environment is performed: the method comprises the steps of taking an environmental noise acquisition microphone as a reference signal, carrying out self-adaptive filtering noise elimination on a main microphone, particularly, controlling updating of the self-adaptive filter based on a voice activity detection result, and stopping updating of the self-adaptive filter when voice activity is detected so as to prevent the filter from diverging and injuring voice signals. The formula is described as follows:
s E =y 1 (n)-h(n)*y 2 (n) (5)
finally, adaptive filter design: for the adaptive filter, the present embodiment implements an nlms filter and uses a block acceleration method. The adaptive filter update formula is as follows:
and outputting the signal subjected to noise elimination processing, wherein a user can adjust audio setting through real-time feedback so as to achieve the optimal voice communication effect.
According to the voice activity detection and environmental noise elimination method based on the double microphones, environmental noise elimination is performed on a VoIP phone through reasonably arranging the double microphones, and a user uses the VoIP phone supporting RTSP to start a double microphone environmental noise elimination function through key setting; after the user dials to make a call, a voice RTP stream is established after the DSP module removes noise;
the VoIP phone comprises a user input module, a call control module, an RTSP protocol control module, a DSP module and a UI module.
The user performs key operation through a user input module of the VoIP phone, starts a double-microphone environment noise suppression function, and constructs a VoIP call request according to the setting information; the audio acquisition module acquires data from the two microphones and then sends the data to the DSP module, and the DSP module performs signal processing operation on the data: the method comprises the steps of windowing, fast Fourier transformation, power spectrum calculation, logarithm subtraction and voice activity control adaptive filtering strategy judgment according to the result: if no voice activity is judged, updating the adaptive filter coefficient; if there is voice activity, the adaptive filter coefficients are stopped updating and the previous coefficients are used for filtering. The noise-reduced audio data is encoded according to the settings and then transmitted via the RTSP protocol. And starting a double-microphone environment denoising function, and transmitting the denoised voice through an RTSP protocol control module.

Claims (3)

1. A voice activity detection and environmental noise elimination method based on double microphones is characterized by comprising the following specific steps:
step one, respectively arranging two omnidirectional microphones at the front end and the rear of a VoIP phone, and collecting signals of the two microphones when a user uses the phone;
the acquired signals are represented as follows:
y i (m)=s i (m)+n i (m),i=1,2
wherein y is 1 (m) represents the signal acquired by the primary microphone; y is 2 (m) represents the signal acquired by the auxiliary microphone;
s i (m) represents the sound signal acquired by the ith microphone when the user uses the phone, n i (m) represents ambient noise collected by the ith microphone;
step two, windowing two paths of microphone signals respectively, performing fast Fourier transform, and calculating respective power spectrums;
the power spectral density of the microphone signal is calculated as follows:
P Yi (n,k)=λP Yi (n-1,k)+(1-λ)|Y i (n,k) 2 |i=1,2
lambda is forgetting factor, Y i (n, k) is the frequency domain value of the microphone signal, P represents the power spectral density, P Yi (n, k) represents the current frame power spectral density, P Yi (n-1, k) represents the power spectral density of the previous frame;
step three, respectively carrying out logarithm and subtraction on the power spectrums of the two paths of microphones, and judging whether the result is larger than an experience threshold epsilon; if yes, judging that voice activity exists, and entering a step four; otherwise, judging that no speaking activity exists, and entering a step five;
the expression is as follows:
step four, taking the acquired signals of the auxiliary microphones as reference signals, performing noise elimination on the signals acquired by the main microphones by using an adaptive filter to obtain enhanced signals, and entering a step six;
the formula is as follows:
s E =y 1 (m)-h(m)*y 2 (m)
where h (m) represents an adaptive filter, x represents a convolution, s E Representing the enhanced signal;
updating the coefficient of the adaptive filter, and returning to the step four;
the update formula is as follows:
wherein μ represents the adaptive filter update step size; e (m) =y 1 (m)-y 2 (m);
And step six, coding the enhanced signal, sending the signal through an RTSP protocol, and enabling a user to adjust audio setting through real-time feedback so as to achieve the optimal voice communication effect.
2. The method for detecting voice activity and eliminating environmental noise based on two microphones as defined in claim 1, wherein in the first step, the main microphone is disposed at the front end of the phone, the auxiliary microphone is disposed at the rear end of the phone, and the distance between the two microphones is 5cm.
3. A dual microphone based voice activity detection and ambient noise cancellation as recited in claim 1The dividing method is characterized in that in the second step, the frequency domain value Y i The calculation formula of (n, k) is:
Y i (n,k)=S i (n,k)+N i (n,k),i=1,2
where n is a frame index, k is a frequency index, S i (n,k),N i (n, k) are respectively the pairs s i (m),n i (m) fourier transforming the frequency domain values.
CN202311282052.8A 2023-09-28 2023-09-28 Voice activity detection and ambient noise elimination method based on double microphones Pending CN117238306A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311282052.8A CN117238306A (en) 2023-09-28 2023-09-28 Voice activity detection and ambient noise elimination method based on double microphones

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311282052.8A CN117238306A (en) 2023-09-28 2023-09-28 Voice activity detection and ambient noise elimination method based on double microphones

Publications (1)

Publication Number Publication Date
CN117238306A true CN117238306A (en) 2023-12-15

Family

ID=89098170

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311282052.8A Pending CN117238306A (en) 2023-09-28 2023-09-28 Voice activity detection and ambient noise elimination method based on double microphones

Country Status (1)

Country Link
CN (1) CN117238306A (en)

Similar Documents

Publication Publication Date Title
CN110838300B (en) Echo cancellation processing method and processing system
US10403299B2 (en) Multi-channel speech signal enhancement for robust voice trigger detection and automatic speech recognition
US9443532B2 (en) Noise reduction using direction-of-arrival information
US9589556B2 (en) Energy adjustment of acoustic echo replica signal for speech enhancement
US8175871B2 (en) Apparatus and method of noise and echo reduction in multiple microphone audio systems
CN108376548B (en) Echo cancellation method and system based on microphone array
CN105825864B (en) Both-end based on zero-crossing rate index is spoken detection and echo cancel method
KR101469739B1 (en) A device for and a method of processing audio signals
US9699554B1 (en) Adaptive signal equalization
CN106713570B (en) Echo cancellation method and device
WO2009130513A1 (en) Two microphone noise reduction system
CN108447496B (en) Speech enhancement method and device based on microphone array
CN109273019B (en) Method for double-talk detection for echo suppression and echo suppression
CN104883462B (en) A kind of sef-adapting filter and filtering method for eliminating acoustic echo
US9313573B2 (en) Method and device for microphone selection
CN106448691B (en) Voice enhancement method for public address communication system
KR20130108063A (en) Multi-microphone robust noise suppression
WO2011129725A1 (en) Method and arrangement for noise cancellation in a speech encoder
US9508359B2 (en) Acoustic echo preprocessing for speech enhancement
US20180308503A1 (en) Real-time single-channel speech enhancement in noisy and time-varying environments
US9020144B1 (en) Cross-domain processing for noise and echo suppression
CN110956975A (en) Echo cancellation method and device
US10129410B2 (en) Echo canceller device and echo cancel method
TWI465121B (en) System and method for utilizing omni-directional microphones for speech enhancement
CN112929506B (en) Audio signal processing method and device, computer storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination