CN107393553A - Aural signature extracting method for voice activity detection - Google Patents

Aural signature extracting method for voice activity detection Download PDF

Info

Publication number
CN107393553A
CN107393553A CN201710578645.7A CN201710578645A CN107393553A CN 107393553 A CN107393553 A CN 107393553A CN 201710578645 A CN201710578645 A CN 201710578645A CN 107393553 A CN107393553 A CN 107393553A
Authority
CN
China
Prior art keywords
mrow
signal
msub
noise
munderover
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710578645.7A
Other languages
Chinese (zh)
Other versions
CN107393553B (en
Inventor
蔡钢林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Yajin Smart Technology Co ltd
Original Assignee
Yongshun Shenzhen Wisdom Mdt Infotech Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yongshun Shenzhen Wisdom Mdt Infotech Ltd filed Critical Yongshun Shenzhen Wisdom Mdt Infotech Ltd
Priority to CN201710578645.7A priority Critical patent/CN107393553B/en
Publication of CN107393553A publication Critical patent/CN107393553A/en
Application granted granted Critical
Publication of CN107393553B publication Critical patent/CN107393553B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0224Processing in the time domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02163Only one microphone
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L2025/783Detection of presence or absence of voice signals based on threshold decision

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The present invention proposes a kind of aural signature extracting method for voice activity detection, comprises the following steps:Calculate the time-domain signal of voice signal;Using the time-domain signal, the prior weight γ (k) and posteriori SNR ε (k) of the voice signal are calculated, wherein, k is frequency coordinate;The aural signature of present frame is calculated according to the time-domain signal, prior weight γ (k) and posteriori SNR ε (k), wherein, the aural signature includes the first dimensional parameter, the second dimensional parameter and third dimension parameter;First dimensional parameter is related to the prior weight γ (k), and the second dimensional parameter is related to the posteriori SNR ε (k), and third dimension parameter is related to the time-domain signal.The present invention characterizes aural signature using prior weight, posteriori SNR joint time-domain signal, and the aural signature of extraction can be used for compared with threshold of audibility, detect real-time speech activity.

Description

Auditory feature extraction method for voice activity detection
Technical Field
The invention relates to the field of voice recognition, in particular to an auditory feature extraction method for voice activity detection.
Background
With the rapid development of internet technology and intelligent hardware in recent years, voice intelligent interaction technologies such as voice recognition, voiceprint recognition and sound source detection are beginning to move from laboratories to users. The voice recognition technology is the most core technology of a voice-based man-machine interaction system. The recognition rate has reached the available accuracy under defined conditions. By limited adjustment is generally meant that the user is closer to the microphone and less noisy. The requirement that the voice command must be issued in close proximity limits the ease of voice interaction.
In the case of far speech, the recognition rate is rapidly reduced because the speech energy is rapidly attenuated while the noise interference energy is substantially unchanged. Another factor affecting the recognition accuracy is that reverberation of the voice command after reaching the walls of the room after multiple reflections also causes mismatching between the actual application and the voice recognition training data set, and affects the recognition rate.
There are two main sources of noise: (1) the microphone signal acquires the channel noise of the system, the channel noise is different due to the sensitivity of the microphone, and the higher the sensitivity of the microphone is, the higher the channel noise is generally; (2) non-negligible ambient noise interference, such as television, air conditioning noise, etc. Reverberation is more complex and more difficult to suppress than noise due to the more complex conditions of generation. Also, noise and reverberation generally coexist, making reverberation suppression more difficult.
CN103559893A discloses an underwater target gamma-chirp cepstrum coefficient auditory feature extraction method, which uses a gamma-chirp auditory filter bank to output and form cepstrum coefficients, gives out an auditory feature vector of an underwater target, and can improve robustness of underwater target signal feature extraction under the condition of complex marine environment interference, thereby improving accuracy of underwater target identification.
The problem of voice recognition under a far-speaking condition is solved, and the auditory characteristics under the far-speaking condition need to be accurately extracted. However, the method provided by CN103559893A, the extracted auditory features of which are limited to underwater environment, is not suitable for speech recognition in the case of far speech.
There is also a super directional Beamforming technology, which employs an annular or linear microphone array to directionally enhance the directional signal of the target sound source through a set of spatial filters. The super-directional Beamforming technology is used for improving the quality of sound signals in terms of sampling. However, by adopting the super directional Beamforming technology, the number of microphones is required to be large, the requirements on the consistency of the microphones and the accuracy of the geometric positions of the microphones are high, the difficulty and the cost of hardware implementation are increased, the integration in most middle and low-level products is difficult, and the application range is very limited.
Disclosure of Invention
The invention mainly aims to provide an auditory feature extraction method for voice activity detection, which can effectively extract auditory features under far speaking conditions in a single-microphone system and improve voice recognition rate.
The invention provides an auditory feature extraction method for voice activity detection, which comprises the following steps:
acquiring a time domain signal of a sound signal;
calculating a prior signal-to-noise ratio gamma (k) and a posterior signal-to-noise ratio (k) of the sound signal by using the time domain signal, wherein k is a frequency coordinate;
calculating auditory characteristics of the current frame according to the time domain signal, the prior signal-to-noise ratio gamma (k) and the posterior signal-to-noise ratio (k), wherein the auditory characteristics comprise a first dimension parameter, a second dimension parameter and a third dimension parameter; the first dimension parameter is related to the a priori signal-to-noise ratio γ (k), the second dimension parameter is related to the a posteriori signal-to-noise ratio (k), and the third dimension parameter is related to the time domain signal.
Preferably, the first dimension parameter is represented by V (1), which can be obtained by the following formula:
the second dimension parameter is represented by V (2), which can be obtained by the following formula:
the third dimension parameter is represented by V (3), which can be obtained by the following formula:
where K is the number of the whole frequency band, LWIs representative of window length, LTRepresenting the starting sample point, the function y is the time domain mixed speech data, and j is the time variable.
Preferably, the prior signal-to-noise ratio γ (k) can be obtained by the following formula:
where l is the time frame coordinate, Y (l, k) is the mixed speech spectrum, ΦV(k) Representing the power spectral density of the noise signal.
Preferably, the posterior signal-to-noise ratio (k) can be determined by the following formula:
wherein β is a smoothing factor, β is a value range of 0.6-0.9,to estimate the speech spectrum, the Max function represents the maximum of two variables chosen.
Preferably, β is 0.75.
Preferably, the time domain signal is represented by y (t), which can be represented by the following formula:
wherein, x (t) is a voice signal with reverberation, v (t) is background noise, h (τ) is a reverberation impact response signal, and s (t- τ) is a voice signal without reverberation.
Preferably, before calculating the prior signal-to-noise ratio γ (k) and the posterior signal-to-noise ratio (k) of the sound signal by using the time domain signal, further comprises,
initializing voice parameters including noise power spectral density ΦV(k) Observing the power spectral density phi of the signalY(k) Estimating the speech spectrumA priori signal-to-noise ratio γ (k) and a posteriori signal-to-noise ratio (k), the initialization procedure is as follows:
l before settingIThe time frame has no voice activity, then
γ(k)=1,(k)=κ,k=1,2,...,K
Where K is the number of the whole band, 1 is the time frame coordinate, Y (l, K) is the mixed speech spectrum, κ is the attenuation factor, ΦV(k) Power spectral density, phi, representing noise signalY(k) Representing the power spectral density of the observed signal,to estimate the speech spectrum.
Preferably, after the initializing the voice parameters, further comprising,
according to the power spectral density of the observation signal of the previous frame, obtaining an observation signal power spectral density estimated value of the next frame smoothly, wherein the observation signal power spectral density estimated value can be obtained by the following formula:
Φ′Y(k)=αΦY(k)+(1-α)|Y(l,k)|2
wherein alpha is a smoothing factor and has a value range of 0.95-0.995.
Preferably, after obtaining the power spectral density estimation value of the observed signal of the next frame according to the power spectral density of the observed signal of the previous frame by smoothing, the method further comprises,
calculating a noise power spectrum adaptive updating step length, wherein the noise power spectrum adaptive updating step length can be obtained by the following formula:
wherein the smoothing factor alpha is taken as a fixed step.
Preferably, after the step size is adaptively updated by calculating the noise power spectrum, further comprising,
updating the noise power spectrum according to the self-adaptive updating step length of the noise power spectrum, wherein the noise power spectrum can be obtained by the following formula:
ΦV(k)=αV(k)Φ′V(k)+(1-αV(k))|Y(l,k)|2
the invention provides an auditory feature extraction method for voice activity detection, which adopts a priori signal-to-noise ratio and a posteriori signal-to-noise ratio to combine with a time domain signal to represent auditory features, and the extracted auditory features can be used for comparing with an auditory threshold value to detect real-time voice activity.
Drawings
FIG. 1 is a flow chart illustrating an embodiment of an auditory feature extraction method for voice activity detection according to the present invention;
fig. 2 is a schematic diagram of a hanning window.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The sound signal referred to in the present invention is digital audio data, that is, digital audio data obtained by converting a sound wave into an analog audio signal by a sound wave conversion circuit and then converting the analog audio signal by an analog-to-digital converter.
Referring to fig. 1, the present invention provides an auditory feature extraction method for voice activity detection, comprising the following steps:
s10, acquiring a time domain signal of the sound signal;
s20, calculating the prior signal-to-noise ratio gamma (k) and the posterior signal-to-noise ratio (k) of the sound signal by using the time domain signal, wherein k is a frequency coordinate;
s30, calculating auditory characteristics of the current frame according to the time domain signal, the prior signal-to-noise ratio gamma (k) and the posterior signal-to-noise ratio (k), wherein the auditory characteristics comprise a first dimension parameter, a second dimension parameter and a third dimension parameter; the first dimension parameter is related to the a priori signal-to-noise ratio γ (k), the second dimension parameter is related to the a posteriori signal-to-noise ratio (k), and the third dimension parameter is related to the time domain signal.
In step S10, the sound signal refers to the mixed voice data acquired by the sound collection system, and is usually stored in a buffer. Assuming that the mixed speech data is y (t), it can be regarded as a convolution of the reverberated speech signal x (t) and the background noise v (t). The mixed speech data is one of time domain signals y (t) which is a sound signal. The reverberated speech signal x (t) can in turn be regarded as a convolution of the reverberant impulse response signal h (τ) and the non-reverberated speech s (t- τ). The mathematical formula can be expressed as:
the above is only one way of acquiring the time domain signal of the sound signal, and the time domain signal of the sound signal may be acquired in other forms.
In step S20, the prior snr can be obtained by the following formula:
where 1 is the time frame coordinate, Y (l, k) is the mixed speech spectrum, ΦV(k) Representing the power spectral density of the noise signal.
Y (l, k) is obtained by FFT transformation of the mixed speech data Y (t), specifically as follows:
where w (t) is a Hanning window of length 512. The waveform of the hanning window is shown in fig. 2.
The posterior signal-to-noise ratio can be obtained by the following formula:
wherein β is a smoothing factor, β is a value range of 0.6-0.9,to estimate the speech spectrum, the Max function represents the maximum of two variables chosen. In this embodiment, the value is preferably 0.75.
The above is only a preferred calculation method of the prior snr and the posterior snr, and any method of performing appropriate deformation decomposition and then performing solution according to the above method should also fall within the scope of the present invention.
In step S30, the auditory characteristics of the current frame include a first dimension parameter V (1), a second dimension parameter V (2), and a third dimension parameter V (3), and are stored in the buffer. Auditory features are computed in the form of three-dimensional column vectors. The auditory characteristics of the current frame may be represented in the following manner:
v (1) can be obtained by the following equation:
v (2) can be obtained by the following equation:
v (3) can be obtained by the following equation:
where K is the number of the whole frequency band, LWIs representative of window length, LTRepresenting the starting sample point, the function y is the time domain mixed speech data, and j is the time variable.
The above is only a preferred calculation method of the first dimension parameter V (1), the second dimension parameter V (2) and the third dimension parameter V (3), and any method of performing appropriate deformation decomposition and then performing solution according to the above method should also fall within the protection scope of the present invention.
The following is a specific calculation process of the auditory characteristics.
Firstly, the estimation of background noise, and the accuracy of the noise energy estimation directly influences the effect of subsequent voice detection. The embodiment of the invention adopts a mode of combining fixed noise estimation with noise self-adaptive updating to ensure the stability and accuracy of the noise estimation. The initialization and specific calculation flow is as follows:
(1) taking the data of the buffer area, windowing the data to perform FFT (fast Fourier transform), and transforming a time domain signal to a frequency spectrum domain:
suppose the mixed speech data is y (t), where x (t) is a speech signal with reverberation, v (t) is background noise, h (τ) is a reverberation impulse response signal, and s (t- τ) is a non-reverberation speech signal. The FFT (fourier transform) is as follows:
where w (t) is a Hanning window of length 512, l is a time frame coordinate, and k is a frequency coordinate.
(2) To front LIThe time frame assumes no voice activity and is initialized as follows:
γ(k)=1,(k)=κ,k=1,2,...,K
where K represents the number of the whole frequency band, phiV(k) Power spectral density, phi, representing noise signalY(k) Representing the power spectral density of the observed signal, gamma (k) is the a priori signal-to-noise ratio, (k) is the a posteriori signal-to-noise ratio,to estimate the speech spectrum, it is initialized to multiply the mean of the mixed spectrum by an attenuation factor k, which takes a value of 0.1.
(3) From LTAnd starting iterative calculation at +1 time frame, wherein the calculation flow is as follows:
(3.1) updating the power spectral density estimated value of the observation signal, namely, according to the result of the previous frame, smoothly obtaining the calculation result of the next frame:
Φ′Y(k)=αΦY(k)+(1-α)|Y(l,k)|2
wherein α is a smoothing factor, a value range is recommended to be 0.95-0.995, and 0.98 is preferably used as a smoothing threshold in this embodiment.
(3.2) calculating the prior and posterior signal-to-noise ratios
Wherein β is a smoothing factor, β is a value range of 0.6 to 0.9, and a value of 0.75 is preferred in this embodiment. The Max function represents the selection of the maximum of the two variables.
(3.3) calculating the self-adaptive updating step length of the noise power spectrum according to the prior signal-to-noise ratio and the posterior signal-to-noise ratio:
namely, a mode of adding a fixed step length and a self-adaptive step length is adopted to realize the whole updating.
(3.4) updating the noise power spectrum according to the step length, wherein the basic principle is that if the voice is less, the step length of updating the noise power spectrum is larger, and the accuracy of noise estimation is ensured; otherwise, a slower step size is used to avoid the speech signal from participating in the iterative update of the noise power spectrum:
ΦV(k)=αV(k)Φ′V(k)+(1-αV(k))|Y(l,k)|2
the output of the above equation is the noise power spectrum update result, which is used for the noise update of the next frame and participating in the voice detection process as a parameter.
After the background noise parameters are accurately estimated, auditory features can be constructed based on the background noise parameters. After the auditory characteristics are obtained, the auditory characteristics of the current frame are compared with a set auditory threshold value, and whether the current frame has voice activity or not can be judged.
The voice activity detection is mainly used for detecting a voice activity area, stopping the optimization processing of voice in a non-voice activity area and reducing power consumption; in the voice activity area, noise interference can be reduced, and the voice optimization effect is improved.
Before extracting the auditory features of the current frame, there is an initialization process, which is as follows:
initializing a characteristic buffer matrix, a characteristic threshold value and a voice detection result buffer area, wherein the characteristic buffer area matrix is formed by LIThe 3-dimensional column vectors are formed and are formulated as follows:
Q(1:LI)=0
θT(1)=FB(1,1)
θT(2)=FB(2,1)
θT(3)=FB(3,1)
wherein, FBIs an auditory feature buffer, Q is a voice activity detection result buffer, θTThe threshold buffer for the auditory feature, i.e. the prior signal-to-noise ratio, the posterior signal-to-noise ratio and the time domain signal are used for the final voice activity detection, respectively. In the auditory feature calculation, LWIs representative of window length, LTThe value range of the start sample point is usually between 5 and 20, and is set to 10 in this embodiment.
From LTStarting with +1 time frame, the current frame auditory features are computed as follows:
after the current frame auditory feature calculation result is obtained, the feature buffer area and the feature threshold are updated, the current auditory feature is used for comparing with the auditory threshold, and the voice detection result is determined according to the comparison result. The method comprises the following specific steps:
according to the current frame auditory characteristic calculation result, updating the characteristic buffer area and the characteristic threshold value, namely kicking the data with the longest time in the buffer area out of the buffer area, and putting the current frame data into the buffer area:
and calculating the hearing threshold corresponding to each dimension parameter:
comparing the current auditory characteristics with an auditory threshold, determining a voice detection result according to the comparison result, and specifically calculating as follows:
q (i) is the score of the dimensional parameter of the auditory feature, QFrameAnd if the result is a judgment result of the voice check, the result is 1, the current frame has voice, and if the result is 0, the current frame has no voice.
Updating a voice detection result buffer area, kicking out the data with the longest time in the buffer area from the buffer area, adding a current frame judgment result, and calculating an average voice detection result in the buffer area:
Q=[Q′(:,2:LB);QFrame]
then, calculating the statistical value of the detection results in the voice detection result buffer, wherein the sum of the detection results is calculated as follows:
since speech is usually continuousComparison QMWith a fixed threshold value LIIf the value is smaller than the threshold value, the frame of the speech in the current buffer area is indicated to be false detection, no speech exists in the current buffer area, the characteristic threshold value is updated, the speech spectrum estimation result is set as a minimum value, and the calculation is as follows:
at the same time, the estimated speech spectrum is updatedThe calculation is as follows:
the value range is 0.1-0.3, and the value of the invention is 0.15. If no false detection exists, the current buffer area is indicated to have speech, and the sound signal can be continuously optimized.
The Kalman adaptation enhancement is assumed to use a length LGThe forward prediction filter of (1) predicting the clean speech spectrum, usually LG<LI. In the present invention, these two parameters are set to L respectivelyG=15,LI25. Since the speech signal can be well represented by an autoregressive model, the error of prediction can be understood as a reverberation component. Based on the minimum mean square error criterion, the adaptive process of filter update is as follows:
before LIThe frame carries out prediction error vector, prediction vector variance matrix and prediction error initialization, and the initialization process is as follows:
E(k)=0
wherein the vector variance matrix P is predictedkIs dimension LG×LG0 matrix of (1) is a prediction error vector GkIs dimension LG× 1, e (k) is the prediction error obtained with the current prediction vector.
From LI+1 frame start, if the voice detection result indicates that there is voice activity, the following adaptive update procedure is performed:
(1.1) updating the prediction error, including the prediction error vector and the prediction spectral error, as follows:
wherein,is dimension LG×LGThe identity matrix of (2).
(1.2) smoothing the prediction spectrum error to make the error estimation smoother, wherein the specific flow is as follows:
E(k)=η|EPre|2-(1-η)|EPre,o|2
wherein eta is a smoothing coefficient, the value range of the smoothing coefficient is 0.6-0.9, and the value of the method is 0.75.
(1.3) Kalman gain calculation, updating the prediction vector, and updating the process as follows:
Gk=G′k+KGEPre
(1.4) reverberation power spectral density update, the update process is as follows:
the reverberation power spectral density and the observation signal power spectral density adopt the same smoothing coefficient α phi'R(k) The reverberant power spectral density of the previous frame. The initial setting of the reverberant power spectral density is 0.
(1.5) constructing an attenuation factor according to the wiener filtering, and outputting an estimated voice spectrum, wherein the calculation is as follows:
the spectral estimation is used both to recover the time domain signal in the next step and to participate in the computation of the a posteriori signal-to-noise ratio in the first step.
(1.6) circularly executing 1.1-1.5 until all frequency bands are updated, recovering a time domain signal by adopting inverse Fourier transform, wherein the calculation flow is as follows:
and after the time domain signal is recovered, sending the time domain signal to a subsequent application terminal, such as a communication device or a voice recognition engine, so as to realize the combined suppression of noise and reverberation.
The method can be used for assisting in voice instruction recognition in a home environment. In a home environment, a user is about 1 to 3 meters away from a microphone, and is affected by home noise and wall reverberation, and the recognition rate is rapidly reduced. The auditory feature extraction method provided by the invention can effectively extract the auditory features in the acquired voice signals, monitor voice activity and reduce the time of false recognition by combining with a corresponding noise removal method. Experiments prove that the recognition rate can be improved from 30% to 65% when the input signal-to-noise ratio is about 10dB at a distance of about 2 meters from a microphone, and the recognition rate is improved from 10% to about 50% when the noise is increased to 20 dB.
The invention provides an auditory feature extraction method for voice activity detection.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by the present specification, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. An auditory feature extraction method for voice activity detection, comprising the steps of:
acquiring a time domain signal of a sound signal;
calculating a prior signal-to-noise ratio gamma (k) and a posterior signal-to-noise ratio (k) of the sound signal by using the time domain signal, wherein k is a frequency coordinate;
calculating auditory characteristics of the current frame according to the time domain signal, the prior signal-to-noise ratio gamma (k) and the posterior signal-to-noise ratio (k), wherein the auditory characteristics comprise a first dimension parameter, a second dimension parameter and a third dimension parameter; the first dimension parameter is related to the a priori signal-to-noise ratio γ (k), the second dimension parameter is related to the a posteriori signal-to-noise ratio (k), and the third dimension parameter is related to the time domain signal.
2. An auditory feature extraction method for voice activity detection as claimed in claim 1, characterized in that the first dimension parameter is denoted V (1), which is found by the following formula:
<mrow> <mi>V</mi> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>K</mi> </munderover> <mi>&amp;gamma;</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> </mrow>
the second dimension parameter is represented by V (2), which is obtained by the following formula:
<mrow> <mi>V</mi> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>K</mi> </munderover> <mi>&amp;epsiv;</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> </mrow>
the third dimension parameter is represented by V (3), which is obtained by the following formula:
<mrow> <mi>V</mi> <mrow> <mo>(</mo> <mn>3</mn> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <msub> <mi>L</mi> <mi>T</mi> </msub> </mrow> <mrow> <msub> <mi>L</mi> <mi>W</mi> </msub> <mo>-</mo> <msub> <mi>L</mi> <mi>T</mi> </msub> </mrow> </munderover> <mi>&amp;gamma;</mi> <mrow> <mo>(</mo> <msup> <mi>y</mi> <mn>2</mn> </msup> <mo>(</mo> <mi>j</mi> <mo>)</mo> <mo>-</mo> <mi>y</mi> <mo>(</mo> <mrow> <mi>j</mi> <mo>+</mo> <msub> <mi>L</mi> <mi>T</mi> </msub> </mrow> <mo>)</mo> <mi>y</mi> <mo>(</mo> <mrow> <mi>j</mi> <mo>-</mo> <msub> <mi>L</mi> <mi>T</mi> </msub> </mrow> <mo>)</mo> <mo>)</mo> </mrow> </mrow>
where K is the number of the whole frequency band, LWIs representative of window length, LTRepresenting the starting sample point, the function y is the time domain mixed speech data, and j is the time variable.
3. An auditory feature extraction method for voice activity detection as claimed in claim 1, characterized in that the a priori signal-to-noise ratio γ (k) is found by the following formula:
<mrow> <mi>&amp;gamma;</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <msup> <mrow> <mo>|</mo> <mi>Y</mi> <mrow> <mo>(</mo> <mi>l</mi> <mo>,</mo> <mi>k</mi> <mo>)</mo> </mrow> <mo>|</mo> </mrow> <mn>2</mn> </msup> <mrow> <msub> <mi>&amp;Phi;</mi> <mi>V</mi> </msub> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow>
where l is the time frame coordinate, Y (l, k) is the mixed speech spectrum, ΦV(k) Representing the power spectral density of the noise signal.
4. An auditory feature extraction method for voice activity detection according to claim 3, characterized in that the posterior signal-to-noise ratio (k) is found by the following formula:
<mrow> <mi>&amp;epsiv;</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <mo>=</mo> <mi>&amp;beta;</mi> <mfrac> <msup> <mrow> <mo>|</mo> <mover> <mi>X</mi> <mo>^</mo> </mover> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <mo>|</mo> </mrow> <mn>2</mn> </msup> <mrow> <msub> <mi>&amp;Phi;</mi> <mi>V</mi> </msub> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>+</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <mi>&amp;beta;</mi> <mo>)</mo> </mrow> <mi>M</mi> <mi>a</mi> <mi>x</mi> <mrow> <mo>(</mo> <mi>&amp;gamma;</mi> <mo>(</mo> <mi>k</mi> <mo>)</mo> <mo>-</mo> <mn>1</mn> <mo>,</mo> <mn>0</mn> <mo>)</mo> </mrow> </mrow>
wherein β is a smoothing factor, β is a value range of 0.6-0.9,to estimate the speech spectrum, the Max function represents the maximum of two variables chosen.
5. An auditory feature extraction method for voice activity detection as claimed in claim 4, characterized in that β is 0.75.
6. An auditory feature extraction method for voice activity detection as claimed in claim 1, characterized in that the time domain signal is represented by y (t), which is represented by the following formula:
<mrow> <mi>y</mi> <mrow> <mo>(</mo> <mi>t</mi> <mo>)</mo> </mrow> <mo>=</mo> <mi>x</mi> <mrow> <mo>(</mo> <mi>t</mi> <mo>)</mo> </mrow> <mo>+</mo> <mi>v</mi> <mrow> <mo>(</mo> <mi>t</mi> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>&amp;tau;</mi> <mo>=</mo> <mn>0</mn> </mrow> <mi>T</mi> </munderover> <mi>h</mi> <mrow> <mo>(</mo> <mi>&amp;tau;</mi> <mo>)</mo> </mrow> <mi>s</mi> <mrow> <mo>(</mo> <mi>t</mi> <mo>-</mo> <mi>&amp;tau;</mi> <mo>)</mo> </mrow> <mo>+</mo> <mi>v</mi> <mrow> <mo>(</mo> <mi>t</mi> <mo>)</mo> </mrow> </mrow>
where x (t) is a speech signal with reverberation, v (t) is background noise, h (τ) is a reverberation impulse response signal, and s (t- τ) is a non-reverberation speech signal.
7. The auditory feature extraction method for voice activity detection according to claim 1, characterized in that, before the calculating the prior signal-to-noise ratio γ (k) and the posterior signal-to-noise ratio (k) of the sound signal using the time domain signal, further comprises,
initializing voice parameters including noise power spectral density ΦV(k) Observing the power spectral density phi of the signalY(k) Estimating the speech spectrumA priori signal-to-noise ratio γ (k) and a posteriori signal-to-noise ratio (k), the initialization procedure is as follows:
l before settingIThe time frame has no voice activity, then
<mrow> <msub> <mi>&amp;Phi;</mi> <mi>V</mi> </msub> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <msub> <mi>L</mi> <mi>I</mi> </msub> </mfrac> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>l</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>L</mi> <mi>I</mi> </msub> </munderover> <msup> <mrow> <mo>|</mo> <mi>Y</mi> <mrow> <mo>(</mo> <mi>l</mi> <mo>,</mo> <mi>k</mi> <mo>)</mo> </mrow> <mo>|</mo> </mrow> <mn>2</mn> </msup> </mrow>
<mrow> <msub> <mi>&amp;Phi;</mi> <mi>Y</mi> </msub> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <msub> <mi>L</mi> <mi>I</mi> </msub> </mfrac> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>l</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>L</mi> <mi>I</mi> </msub> </munderover> <msup> <mrow> <mo>|</mo> <mi>Y</mi> <mrow> <mo>(</mo> <mi>l</mi> <mo>,</mo> <mi>k</mi> <mo>)</mo> </mrow> <mo>|</mo> </mrow> <mn>2</mn> </msup> </mrow>
<mrow> <mover> <mi>X</mi> <mo>^</mo> </mover> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <mo>=</mo> <mi>k</mi> <mfrac> <mn>1</mn> <msub> <mi>L</mi> <mi>I</mi> </msub> </mfrac> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>l</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>L</mi> <mi>I</mi> </msub> </munderover> <mi>Y</mi> <mrow> <mo>(</mo> <mi>l</mi> <mo>,</mo> <mi>k</mi> <mo>)</mo> </mrow> </mrow>
γ(k)=1,(k)=κ,k=1,2,...,K
Where K is the number of the whole band, l is the time frame coordinate, Y (l, K) is the mixed speech spectrum, κ is the attenuation factor, ΦV(k) Power spectral density, phi, representing noise signalY(k) Representing the power spectral density of the observed signal,to estimate the speech spectrum.
8. An auditory feature extraction method for voice activity detection as claimed in claim 7, further comprising, after the initialization of the voice parameters,
according to the power spectral density of the observation signal of the previous frame, obtaining the power spectral density estimation value of the observation signal of the next frame smoothly, wherein the power spectral density estimation value of the observation signal is obtained by the following formula:
Φ′Y(k)=αΦY(k)+(1-α)|Y(l,k)|2
wherein alpha is a smoothing factor and has a value range of 0.95-0.995.
9. An auditory feature extraction method for voice activity detection as claimed in claim 8, further comprising, after smoothing the estimate of the observed signal power spectral density for the next frame based on the observed signal power spectral density for the previous frame,
calculating the adaptive updating step length of the noise power spectrum, wherein the adaptive updating step length of the noise power spectrum is obtained by the following formula:
<mrow> <msub> <mi>&amp;alpha;</mi> <mi>V</mi> </msub> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <mo>=</mo> <mi>&amp;alpha;</mi> <mo>+</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <mi>exp</mi> <mo>(</mo> <mrow> <mo>-</mo> <mi>&amp;epsiv;</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> </mrow> <mo>)</mo> <mfrac> <mrow> <mi>&amp;gamma;</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> </mrow> <mrow> <mi>&amp;gamma;</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <mo>+</mo> <mn>1</mn> </mrow> </mfrac> <mo>)</mo> </mrow> </mrow>
wherein the smoothing factor alpha is taken as a fixed step.
10. An auditory feature extraction method for voice activity detection as claimed in claim 9, further comprising, after the step size is adaptively updated by the computation of the noise power spectrum,
updating the noise power spectrum according to the self-adaptive updating step length of the noise power spectrum, wherein the noise power spectrum is obtained by the following formula:
ΦV(k)=αV(k)Φ′V(k)+(1-αV(k))|Y(l,k)|2
CN201710578645.7A 2017-07-14 2017-07-14 Auditory feature extraction method for voice activity detection Active CN107393553B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710578645.7A CN107393553B (en) 2017-07-14 2017-07-14 Auditory feature extraction method for voice activity detection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710578645.7A CN107393553B (en) 2017-07-14 2017-07-14 Auditory feature extraction method for voice activity detection

Publications (2)

Publication Number Publication Date
CN107393553A true CN107393553A (en) 2017-11-24
CN107393553B CN107393553B (en) 2020-12-22

Family

ID=60340769

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710578645.7A Active CN107393553B (en) 2017-07-14 2017-07-14 Auditory feature extraction method for voice activity detection

Country Status (1)

Country Link
CN (1) CN107393553B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109643554A (en) * 2018-11-28 2019-04-16 深圳市汇顶科技股份有限公司 Adaptive voice Enhancement Method and electronic equipment
CN114724576A (en) * 2022-06-09 2022-07-08 广州市保伦电子有限公司 Method, device and system for updating threshold in howling detection in real time

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101599274A (en) * 2009-06-26 2009-12-09 瑞声声学科技(深圳)有限公司 The method that voice strengthen
CN101976565A (en) * 2010-07-09 2011-02-16 瑞声声学科技(深圳)有限公司 Dual-microphone-based speech enhancement device and method
CN101976566A (en) * 2010-07-09 2011-02-16 瑞声声学科技(深圳)有限公司 Voice enhancement method and device using same
CN103559893A (en) * 2013-10-17 2014-02-05 西北工业大学 Gammachirp cepstrum coefficient auditory feature extraction method of underwater targets
CN104916292A (en) * 2014-03-12 2015-09-16 华为技术有限公司 Method and apparatus for detecting audio signals
CN105810214A (en) * 2014-12-31 2016-07-27 展讯通信(上海)有限公司 Voice activation detection method and device
CN105810201A (en) * 2014-12-31 2016-07-27 展讯通信(上海)有限公司 Voice activity detection method and system
US20170004840A1 (en) * 2015-06-30 2017-01-05 Zte Corporation Voice Activity Detection Method and Method Used for Voice Activity Detection and Apparatus Thereof
CN106328155A (en) * 2016-09-13 2017-01-11 广东顺德中山大学卡内基梅隆大学国际联合研究院 Speech enhancement method of correcting priori signal-to-noise ratio overestimation
CN106558315A (en) * 2016-12-02 2017-04-05 深圳撒哈拉数据科技有限公司 Heterogeneous mike automatic gain calibration method and system

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101599274A (en) * 2009-06-26 2009-12-09 瑞声声学科技(深圳)有限公司 The method that voice strengthen
CN101976565A (en) * 2010-07-09 2011-02-16 瑞声声学科技(深圳)有限公司 Dual-microphone-based speech enhancement device and method
CN101976566A (en) * 2010-07-09 2011-02-16 瑞声声学科技(深圳)有限公司 Voice enhancement method and device using same
CN103559893A (en) * 2013-10-17 2014-02-05 西北工业大学 Gammachirp cepstrum coefficient auditory feature extraction method of underwater targets
CN104916292A (en) * 2014-03-12 2015-09-16 华为技术有限公司 Method and apparatus for detecting audio signals
CN105810214A (en) * 2014-12-31 2016-07-27 展讯通信(上海)有限公司 Voice activation detection method and device
CN105810201A (en) * 2014-12-31 2016-07-27 展讯通信(上海)有限公司 Voice activity detection method and system
US20170004840A1 (en) * 2015-06-30 2017-01-05 Zte Corporation Voice Activity Detection Method and Method Used for Voice Activity Detection and Apparatus Thereof
CN106328155A (en) * 2016-09-13 2017-01-11 广东顺德中山大学卡内基梅隆大学国际联合研究院 Speech enhancement method of correcting priori signal-to-noise ratio overestimation
CN106558315A (en) * 2016-12-02 2017-04-05 深圳撒哈拉数据科技有限公司 Heterogeneous mike automatic gain calibration method and system

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
A.C. SURENDRAN,: ""Logistics discriminative speech detectors using posterior SNR"", 《ICASSP》 *
ROBERTO GEMELLO: ""A modified Ephraim-malah noise suppression rule for automatic speech recognition"", 《ICASSP》 *
李嘉安娜: ""噪声环境下的语音端点检测方法研究"", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
邱一良: ""噪声环境下的语音检测方法研究"", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109643554A (en) * 2018-11-28 2019-04-16 深圳市汇顶科技股份有限公司 Adaptive voice Enhancement Method and electronic equipment
CN114724576A (en) * 2022-06-09 2022-07-08 广州市保伦电子有限公司 Method, device and system for updating threshold in howling detection in real time

Also Published As

Publication number Publication date
CN107393553B (en) 2020-12-22

Similar Documents

Publication Publication Date Title
CN107393550B (en) Voice processing method and device
CN108831495B (en) Speech enhancement method applied to speech recognition in noise environment
CN110867181B (en) Multi-target speech enhancement method based on SCNN and TCNN joint estimation
CN107993670B (en) Microphone array speech enhancement method based on statistical model
US10679617B2 (en) Voice enhancement in audio signals through modified generalized eigenvalue beamformer
CN111445919B (en) Speech enhancement method, system, electronic device, and medium incorporating AI model
WO2009110574A1 (en) Signal emphasis device, method thereof, program, and recording medium
CN107360497B (en) Calculation method and device for estimating reverberation component
CN106384588B (en) The hybrid compensation method of additive noise and reverberation in short-term based on vector Taylor series
CN112185408B (en) Audio noise reduction method and device, electronic equipment and storage medium
Niwa et al. Post-filter design for speech enhancement in various noisy environments
Schwartz et al. Joint estimation of late reverberant and speech power spectral densities in noisy environments using Frobenius norm
CN108538306B (en) Method and device for improving DOA estimation of voice equipment
CN112201273B (en) Noise power spectral density calculation method, system, equipment and medium
JP6748304B2 (en) Signal processing device using neural network, signal processing method using neural network, and signal processing program
CN112712818A (en) Voice enhancement method, device and equipment
CN112530451A (en) Speech enhancement method based on denoising autoencoder
CN107393553B (en) Auditory feature extraction method for voice activity detection
CN114242095B (en) Neural network noise reduction system and method based on OMLSA framework adopting harmonic structure
CN107346658B (en) Reverberation suppression method and device
JP7383122B2 (en) Method and apparatus for normalizing features extracted from audio data for signal recognition or modification
WO2020078210A1 (en) Adaptive estimation method and device for post-reverberation power spectrum in reverberation speech signal
CN107393558B (en) Voice activity detection method and device
CN107393559B (en) Method and device for checking voice detection result
CN113160842B (en) MCLP-based voice dereverberation method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20221130

Address after: 2C1, Plant 2, Baimenqian Industrial Zone, No. 215, Busha Road, Nanlong Community, Nanwan Street, Longgang District, Shenzhen, Guangdong 518000

Patentee after: Shenzhen Yajin Smart Technology Co.,Ltd.

Address before: 518000 Jinhua building, Longfeng 3rd road, Dalang street, Longhua New District, Shenzhen City, Guangdong Province

Patentee before: SHENZHEN YONSZ INFORMATION TECHNOLOGY CO.,LTD.