CN107393558B - Voice activity detection method and device - Google Patents

Voice activity detection method and device Download PDF

Info

Publication number
CN107393558B
CN107393558B CN201710578644.2A CN201710578644A CN107393558B CN 107393558 B CN107393558 B CN 107393558B CN 201710578644 A CN201710578644 A CN 201710578644A CN 107393558 B CN107393558 B CN 107393558B
Authority
CN
China
Prior art keywords
signal
dimension parameter
voice
auditory
noise ratio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710578644.2A
Other languages
Chinese (zh)
Other versions
CN107393558A (en
Inventor
蔡钢林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Yajin Smart Technology Co ltd
Original Assignee
Shenzhen Yonsz Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Yonsz Information Technology Co ltd filed Critical Shenzhen Yonsz Information Technology Co ltd
Priority to CN201710578644.2A priority Critical patent/CN107393558B/en
Publication of CN107393558A publication Critical patent/CN107393558A/en
Application granted granted Critical
Publication of CN107393558B publication Critical patent/CN107393558B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L2025/783Detection of presence or absence of voice signals based on threshold decision

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Complex Calculations (AREA)

Abstract

The invention provides a voice activity detection method and a voice activity detection device, wherein the method comprises the steps of calculating the auditory characteristics of a sound signal, wherein the auditory characteristics comprise a first dimension parameter related to a prior signal-to-noise ratio, a second dimension parameter related to a posterior signal-to-noise ratio and a third dimension parameter related to a time domain signal; and comparing the first dimension parameter, the second dimension parameter and the third dimension parameter with respective corresponding hearing threshold values to obtain a detection result. The invention adopts the prior signal-to-noise ratio and the posterior signal-to-noise ratio to combine with the time domain signal to represent the auditory characteristics, and the extracted auditory characteristics are compared with the auditory threshold value to detect the real-time voice activity. The invention can effectively extract the auditory characteristics under the far-distance speaking condition and detect the existence of the voice in the voice signal under a single microphone system.

Description

Voice activity detection method and device
Technical Field
The present invention relates to the field of speech recognition, and in particular, to a method and an apparatus for detecting speech activity.
Background
With the rapid development of internet technology and intelligent hardware in recent years, voice intelligent interaction technologies such as voice recognition, voiceprint recognition and sound source detection are beginning to move from laboratories to users. The voice recognition technology is the most core technology of a voice-based man-machine interaction system. The recognition rate has reached the available accuracy under defined conditions. By limited adjustment is generally meant that the user is closer to the microphone and less noisy. The requirement that the voice command must be issued in close proximity limits the ease of voice interaction.
In the case of far speech, the recognition rate is rapidly reduced because the speech energy is rapidly attenuated while the noise interference energy is substantially unchanged. Another factor affecting the recognition accuracy is that reverberation of the voice command after reaching the walls of the room after multiple reflections also causes mismatching between the actual application and the voice recognition training data set, and affects the recognition rate.
There are two main sources of noise: (1) the microphone signal acquires the channel noise of the system, the channel noise is different due to the sensitivity of the microphone, and the higher the sensitivity of the microphone is, the higher the channel noise is generally; (2) non-negligible ambient noise interference, such as television, air conditioning noise, etc. Reverberation is more complex and more difficult to suppress than noise due to the more complex conditions of generation. Also, noise and reverberation generally coexist, making reverberation suppression more difficult.
201510119374.X discloses a voice detection method and a device, and the method specifically comprises the following steps: overlapping and framing the collected sound signals to obtain a plurality of corresponding sound frames; windowing the obtained multiple sound frames; carrying out frequency domain conversion on the sound frames subjected to windowing processing to obtain frequency spectrums corresponding to the sound frames; performing cepstrum domain conversion on the obtained frequency spectrum corresponding to each sound frame to obtain a corresponding cepstrum; calculating cepstrum distance between cepstrum of two adjacent sound frames; and when the calculated cepstrum distance is larger than a preset distance threshold, carrying out voice detection on the collected sound signal. The scheme can save the time of voice detection.
However, the method compares the calculated cepstrum distance with a preset threshold, and although thresholds at different distances are preset, the preset threshold cannot be applied in a specific scene due to the complexity of the actual environment, so that the accuracy of speech recognition is reduced.
There is also a super directional Beamforming technology, which employs an annular or linear microphone array to directionally enhance the directional signal of the target sound source through a set of spatial filters. The super-directional Beamforming technology is used for improving the quality of sound signals in terms of sampling. However, by adopting the super directional Beamforming technology, the number of microphones is required to be large, the requirements on the consistency of the microphones and the accuracy of the geometric positions of the microphones are high, the difficulty and the cost of hardware implementation are increased, the integration in most middle and low-level products is difficult, and the application range is very limited.
Disclosure of Invention
The main objective of the present invention is to provide a voice activity detection method and device, which can effectively extract the auditory characteristics under the far-end situation in a single-microphone system, and detect the existence of voice in the voice signal.
The invention provides a voice activity detection method, which comprises the following steps:
calculating auditory characteristics of the sound signal, wherein the auditory characteristics comprise a first dimension parameter related to a prior signal-to-noise ratio, a second dimension parameter related to a posterior signal-to-noise ratio and a third dimension parameter related to a time domain signal;
and comparing the first dimension parameter, the second dimension parameter and the third dimension parameter with respective corresponding hearing threshold values to obtain a detection result. And if any one of the first dimension parameter, the second dimension parameter and the third dimension parameter is larger than the corresponding hearing threshold value, judging that the voice activity exists in the voice signal. And if any one of the first dimension parameter, the second dimension parameter and the third dimension parameter is larger than the corresponding hearing threshold value, judging that the sound signal has no voice activity.
Preferably, the first dimension parameter is represented by V (1), which is obtained by the following formula:
Figure BDA0001351137440000021
wherein, gamma (K) is the prior signal-to-noise ratio, K is the frequency, and K is the integral number of the frequency band;
the second dimension parameter is represented by V (2), which is obtained by the following formula:
Figure BDA0001351137440000022
wherein, (k) is the posterior signal-to-noise ratio;
the third dimension parameter is represented by V (3), which is obtained by the following formula:
Figure BDA0001351137440000031
wherein L isWIs representative of window length, LTRepresenting the starting sample point, the function y is the time domain mixed speech data, and j is the time variable.
Preferably, the prior signal-to-noise ratio γ (k) is obtained by the following formula:
Figure BDA0001351137440000032
where 1 is the time frame coordinate, Y (l, k) is the mixed speech spectrum, ΦV(k) Representing the power spectral density of the noise signal.
Preferably, the posterior signal-to-noise ratio (k) is determined by the following formula:
Figure BDA0001351137440000033
wherein β is a smoothing factor, β is a value range of 0.6-0.9,
Figure BDA0001351137440000034
to estimate the speech spectrum, the Max function represents the maximum of two variables chosen.
Preferably, the first and second electrodes are formed of a metal,
beta is 0.75.
Preferably, the time domain signal is represented by y (t), which is represented by the following formula:
Figure BDA0001351137440000035
where x (t) is a speech signal with reverberation, v (t) is background noise, h (τ) is a reverberation impulse response signal, and s (t- τ) is a non-reverberation speech signal.
Preferably, the calculating the auditory characteristic of the sound signal includes:
calculating a time domain signal of the sound signal;
calculating a prior signal-to-noise ratio gamma (k) and a posterior signal-to-noise ratio (k) of the sound signal by using the time domain signal, wherein k is a frequency coordinate;
and calculating the auditory characteristics of the current frame according to the time domain signal, the prior signal-to-noise ratio gamma (k) and the posterior signal-to-noise ratio (k).
Preferably, before calculating the prior signal-to-noise ratio γ (k) and the posterior signal-to-noise ratio (k) of the sound signal by using the time domain signal, further comprises,
initializing voice parameters including noise power spectral density ΦV(k) Observing the power spectral density phi of the signalY(k)、Estimating speech spectra
Figure BDA0001351137440000041
A priori signal-to-noise ratio γ (k) and a posteriori signal-to-noise ratio (k), the initialization procedure is as follows:
l before settingIThe time frame has no voice activity, then
Figure BDA0001351137440000042
Figure BDA0001351137440000043
Figure BDA0001351137440000044
γ(k)=1,(k)=κ,k=1,2,...,K
Where K is the number of the whole band, 1 is the time frame coordinate, Y (l, K) is the mixed speech spectrum, κ is the attenuation factor, ΦV(k) Power spectral density, phi, representing noise signalY(k) Representing the power spectral density of the observed signal,
Figure BDA0001351137440000045
to estimate the speech spectrum.
Preferably, after the initializing the voice parameters, further comprising,
according to the power spectral density of the observation signal of the previous frame, obtaining an observation signal power spectral density estimated value of the next frame smoothly, wherein the observation signal power spectral density estimated value can be obtained by the following formula:
Φ′Y(k)=αΦY(k)+(1-α)|Y(l,k)|2
wherein alpha is a smoothing factor and has a value range of 0.95-0.995.
Preferably, after obtaining the power spectral density estimation value of the observed signal of the next frame according to the power spectral density of the observed signal of the previous frame by smoothing, the method further comprises,
calculating a noise power spectrum adaptive updating step length, wherein the noise power spectrum adaptive updating step length can be obtained by the following formula:
Figure BDA0001351137440000046
wherein the smoothing factor alpha is taken as a fixed step.
Preferably, after the step size is adaptively updated by calculating the noise power spectrum, further comprising,
updating the noise power spectrum according to the self-adaptive updating step length of the noise power spectrum, wherein the noise power spectrum can be obtained by the following formula:
ΦV(k)=αV(k)Φ′V(k)+(1-αV(k))|Y(l,k)|2
preferably, the hearing threshold is in θT(i) Is represented by, i is 1, 2, 3, thetaT(1) Corresponding to said first dimension parameter, θT(2) Corresponding to the second dimension parameter, θT(3) Corresponding to said third dimension parameter, θT(i) The following formula is used to obtain:
Figure BDA0001351137440000051
θ′T(i) auditory threshold for previous frame, FBIs a feature buffer matrix consisting ofIAn auditory feature consisting ofI-1 frame and current frame auditory features, i being the number of rows of said feature buffer matrix and j being the number of columns of said feature buffer matrix.
Preferably, said FBThe following formula is used to obtain:
Figure BDA0001351137440000052
F′Bis the feature buffer matrix of the previous frame, V (1) is the first dimension parameter, V (2) is the second dimension parameter, and V (3) is the third dimension parameterAnd (4) parameters.
Preferably, in the step of comparing the first dimension parameter, the second dimension parameter, and the third dimension parameter with their respective hearing thresholds to obtain the detection result, the detection result is obtained by the following formula:
Figure BDA0001351137440000053
Figure BDA0001351137440000054
q (i) is the score of the dimensional parameter of the auditory feature, QFrameAnd if the judgment result is a voice check judgment result, the judgment result is 1, the current frame has voice, and if the judgment result is 0, the current frame has no voice.
The invention also provides a voice activity detection device, comprising:
the auditory feature calculation module is used for calculating auditory features of the sound signals, and the auditory features comprise first dimension parameters related to the prior signal-to-noise ratio, second dimension parameters related to the posterior signal-to-noise ratio and third dimension parameters related to the time domain signals;
and the voice detection module is used for comparing the first dimension parameter, the second dimension parameter and the third dimension parameter with respective corresponding hearing threshold values to obtain a detection result.
Preferably, the auditory feature calculation module includes:
a first dimension parameter calculation unit for calculating a first dimension parameter, said first dimension parameter being represented by V (1), which is obtained by the following formula:
Figure BDA0001351137440000061
wherein, gamma (K) is the prior signal-to-noise ratio, K is the frequency, and K is the integral number of the frequency band;
a second dimension parameter calculation unit for calculating a second dimension parameter, the second dimension parameter being represented by V (2), which is obtained by the following formula:
Figure BDA0001351137440000062
wherein, (k) is the posterior signal-to-noise ratio;
a third dimension parameter calculating unit, configured to calculate a third dimension parameter, where the third dimension parameter is represented by V (3), and is obtained by the following formula:
Figure BDA0001351137440000063
wherein L isWIs representative of window length, LTRepresenting the starting sample point, the function y is the time domain mixed speech data, and j is the time variable.
Preferably, the auditory feature calculation module includes:
a prior signal-to-noise ratio calculation unit for calculating a prior signal-to-noise ratio, said prior signal-to-noise ratio γ (k) being obtained by the following formula:
Figure BDA0001351137440000064
where 1 is the time frame coordinate, Y (l, k) is the mixed speech spectrum, ΦV(k) Representing the power spectral density of the noise signal.
Preferably, the auditory feature calculation module includes:
a posterior signal-to-noise ratio calculation unit for calculating a posterior signal-to-noise ratio, said posterior signal-to-noise ratio (k) being calculated by the following formula:
Figure BDA0001351137440000065
wherein β is a smoothing factor, β is a value range of 0.6-0.9,
Figure BDA0001351137440000066
to estimate the speech spectrum, the Max function represents the maximum of two variables chosen.
Preferably, β is 0.75.
Preferably, the auditory feature calculation module includes:
a time domain signal calculation unit for calculating a time domain signal, said time domain signal being represented by y (t), which is obtained by the following formula:
Figure BDA0001351137440000071
where x (t) is a speech signal with reverberation, v (t) is background noise, h (τ) is a reverberation impulse response signal, and s (t- τ) is a non-reverberation speech signal.
Preferably, the voice detection module includes:
an auditory threshold calculation unit for calculating an auditory threshold in θT(i) Is represented by, i is 1, 2, 3, thetaT(1) Corresponding to said first dimension parameter, θT(2) Corresponding to the second dimension parameter, θT(3) Corresponding to said third dimension parameter, θT(i) The following formula is used to obtain:
Figure BDA0001351137440000072
θ′T(i) auditory threshold for previous frame, FBIs a feature buffer matrix consisting of LIAn auditory feature consisting of front LI-1 frame and current frame auditory features, i being the number of rows of said feature buffer matrix and j being the number of columns of said feature buffer matrix.
Preferably, the voice detection module includes:
a feature buffer matrix calculation unit for calculating a feature buffer matrix, FBThe following formula is used to obtain:
Figure BDA0001351137440000073
F′Bfor the previous frameAnd V (1) is the first dimension parameter, V (2) is the second dimension parameter, and V (3) is the third dimension parameter.
Preferably, the voice detection module includes:
the detection unit is used for comparing the first dimension parameter, the second dimension parameter and the third dimension parameter with respective corresponding hearing threshold values to obtain a detection result, and the detection result is obtained by the following formula:
Figure BDA0001351137440000074
Figure BDA0001351137440000081
q (i) is the score of the dimensional parameter of the auditory feature, QFrameAnd if the judgment result is a voice check judgment result, the judgment result is 1, the current frame has voice, and if the judgment result is 0, the current frame has no voice.
According to the voice activity detection method and device, the prior signal-to-noise ratio and the posterior signal-to-noise ratio are combined with the time domain signal to represent the auditory characteristics, and the extracted auditory characteristics can be used for being compared with an auditory threshold value to detect real-time voice activity. The invention can effectively extract the auditory characteristics under the far-distance speaking condition and detect the existence of the voice in the voice signal under a single microphone system.
Drawings
FIG. 1 is a flowchart illustrating a voice activity detection method according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a voice activity detection apparatus according to an embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The sound signal referred to in the present invention is digital audio data, that is, digital audio data obtained by converting a sound wave into an analog audio signal by a sound wave conversion circuit and then converting the analog audio signal by an analog-to-digital converter.
Referring to fig. 1, the present invention provides a voice activity detection method, including the following steps:
s10, calculating auditory characteristics of the sound signal, wherein the auditory characteristics comprise a first dimension parameter related to the prior signal-to-noise ratio, a second dimension parameter related to the posterior signal-to-noise ratio and a third dimension parameter related to the time domain signal;
and S20, comparing the first dimension parameter, the second dimension parameter and the third dimension parameter with respective corresponding hearing threshold values to obtain detection results.
In step S10, an auditory feature of the current frame is obtained, where the auditory feature includes three parameters, namely, a first dimension parameter V (1) related to the prior snr, a second dimension parameter V (2) related to the posterior snr, and a third dimension parameter V (3) related to the time domain signal. The auditory characteristics of the current frame may be represented in the following manner:
Figure BDA0001351137440000091
v (1) which can be obtained by the following formula:
Figure BDA0001351137440000092
v (2) can be obtained by the following equation:
Figure BDA0001351137440000093
v (3) can be obtained by the following equation:
Figure BDA0001351137440000094
where K is the number of the whole frequency band, LWIs representative of window length, LTRepresenting the starting sample point, the function y is time domain mixed voice data, j is time variable, gamma (k) is prior signal-to-noise ratio, and k is posterior signal-to-noise ratio. The time-domain mixed voice data is one of time-domain signals.
The above is only a preferred calculation method of the first dimension parameter V (1), the second dimension parameter V (2) and the third dimension parameter V (3), and any method of performing appropriate deformation decomposition and then performing solution according to the above method should also fall within the protection scope of the present invention.
The sound signal refers to mixed voice data acquired by a sound collection system, which is typically stored in a buffer. Assuming that the mixed speech data is y (t), it can be regarded as a convolution of the reverberated speech signal x (t) and the background noise v (t). The reverberated speech signal x (t) can in turn be regarded as a convolution of the reverberant impulse response signal h (τ) and the non-reverberated speech signal s (t- τ). The mathematical formula can be expressed as:
Figure BDA0001351137440000095
the above is only one way of acquiring the time domain signal of the sound signal, and the time domain signal of the sound signal may be acquired in other forms.
In step S20, the first dimension parameter, the second dimension parameter, and the third dimension parameter are compared with their respective hearing thresholds to obtain a detection result. For example, if any one of the first dimension parameter, the second dimension parameter, and the third dimension parameter is greater than the respective hearing threshold, it is determined that voice activity exists in the sound signal. And if any one of the first dimension parameter, the second dimension parameter and the third dimension parameter is larger than the corresponding hearing threshold value, judging that the sound signal has no voice activity.
The above process can be solved by the following equation:
Figure BDA0001351137440000101
Figure BDA0001351137440000102
q (i) is the score of the dimensional parameter of the auditory feature, QFrameAnd if the result is a judgment result of the voice check, the result is 1, the current frame has voice, and if the result is 0, the current frame has no voice.
The following is a specific calculation procedure for noise estimation.
Firstly, the estimation of background noise, and the accuracy of the noise energy estimation directly influences the effect of subsequent voice detection. The embodiment of the invention adopts a mode of combining fixed noise estimation with noise self-adaptive updating to ensure the stability and accuracy of the noise estimation. The initialization and specific calculation flow is as follows:
taking the data of the buffer area, windowing the data to perform FFT (fast Fourier transform), and transforming a time domain signal to a frequency spectrum domain:
suppose that the mixed speech data is y (t), where x (t) is a speech signal with reverberation, v (t) is background noise, h (τ) is a reverberation impulse response signal, and s (t- τ) is a non-reverberation speech signal. The FFT (fourier transform) is as follows:
Figure BDA0001351137440000103
Figure BDA0001351137440000104
where w (t) is a Hanning window of length 512, l is a time frame coordinate, and k is a frequency coordinate.
To front LIThe time frame assumes no voice activity and is initialized as follows:
Figure BDA0001351137440000105
Figure BDA0001351137440000106
Figure BDA0001351137440000111
γ(k)=1,(k)=κ,k=1,2,...,K
where K represents the number of the whole frequency band, phiV(k) Power spectral density, phi, representing noise signalY(k) Representing the power spectral density of the observed signal, gamma (k) is the a priori signal-to-noise ratio, (k) is the a posteriori signal-to-noise ratio,
Figure BDA0001351137440000112
to estimate the speech spectrum, it is initialized to multiply the mean of the mixed spectrum by an attenuation factor k, which takes a value of 0.1.
From LTAnd starting iterative calculation at +1 time frame, wherein the calculation flow is as follows:
updating the power spectral density estimated value of the observation signal, namely smoothly obtaining the calculation result of the next frame according to the result of the previous frame:
Φ′Y(k)=αΦY(k)+(1-α)|Y(l,k)|2
wherein α is a smoothing factor, a value range is recommended to be 0.95-0.995, and 0.98 is preferably used as a smoothing threshold in this embodiment.
Calculating a priori signal-to-noise ratio and a posteriori signal-to-noise ratio
Figure BDA0001351137440000113
Figure BDA0001351137440000114
Wherein β is a smoothing factor, β is a value range of 0.6 to 0.9, and a value of 0.75 is preferred in this embodiment. The Max function represents the selection of the maximum of the two variables.
The above is only a preferred calculation method of the prior snr and the posterior snr, and any method of performing appropriate deformation decomposition and then performing solution according to the above method should also fall within the scope of the present invention.
Calculating the self-adaptive updating step length of the noise power spectrum according to the prior posterior signal-to-noise ratio:
Figure BDA0001351137440000115
namely, a mode of adding a fixed step length and a self-adaptive step length is adopted to realize the whole updating.
Updating the noise power spectrum according to the step length, wherein the basic principle is that if the voice is less, the step length of updating the noise power spectrum is larger, and the accuracy of noise estimation is ensured; otherwise, a slower step size is used to avoid the speech signal from participating in the iterative update of the noise power spectrum:
ΦV(k)=αV(k)Φ′V(k)+(1-αV(k))|Y(l,k)|2
the output of the above equation is the noise power spectrum update result, which is used for the noise update of the next frame and participating in the voice detection process as a parameter.
The following is a specific process of voice detection.
After the background noise parameters are accurately estimated, auditory features can be constructed based on the background noise parameters. After the auditory characteristics are obtained, the auditory characteristics of the current frame are compared with a set auditory threshold value, and whether the current frame has voice activity or not can be judged.
The voice activity detection is mainly used for detecting a voice activity area, stopping the optimization processing of voice in a non-voice activity area and reducing power consumption; in the voice activity area, noise interference can be reduced, and the voice optimization effect is improved.
Before extracting the auditory features of the current frame, there is an initialization process, which is as follows:
initializing a characteristic buffer matrix, a characteristic threshold value and a voice detection result buffer area, wherein the characteristic buffer area matrix is formed by LIThe 3-dimensional column vectors are formed and are formulated as follows:
Figure BDA0001351137440000121
Figure BDA0001351137440000122
Figure BDA0001351137440000123
Q(1:LI)=0
θT(1)=FB(1,1)
θT(2)=FB(2,1)
θT(3)=FB(3,1)
wherein, FBIs an auditory feature buffer, Q is a voice activity detection result buffer, θTThe threshold buffer for the auditory feature, i.e. the prior signal-to-noise ratio, the posterior signal-to-noise ratio and the time domain signal are used for the final voice activity detection, respectively. In time domain signal calculation, LWIs representative of window length, LTThe value range of the start sample point is usually between 5 and 20, and is set to 10 in this embodiment.
From LTStarting with +1 time frame, the current frame auditory features are computed as follows:
Figure BDA0001351137440000124
Figure BDA0001351137440000125
Figure BDA0001351137440000126
according to the current frame auditory characteristic calculation result, updating the characteristic buffer area and the characteristic threshold value, namely kicking the data with the longest time in the buffer area out of the buffer area, and putting the current frame data into the buffer area:
Figure BDA0001351137440000131
and calculating the hearing threshold corresponding to each dimension parameter:
Figure BDA0001351137440000132
comparing the current auditory characteristics with an auditory threshold, determining a voice detection result according to the comparison result, and specifically calculating as follows:
Figure BDA0001351137440000133
Figure BDA0001351137440000134
q (i) is the score of the dimensional parameter of the auditory feature, QFrameAnd if the result is a judgment result of the voice check, the result is 1, the current frame has voice, and if the result is 0, the current frame has no voice.
Updating a voice detection result buffer area, kicking out the data with the longest time in the buffer area from the buffer area, adding a current frame judgment result, and calculating an average voice detection result in the buffer area:
Q=[Q′(:,2:LB);QFrame]
then, calculating the statistical value of the detection results in the voice detection result buffer, wherein the sum of the detection results is calculated as follows:
Figure BDA0001351137440000135
since speech is usually continuous, the contrast QMWith a fixed threshold value LIIf the value is smaller than the threshold value, the frame of the speech in the current buffer area is indicated to be false detection, no speech exists in the current buffer area, the characteristic threshold value is updated, the speech spectrum estimation result is set as a minimum value, and the calculation is as follows:
Figure BDA0001351137440000136
at the same time, the estimated speech spectrum is updated
Figure BDA0001351137440000137
The calculation is as follows:
Figure BDA0001351137440000138
the value range is 0.1-0.3, and the value of the invention is 0.15. If no false detection exists, the current buffer area is indicated to have speech, and the sound signal can be continuously optimized.
The Kalman adaptation enhancement is assumed to use a length LGThe forward prediction filter of (1) predicting the clean speech spectrum, usually LG<LI. In the present invention, these two parameters are set to L respectivelyG=15,LI25. Since the speech signal can be well represented by an autoregressive model, the error of prediction can be understood as a reverberation component. Based on the minimum mean square error criterion, the adaptive process of filter update is as follows:
before LIThe frame carries out prediction error vector, prediction vector variance matrix and prediction error initialization, and the initialization process is as follows:
Figure BDA00013511374400001411
Figure BDA00013511374400001412
E(k)=0
wherein the vector variance matrix P is predictedkIs dimension LG×LG0 matrix of (1) is a prediction error vector GkIs dimension LG× 1, e (k) is the prediction error obtained with the current prediction vector.
From LI+1 frame start, if the voice detection result indicates the presence of voice activity, performing an adaptive update as followsThe process is as follows:
(1.1) updating the prediction error, including the prediction error vector and the prediction spectral error, as follows:
Figure BDA0001351137440000141
Figure BDA0001351137440000142
wherein the content of the first and second substances,
Figure BDA0001351137440000143
is dimension LG×LGThe identity matrix of (2).
(1.2) smoothing the prediction spectrum error to make the error estimation smoother, wherein the specific flow is as follows:
Figure BDA0001351137440000144
E(k)=η|EPre|2-(1-η)|EPre,o|2
wherein eta is a smoothing coefficient, the value range of the smoothing coefficient is 0.6-0.9, and the value of the method is 0.75.
(1.3) Kalman gain calculation, updating the prediction vector, and updating the process as follows:
Figure BDA0001351137440000145
Figure BDA0001351137440000146
Gk=G′k+KGEPre
(1.4) reverberation power spectral density update, the update process is as follows:
Figure BDA0001351137440000147
the reverberation power spectral density and the observation signal power spectral density adopt the same smoothing coefficient α phi'R(k) The reverberant power spectral density of the previous frame. The initial setting of the reverberant power spectral density is 0.
(1.5) constructing an attenuation factor according to the wiener filtering, and outputting an estimated voice spectrum, wherein the calculation is as follows:
Figure BDA0001351137440000148
Figure BDA0001351137440000149
the spectral estimation is used both to recover the time domain signal in the next step and to participate in the computation of the a posteriori signal-to-noise ratio in the first step.
(1.6) circularly executing 1.1-1.5 until all frequency bands are updated, recovering a time domain signal by adopting inverse Fourier transform, wherein the calculation flow is as follows:
Figure BDA00013511374400001410
and after the time domain signal is recovered, sending the time domain signal to a subsequent application terminal, such as a communication device or a voice recognition engine, so as to realize the combined suppression of noise and reverberation.
Referring to fig. 2, the present invention further provides a voice activity detecting apparatus, including:
the auditory feature calculation module 10 is configured to calculate an auditory feature of the sound signal, where the auditory feature includes a first dimension parameter related to a prior signal-to-noise ratio, a second dimension parameter related to a posterior signal-to-noise ratio, and a third dimension parameter related to a time domain signal;
and the detection voice module 20 is configured to compare the first dimension parameter, the second dimension parameter, and the third dimension parameter with respective corresponding hearing thresholds to obtain a detection result.
Optionally, the auditory feature calculation module includes:
a first dimension parameter calculation unit for calculating a first dimension parameter, said first dimension parameter being represented by V (1), which is obtained by the following formula:
Figure BDA0001351137440000151
wherein, gamma (K) is the prior signal-to-noise ratio, K is the frequency, and K is the integral number of the frequency band;
a second dimension parameter calculation unit for calculating a second dimension parameter, the second dimension parameter being represented by V (2), which is obtained by the following formula:
Figure BDA0001351137440000152
wherein, (k) is the posterior signal-to-noise ratio;
a third dimension parameter calculating unit, configured to calculate a third dimension parameter, where the third dimension parameter is represented by V (3), and is obtained by the following formula:
Figure BDA0001351137440000153
wherein L isWIs representative of window length, LTRepresenting the starting sample point, the function y is the time domain mixed speech data, and j is the time variable.
Optionally, the auditory feature calculation module includes:
a prior signal-to-noise ratio calculation unit for calculating a prior signal-to-noise ratio, said prior signal-to-noise ratio γ (k) being obtained by the following formula:
Figure BDA0001351137440000154
where 1 is the time frame coordinate, Y (l, k) is the mixed speech spectrum, ΦV(k) Representing the power spectral density of the noise signal.
Optionally, the auditory feature calculation module includes:
a posterior signal-to-noise ratio calculation unit for calculating a posterior signal-to-noise ratio, said posterior signal-to-noise ratio (k) being calculated by the following formula:
Figure BDA0001351137440000161
wherein β is a smoothing factor, β is a value range of 0.6-0.9,
Figure BDA0001351137440000162
to estimate the speech spectrum, the Max function represents the maximum of two variables chosen.
Alternatively, β is 0.75.
Optionally, the auditory feature calculation module includes:
a time domain signal calculation unit for calculating a time domain signal, said time domain signal being represented by y (t), which is obtained by the following formula:
Figure BDA0001351137440000163
where x (t) is a speech signal with reverberation, v (t) is background noise, h (τ) is a reverberation impulse response signal, and s (t- τ) is a non-reverberation speech signal.
Optionally, the voice detection module includes:
an auditory threshold calculation unit for calculating an auditory threshold in θT(i) Is represented by, i is 1, 2, 3, thetaT(1) Corresponding to said first dimension parameter, θT(2) Corresponding to the second dimension parameter, θT(3) Corresponding to said third dimension parameter, θT(i) The following formula is used to obtain:
Figure BDA0001351137440000164
θ′T(i) auditory threshold for previous frame, FBIs a feature buffer matrix consisting of LIAn auditory feature consisting of front LI-1 auditory features of the frame and the current frame, i beingAnd j is the column number of the characteristic buffer area matrix.
Optionally, the voice detection module includes:
a feature buffer matrix calculation unit for calculating a feature buffer matrix, FBThe following formula is used to obtain:
Figure BDA0001351137440000165
F′Band V (1) is the characteristic buffer area matrix of the previous frame, V (2) is the first dimension parameter, V (3) is the third dimension parameter.
Optionally, the voice detection module includes:
the detection unit is used for comparing the first dimension parameter, the second dimension parameter and the third dimension parameter with respective corresponding hearing threshold values to obtain a detection result, and the detection result is obtained by the following formula:
Figure BDA0001351137440000171
Figure BDA0001351137440000172
q (i) is the score of the dimensional parameter of the auditory feature, QFrameAnd if the result is a judgment result of the voice check, the result is 1, the current frame has voice, and if the result is 0, the current frame has no voice.
The method can be used for assisting in voice instruction recognition in a home environment. In a home environment, a user is about 1 to 3 meters away from a microphone, and is affected by home noise and wall reverberation, and the recognition rate is rapidly reduced. The voice activity detection method provided by the invention can effectively extract the auditory characteristics in the acquired voice signal, monitor the voice activity and reduce the time of false recognition. Experiments prove that the recognition rate can be improved from 30% to 65% when the input signal-to-noise ratio is about 10dB at a distance of about 2 meters from a microphone, and the recognition rate is improved from 10% to about 50% when the noise is increased to 20 dB.
The invention provides a voice activity detection method and device, which adopt a priori signal-to-noise ratio and a posteriori signal-to-noise ratio to combine with a time domain signal to represent auditory characteristics, and the extracted auditory characteristics are used for being compared with an auditory threshold value to detect real-time voice activity.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by the present specification, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (9)

1. A voice activity detection method, comprising the steps of:
calculating auditory characteristics of the sound signal, wherein the auditory characteristics comprise a first dimension parameter related to a prior signal-to-noise ratio, a second dimension parameter related to a posterior signal-to-noise ratio and a third dimension parameter related to a time domain signal; the sound signal is digital audio data;
comparing the first dimension parameter, the second dimension parameter and the third dimension parameter with respective corresponding hearing threshold values to obtain detection results;
the first dimension parameter is represented by V (1), which is obtained by the following formula:
Figure FDA0002592353470000011
wherein, gamma (K) is the prior signal-to-noise ratio, K is the frequency, and K is the integral number of the frequency band;
the second dimension parameter V (2) is represented by the following formula:
Figure FDA0002592353470000012
wherein, (k) is the posterior signal-to-noise ratio;
the third dimension parameter is represented by V (3), which is obtained by the following formula:
Figure FDA0002592353470000013
wherein L isWIs representative of window length, LTRepresenting an initial sample point, wherein the function y is time domain mixed voice data, and j is a time variable;
the obtaining of the detection result comprises: obtaining a current auditory characteristic calculation result by the calculation formulas of V (1), V (1) and V (3), comparing the current auditory characteristic calculation result with an auditory threshold, and determining a detection result according to the comparison result to judge whether the voice exists or not;
the judging whether the voice exists or not comprises the following steps: and calculating a detection result statistic value by using a mode of calculating the sum of the detection results, judging whether the detection is carried out by mistake or not, and carrying out optimization processing on the sound signal when the detection is not carried out by mistake.
2. The voice activity detection method according to claim 1, wherein the a priori signal-to-noise ratio γ (k) is obtained by the following formula:
Figure FDA0002592353470000021
where l is the time frame coordinate, Y (l, k) is the mixed speech spectrum, ΦV(k) Representing the power spectral density of the noise signal.
3. The voice activity detection method according to claim 2, characterized in that the a posteriori signal-to-noise ratio (k) is obtained by the following formula:
Figure FDA0002592353470000022
wherein β is a smoothing factor, β is a value range of 0.6-0.9,
Figure FDA0002592353470000023
for estimating the speech spectrum, the Max functionIndicating that the maximum of the two variables is selected.
4. The voice activity detection method according to claim 3,
beta is 0.75.
5. The voice activity detection method according to claim 1, wherein the time domain signal is represented by y (t), which is represented by the following formula:
Figure FDA0002592353470000024
where x (t) is a speech signal with reverberation, v (t) is background noise, h (τ) is a reverberation impulse response signal, and s (t- τ) is a non-reverberation speech signal.
6. The voice activity detection method according to claim 1, wherein the hearing threshold is in ΘT(i) Where i is 1, 2, 3, θ T (1) corresponds to the first dimension parameter, and θ T is equal to the second dimension parameterT(2) Corresponding to the second dimension parameter, θT(3) Corresponding to said third dimension parameter, θT(i) The following formula is used to obtain:
Figure FDA0002592353470000025
θ′T(i) auditory threshold for previous frame, FBIs a feature buffer matrix consisting of LIAn auditory feature consisting of front LI-1 frame and current frame auditory features, i being the number of rows of said feature buffer matrix and j being the number of columns of said feature buffer matrix.
7. The voice activity detection method of claim 6, wherein FBThe following formula is used to obtain:
Figure FDA0002592353470000031
F’Band V (1) is the characteristic buffer area matrix of the previous frame, V (2) is the first dimension parameter, V (3) is the third dimension parameter.
8. The method according to claim 7, wherein in the step of comparing the first dimension parameter, the second dimension parameter, and the third dimension parameter with their respective hearing thresholds to obtain the detection result, the detection result is obtained according to the following formula:
Figure FDA0002592353470000032
Figure FDA0002592353470000033
q (i) is the score of the dimensional parameter of the auditory feature, QFrameAnd if the judgment result is a voice check judgment result, the judgment result is 1, the current frame has voice, and if the judgment result is 0, the current frame has no voice.
9. A voice activity detection apparatus, comprising:
an auditory feature calculation module for calculating auditory features of the sound signal, the auditory features including those related to a priori signal-to-noise ratio
First dimension parameter
Figure FDA0002592353470000034
Related to the A posteriori signal-to-noise ratio
Second dimension parameter
Figure FDA0002592353470000035
Relating to time-domain signals
Third dimension parameter
Figure FDA0002592353470000036
The sound signal is digital audio data;
a voice detection module, configured to compare the first dimension parameter, where γ (K) is a priori signal-to-noise ratio, K is frequency, K is the whole number of frequency bands, and a second dimension parameter, where (K) is a posteriori signal-to-noise ratio, and a third dimension parameter, with respective corresponding hearing thresholds, to obtain a detection result, where L isWIs representative of window length, LTRepresenting an initial sample point, wherein the function y is time domain mixed voice data, and j is a time variable; the obtaining of the detection result comprises: and judging whether the sound signal is detected mistakenly or not, and carrying out optimization processing on the sound signal when the sound signal is not detected mistakenly.
CN201710578644.2A 2017-07-14 2017-07-14 Voice activity detection method and device Active CN107393558B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710578644.2A CN107393558B (en) 2017-07-14 2017-07-14 Voice activity detection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710578644.2A CN107393558B (en) 2017-07-14 2017-07-14 Voice activity detection method and device

Publications (2)

Publication Number Publication Date
CN107393558A CN107393558A (en) 2017-11-24
CN107393558B true CN107393558B (en) 2020-09-11

Family

ID=60340739

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710578644.2A Active CN107393558B (en) 2017-07-14 2017-07-14 Voice activity detection method and device

Country Status (1)

Country Link
CN (1) CN107393558B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101625857A (en) * 2008-07-10 2010-01-13 新奥特(北京)视频技术有限公司 Self-adaptive voice endpoint detection method
CN102959625A (en) * 2010-12-24 2013-03-06 华为技术有限公司 Method and apparatus for adaptively detecting voice activity in input audio signal
CN103489454A (en) * 2013-09-22 2014-01-01 浙江大学 Voice endpoint detection method based on waveform morphological characteristic clustering
CN103854662A (en) * 2014-03-04 2014-06-11 中国人民解放军总参谋部第六十三研究所 Self-adaptation voice detection method based on multi-domain joint estimation
CN103903634A (en) * 2012-12-25 2014-07-02 中兴通讯股份有限公司 Voice activation detection (VAD), and method and apparatus for the VAD
KR20140117885A (en) * 2013-03-27 2014-10-08 주식회사 시그테크 Method for voice activity detection and communication device implementing the same
CN104916292A (en) * 2014-03-12 2015-09-16 华为技术有限公司 Method and apparatus for detecting audio signals
CN105261375A (en) * 2014-07-18 2016-01-20 中兴通讯股份有限公司 Voice activity detection method and apparatus
CN105706167A (en) * 2015-11-19 2016-06-22 瑞典爱立信有限公司 Method and apparatus for voiced speech detection

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2828854B1 (en) * 2012-03-23 2016-03-16 Dolby Laboratories Licensing Corporation Hierarchical active voice detection

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101625857A (en) * 2008-07-10 2010-01-13 新奥特(北京)视频技术有限公司 Self-adaptive voice endpoint detection method
CN102959625A (en) * 2010-12-24 2013-03-06 华为技术有限公司 Method and apparatus for adaptively detecting voice activity in input audio signal
CN103903634A (en) * 2012-12-25 2014-07-02 中兴通讯股份有限公司 Voice activation detection (VAD), and method and apparatus for the VAD
KR20140117885A (en) * 2013-03-27 2014-10-08 주식회사 시그테크 Method for voice activity detection and communication device implementing the same
CN103489454A (en) * 2013-09-22 2014-01-01 浙江大学 Voice endpoint detection method based on waveform morphological characteristic clustering
CN103854662A (en) * 2014-03-04 2014-06-11 中国人民解放军总参谋部第六十三研究所 Self-adaptation voice detection method based on multi-domain joint estimation
CN104916292A (en) * 2014-03-12 2015-09-16 华为技术有限公司 Method and apparatus for detecting audio signals
CN105261375A (en) * 2014-07-18 2016-01-20 中兴通讯股份有限公司 Voice activity detection method and apparatus
CN105706167A (en) * 2015-11-19 2016-06-22 瑞典爱立信有限公司 Method and apparatus for voiced speech detection

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
多语音特征参数的端点检测方法研究;汪石农等;《计算机工程与设计》;20120228;第33卷(第2期);第684-687页,第694页 *

Also Published As

Publication number Publication date
CN107393558A (en) 2017-11-24

Similar Documents

Publication Publication Date Title
CN107393550B (en) Voice processing method and device
CN108831495B (en) Speech enhancement method applied to speech recognition in noise environment
CN107452389B (en) Universal single-track real-time noise reduction method
CN109273021B (en) RNN-based real-time conference noise reduction method and device
KR20180115984A (en) Method and apparatus for integrating and removing acoustic echo and background noise based on deepening neural network
CN112735456B (en) Speech enhancement method based on DNN-CLSTM network
US10679617B2 (en) Voice enhancement in audio signals through modified generalized eigenvalue beamformer
Naqvi et al. Multimodal (audio–visual) source separation exploiting multi-speaker tracking, robust beamforming and time–frequency masking
CN108538306B (en) Method and device for improving DOA estimation of voice equipment
CN112185408B (en) Audio noise reduction method and device, electronic equipment and storage medium
CN111445919A (en) Speech enhancement method, system, electronic device, and medium incorporating AI model
CN113314135B (en) Voice signal identification method and device
JP6748304B2 (en) Signal processing device using neural network, signal processing method using neural network, and signal processing program
CN107360497B (en) Calculation method and device for estimating reverberation component
CN116013344A (en) Speech enhancement method under multiple noise environments
CN107393553B (en) Auditory feature extraction method for voice activity detection
CN107346658B (en) Reverberation suppression method and device
EP2774147B1 (en) Audio signal noise attenuation
CN115424627A (en) Voice enhancement hybrid processing method based on convolution cycle network and WPE algorithm
WO2020078210A1 (en) Adaptive estimation method and device for post-reverberation power spectrum in reverberation speech signal
Nie et al. Deep Noise Tracking Network: A Hybrid Signal Processing/Deep Learning Approach to Speech Enhancement.
CN107393558B (en) Voice activity detection method and device
CN107393559B (en) Method and device for checking voice detection result
JP7383122B2 (en) Method and apparatus for normalizing features extracted from audio data for signal recognition or modification
Miyazaki et al. Theoretical analysis of parametric blind spatial subtraction array and its application to speech recognition performance prediction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20221129

Address after: 2C1, Plant 2, Baimenqian Industrial Zone, No. 215, Busha Road, Nanlong Community, Nanwan Street, Longgang District, Shenzhen, Guangdong 518000

Patentee after: Shenzhen Yajin Smart Technology Co.,Ltd.

Address before: 518000 Jinhua building, Longfeng 3rd road, Dalang street, Longhua New District, Shenzhen City, Guangdong Province

Patentee before: SHENZHEN YONSZ INFORMATION TECHNOLOGY CO.,LTD.