CN111292758B

CN111292758B - Voice activity detection method and device and readable storage medium

Info

Publication number: CN111292758B
Application number: CN201910184966.8A
Authority: CN
Inventors: 孟建华; 董斐; 張維城; 戚萌; 林福辉
Original assignee: Spreadtrum Communications Shanghai Co Ltd
Current assignee: Spreadtrum Communications Shanghai Co Ltd
Priority date: 2019-03-12
Filing date: 2019-03-12
Publication date: 2022-10-25
Anticipated expiration: 2039-03-12
Also published as: CN111292758A

Abstract

A voice activity detection method and device and a readable storage medium are provided, wherein the voice activity detection method comprises the following steps: acquiring an acquired voice signal; respectively adopting a voice noise reduction algorithm and a harmonic detection algorithm to judge whether voice activity exists in the voice signal; determining that voice activity is detected from the speech signal when both the speech noise reduction algorithm and the harmonic detection algorithm determine that voice activity is present in the speech signal. By adopting the scheme, the voice activity can be accurately detected.

Description

Voice activity detection method and device and readable storage medium

Technical Field

The invention belongs to the technical field of voice, and particularly relates to a voice activity detection method and device and a readable storage medium.

Background

Traditional voice activity detection methods, such as detection methods based on volume or microphone level, often cause false detection in a noisy environment, which further causes false activation of voice equipment, not only has the problem of power consumption, but also can disturb others.

Most of the existing voice activity detection methods are based on sound energy judgment, and mainly have the following two defects: firstly, speech cannot be accurately distinguished in a noise environment, such as wind sound and the like which often occur in a noisy public place and outdoors; secondly, in a quiet environment, some sudden non-voice sounds such as telephone ring, door closing sound and the like are easy to be mistakenly detected.

Disclosure of Invention

The embodiment of the invention solves the problem of accurately detecting the voice activity.

To solve the foregoing technical problem, an embodiment of the present invention provides a voice activity detection method, where the voice activity detection method includes: acquiring an acquired voice signal; respectively adopting a voice noise reduction algorithm and a harmonic detection algorithm to judge whether voice activity exists in the voice signal; when both the speech noise reduction algorithm and the harmonic detection algorithm determine that speech activity is present in the speech signal, determining that speech activity is detected from the speech signal.

Optionally, the determining whether the voice signal has voice activity by respectively adopting a voice noise reduction algorithm and a harmonic detection algorithm includes: performing voice noise reduction calculation on the voice signal to obtain a noise-reduced voice signal; calculating the energy corresponding to the voice signal and the energy corresponding to the voice signal after noise reduction to obtain the energy ratio of the voice signal before and after noise reduction; and when the energy ratio is smaller than a preset first energy ratio threshold, judging that voice activity exists in the voice signal.

Optionally, the determining whether the voice signal has voice activity by respectively adopting a voice noise reduction algorithm and a harmonic detection algorithm includes: performing fast Fourier transform on the voice signal to obtain a voice signal frequency domain amplitude spectrum; performing voice noise reduction calculation on the voice signal by adopting a wiener filtering noise reduction algorithm to obtain a noise-reduced voice signal frequency domain amplitude spectrum; calculating the energy ratio of the voice signal frequency domain amplitude spectrum before and after noise reduction according to the wiener filter function, the voice signal frequency domain amplitude spectrum after noise reduction and the voice signal frequency domain amplitude spectrum; the wiener filter function is obtained by calculation according to the wiener filtering noise reduction algorithm and the noise estimation value of the voice signal; the noise estimation value is obtained by adopting a noise estimation algorithm; and when the energy ratio is smaller than a preset second energy ratio threshold, judging that voice activity exists in the voice signal.

Optionally, the following formula is adopted to calculate the voice message before and after the noise reductionEnergy ratio of the magnitude spectrum of the signal frequency domain:

wherein, E _w The energy ratio of the voice signal frequency domain amplitude spectrum before and after noise reduction is shown, ks is a preset frequency point starting point, ke is a preset frequency point terminal point, Y (k) is the voice signal frequency domain amplitude spectrum, and S' (k) is the voice signal frequency domain amplitude spectrum after noise reduction.

Optionally, the preset second energy ratio threshold is positively correlated with the following value: and the ratio of the energy corresponding to the voice signal to the energy corresponding to the noise estimation value.

Optionally, the determining whether the voice signal has voice activity by respectively adopting a voice noise reduction algorithm and a harmonic detection algorithm includes: and when the voice signal is in a preset voice fundamental frequency range and contains harmonic features, judging that voice activity exists in the voice signal.

Optionally, the determining whether the voice signal has voice activity by respectively adopting a voice noise reduction algorithm and a harmonic detection algorithm includes: performing fast Fourier transform on the voice signal to obtain a voice signal frequency domain amplitude spectrum; determining the number of peak values of the frequency domain amplitude spectrum of the voice signal; the peak value of the voice signal frequency domain amplitude spectrum is determined by adopting the following method: when the frequency domain amplitude spectrum corresponding to the ith frequency point in the voice signal frequency domain amplitude spectrum is larger than the maximum value in the frequency domain amplitude spectrum corresponding to the (i + 1) th frequency point, the frequency domain amplitude spectrum corresponding to the (i-1) th frequency point and a preset amplitude threshold corresponding to the ith frequency point, determining the frequency domain amplitude spectrum corresponding to the ith frequency point as a peak value of the voice signal voice frequency domain amplitude spectrum; and when the number of the peak values exceeds a preset threshold value of the number of the peak values, judging that voice activity exists in the voice signal.

Optionally, the preset amplitude threshold corresponding to the ith frequency point is obtained by the following method: performing fast Fourier transform on the voice signal to obtain a voice signal frequency domain amplitude spectrum; selecting a noise estimation value of the voice signal, a mean value of the voice signal frequency domain amplitude spectrum, and a maximum value in a corresponding minimum voice frequency domain amplitude spectrum from the (i-1) th frequency point to the (i + 1) th frequency point as a preset amplitude threshold value corresponding to the ith frequency point; and calculating the noise estimation value of the voice signal by adopting a noise estimation algorithm.

Optionally, after determining the number of peaks of the frequency-domain amplitude spectrum of the speech signal, the method further includes: sequentially taking the frequency index value corresponding to each peak value as a fundamental frequency, and calculating the frequency doubling deviation between the frequency index value corresponding to each peak value after the peak value corresponding to the fundamental frequency and the fundamental frequency; when the frequency multiplication deviation is larger than a preset deviation threshold value, the peak value is excluded; sequentially calculating weighted values of all residual peak values according to the frequency multiplication deviation and the residual peak values; comparing the corresponding weighted value under each fundamental frequency, and selecting the maximum weighted value; and when the maximum weighted value is greater than a preset weighted threshold value, judging that voice activity exists in the voice signal.

Optionally, the weighted values of all remaining peaks are calculated by using the following formula: e _h ＝∑α _n Y[p _n ](ii) a Wherein E is _h Is a weighted value, p, of said total remaining peak value _n For the frequency index value corresponding to the nth remaining peak, Y [ p ] _n ]For the frequency domain amplitude spectrum corresponding to the nth residual peak, α _n Is a preset weight coefficient, alpha _n ∈(0,1]。

Optionally, the frequency multiplication deviation is calculated by using the following formula:

where Δ f is the frequency multiplication deviation, p _n For the frequency index value, p, corresponding to the nth peak _bb Is a frequency index value that is a fundamental frequency.

Optionally, the preset weighting threshold is positively correlated with the following value: a ratio of energy corresponding to the speech signal to energy corresponding to the noise estimate; and calculating the noise estimation value of the voice signal by adopting a noise estimation algorithm.

Optionally, the following formula is adopted to calculate the preset amplitude threshold corresponding to the ith frequency point: y is _thr ＝2max(mean(Y),D[k],min(Y[k-2]……Y[k+2]) (ii) a Wherein Y is _thr A preset amplitude threshold value corresponding to the ith frequency point, Y is the frequency domain amplitude spectrum of the voice signal, dk]For the noise estimate, k is the frequency point, Y [ k-2 ]]……Y[k+2]And the spectrum is the voice signal frequency domain amplitude spectrum corresponding to the k-2 to k +2 frequency points.

Optionally, the determining that voice activity is detected from the voice signal includes: calculating the energy corresponding to the voice signal; and when the energy corresponding to the voice signal is larger than a preset energy threshold value, judging that voice activity is detected from the voice signal.

Optionally, the energy corresponding to the speech signal is calculated by using the following formula: e _abs ＝∑(Y[k]) ² (ii) a Wherein E is _abs For the corresponding energy of the speech signal, Y [ k ]]A frequency domain amplitude spectrum of the voice signal is obtained, and k is a preset frequency range; the voice signal frequency domain magnitude spectrum is obtained by performing fast Fourier transform on the voice signal.

Optionally, the calculating the energy corresponding to the voice signal includes: carrying out noise estimation on the voice signal to obtain a noise estimation value; calculating the ratio of the energy corresponding to the voice signal to the energy corresponding to the noise estimation value; and when the ratio of the energy corresponding to the voice signal to the energy corresponding to the noise estimation value is greater than a preset third energy ratio threshold value, judging that voice activity is detected from the voice signal.

Optionally, the following formula is adopted to calculate a ratio of energy corresponding to the speech signal to energy corresponding to the noise estimation value: e _vs ＝log(E _abs )-log(E _n ) (ii) a Wherein E is _vs Is the ratio of the energy corresponding to the speech signal to the energy corresponding to the noise estimate, E _abs For the corresponding energy of the speech signal, E _n For the energy corresponding to said noise estimate, E _n ＝∑(D[k]) ² ，D[k]Is the noise estimate.

Optionally, after determining that voice activity is detected from the voice signal, the method further includes: when the detected voice activity appears after continuous non-voice activity and the number of the continuous non-voice activity frames exceeds a preset first frame number threshold, caching the voice activity, and when the number of the voice frames of the voice activity exceeds a preset second frame number threshold, outputting a voice signal corresponding to the voice activity; when the detected non-voice activity appears after continuous voice activity, and the frame number of the continuous voice activity exceeds a preset third frame number threshold value, continuing voice activity detection, and when the frame number of the non-voice activity exceeds a preset fourth frame number threshold value, stopping outputting a voice signal corresponding to the voice activity.

Optionally, after determining that voice activity is detected from the voice signal, the method further includes: within a preset mixed frame number threshold range, when the detected voice activity and non-voice activity occur alternately, calculating the proportion of the voice activity frame number to the sum of the voice activity frame number and the non-voice activity frame number; and when the proportion is larger than a preset proportion threshold value, outputting a voice signal corresponding to the voice activity.

Optionally, the speech noise reduction algorithm is at least one of the following algorithms: LMS, NLMS, spectral subtraction, and wiener filtering algorithms.

Optionally, the harmonic detection algorithm is at least one of the following algorithms: autocorrelation function methods, cepstrum methods, linear prediction methods, and wavelet methods.

In order to solve the above technical problem, an embodiment of the present invention further discloses a voice activity detection apparatus, including: the acquisition unit is used for acquiring the acquired voice signals; the first judging unit is used for judging whether voice activity exists in the voice signal by respectively adopting a voice noise reduction algorithm and a harmonic detection algorithm; a second determining unit, configured to determine that voice activity is detected from the voice signal when the voice noise reduction algorithm and the harmonic detection algorithm both determine that voice activity exists in the voice signal.

Optionally, the first judging unit is configured to: performing voice noise reduction calculation on the voice signal to obtain a noise-reduced voice signal; calculating the energy corresponding to the voice signal and the energy corresponding to the voice signal after noise reduction to obtain the energy ratio of the voice signal before and after noise reduction; and when the energy ratio is smaller than a preset first energy ratio threshold, judging that voice activity exists in the voice signal.

Optionally, the first judging unit is configured to: performing fast Fourier transform on the voice signal to obtain a voice signal frequency domain amplitude spectrum; performing voice noise reduction calculation on the voice signal by adopting a wiener filtering noise reduction algorithm to obtain a noise-reduced voice signal frequency domain amplitude spectrum; calculating the energy ratio of the voice signal frequency domain amplitude spectrum before and after noise reduction according to the wiener filter function, the voice signal frequency domain amplitude spectrum after noise reduction and the voice signal frequency domain amplitude spectrum; the wiener filter function is obtained by calculation according to the wiener filter noise reduction algorithm and the noise estimation value of the voice signal; the noise estimation value is obtained by adopting a noise estimation algorithm; and when the energy ratio is smaller than a preset second energy ratio threshold, judging that voice activity exists in the voice signal.

Optionally, the energy ratio of the frequency domain amplitude spectrum of the speech signal before and after noise reduction is calculated by using the following formula:

Optionally, the first judging unit is configured to: and when the voice signal is in a preset voice fundamental frequency range and contains harmonic features, judging that voice activity exists in the voice signal.

Optionally, the first judging unit is configured to: performing fast Fourier transform on the voice signal to obtain a voice signal frequency domain amplitude spectrum; determining the number of peak values of the frequency domain amplitude spectrum of the voice signal; the peak value of the voice signal frequency domain amplitude spectrum is determined by adopting the following method: when a frequency domain amplitude spectrum corresponding to an ith frequency point in the voice signal frequency domain amplitude spectrum is larger than the maximum value of a frequency domain amplitude spectrum corresponding to an i +1 th frequency point, a frequency domain amplitude spectrum corresponding to an i-1 th frequency point and a preset amplitude threshold value corresponding to the ith frequency point, determining the frequency domain amplitude spectrum corresponding to the ith frequency point as a peak value of the voice signal voice frequency domain amplitude spectrum; and when the number of the peak values exceeds a preset threshold value of the number of the peak values, judging that voice activity exists in the voice signal.

Optionally, the preset amplitude threshold corresponding to the ith frequency point is obtained by the following method: performing fast Fourier transform on the voice signal to obtain a voice signal frequency domain amplitude spectrum; selecting a noise estimation value of the voice signal, a mean value of the voice signal frequency domain amplitude spectrum, and a maximum value in a corresponding minimum voice frequency domain amplitude spectrum from the i-1 frequency point to the i +1 frequency point as a preset amplitude threshold corresponding to the i frequency point; and calculating the noise estimation value of the voice signal by adopting a noise estimation algorithm.

Optionally, the first determining unit is further configured to: sequentially taking the frequency index value corresponding to each peak value as a fundamental frequency, and calculating the frequency doubling deviation between the frequency index value corresponding to each peak value after the peak value corresponding to the fundamental frequency and the fundamental frequency; when the frequency multiplication deviation is larger than a preset deviation threshold value, the peak value is excluded; calculating weighted values of all residual peak values in sequence according to the frequency multiplication deviation and the residual peak values; comparing the weighted value corresponding to each fundamental frequency, and selecting the maximum weighted value; and when the maximum weighted value is larger than a preset weighted threshold value, judging that voice activity exists in the voice signal.

Optionally, the weighted values of all remaining peaks are calculated by using the following formula: e _h ＝∑α _n Y[p _n ](ii) a Wherein, E _h A weight value, p, for said total remaining peak value _n For the frequency index value corresponding to the nth remaining peak, Y [ p ] _n ]Is the nth one leftFrequency domain amplitude spectrum, alpha, corresponding to the residual peak value _n Is a preset weight coefficient, alpha _n ∈(0,1]。

wherein Δ f is the frequency multiplication deviation, p _n For the frequency index value, p, corresponding to the nth peak _bb Is a frequency index value that is a fundamental frequency.

Optionally, the following formula is adopted to calculate a preset amplitude threshold corresponding to the ith frequency point: y is _thr ＝2max(mean(Y),D[k],min(Y[k-2]……Y[k+2]) (ii) a Wherein, Y _thr A preset amplitude threshold value corresponding to the ith frequency point, Y is the frequency domain amplitude spectrum of the voice signal, dk]For the noise estimate, k is the frequency point, Y [ k-2 ]]……Y[k+2]And the spectrum is the voice signal frequency domain amplitude spectrum corresponding to the k-2 to k +2 frequency points.

Optionally, the second judging unit is configured to: calculating the energy corresponding to the voice signal; and when the energy corresponding to the voice signal is larger than a preset energy threshold value, judging that voice activity is detected from the voice signal.

Optionally, the energy corresponding to the speech signal is calculated by using the following formula: e _abs ＝∑(Y[k]) ² (ii) a Wherein, E _abs For the corresponding energy of the speech signal, Y [ k ]]A frequency domain amplitude spectrum of the voice signal is obtained, and k is a preset frequency range; the voice signal frequency domain magnitude spectrum is obtained by performing fast Fourier transform on the voice signal.

Optionally, the second determining unit is configured to: carrying out noise estimation on the voice signal to obtain a noise estimation value; calculating the ratio of the energy corresponding to the voice signal to the energy corresponding to the noise estimation value; and when the ratio of the energy corresponding to the voice signal to the energy corresponding to the noise estimation value is larger than a preset third energy ratio threshold value, judging that voice activity is detected from the voice signal.

Optionally, the following formula is adopted to calculate a ratio of energy corresponding to the speech signal to energy corresponding to the noise estimation value: e _vs ＝log(E _abs )-log(E _n ) (ii) a Wherein, E _vs Is the ratio of the energy corresponding to the speech signal to the energy corresponding to the noise estimate, E _abs For the corresponding energy of the speech signal, E _n For the energy corresponding to said noise estimate, E _n ＝∑(D[k]) ² ，D[k]Is the noise estimate.

Optionally, the second determining unit is further configured to: when the detected voice activity appears after continuous non-voice activity and the number of the continuous non-voice activity frames exceeds a preset first frame number threshold, caching the voice activity, and when the number of the voice frames of the voice activity exceeds a preset second frame number threshold, outputting a voice signal corresponding to the voice activity; when the detected non-voice activity occurs after continuous voice activity and the frame number of the continuous voice activity exceeds a preset third frame number threshold, voice activity detection is continued, and when the frame number of the non-voice activity exceeds a preset fourth frame number threshold, voice signals corresponding to the voice activity are stopped being output.

Optionally, the second determining unit is further configured to: within a preset mixed frame number threshold range, when the detected voice activity and non-voice activity occur alternately, calculating the proportion of the voice activity frame number to the sum of the voice activity frame number and the non-voice activity frame number; and outputting a voice signal corresponding to the voice activity when the proportion is larger than a preset proportion threshold value.

The embodiment of the invention also discloses a computer-readable storage medium, which is a nonvolatile storage medium or a non-transient storage medium, and is stored with computer instructions, and the computer instructions execute the steps of any one of the voice activity detection methods when running.

The embodiment of the present invention further provides a voice activity detection apparatus, which includes a memory and a processor, where the memory stores a computer instruction that can be executed on the processor, and the processor executes any of the steps of the voice activity detection method when executing the computer instruction.

Compared with the prior art, the technical scheme of the embodiment of the invention has the following beneficial effects:

and respectively adopting a voice noise reduction algorithm and a harmonic detection algorithm to judge whether voice activity exists in the voice signal. Determining that voice activity is detected from the speech signal when both the speech noise reduction algorithm and the harmonic detection algorithm determine that voice activity is present in the speech signal. And filtering the interference of steady-state noise to voice activity detection by adopting a voice noise reduction algorithm, and filtering the interference of non-steady-state noise to voice activity detection by adopting a harmonic detection algorithm. When the voice noise reduction algorithm and the harmonic detection algorithm judge that voice activity exists in the voice signal, the voice activity is judged to be detected from the voice signal, and the accuracy of voice activity detection is greatly improved.

Further, when the energy corresponding to the voice signal is larger than a preset energy threshold value, the voice activity is judged to be detected from the voice signal. The accuracy of voice activity detection can be improved by energy determination of the voice signal.

Further, when the detected voice activity occurs after continuous non-voice activity and the number of the continuous non-voice activity frames exceeds a preset first frame number threshold, caching the voice activity, and when the number of the voice frames of the voice activity exceeds a preset second frame number threshold, outputting a voice signal corresponding to the voice activity; when the detected non-voice activity occurs after continuous voice activity and the frame number of the continuous voice activity exceeds a preset third frame number threshold, voice activity detection is continued, and when the frame number of the non-voice activity exceeds a preset fourth frame number threshold, voice signals corresponding to the voice activity are stopped being output. The false detection of individual voice frames can be eliminated, and the interference of some transient voices to the voice activity detection is also filtered.

Drawings

FIG. 1 is a flow chart of a method of voice activity detection in an embodiment of the present invention;

FIG. 2 is a waveform diagram of a noisy speech signal in an embodiment of the present invention;

FIG. 3 is a diagram illustrating a voice activity detection result of noisy speech using a speech noise reduction algorithm in an embodiment of the present invention;

FIG. 4 is a schematic diagram of a voice activity detection result of noisy speech using a harmonic detection algorithm in an embodiment of the present invention;

FIG. 5 is a waveform diagram of the blowing sound and the wind sound in the embodiment of the present invention;

FIG. 6 is a diagram illustrating the detection results of the voice activity of the blowing sound and the wind sound by using the wiener filtering algorithm in the embodiment of the present invention;

FIG. 7 is a schematic diagram of the detection result of the voice activity of the blowing sound and the wind sound by the peak detection algorithm in the embodiment of the present invention;

FIG. 8 is a waveform of an alarm sound in an embodiment of the present invention;

FIG. 9 is a schematic diagram illustrating the detection of an alarm voice activity using a wiener filtering algorithm in an embodiment of the present invention;

FIG. 10 is a schematic representation of the detection of voice activity of an alarm sound using a peak detection algorithm in an embodiment of the present invention;

fig. 11 is a schematic structural diagram of a voice activity detection apparatus in an embodiment of the present invention.

Detailed Description

In the prior art, most voice activity detection methods are based on sound energy judgment, and mainly have the following two disadvantages: firstly, voice cannot be accurately distinguished in a noise environment, such as wind noise and the like frequently occurring in noisy public places and outdoors; secondly, in a quiet environment, some sudden non-voice sounds such as telephone ring, door closing sound and the like are easy to be mistakenly detected.

In the embodiment of the invention, a voice noise reduction algorithm and a harmonic detection algorithm are respectively adopted to judge whether voice activity exists in the voice signal. Determining that voice activity is detected from the speech signal when both the speech noise reduction algorithm and the harmonic detection algorithm determine that voice activity is present in the speech signal. And filtering the interference of steady-state noise to voice activity detection by adopting a voice noise reduction algorithm, and filtering the interference of non-steady-state noise to voice activity detection by adopting a harmonic detection algorithm. When the voice noise reduction algorithm and the harmonic detection algorithm judge that voice activity exists in the voice signal, the voice activity is judged to be detected from the voice signal, and the accuracy of voice activity detection is greatly improved.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.

An embodiment of the present invention provides a voice activity detection method, which is described in detail below with reference to fig. 1 through specific steps.

And step S101, acquiring the collected voice signal.

In implementations, the speech signal may be captured by a device having audio capture functionality. In practical application, the collected voice signals usually carry noise signals, so that voice activity detection processing can be performed on the collected voice signals to judge whether corresponding voice activities exist in the voice signals.

In practical applications, the collected voice signal may have a certain dc drift, which may affect the subsequent processing. Therefore, the collected voice signals can be subjected to direct current removing processing to obtain the voice signals without direct current drift interference.

As described above, the first-order filtering algorithm may be used to perform dc removal processing on the collected voice signal, and the specific processing procedure is as follows: and according to the collected voice signals and the filter coefficients, obtaining direct current component estimation values, and then removing the direct current component estimation values in the collected voice signals to obtain the voice signals with the direct current drift interference removed.

In the embodiment of the present invention, the dc component estimated value may be calculated by using the following formula (1):

dc(n)＝αy(n)+(1-α)dc(n-1)； (1)

wherein dc (n) is the estimated value of the direct current component of the nth frame, and the initial value thereof is 0, i.e. dc (1) =0, dc (n-1) is the estimated value of the direct current component of the nth-1 frame, y (n) is the collected voice signal, α is the filter coefficient, and α ∈ (0, 1). The smaller the filter coefficient alpha is, the more stable the direct current component estimation value is, but the sensitivity is correspondingly reduced; the larger the filter coefficient α, the higher the sensitivity, and the more unstable the corresponding dc component estimation value. Therefore, a user can determine an acceptable sensitivity range and a stability in practical use according to the actual needs of the user, and select a suitable filter coefficient α, which is not limited herein.

It is understood that the dc removal process described above is described using a first order filtering algorithm. In a specific application, a user may select different methods to perform dc removal processing on the collected voice signal according to the own requirements, which is not described herein.

Step S102, a voice noise reduction algorithm and a harmonic detection algorithm are respectively adopted to judge whether voice activity exists in the voice signal.

In specific implementation, windowing and framing processing can be performed on a voice signal to perform fast fourier transform, so as to obtain a short-time frequency domain magnitude spectrum of the voice signal; and judging whether voice activity exists in the voice signal or not by respectively adopting a voice noise reduction algorithm and a harmonic detection algorithm according to the voice signal and the short-time frequency domain amplitude spectrum of the voice signal.

Because the collected voice signal usually contains various types of noise, effective voice activity needs to be detected from the voice signal, so that the voice equipment can perform subsequent operations such as extraction or output on the voice activity. In the embodiment of the present invention, the voice activity may refer to: speech output by a user using the speech device.

In specific implementation, a speech noise reduction algorithm can be adopted to calculate the ratio of the energy corresponding to the speech signal before and after noise reduction, and then whether the speech signal has speech activity is judged; meanwhile, a harmonic detection algorithm can be adopted to calculate whether the harmonic energy of the voice signal conforms to the characteristics of voice activity, and then whether the voice activity exists in the voice signal is judged.

In a specific implementation, the noise reduction processing may be performed on the collected voice signal first. In the embodiment of the present invention, noise reduction processing may be performed on the acquired voice signal by using at least one of the following voice noise reduction algorithms: LMS, NLMS, spectral subtraction, and wiener filtering algorithms. The user can select one or more algorithm combination modes to perform noise reduction processing according to the actual situation of the user, and then after voice activity is detected in the voice signal, the frequency domain amplitude spectrum of the voice signal subjected to noise reduction is converted back to time domain data, the voice signal subjected to noise reduction is output, and the definition of voice is improved.

In a specific implementation, the harmonic detection algorithm may include at least one of the following algorithms: autocorrelation function methods, cepstrum methods, linear prediction methods, and wavelet methods. The user can select one or more algorithms to be combined for harmonic detection according to the actual situation of the user, which is not described herein.

In a specific implementation, the determining whether voice activity exists in the voice signal by using a voice noise reduction algorithm may include the following processes: firstly, carrying out voice noise reduction calculation on a voice signal to obtain a noise-reduced voice signal; then calculating the energy corresponding to the voice signal and the energy corresponding to the voice signal after noise reduction to obtain the energy ratio of the voice signal before and after noise reduction; and when the energy ratio is smaller than a preset first energy ratio threshold, judging that voice activity exists in the voice signal.

In a specific implementation, a wiener filtering algorithm may be used to determine whether voice activity is present in the voice signal.

The wiener filtering algorithm designs a digital filterA filter h (n) for outputting a noise-containing speech signal y (n) when the speech signal y (n) is inputted, the filter being configured such that the mean square error E between the output y (n) h (n) and a noise-free speech signal s (n) [ { y (n)' h (n) -s (n) }is obtained according to a minimum mean square error criterion ² ]And reaches the minimum.

The frequency domain magnitude spectrum estimator H (k) of the wiener filter can be represented by the following formula (2):

wherein, P _s [k]For noise-free speech power spectrum, P _d [k]For the noise power spectrum, the frequency domain magnitude spectrum estimator H (k) of the wiener filter is obtained by performing fast fourier transform on the digital filter H (n), and the voice signal power spectrum is obtained by performing autocorrelation budget on a corresponding voice signal and then performing fast fourier transform, which is not described in detail in the following.

Defining the Prior Signal-to-noise ratio SNR _prior [k]For speech power spectrum P without noise _s [k]And the noise power spectrum P _d [k]Can convert equation (2) to the following equation (3):

defining the A posteriori signal-to-noise ratio SNR _post [k]For the power spectrum (noise-free) P of speech _y [k]And the noise power spectrum P _d [k]Is expressed by the following formula (4) _post [k]：

Since the voice signal includes a voice signal containing no noise and a noise signal, the power spectrum P of the voice signal is expressed by the following formula (5) _y [k]：

P _y [k]＝P _s [k]+P _d [k]； (5)

Therefore, equations (6) and (7) are obtained using the following derivation procedure:

wherein the SNR _prior [k]＝SNR _post [k]-1；SNR _prior [k] _t A priori signal-to-noise ratio, SNR, at t time _post [k] _t A posteriori signal-to-noise ratio, SNR, for t time _prior [k] _(t-1) Is the prior signal-to-noise ratio of t-1 time, Y [ k ]]For the frequency domain amplitude spectrum of the voice signal, D (k) is a noise estimation value of the voice signal, t is time, alpha is a smoothing parameter, and alpha belongs to (0, 1), and the larger alpha is, the greater the suppression effect on the noise is.

The voice signal frequency domain amplitude spectrum Y [ k ] can be obtained by collecting voice signals through voice collecting equipment, and the noise estimation value of the voice signals can be obtained through calculation of a noise estimation algorithm.

In practical applications, the noise estimation algorithm may include: a continuous spectrum minimum tracking method, recursive average noise estimation, histogram noise estimation, etc., and the invention is not limited to the noise estimation algorithm for noise estimation.

Taking a continuous spectrum minimum tracking method as an example, the method combines a spectrum short-time minimum algorithm and a time recursive average algorithm, and calculates a frequency domain amplitude spectrum D (k) of a noise estimation value by adopting the following formula (8):

wherein, D (k) _t And D [ k ]] _t-1 Respectively representing the frequency domain amplitude spectrum of the noise estimation value of the current frame and the previous frame at the k frequency point, Y [ k ]] _t And Y [ k ]] _t-1 Respectively representing the frequency domain amplitude spectrum of the speech signal containing noise at the k-th frequency point of the current frame and the previous frame, and defaulting to the pure noise signal of the initial frame, namely D [ k [ ]] ₀ ＝Y[k] ₀ ，S[k] _t Indicating that the current frame is at the k frequency pointOf a speech signal without noise, SM ₁ 、SM ₂ 、SM ₃ All are smoothing factors, and all are greater than 0 and less than 1.

After the frequency domain amplitude spectrum D (k) of the noise estimation value is obtained, the filter function H (k) of the current frame can be obtained by recursion according to the filter function of the previous frame by combining the formulas (3), (4) and (6) _t Then, the frequency domain amplitude spectrum of the noise-reduced speech signal can be calculated by using the following formula (9):

S'(k)＝Y(k) _t H(k) _t ； (9)

in a specific implementation, a speech noise reduction algorithm, wiener filtering, is used to determine whether speech activity exists in the speech signal. Firstly, carrying out fast Fourier transform on the voice signal to obtain a voice signal frequency domain amplitude spectrum; performing voice noise reduction calculation on the voice signal by adopting a wiener filtering noise reduction algorithm to obtain a noise-reduced voice signal frequency domain amplitude spectrum; calculating the energy ratio of the voice signal frequency domain amplitude spectrum before and after noise reduction according to the wiener filter function, the voice signal frequency domain amplitude spectrum after noise reduction and the voice signal frequency domain amplitude spectrum; the wiener filter function is obtained by calculation according to the wiener filtering noise reduction algorithm and the noise estimation value of the voice signal; the noise estimation value is obtained by adopting a noise estimation algorithm; and when the energy ratio is smaller than a preset second energy ratio threshold, judging that voice activity exists in the voice signal.

In a specific implementation, the energy ratio of the frequency domain amplitude spectrum of the speech signal before and after the noise reduction is calculated by using the following formula (10):

In practical application, the difference value of the frequency domain amplitude spectrum of the voice signal before and after noise reduction in the logarithmic domain can be directly calculated to serve as an energy ratio value, so that the calculated amount is reduced; the square error of the frequency domain amplitude spectrum of the voice signal before and after noise reduction can be calculated to serve as an energy ratio, and then when the square error is smaller than a preset noise threshold value, the voice signal is judged to have voice activity, namely, the purpose of calculating the energy ratio corresponding to the voice signal before and after noise reduction can be achieved, and a specific calculation mode is not limited.

In practical application, the energy ratio of the frequency domain amplitude spectrum of the voice signal before and after noise reduction is calculated, and a preset second energy ratio threshold value can be stabilized within a certain range through the function characteristic of logarithm, so that the aim of accurately judging whether voice activity exists in the voice signal is fulfilled. It can be understood that the user may also set the difference or ratio of the energy of the amplitude spectrum of the frequency domain of the speech signal before and after the noise reduction in different function domains according to the self condition, which is not limited herein.

In a specific implementation, the preset second energy ratio threshold may be set as a fixed value according to a previous voice activity detection result, may also be set as a dynamic threshold according to a specific situation of a voice signal, and may also be positively correlated with the following value: and the ratio of the energy corresponding to the voice signal to the energy corresponding to the noise estimation value. The user may select one or more preset setting manners of the second energy ratio threshold according to different requirements of the user, which is not described herein again.

In a specific implementation, since the speech signal is composed of a fundamental frequency signal and a series of harmonic signals, when the speech signal is within a preset speech fundamental frequency range and contains harmonic features, it can be determined that speech activity exists in the speech signal. The harmonic features may be a complete harmonic series including a fundamental frequency, a first harmonic, a second harmonic, etc., or may be a continuous harmonic segment containing a second harmonic and a third harmonic. The voice activity is judged through the harmonic wave characteristics, the requirement on the quality of the voice signal is low, and the voice activity detection method has the capability of resisting various noise interferences.

Formants are important features that reflect the resonance characteristics of vocal tract, and appear as periodic maxima in the speech spectral envelope. Because the frequency of the harmonic signal is approximate to the integral multiple of the fundamental frequency signal, the maximum value of the frequency domain amplitude spectrum of a frame of voice and the corresponding frequency thereof can be detected by searching the mode with periodic harmonic energy, namely peak detection.

In practical applications, the low frequency band of the speech signal generally has more noise, and the peak detection search range above 3000Hz increases a large amount of calculation, and false detection is easily caused due to the smaller amplitude. Therefore, in consideration of noise interference and computational complexity, the search range for peak detection is not as large as possible, and the search should be performed in the main frequency range of human voice.

In one embodiment of the present invention, a range of 80Hz-3000Hz is selected for peak detection. The index range is 80N/f _s To 3000N/f _s Where N is the FFT length, f _s Is the sampling rate. The user can adjust the search range according to the self requirement, and the invention is not limited herein. After the search range is determined, detection is performed sequentially from the index range. And if the amplitude of the current index is greater than the amplitudes of the front point and the rear point of the current index and is also greater than the preset amplitude threshold value of the current index, judging that the current index is a peak value, and adding one to the total number of the peak values. And after the detection is finished in the index range, judging that the probability of no voice activity exists in the voice signal is higher when the number of the detected peak values is less than two.

In specific implementation, the voice signal is firstly subjected to fast fourier transform to obtain a voice signal frequency domain amplitude spectrum, and then the number of peak values of the voice signal frequency domain amplitude spectrum is determined. And when the number of the peak values exceeds a preset threshold value of the number of the peak values, judging that voice activity exists in the voice signal.

The peak value of the frequency domain amplitude spectrum of the speech signal is determined as follows: and when the frequency domain amplitude spectrum corresponding to the ith frequency point in the frequency domain amplitude spectrum of the voice signal is larger than the maximum value of the frequency domain amplitude spectrum corresponding to the (i + 1) th frequency point, the frequency domain amplitude spectrum corresponding to the (i-1) th frequency point and a preset amplitude threshold corresponding to the ith frequency point, determining that the frequency domain amplitude spectrum corresponding to the ith frequency point is a peak value of the voice frequency domain amplitude spectrum of the voice signal.

In a specific implementation, the preset amplitude threshold corresponding to the ith frequency point may be obtained by: performing fast Fourier transform on the voice signal to obtain a voice signal frequency domain amplitude spectrum; selecting a noise estimation value of the voice signal, a mean value of the voice signal frequency domain amplitude spectrum, and a maximum value in a corresponding minimum voice frequency domain amplitude spectrum from the (i-1) th frequency point to the (i + 1) th frequency point as a preset amplitude threshold value corresponding to the ith frequency point; and calculating the noise estimation value of the voice signal by adopting a noise estimation algorithm.

In a specific implementation, the following formula (11) may be adopted to calculate the preset amplitude threshold corresponding to the ith frequency point:

Y _thr ＝2max(mean(Y),D[k],min(Y[k-2]……Y[k+2])； (11)

wherein Y is _thr A preset amplitude threshold value corresponding to the ith frequency point, Y is the frequency domain amplitude spectrum of the voice signal, dk]For the noise estimate, k is the frequency point, Y [ k-2 ]]……Y[k+2]And the amplitude spectra of the frequency domains of the voice signals corresponding to the k-2 th to k +2 nd frequency points.

In a specific implementation, after determining the number of peaks of the frequency domain magnitude spectrum of the speech signal, the method further includes: sequentially taking the frequency index value corresponding to each peak value as a fundamental frequency, and calculating the frequency doubling deviation between the frequency index value corresponding to each peak value after the peak value corresponding to the fundamental frequency and the fundamental frequency; when the frequency multiplication deviation is larger than a preset deviation threshold value, the peak value is excluded; sequentially calculating weighted values of all residual peak values according to the frequency multiplication deviation and the residual peak values; comparing the weighted value corresponding to each fundamental frequency, and selecting the maximum weighted value; and when the maximum weighted value is greater than a preset weighted threshold value, judging that voice activity exists in the voice signal.

In a specific implementation, the frequency doubling deviation is calculated using the following equation (12):

wherein Δ f is the frequency multiplication deviation, p _n For the frequency index value, p, corresponding to the nth peak _bb Is a frequency index value that is a frequency of the fundamental frequency.

In a specific implementation, the weighted values of all remaining peaks are calculated using the following equation (13):

E _h ＝∑α _n Y[p _n ]； (13)

wherein, E _h Is a weighted value, p, of said total remaining peak value _n For the frequency index value corresponding to the nth remaining peak, Y [ p ] _n ]For the frequency domain amplitude spectrum, α, corresponding to the nth residual peak _n Is a preset weight coefficient, alpha _n ∈(0,1]，α _n In relation to frequency multiplication and frequency multiplication deviation, when frequency multiplication is larger alpha _n The smaller the frequency multiplication deviation, the larger the frequency multiplication deviation alpha _n The smaller.

In a specific implementation, the preset weighting threshold may be set as a fixed value according to a previous voice activity detection result, or may be set as a dynamic threshold according to a specific situation of a voice signal, and may be positively correlated with the following values: a ratio of energy corresponding to the speech signal to energy corresponding to the noise estimate; and calculating the noise estimation value of the voice signal by adopting a noise estimation algorithm.

In one embodiment of the invention, after all peaks are detected within the index range, all detected peaks are labeled as Y [ p ] ₁ ]……Y[p _cnt ]And sequentially taking the frequency index values corresponding to all the peak values as candidate fundamental frequency or harmonic signals. Since the fundamental frequency of speech is typically between 80Hz and 500Hz, it is first assumed that p ₁ For the fundamental frequency, the frequency p of the candidate harmonic wave above the fundamental frequency is ₂ ……p _cnt Making a frequency doubling decision, i.e. determining p ₂ To p _cnt Whether or not to approach an integer multiple of the base frequency signal index. And if the frequency multiplication deviation is greater than a preset deviation threshold value, excluding the corresponding peak value at the index. The remaining harmonics are then amplitude weighted to get the sum p ₁ The amplitude of the fundamental frequency is weighted. Then, let it successively assume that p does not exceed the frequency index of 500Hz ₂ ……p _cnt-1 Repeating the above steps to find the maximum weight value for the fundamental frequency. And when the maximum weighted value is larger than a preset weighted threshold value, judging that voice activity exists in the voice signal.

The voice activity detection has higher robustness by adopting the voice noise reduction algorithm and the harmonic detection algorithm, and because various algorithms are calculated on the basis of the frequency domain amplitude spectrum, the data reuse rate in the calculation process is high, and the calculation amount and the calculation cost are reduced.

Step S103, when the voice noise reduction algorithm and the harmonic detection algorithm judge that voice activity exists in the voice signal, judging that the voice activity is detected from the voice signal.

In the process of voice activity detection, the voice of the non-user is often mixed in the voice signal, and then the false detection is caused. Therefore, the energy corresponding to the voice signal can be calculated and compared with the preset energy threshold value to judge whether voice activity exists in the voice signal.

In a specific implementation, determining that voice activity is detected from the voice signal may include: calculating the energy corresponding to the voice signal; and when the energy corresponding to the voice signal is larger than a preset energy threshold value, judging that the voice activity is detected from the voice signal, and further improving the accuracy of voice activity detection on the basis of a voice noise reduction algorithm and a harmonic detection algorithm.

In a specific implementation, the energy corresponding to the speech signal is calculated by using the following formula (14):

E _abs ＝∑(Y[k]) ² ； (14)

wherein E is _abs For the corresponding energy of the speech signal, Y [ k ]]A frequency domain amplitude spectrum of the voice signal is obtained, and k is a preset frequency range; the voice signal frequency domain magnitude spectrum is obtained by performing fast Fourier transform on the voice signal.

In an embodiment of the present invention, for a frequency domain amplitude spectrum of a frame of speech signal, energy corresponding to the speech signal is a sum of squares of each frequency point. Meanwhile, the main frequency range of the voice is considered, possible noise interference is eliminated, and energy corresponding to the voice signal with the frequency point of 80Hz to 3000Hz (namely, partial frequency band) is calculated. It can be understood that the user can adjust the selected frequency band range according to different requirements of the user, and the present invention is not described herein.

In specific implementation, energy corresponding to the voice signal is calculated, and noise estimation needs to be performed on the voice signal to obtain a noise estimation value; then calculating the ratio of the energy corresponding to the voice signal to the energy corresponding to the noise estimation value; and when the ratio of the energy corresponding to the voice signal to the energy corresponding to the noise estimation value is greater than a preset third energy ratio threshold value, judging that voice activity is detected from the voice signal.

In practical application, the ratio of the energy corresponding to the voice signal to the energy corresponding to the noise estimation value is calculated, and a preset third energy ratio threshold value can be stabilized within a certain range through the function characteristic of logarithm, so that the purpose of accurately judging whether voice activity exists in the voice signal is achieved. It is understood that the user may also set a ratio of the energy corresponding to the speech signal to the energy corresponding to the noise estimation value in different function domains according to the user's own situation, and the invention is not limited herein.

In a specific implementation, the following formula (15) is used to calculate a ratio of the energy corresponding to the speech signal to the energy corresponding to the noise estimation value:

E _vs ＝log(E _abs )-log(E _n )； (15)

wherein E is _vs Is the ratio of the energy corresponding to the speech signal to the energy corresponding to the noise estimate, E _abs For the corresponding energy of the speech signal, E _n For the energy corresponding to said noise estimate, E _n ＝∑(D[k]) ² ，D[k]Is the noise estimate.

In practical applications, because special work requirements of some users, such as engineering maintenance personnel, are often inconvenient To release both hands To operate a Talk key (Push To Talk, PTT) of the intercom for talking, some of the intercom has a Voice Operated Exchange (VOX) function, that is, whether there is Voice activity is judged by signals collected by a microphone, so as To serve as a basis for starting or sleeping the intercom. When the interphone needs to be activated, the user can directly start the interphone through voice without pressing a PTT key. Therefore, the hands of the user can be effectively released, and the communication conversation is more convenient.

The voice activity detection method provided by the invention can be used for the voice intercom equipment, not only can have reliable activation rate and lower false activation rate in a noise environment, but also can prevent other people around from activating the intercom of a user when speaking by detecting the energy corresponding to the voice signal, and is convenient for the user to adjust the volume according to the sound.

In a specific implementation, after determining that voice activity is detected from the voice signal, the method further includes: when the detected voice activity appears after continuous non-voice activity and the number of the continuous non-voice activity frames exceeds a preset first frame number threshold, caching the voice activity, and when the number of the voice frames of the voice activity exceeds a preset second frame number threshold, outputting a voice signal corresponding to the voice activity; when the detected non-voice activity occurs after continuous voice activity and the frame number of the continuous voice activity exceeds a preset third frame number threshold, voice activity detection is continued, and when the frame number of the non-voice activity exceeds a preset fourth frame number threshold, voice signals corresponding to the voice activity are stopped being output. The false detection of individual voice frames can be eliminated, and the interference of some transient voices to the voice activity detection is also filtered.

In a specific implementation, when voice activity is detected from the voice signal, within a preset threshold range of a mixed frame number, when the detected voice activity and non-voice activity alternately occur, calculating the proportion of the voice activity frame number to the sum of the voice activity frame number and the non-voice activity frame number; and when the proportion is larger than a preset proportion threshold value, outputting a voice signal corresponding to the voice activity so as to resist the interference of transient sudden sound on the voice activity detection.

When the voice activity detection method provided by the invention is used on the voice equipment, a certain amount of cache can be set in the voice equipment. Every time the microphone collects a frame of signal, the signal is stored in the buffer. The speech device is not activated immediately when a speech frame is detected in consecutive non-speech frames. And after a plurality of frames of voice are continuously accumulated and judged, starting to activate, and starting to transmit the cached data. When a non-speech frame is detected in continuous speech frames, the transmission is not interrupted immediately, but is interrupted after a plurality of frames of non-speech frames are continuously accumulated and determined.

In the embodiment of the present invention, threshold values voice _ confirm, mute _ confirm, and mix _ confirm may be set, where voice _ confirm indicates the number of frames required for confirming the start of voice and the number of frames required for determining buffering. A voice _ confirm increase may eliminate very brief sounds, reduce false activations, but increase transmission delay. The voice _ confirm reduction may reduce the delay and transmit a shorter duration signal. mute _ confirm indicates the number of frames needed to confirm the end of speech. mix _ confirm indicates the number of frames required to determine a speech candidate segment when speech frames and non-speech frames are mixed, and when set to 2 or 3, detection accuracy can be improved. It should be understood that the three thresholds can be set to different values according to the actual requirement of the user, and the invention is not limited thereto.

Next, the initial value of the speech frame number, voice _ cnt, is set to zero, and the initial value of the non-speech frame number, mute _ cnt, is set to zero. Each time a speech frame occurs, voice _ cnt is incremented by 1, otherwise mute _ cnt is incremented by 1. When voice _ cnt is equal to mix _ confirm, let mute _ cnt be zero. When voice _ cnt exceeds voice _ confirm, let mute _ cnt be zero, i.e. it is determined that there is voice activity, and send voice from buffered voice frames.

When the voice frame and the non-voice frame are mixed together alternately in a plurality of frames, calculating the ratio rate of the mixed section of voice frame, if the ratio rate is higher, the voice frame is classified as the voice frame, otherwise, the voice frame is classified as the non-voice frame.

And when the mute _ cnt exceeds the mute _ confirm, the voice _ cnt is enabled to be zero, and the voice signal is judged to have no voice activity and not to be sent.

In summary, a voice noise reduction algorithm and a harmonic detection algorithm are respectively adopted to determine whether voice activity exists in the voice signal. Determining that voice activity is detected from the speech signal when both the speech noise reduction algorithm and the harmonic detection algorithm determine that voice activity is present in the speech signal. And filtering the interference of steady-state noise to voice activity detection by adopting a voice noise reduction algorithm, and filtering the interference of unsteady-state noise to voice activity detection by adopting a harmonic detection algorithm. When the voice noise reduction algorithm and the harmonic detection algorithm judge that voice activity exists in the voice signal, the voice activity is judged to be detected from the voice signal, and the accuracy of voice activity detection is greatly improved.

Referring to fig. 2, a waveform diagram of a noisy speech signal in an embodiment of the present invention is shown. Fig. 3 is a schematic diagram showing a voice activity detection result of noisy speech using a speech noise reduction algorithm in an embodiment of the present invention. Fig. 4 is a schematic diagram showing a voice activity detection result of noisy speech using a harmonic detection algorithm in the embodiment of the present invention. Wherein E is _w For the energy ratio, E, of the frequency-domain amplitude spectrum of the speech signal before and after noise reduction _h The maximum weight for all remaining peaks.

In fig. 2, the abscissa unit is the sampling point and the ordinate unit is the normalized amplitude value. In fig. 3 and 4, the abscissa unit is a sampling point, and the ordinate unit is a set value.

Because the background noise contains the voice with small volume, harmonic detection occasionally exceeds a threshold value, but the wiener filtering noise reduction result is judged as a noise frame, the noise is not judged as voice activity, and the voice activity detection has better detection accuracy by combining the harmonic detection and the wiener filtering noise reduction algorithm.

Referring to fig. 5, a waveform diagram of blowing sound and wind sound in the embodiment of the present invention is shown, fig. 6 is a schematic diagram of a voice activity detection result of blowing sound and wind sound using a wiener filtering algorithm in the embodiment of the present invention, and fig. 7 is a schematic diagram of a voice activity detection result of blowing sound and wind sound using a peak detection algorithm in the embodiment of the present invention. E _w For the energy of the frequency domain amplitude spectrum of the speech signal before and after noise reductionRatio of E _h The maximum weight for all remaining peaks.

In fig. 5, the abscissa unit is the sample point and the ordinate unit is the normalized amplitude value. In fig. 6 and 7, the abscissa unit is a sampling point, and the ordinate unit is a set value.

As can be seen from fig. 5-7, when noise is just present, there is a short misjudgment because the noise estimation detected by wiener filtering is not updated in time, but the harmonic detection portion does not detect continuous speech, and there are only misjudgments of a few frames. Therefore, by combining the two algorithms, the voice activity detection result still has no misjudgment, and the detection accuracy is better.

Referring to fig. 8, a waveform diagram of an alarm sound in the embodiment of the present invention is shown, fig. 9 is a schematic diagram of a voice activity detection result of an alarm sound using a wiener filtering algorithm in the embodiment of the present invention, and fig. 10 is a schematic diagram of a voice activity detection result of an alarm sound using a peak detection algorithm in the embodiment of the present invention. E _w For the energy ratio, E, of the frequency-domain amplitude spectrum of the speech signal before and after noise reduction _h The maximum weight for all remaining peaks.

In fig. 8, the abscissa unit is the sampling point and the ordinate unit is the normalized amplitude value. In fig. 9 and 10, the abscissa unit is a sampling point, and the ordinate unit is a set value.

As can be seen from fig. 8 to 10, in the voice activity detection process, since the voice fundamental frequency signal or the periodic harmonic signal is not detected, the voice activity detection result does not have a false determination, and shows a better robustness.

Referring to fig. 11, an embodiment of the present invention further provides a voice activity detection apparatus 100, including: an acquisition unit 1001, a first determination unit 1002, and a second determination unit 1003;

the acquiring unit 1001 is configured to acquire an acquired voice signal;

the first judging unit 1002 is configured to judge whether voice activity exists in the voice signal by using a voice noise reduction algorithm and a harmonic detection algorithm, respectively;

the second determining unit 1003 is configured to determine that voice activity is detected from the voice signal when both the voice noise reduction algorithm and the harmonic detection algorithm determine that voice activity exists in the voice signal.

In a specific implementation, the first determining unit 1002 may be configured to: performing voice noise reduction calculation on the voice signal to obtain a noise-reduced voice signal; calculating the energy corresponding to the voice signal and the energy corresponding to the voice signal after noise reduction to obtain the energy ratio of the voice signal before and after noise reduction; and when the energy ratio is smaller than a preset first energy ratio threshold, judging that voice activity exists in the voice signal.

In a specific implementation, the first determining unit 1002 may be configured to: performing fast Fourier transform on the voice signal to obtain a voice signal frequency domain amplitude spectrum; performing voice noise reduction calculation on the voice signal by adopting a wiener filtering noise reduction algorithm to obtain a noise-reduced voice signal frequency domain amplitude spectrum; calculating the energy ratio of the voice signal frequency domain amplitude spectrum before and after noise reduction according to the wiener filter function, the voice signal frequency domain amplitude spectrum after noise reduction and the voice signal frequency domain amplitude spectrum; the wiener filter function is obtained by calculation according to the wiener filter noise reduction algorithm and the noise estimation value of the voice signal; the noise estimation value is obtained by adopting a noise estimation algorithm; and when the energy ratio is smaller than a preset second energy ratio threshold, judging that voice activity exists in the voice signal.

In a specific implementation, the following formula may be adopted to calculate an energy ratio of the frequency domain amplitude spectrum of the speech signal before and after noise reduction:

wherein, E _w And the energy ratio of the voice signal frequency domain amplitude spectrum before and after noise reduction is determined, ks is a preset frequency point starting point, ke is a preset frequency point end point, Y (k) is the voice signal frequency domain amplitude spectrum, and S' (k) is the voice signal frequency domain amplitude spectrum after noise reduction.

In a specific implementation, the preset second energy ratio threshold may be positively correlated with: and the ratio of the energy corresponding to the voice signal to the energy corresponding to the noise estimation value.

In a specific implementation, the first determining unit 1002 may be configured to: and when the voice signal is in a preset voice fundamental frequency range and contains harmonic features, judging that voice activity exists in the voice signal.

In a specific implementation, the first determining unit 1002 may be configured to: performing fast Fourier transform on the voice signal to obtain a voice signal frequency domain amplitude spectrum; determining the number of peak values of the frequency domain amplitude spectrum of the voice signal; the peak value of the frequency domain amplitude spectrum of the voice signal is determined by adopting the following method: when the frequency domain amplitude spectrum corresponding to the ith frequency point in the frequency domain amplitude spectrum of the voice signal is larger than the maximum value of the frequency domain amplitude spectrum corresponding to the (i + 1) th frequency point, the frequency domain amplitude spectrum corresponding to the (i-1) th frequency point and a preset amplitude threshold corresponding to the ith frequency point, determining the frequency domain amplitude spectrum corresponding to the ith frequency point as a peak value of the voice frequency domain amplitude spectrum of the voice signal; and when the number of the peak values exceeds a preset threshold value of the number of the peak values, judging that voice activity exists in the voice signal.

In a specific implementation, the preset amplitude threshold corresponding to the ith frequency point may be obtained by: performing fast Fourier transform on the voice signal to obtain a voice signal frequency domain amplitude spectrum; selecting a noise estimation value of the voice signal, a mean value of the voice signal frequency domain amplitude spectrum, and a maximum value in a corresponding minimum voice frequency domain amplitude spectrum from the i-1 frequency point to the i +1 frequency point as a preset amplitude threshold corresponding to the i frequency point; and calculating the noise estimation value of the voice signal by adopting a noise estimation algorithm.

In a specific implementation, the first determining unit 1002 may be further configured to: sequentially taking the frequency index value corresponding to each peak value as a fundamental frequency, and calculating the frequency doubling deviation between the frequency index value corresponding to each peak value after the peak value corresponding to the fundamental frequency and the fundamental frequency; when the frequency multiplication deviation is larger than a preset deviation threshold value, the peak value is excluded; sequentially calculating weighted values of all residual peak values according to the frequency multiplication deviation and the residual peak values; comparing the weighted value corresponding to each fundamental frequency, and selecting the maximum weighted value; and when the maximum weighted value is larger than a preset weighted threshold value, judging that voice activity exists in the voice signal.

In a specific implementation, the weighted values of all the remaining peaks can be calculated using the following formula: e _h ＝∑α _n Y[p _n ](ii) a Wherein, E _h Is a weighted value, p, of said total remaining peak value _n For the frequency index value corresponding to the nth remaining peak, Y [ p ] _n ]For the frequency domain amplitude spectrum, α, corresponding to the nth residual peak _n Is a preset weight coefficient, alpha _n ∈(0,1]。

In a specific implementation, the frequency doubling deviation can be calculated using the following formula:

In a specific implementation, the preset weighted threshold may be positively correlated with the following values: a ratio of energy corresponding to the speech signal to energy corresponding to the noise estimate; and calculating the noise estimation value of the voice signal by adopting a noise estimation algorithm.

In a specific implementation, the following formula may be adopted to calculate the preset amplitude threshold corresponding to the ith frequency point: y is _thr ＝2max(mean(Y),D[k],min(Y[k-2]……Y[k+2]) (ii) a Wherein, Y _thr A preset amplitude threshold value corresponding to the ith frequency point, Y is the frequency domain amplitude spectrum of the voice signal, dk]For the noise estimate, k is the frequency point, Y [ k-2 ]]……Y[k+2]And the spectrum is the voice signal frequency domain amplitude spectrum corresponding to the k-2 to k +2 frequency points.

In a specific implementation, the second determining unit 1003 may be configured to: calculating the energy corresponding to the voice signal; and when the energy corresponding to the voice signal is larger than a preset energy threshold value, judging that voice activity is detected from the voice signal.

On toolIn a volume implementation, the following formula can be used to calculate the corresponding energy of the speech signal: e _abs ＝∑(Y[k]) ² (ii) a Wherein E is _abs For the corresponding energy of the speech signal, Y [ k ]]A frequency domain amplitude spectrum of the voice signal is obtained, and k is a preset frequency range; the voice signal frequency domain magnitude spectrum is obtained by performing fast Fourier transform on the voice signal.

In a specific implementation, the second determining unit 1003 may be configured to: carrying out noise estimation on the voice signal to obtain a noise estimation value; calculating the ratio of the energy corresponding to the voice signal to the energy corresponding to the noise estimation value; and when the ratio of the energy corresponding to the voice signal to the energy corresponding to the noise estimation value is larger than a preset third energy ratio threshold value, judging that voice activity is detected from the voice signal.

In a specific implementation, the following formula may be adopted to calculate a ratio of energy corresponding to the speech signal to energy corresponding to the noise estimation value: e _vs ＝log(E _abs )-log(E _n ) (ii) a Wherein E is _vs Is the ratio of the energy corresponding to the speech signal to the energy corresponding to the noise estimate, E _abs For the corresponding energy of the speech signal, E _n For the energy corresponding to said noise estimate, E _n ＝∑(D[k]) ² ，D[k]Is the noise estimate.

In a specific implementation, the second determining unit 1003 may be further configured to: when the detected voice activity appears after continuous non-voice activity and the number of the continuous non-voice activity frames exceeds a preset first frame number threshold, caching the voice activity, and when the number of the voice frames of the voice activity exceeds a preset second frame number threshold, outputting a voice signal corresponding to the voice activity; when the detected non-voice activity appears after continuous voice activity, and the frame number of the continuous voice activity exceeds a preset third frame number threshold value, continuing voice activity detection, and when the frame number of the non-voice activity exceeds a preset fourth frame number threshold value, stopping outputting a voice signal corresponding to the voice activity.

In a specific implementation, the second determining unit 1003 may be further configured to: within a preset threshold range of mixed frame number, when the detected voice activity and non-voice activity alternately occur, calculating the proportion of the voice activity frame number to the sum of the voice activity frame number and the non-voice activity frame number; and outputting a voice signal corresponding to the voice activity when the proportion is larger than a preset proportion threshold value.

In a specific implementation, the speech noise reduction algorithm may be at least one of the following algorithms: LMS, NLMS, spectral subtraction, and wiener filtering algorithms.

In a specific implementation, the harmonic detection algorithm may be at least one of the following algorithms: autocorrelation function methods, cepstrum methods, linear prediction methods, and wavelet methods.

An embodiment of the present invention further provides a computer-readable storage medium, which is a non-volatile storage medium or a non-transitory storage medium, and has stored thereon computer instructions, where the computer instructions, when executed, perform the steps of any one of the voice activity detection methods provided in the foregoing embodiments of the present invention.

An embodiment of the present invention further provides a voice activity detection apparatus, which includes a memory and a processor, where the memory stores computer instructions that are executable on the processor, and when the processor executes the computer instructions, the steps of any one of the voice activity detection methods provided in the foregoing embodiments of the present invention are executed.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in any computer readable storage medium, and the storage medium may include: ROM, RAM, magnetic or optical disks, and the like.

Although the present invention is disclosed above, the present invention is not limited thereto. Various changes and modifications may be effected by one skilled in the art without departing from the spirit and scope of the invention, as defined in the appended claims.

Claims

1. A method for voice activity detection, comprising:

acquiring an acquired voice signal;

respectively adopting a voice noise reduction algorithm and a harmonic detection algorithm to judge whether voice activity exists in the voice signal;

when the speech noise reduction algorithm and the harmonic detection algorithm both determine that speech activity is present in the speech signal, determining that speech activity is detected from the speech signal;

the judging whether the voice signal has voice activity by respectively adopting the voice noise reduction algorithm and the harmonic detection algorithm comprises the following steps: performing fast Fourier transform on the voice signal to obtain a voice signal frequency domain amplitude spectrum; performing voice noise reduction calculation on the voice signal by adopting a wiener filtering noise reduction algorithm to obtain a noise-reduced voice signal frequency domain amplitude spectrum; calculating the energy ratio of the voice signal frequency domain amplitude spectrum before and after noise reduction according to the wiener filter function, the voice signal frequency domain amplitude spectrum after noise reduction and the voice signal frequency domain amplitude spectrum; the wiener filter function is obtained by calculation according to the wiener filter noise reduction algorithm and the noise estimation value of the voice signal; the noise estimation value is obtained by adopting a noise estimation algorithm; when the energy ratio is smaller than a preset second energy ratio threshold, judging that voice activity exists in the voice signal;

or, the judging whether the voice signal has voice activity by respectively adopting the voice noise reduction algorithm and the harmonic detection algorithm includes: performing fast Fourier transform on the voice signal to obtain a voice signal frequency domain amplitude spectrum; determining the number of peak values of the frequency domain amplitude spectrum of the voice signal; the peak value of the voice signal frequency domain amplitude spectrum is determined by adopting the following method: when the frequency domain amplitude spectrum corresponding to the ith frequency point in the voice signal frequency domain amplitude spectrum is larger than the maximum value in the frequency domain amplitude spectrum corresponding to the (i + 1) th frequency point, the frequency domain amplitude spectrum corresponding to the (i-1) th frequency point and a preset amplitude threshold corresponding to the ith frequency point, determining the frequency domain amplitude spectrum corresponding to the ith frequency point as a peak value of the voice signal voice frequency domain amplitude spectrum; and when the number of the peak values exceeds a preset threshold value of the number of the peak values, judging that voice activity exists in the voice signal.

2. The method of claim 1, wherein the determining whether voice activity is present in the voice signal using a voice noise reduction algorithm and a harmonic detection algorithm, respectively, comprises:

performing voice noise reduction calculation on the voice signal to obtain a noise-reduced voice signal;

calculating the energy corresponding to the voice signal and the energy corresponding to the voice signal after noise reduction to obtain the energy ratio of the voice signal before and after noise reduction;

and when the energy ratio is smaller than a preset first energy ratio threshold, judging that voice activity exists in the voice signal.

3. The voice activity detection method according to claim 1, wherein the energy ratio of the voice signal frequency domain magnitude spectrum before and after the noise reduction is calculated by using the following formula:

wherein E is _w And the energy ratio of the voice signal frequency domain amplitude spectrum before and after noise reduction is determined, ks is a preset frequency point starting point, ke is a preset frequency point end point, Y (k) is the voice signal frequency domain amplitude spectrum, and S' (k) is the voice signal frequency domain amplitude spectrum after noise reduction.

4. The voice activity detection method according to claim 1, wherein the preset second energy ratio threshold is positively correlated with: and the ratio of the energy corresponding to the voice signal to the energy corresponding to the noise estimation value.

5. The method of claim 1, wherein the determining whether voice activity is present in the voice signal using a voice noise reduction algorithm and a harmonic detection algorithm, respectively, comprises:

and when the voice signal is in a preset voice fundamental frequency range and contains harmonic features, judging that voice activity exists in the voice signal.

6. The voice activity detection method according to claim 1, wherein the preset amplitude threshold corresponding to the ith frequency point is obtained by:

performing fast Fourier transform on the voice signal to obtain a voice signal frequency domain amplitude spectrum;

selecting a noise estimation value of the voice signal, a mean value of the voice signal frequency domain amplitude spectrum, and a maximum value in a corresponding minimum voice frequency domain amplitude spectrum from the i-1 frequency point to the i +1 frequency point as a preset amplitude threshold corresponding to the i frequency point; and calculating the noise estimation value of the voice signal by adopting a noise estimation algorithm.

7. The voice activity detection method of claim 1, after determining the number of peaks of the frequency domain magnitude spectrum of the voice signal, further comprising:

sequentially taking the frequency index value corresponding to each peak value as a fundamental frequency, and calculating the frequency doubling deviation between the frequency index value corresponding to each peak value after the peak value corresponding to the fundamental frequency and the fundamental frequency;

when the frequency multiplication deviation is larger than a preset deviation threshold value, the peak value is excluded;

sequentially calculating weighted values of all residual peak values according to the frequency multiplication deviation and the residual peak values;

comparing the corresponding weighted value under each fundamental frequency, and selecting the maximum weighted value;

and when the maximum weighted value is greater than a preset weighted threshold value, judging that voice activity exists in the voice signal.

8. The voice activity detection method of claim 7 wherein the weighting values for all remaining peaks are calculated using the formula:

E _h ＝∑α _n Y[p _n ]；

wherein E is _h Is a weighted value, p, of said total remaining peak value _n For the frequency index value corresponding to the nth remaining peak, Y [ p ] _n ]For the frequency domain amplitude spectrum corresponding to the nth residual peak, α _n Is a preset weight coefficient, alpha _n ∈(0,1]。

9. The voice activity detection method of claim 7 wherein the octave deviation is calculated using the following equation:

10. The voice activity detection method of claim 7, wherein the predetermined weighted threshold is positively correlated with: the ratio of the energy corresponding to the voice signal to the energy corresponding to the noise estimation value; and calculating the noise estimation value of the voice signal by adopting a noise estimation algorithm.

11. The voice activity detection method according to claim 6, wherein the preset amplitude threshold corresponding to the ith frequency point is calculated by using the following formula:

Y _thr ＝2max(mean(Y),D[k],min(Y[k-2]……Y[k+2])；

wherein Y is _thr A preset amplitude threshold corresponding to the ith frequency point, Y is the frequency domain amplitude spectrum of the voice signal, dk]For the noise estimate, k is the frequency point, Y [ k-2 ]]……Y[k+2]And the spectrum is the voice signal frequency domain amplitude spectrum corresponding to the k-2 to k +2 frequency points.

12. The voice activity detection method of claim 1, wherein the determining that voice activity is detected from the voice signal comprises:

calculating the energy corresponding to the voice signal;

and when the energy corresponding to the voice signal is larger than a preset energy threshold value, judging that voice activity is detected from the voice signal.

13. The voice activity detection method of claim 12 wherein the corresponding energy of the voice signal is calculated using the formula:

E _abs ＝∑(Y[k]) ² ；

14. The voice activity detection method of claim 12, wherein the calculating the corresponding energy of the voice signal comprises:

carrying out noise estimation on the voice signal to obtain a noise estimation value;

calculating the ratio of the energy corresponding to the voice signal to the energy corresponding to the noise estimation value;

and when the ratio of the energy corresponding to the voice signal to the energy corresponding to the noise estimation value is larger than a preset third energy ratio threshold value, judging that voice activity is detected from the voice signal.

15. The voice activity detection method of claim 14 wherein the ratio of the energy corresponding to the voice signal to the energy corresponding to the noise estimate is calculated using the formula:

E _vs ＝log(E _abs )-log(E _n )；

wherein E is _vs Is the ratio of the energy corresponding to the speech signal to the energy corresponding to the noise estimate, E _abs For the corresponding energy of the speech signal, E _n For the energy corresponding to said noise estimate, E _n ＝Σ(D[k]) ² ，D[k]Is the noise estimate.

16. The voice activity detection method of claim 1, when it is determined that voice activity is detected from the voice signal, further comprising:

when the detected voice activity occurs after continuous non-voice activity and the frame number of the continuous non-voice activity exceeds a preset first frame number threshold, caching the voice activity, and when the voice frame number of the voice activity exceeds a preset second frame number threshold, outputting a voice signal corresponding to the voice activity;

when the detected non-voice activity appears after continuous voice activity, and the frame number of the continuous voice activity exceeds a preset third frame number threshold value, continuing voice activity detection, and when the frame number of the non-voice activity exceeds a preset fourth frame number threshold value, stopping outputting a voice signal corresponding to the voice activity.

17. The voice activity detection method of claim 16, when it is determined that voice activity is detected from the voice signal, further comprising:

within a preset mixed frame number threshold range, when the detected voice activity and non-voice activity occur alternately, calculating the proportion of the voice activity frame number to the sum of the voice activity frame number and the non-voice activity frame number;

and when the proportion is larger than a preset proportion threshold value, outputting a voice signal corresponding to the voice activity.

18. The voice activity detection method of claim 1, wherein the voice noise reduction algorithm is at least one of: LMS, NLMS, spectral subtraction, and wiener filtering algorithms.

19. The voice activity detection method of claim 1, wherein the harmonic detection algorithm is at least one of: autocorrelation function methods, cepstrum methods, linear prediction methods, and wavelet methods.

20. A voice activity detection apparatus, comprising:

the acquisition unit is used for acquiring the acquired voice signals;

the first judging unit is used for judging whether voice activity exists in the voice signal by respectively adopting a voice noise reduction algorithm and a harmonic detection algorithm;

a second judging unit, configured to judge that voice activity is detected from the voice signal when the voice noise reduction algorithm and the harmonic detection algorithm both judge that voice activity exists in the voice signal;

the first judging unit is used for carrying out fast Fourier transform on the voice signal to obtain a voice signal frequency domain amplitude spectrum; performing voice noise reduction calculation on the voice signal by adopting a wiener filtering noise reduction algorithm to obtain a noise-reduced voice signal frequency domain amplitude spectrum; calculating the energy ratio of the voice signal frequency domain amplitude spectrum before and after noise reduction according to the wiener filter function, the voice signal frequency domain amplitude spectrum after noise reduction and the voice signal frequency domain amplitude spectrum; the wiener filter function is obtained by calculation according to the wiener filter noise reduction algorithm and the noise estimation value of the voice signal; the noise estimation value is obtained by adopting a noise estimation algorithm; when the energy ratio is smaller than a preset second energy ratio threshold, judging that voice activity exists in the voice signal;

or, the first judging unit is configured to perform fast fourier transform on the voice signal to obtain a voice signal frequency domain amplitude spectrum; determining the number of peak values of the frequency domain amplitude spectrum of the voice signal; the peak value of the voice signal frequency domain amplitude spectrum is determined by adopting the following method: when the frequency domain amplitude spectrum corresponding to the ith frequency point in the voice signal frequency domain amplitude spectrum is larger than the maximum value in the frequency domain amplitude spectrum corresponding to the (i + 1) th frequency point, the frequency domain amplitude spectrum corresponding to the (i-1) th frequency point and a preset amplitude threshold corresponding to the ith frequency point, determining the frequency domain amplitude spectrum corresponding to the ith frequency point as a peak value of the voice signal voice frequency domain amplitude spectrum; and when the number of the peak values exceeds a preset threshold value of the number of the peak values, judging that voice activity exists in the voice signal.

21. The voice activity detection apparatus of claim 20, wherein the first determination unit is configured to: performing voice noise reduction calculation on the voice signal to obtain a noise-reduced voice signal; calculating the energy corresponding to the voice signal and the energy corresponding to the voice signal after noise reduction to obtain the energy ratio of the voice signal before and after noise reduction; and when the energy ratio is smaller than a preset first energy ratio threshold, judging that voice activity exists in the voice signal.

22. The voice activity detection apparatus according to claim 20, wherein the energy ratio of the amplitude spectrum of the voice signal before and after the noise reduction is calculated by using the following formula:

wherein E is _w The energy ratio of the voice signal frequency domain amplitude spectrum before and after noise reduction is shown, ks is a preset frequency point starting point, ke is a preset frequency point terminal point, Y (k) is the voice signal frequency domain amplitude spectrum, and S' (k) is the voice signal frequency domain amplitude spectrum after noise reduction.

23. The voice activity detection device of claim 20, wherein the predetermined second energy ratio threshold is positively correlated with: and the ratio of the energy corresponding to the voice signal to the energy corresponding to the noise estimation value.

24. The voice activity detection apparatus of claim 20, wherein the first determining unit is configured to: and when the voice signal is in a preset voice fundamental frequency range and contains harmonic features, judging that voice activity exists in the voice signal.

25. The voice activity detection device according to claim 20, wherein the preset amplitude threshold corresponding to the ith frequency point is obtained by: performing fast Fourier transform on the voice signal to obtain a voice signal frequency domain amplitude spectrum; selecting a noise estimation value of the voice signal, a mean value of the voice signal frequency domain amplitude spectrum, and a maximum value in a corresponding minimum voice frequency domain amplitude spectrum from the (i-1) th frequency point to the (i + 1) th frequency point as a preset amplitude threshold value corresponding to the ith frequency point; and calculating the noise estimation value of the voice signal by adopting a noise estimation algorithm.

26. The voice activity detection apparatus of claim 20, wherein the first determining unit is further configured to: sequentially taking the frequency index value corresponding to each peak value as a fundamental frequency, and calculating the frequency doubling deviation between the frequency index value corresponding to each peak value after the peak value corresponding to the fundamental frequency and the fundamental frequency; when the frequency multiplication deviation is larger than a preset deviation threshold value, the peak value is excluded; calculating weighted values of all residual peak values in sequence according to the frequency multiplication deviation and the residual peak values; comparing the corresponding weighted value under each fundamental frequency, and selecting the maximum weighted value; and when the maximum weighted value is greater than a preset weighted threshold value, judging that voice activity exists in the voice signal.

27. The voice activity detection device of claim 26 wherein the weighting of all remaining peaks is calculated using the following equation:

E _h ＝Σα _n Y[p _n ]；

28. The voice activity detection apparatus of claim 26, wherein the octave bias is calculated using the following equation:

29. The voice activity detection device of claim 26, wherein the preset weighted threshold is positively correlated with: the ratio of the energy corresponding to the voice signal to the energy corresponding to the noise estimation value; and calculating the noise estimation value of the voice signal by adopting a noise estimation algorithm.

30. The voice activity detection device according to claim 25, wherein the preset amplitude threshold corresponding to the ith frequency point is calculated using the following formula:

Y _thr ＝2max(mean(Y),D[k],min(Y[k-2]……Y[k+2])；

wherein, Y _thr A preset amplitude threshold corresponding to the ith frequency point, Y is the frequency domain amplitude spectrum of the voice signal, dk]For the noise estimate, k is the frequency point, Y [ k-2 ]]……Y[k+2]And the amplitude spectra of the frequency domains of the voice signals corresponding to the k-2 th to k +2 nd frequency points.

31. The voice activity detection apparatus according to claim 20, wherein the second determining unit is configured to: calculating the energy corresponding to the voice signal; and when the energy corresponding to the voice signal is larger than a preset energy threshold value, judging that voice activity is detected from the voice signal.

32. The voice activity detection device of claim 31 wherein the corresponding energy of the voice signal is calculated using the formula:

E _abs ＝∑(Y[k]) ² ；

33. The voice activity detection apparatus of claim 31, wherein the second determination unit is configured to: carrying out noise estimation on the voice signal to obtain a noise estimation value; calculating the ratio of the energy corresponding to the voice signal to the energy corresponding to the noise estimation value; and when the ratio of the energy corresponding to the voice signal to the energy corresponding to the noise estimation value is larger than a preset third energy ratio threshold value, judging that voice activity is detected from the voice signal.

34. The voice activity detection device of claim 33 wherein the ratio of the energy corresponding to the voice signal to the energy corresponding to the noise estimate is calculated using the formula:

E _vs ＝log(E _abs )-log(E _n )；

35. The voice activity detection apparatus of claim 20, wherein the second determination unit is further configured to: when the detected voice activity appears after continuous non-voice activity and the number of the continuous non-voice activity frames exceeds a preset first frame number threshold, caching the voice activity, and when the number of the voice frames of the voice activity exceeds a preset second frame number threshold, outputting a voice signal corresponding to the voice activity; when the detected non-voice activity occurs after continuous voice activity and the frame number of the continuous voice activity exceeds a preset third frame number threshold, voice activity detection is continued, and when the frame number of the non-voice activity exceeds a preset fourth frame number threshold, voice signals corresponding to the voice activity are stopped being output.

36. The voice activity detection apparatus of claim 35, wherein the second determining unit is further configured to: within a preset mixed frame number threshold range, when the detected voice activity and non-voice activity occur alternately, calculating the proportion of the voice activity frame number to the sum of the voice activity frame number and the non-voice activity frame number; and outputting a voice signal corresponding to the voice activity when the proportion is larger than a preset proportion threshold value.

37. The voice activity detection device of claim 20 wherein the voice noise reduction algorithm is at least one of: LMS, NLMS, spectral subtraction, and wiener filtering algorithms.

38. The voice activity detection apparatus of claim 20, wherein the harmonic detection algorithm is at least one of: autocorrelation function methods, cepstrum methods, linear prediction methods, and wavelet methods.

39. A computer-readable storage medium, being a non-volatile storage medium or a non-transitory storage medium, having stored thereon computer instructions, which when executed by a processor, perform the voice activity detection method of any one of claims 1 to 19.

40. A voice activity detection device comprising a memory and a processor, the memory having stored thereon computer instructions executable on the processor, wherein the processor executes the computer instructions to perform the voice activity detection method of any one of claims 1 to 19.