CN105810201A

CN105810201A - Voice activity detection method and system

Info

Publication number: CN105810201A
Application number: CN201410853931.6A
Authority: CN
Inventors: 孙廷玮; 林福辉
Original assignee: Spreadtrum Communications Shanghai Co Ltd
Current assignee: Spreadtrum Communications Shanghai Co Ltd
Priority date: 2014-12-31
Filing date: 2014-12-31
Publication date: 2016-07-27
Anticipated expiration: 2034-12-31
Also published as: CN105810201B

Abstract

The invention provides a voice activity detection method and system. The method comprises the steps: calculating the spectrum density of a current frame of an audio signal; estimating an expected value of the spectrum density of noise; calculating the signal to noise ratio of the current frame based on the spectrum density of the current frame and the spectrum density of noise; and generating a voice activity detection result based on the signal to noise ratio of the current frame and a preset threshold. Therefore, the voice activity detection result is related with the probability statistics distribution of noise, thereby overcoming the impact on the detection result from noise. Meanwhile, the preset threshold is a dynamic threshold and is related with the change of noise, thereby enabling the detection result to be adapted to the noise environment of the current frame.

Description

Voice activity detection method and system thereof

Technical field

The present invention relates to speech recognition technology, particularly relate to a kind of voice activity detection method and system thereof.

Background technology

Voice activity detection (VoiceActivitydetection, VAD) is also referred to as speech detection, for detecting the presence or absence of voice in speech processes, thus the voice segments in signal and non-speech segment being separated.VAD can be used for Echo cancellation, noise suppression, language person identification and speech recognition etc..

Traditional vad algorithm often selects the features such as the short-time energy of audio signal, spectrum energy, zero-crossing rate to judge.Therefore, under the environment of pure voice environment and high s/n ratio, better performances.And under the background environment that low signal-to-noise ratio or noise are unstable, testing result can be subject to the impact of feature of noise amount, thus causing hydraulic performance decline.

Along with the development of speech recognition technology, the requirement of voice activity detection is also more and more higher.Accordingly, it would be desirable to a kind of VAD detection method, it is possible in the presence of a harsh environment, such as in the environment of noise instability or low signal-to-noise ratio, still keep good detection performance.

Summary of the invention

The problem that this invention address that makes the performance of voice activity detection will not decline under the environment of noise instability or low signal-to-noise ratio.

For solving the problems referred to above, the invention provides a kind of voice activity detection method, including: calculate the spectrum density of audio signal present frame；Calculate the expected value of noise spectrum density；Based on the expected value of the spectrum density of described present frame and described noise spectrum density, calculate the signal to noise ratio of present frame；And based on the signal to noise ratio of described present frame and pre-determined threshold, generate voice activity detection result.

Alternatively, the expected value of noise spectrum density is based on the statistical distribution calculating of noise.

Alternatively, the calculating of described signal to noise ratio is based on formula:

Wherein, SNR represents signal to noise ratio.

Alternatively, described pre-determined threshold is dynamic threshold and changes with the change of signal to noise ratio.

Alternatively, the calculating of described dynamic threshold is based on formula:

γ = \sqrt{2 * D} * er {fc}^{- 1} (2 P_{FA})

Wherein, γ represents that dynamic threshold, D represent the variance of signal to noise ratio, P_FARepresent the probability of false-alarm.

Alternatively, when described signal to noise ratio is more than described pre-determined threshold, the voice activity detection result of generation is the present frame of described audio signal is voice segments；When described signal to noise ratio is less than described pre-determined threshold, the voice activity detection result of generation is the present frame of described audio signal is non-speech segment.

Present invention also offers a kind of voice activity detection system, including: receive unit, be used for receiving audio signal；Processing unit, for calculating the signal to noise ratio of present frame, the expected value of spectrum density and noise spectrum density that the signal to noise ratio of wherein said present frame is based on described audio signal present frame calculates；And judging unit, it is configured to based on the signal to noise ratio of described present frame and pre-determined threshold, voice activity detection result to be generated.

Alternatively, described processing unit includes: the first computing unit is for calculating the spectrum density of described audio signal present frame；Second computing unit is for calculating the expected value of noise spectrum density；And the 3rd computing unit, for calculating the signal to noise ratio of described present frame.

Alternatively, the expected value of described noise spectrum density be based on noise statistical distribution calculate and come.

Wherein, SNR represents signal to noise ratio.

Alternatively, described processing unit farther includes: the 4th computing unit, is used for calculating described dynamic threshold.

γ = \sqrt{2 * D} * er {fc}^{- 1} (2 P_{FA})

Alternatively, when described signal to noise ratio is more than described pre-determined threshold, the voice activity detection result of generation is the present frame of described audio signal is voice segments；When described signal to noise ratio is less than described pre-determined threshold, the present frame that voice activity detection result is described audio signal of generation is non-speech segment.

Compared with prior art, technical scheme has the advantage that

First, the VAD judged result of the present invention is based on the statistical distribution of noise, rather than generated by the statistical distribution of voice signal.Specifically, voice activity detection method provided by the invention needs the probability distribution of statistics noise, and based on the expected value of this estimation noise spectrum density, and then generate judged result.Due in actual life, noise is belonging to the signal of long-term stability, therefore, as long as the probability distribution statistical of noise is appropriate, then no matter VAD judged result is when the present frame of voice signal to be measured is in steady noise environment, when being in the environment of unstable noise, all can have and detect performance preferably.

Further, when carrying out VAD and judging, employing is dynamic threshold, and this dynamic threshold is relevant with the change of noise, so that dynamic threshold can change along with the change of noise signal, with the background environment that good self adaptation is current.

Accompanying drawing explanation

Fig. 1 is the schematic flow sheet of the voice activity detection method of one embodiment of the invention；

Fig. 2 is the structural representation of the voice activity detection system of one embodiment of the invention；And

Fig. 3 is the structural representation of the processing unit in the voice activity detection system of one embodiment of the invention.

Detailed description of the invention

Understandable for enabling the above-mentioned purpose of the present invention, feature and advantage to become apparent from, below in conjunction with accompanying drawing, specific embodiments of the invention are described in detail.

With reference to Fig. 1, illustrate the voice activity detection method 100 of one embodiment of the invention.The method comprises the following steps.

S101, calculates the spectrum density of audio signal present frame.

In the method 100, voice activity detection carries out frame by frame.Specifically, audio signal is divided into multiple frame, then each frame of this audio signal is detected respectively, and then determines voice segments and the non-speech segment of audio signal.Wherein, the length range of each frame can set that as 10ms to 30ms.Therefore, the present frame of audio signal is the frame being currently needed for carrying out voice activity detection.

In certain embodiments, the spectrum density of described present frame is the frequency spectral density (PowerSpectralDensity of present frame, PSD), thus the power that the audio signal calculating present frame has in cell frequency, using the tolerance of the characteristic quantity as present frame.

In certain embodiments, the calculating of the spectrum density of audio signal present frame can adopt Pasteur's Power estimation algorithm (BartlettAlgorithm).In certain embodiments, the calculating of the spectrum density of present frame can also adopt periodogram algorithm.The algorithm of present frame spectrum density is not limited as by the present invention.

S103, calculates the expected value of noise spectrum density.

The expected value of noise spectrum density is based on the statistical distribution of noise signal and calculates, and wherein, the statistical distribution of noise is to carry out under the pure noisy environment not having voice signal.

In certain embodiments, the estimation of noise spectrum density can adopt the algorithm with less variance, and ratio Pasteur's algorithm described above, to improve the detection performance of VAD.This is because, method 100, when the present frame of audio signal carrying out VAD and judging, namely when judging that present frame is voice segments or non-speech segment, relates to the variance of noise, can elaborate in this below step.

It is noted that voice activity detection method 100 provided by the invention is built upon on some assumed conditions.Specifically, assume initially that and voice signal and what noise signal was independent from be absent from incidence relation from each other；Secondly assume that noise signal is steady in a long-term, and voice signal is short-term stability.It addition, these hypothesis are consistent with practical situation, therefore, the method and system based on these hypothesis also has real value.

S105, based on the expected value of the spectrum density of present frame and noise spectrum density, calculates signal to noise ratio (SNR).

The calculating of signal to noise ratio can based on formula:

Wherein, the expected value of noise spectrum density be based on signal statistical distribution calculate and come, therefore, the signal to noise ratio of present frame be also based on noise statistical distribution calculate obtain.

S107, based on the variance of SNR, calculates dynamic threshold γ.

The calculating of dynamic threshold γ can based on formula:

γ = \sqrt{2 * D} * er {fc}^{- 1} (2 P_{FA})

Wherein, D represents the variance of SNR, P_FARepresent false-alarm (FalseAlarm) probability.False-alarm probability refers to that noise is mistaken for the probability of voice, and namely when not having voice signal, signal to noise ratio snr is judged as the probability more than γ.

Therefore, dynamic threshold is relevant with the variance of SNR, namely relevant with the change of SNR, simultaneously, when noise instability, the variance of SNR can change, so, the value of dynamic threshold can change along with the change of noise, therefore, voice activity detection method 100 provided by the invention can the change of self adaptation noise, thus under or the environment of signal to noise ratio unstable at noise, detection performance will not decline.

It addition, dynamic threshold also with false-alarm probability P_FARelevant, in actual applications, it is possible to by controlling false-alarm probability, improve the performance of voice activity detection method 100.

S109, based on SNR and γ, generates the VAD judged result of present frame.

Wherein, when SNR is more than γ, the VAD judged result of generation is present frame is voice segments, and namely the present frame of audio signal is voice segments；When SNR is less than γ, the VAD judged result of generation is present frame is non-speech segment, and namely the present frame of audio signal is non-speech segment.

In certain embodiments, the generation of VAD judged result may be based on the SNR of present frame and what fixed threshold value generated.Concrete, when SNR is more than this fixed threshold value, then judge that present frame is as voice segments；When SNR is less than this fixed threshold, then judge that present frame is as non-speech segment.

Therefore, VAD method 100 provided by the invention is based on the statistical distribution of noise, rather than what voice-based statistical distribution carried out.Meanwhile, the generation of VAD judged result is relevant with the change (variance of signal to noise ratio) of real-time signal to noise ratio (SNR of present frame) and noise.Such that it is able to overcome the impact that VAD judged result is caused by noise instability or low signal-to-noise ratio.

It addition, the generation of VAD judged result only needs to consider the SNR of present frame, without considering priori and posteriority SNR.Therefore, speech detection method 100 provided by the invention is simpler, such that it is able to improve the efficiency of detection.

With reference to Fig. 2, illustrate the voice activity detection system 200 of one embodiment of the invention.This system includes: receives unit 201 and is used for receiving audio signal；Processing unit 203 is for calculating the signal to noise ratio of described audio signal present frame, and wherein, the expected value of spectrum density and noise spectrum density that this signal to noise ratio is based on present frame calculates acquisition；And judging unit 205 is used for generating VAD judged result, wherein this judged result be based on processing unit 203 calculate signal to noise ratio and pre-determined threshold generate.

With reference to Fig. 3, processing unit 203 includes the first computing unit 2031 for calculating the spectrum density of present frame, and the second computing unit 2033 is for calculating the expected value of noise spectrum density, and the 3rd computing unit 2035 is for calculating the signal to noise ratio of present frame.

The signal to noise ratio (SNR) that 3rd computing unit 2035 calculates is based on formula:

Wherein, the expected value of noise spectrum density be based on signal statistical distribution calculate and come, therefore, the signal to noise ratio of present frame be also based on noise statistical distribution calculate obtain.It addition, the statistical distribution of noise is to carry out in the pure noise situation not having voice signal movable.

In certain embodiments, the calculating of audio signal present frame spectrum density and noise spectrum density can adopt Pasteur's Power estimation algorithm (BartlettAlgorithm), periodogram algorithm etc..The present invention is not limited as when this.But, it should be noted that the estimation of noise spectrum density is preferably with the algorithm with less variance, ratio Pasteur's algorithm described above, thus improving the detection performance of VAD method, this is because the variance that the judging unit of this system 200 is based on noise generates VAD judged result.

Described spectrum density can be frequency spectral density (PowerSpectralDensity, PSD), thus calculating the power having in the audio signal cell frequency of present frame, as the tolerance to present frame.

In certain embodiments, described pre-determined threshold is dynamic threshold, and this dynamic threshold is calculated by processing unit 203.Specifically, processing unit 203 still further comprises the 4th computing unit 2037, is used for calculating dynamic threshold, and the calculating of this dynamic threshold γ can based on formula:

γ = \sqrt{2 * D} * er {fc}^{- 1} (2 P_{FA})

Therefore, dynamic threshold is relevant with the variance of SNR, namely relevant with the change of SNR, simultaneously, when noise instability, the variance of SNR can change, so, the value of dynamic threshold can change along with the change of noise, and therefore, the judged result of VAD can the change of self adaptation noise.

It addition, dynamic threshold also with false-alarm probability P_FARelevant, in actual applications, it is possible to by reducing the probability of false-alarm, improve the VAD degree of accuracy judged.

Judging unit 205 is when carrying out VAD and judging, if the signal to noise ratio snr of present frame is more than dynamic threshold γ, then present frame is judged as voice segments；If the signal to noise ratio snr of present frame is less than dynamic threshold γ, then present frame is judged as non-speech segment.

In system 200 provided by the invention, the generation of VAD judged result is relevant with the change (variance of signal to noise ratio) of real-time signal to noise ratio (SNR of present frame) and noise.Thus overcoming the impact that noise is unstable or VAD judged result is caused by low signal-to-noise ratio.It addition, each frame is carried out VAD when judging, it is only necessary to consider the situation of current SNR, without the situation considering priori and posteriority SNR, it is judged that method is simpler.

System 200 can be passed through audio signal is carried out VAD judgement frame by frame, to determine voice segments and the non-speech segment of this audio signal.

System 200 can further include performance element 207 be configured to can: the different frame (voice segments and non-speech segment) of audio signal is performed different operations by the VAD judged result based on judging unit 205, for instance, identify, decoding etc..

Although present disclosure is as above, but the present invention is not limited to this.Any those skilled in the art, without departing from the spirit and scope of the present invention, all can make various changes or modifications, and therefore protection scope of the present invention should be as the criterion with claim limited range.

Claims

1. a voice activity detection method, it is characterised in that including:

Calculate the spectrum density of audio signal present frame；

Calculate the expected value of noise spectrum density；

Based on the expected value of the spectrum density of described present frame and described noise spectrum density, calculate the signal to noise ratio of present frame；With

Based on signal to noise ratio and the pre-determined threshold of described present frame, generate voice activity detection result.

2. method according to claim 1, it is characterised in that wherein the expected value of noise spectrum density is based on the statistical distribution calculating of noise.

3. method according to claim 1, it is characterised in that the calculating of wherein said signal to noise ratio is based on formula:

Wherein, SNR represents signal to noise ratio.

4. method according to claim 1, it is characterised in that wherein said pre-determined threshold is dynamic threshold and changes with the change of signal to noise ratio.

5. method according to claim 4, it is characterised in that the calculating of wherein said dynamic threshold is based on formula:

γ = \sqrt{2 * D} * {erfc}^{- 1} (2 P_{FA})

6. method according to claim 1, it is characterised in that when described signal to noise ratio is more than described pre-determined threshold, the voice activity detection result of generation is the present frame of described audio signal is voice segments；When described signal to noise ratio is less than described pre-determined threshold, the voice activity detection result of generation is the present frame of described audio signal is non-speech segment.

7. a voice activity detection system, it is characterised in that including:

Receive unit, be used for receiving audio signal；

Processing unit, for calculating the signal to noise ratio of present frame, the expected value of spectrum density and noise spectrum density that the signal to noise ratio of wherein said present frame is based on described audio signal present frame calculates；And

Judging unit, is configured to based on the signal to noise ratio of described present frame and pre-determined threshold, to generate voice activity detection result.

8. system according to claim 7, it is characterised in that described processing unit includes: the first computing unit is for calculating the spectrum density of described audio signal present frame；Second computing unit is for calculating the expected value of noise spectrum density；And the 3rd computing unit, for calculating the signal to noise ratio of described present frame.

9. system according to claim 7, it is characterised in that the expected value of described noise spectrum density be based on noise statistical distribution calculate and come.

10. system according to claim 7, it is characterised in that the calculating of wherein said signal to noise ratio is based on formula:

Wherein, SNR represents signal to noise ratio.

11. system according to claim 7, it is characterised in that wherein said pre-determined threshold is dynamic threshold and changes with the change of signal to noise ratio.

12. system according to claim 11, it is characterised in that described processing unit farther includes: the 4th computing unit, it is used for calculating described dynamic threshold.

13. system according to claim 12, it is characterised in that the calculating of wherein said dynamic threshold is based on formula:

γ = \sqrt{2 * D} * {erfc}^{- 1} (2 P_{FA})

14. system according to claim 7, it is characterised in that when described signal to noise ratio is more than described pre-determined threshold, the voice activity detection result of generation is the present frame of described audio signal is voice segments；When described signal to noise ratio is less than described pre-determined threshold, the present frame that voice activity detection result is described audio signal of generation is non-speech segment.