CN104036777A

CN104036777A - Method and device for voice activity detection

Info

Publication number: CN104036777A
Application number: CN201410217411.6A
Authority: CN
Inventors: 何勇军; 孙广路; 谢怡宁; 郑云龙
Original assignee: Harbin University of Science and Technology
Current assignee: Harbin University of Science and Technology
Priority date: 2014-05-22
Filing date: 2014-05-22
Publication date: 2014-09-10

Abstract

The invention relates to a method and a device for voice activity detection. The method comprises the steps of extracting the signal characteristics of clean voice signals and the signal characteristics of noise mixed voice signals, carrying out dictionary training according to the signal characteristics of the clean voice signals to obtain a voice dictionary, dynamically updating predetermined noise training data according to the signal characteristics of the noise mixed voice signals, extracting the signal characteristics of the updated noise training data and carrying out online dictionary training to obtain a noise dictionary; performing sparse representation on the signal frames of a noise mixed voice signal input according to the voice dictionary and the noise dictionary, extracting a sparse coefficient in the sparse representation, and detecting the signal frames of the input noise mixed voice signal according to the sparse coefficient. The method and the device are capable of accurately recognizing the voice part and the non-voice part of a voice signal in a noise environment, and the performance of the voice activity detection in the varying noise environment is improved.

Description

A kind of voice activity detection method and device

Technical field

The present invention relates to voice process technology field, relate in particular to a kind of voice activity detection method and device.

Background technology

The matter of utmost importance that analysis and processed voice need to solve is voice and the non-voice detecting in voice signal, and this task is called as voice activity detection (Voice activity detection, VAD).This technology has vital role in speech processes field, and affects to a great extent the performance of other application technologies, typically has robust speech identification, Speaker Identification, voice programming and transmission, and associating noise reduction and echo elimination etc.

The basic skills of traditional VAD has G.729 standard etc., criterion calculation line spectrum frequency G.729, full frequency band energy, low-frequency range energy (<1khz), and zero-crossing rate.Then set thresholding each frame of signal is simply classified, also with level and smooth and adaptive correction, improve the accuracy of classification simultaneously.

Although said method can be obtained satisfied performance under without the environment of making an uproar, under noise circumstance, its performance will sharply reduce.For addressing this problem, some researchers have proposed the voice activity detection algorithms based on statistical model.Typically the spectral coefficient of hypothesis noise and voice signal can be carried out to modeling with complicated Gaussian random variable, thereby develop the voice activity detection algorithms based on likelihood ratio test.Afterwards, there were again many researchers to want by supposing that for voice signal different statistical models improves the performance of the voice activity detection algorithms based on statistical model.Such as having Gauss model, laplace model, snr measurement, a plurality of observation likelihood ratio test, broad sense gamma distributed model, Markov model etc.

These methods have good performance under stable noise circumstance, but under the condition of the noise changing, its performance is still difficult to meet the practical requirement of reality.For addressing this problem, researchers have further proposed acoustics event detection (AED) technology, method of transition card Thalmann filter (SKF) and clustering algorithm (as spectral clustering) etc.

In recent years, along with the maturation of Its Sparse Decomposition and reconstruct theory, the every field that sparse coding (Sparse Coding) is processed at signal has shown great potential.This technology under sparse property criterion by one group of primitive signal linear expression for signal, the rarefaction representation of picked up signal (Sparse Representation).Wherein, each primitive signal is called an atom (Atom), and all former molecular set are called atom dictionary (Atom Dictionary).A large amount of signals in reality, meet sparse property as voice, image etc. all meet or are similar to.

It is exactly a kind of separation method based on sparse signal representation that anatomic element is analyzed (Morphological Component Analysis, MCA).For each signal source in mixed signal, all there is this corresponding dictionary in this method hypothesis, can this signal of rarefaction representation, and other signals can not be with this dictionary rarefaction representation.Containing in noisy situation, MCA is a kind of effective rarefaction representation method.Based on K svd (KSVD), it is the complete dictionary training method of a kind of mistake being expanded by K-mean algorithm.This algorithm uses the method for rarefaction representation, compares traditional algorithm and has less calculated amount and better performance.

Summary of the invention

(1) technical matters that will solve

The object of this invention is to provide a kind of voice activity detection method and device, to solve the problem of prior art detection poor robustness of voice activity detection under the noise conditions changing.

(2) technical scheme

In order to achieve the above object, the present invention proposes a kind of voice activity detection method, the method comprises the following steps:

Extract the signal characteristic of clean speech signal and the signal characteristic of mixed noisy speech signal;

According to the signal characteristic of described clean speech signal, carry out dictionary training and obtain voice dictionary;

According to the signal characteristic of described mixed noisy speech signal, dynamically update default noise training data, the signal characteristic that extracts the described noise training data after upgrading carries out online dictionary training and obtains noise dictionary;

According to described voice dictionary and noise dictionary, the signal frame of the mixed noisy speech signal of input is carried out to rarefaction representation;

Extract the sparse coefficient in described rarefaction representation, according to described sparse coefficient, the signal frame of the mixed noisy speech signal of input is detected.

Preferably, extracting the signal characteristic of clean speech signal and the signal characteristic of mixed noisy speech signal specifically comprises:

The discrete-time signal of clean speech is carried out to pre-service;

Signal frame through pretreated clean speech signal is carried out to the amplitude spectrum that discrete Fourier transform (DFT) obtains clean speech signal, the signal characteristic using the amplitude spectrum of described clean speech signal as clean speech signal;

Discrete-time signal to the mixed voice of making an uproar carries out pre-service;

Signal frame through pretreated mixed noisy speech signal is carried out to the amplitude spectrum that discrete Fourier transform (DFT) obtains mixed noisy speech signal, the signal characteristic using the amplitude spectrum of described mixed noisy speech signal as mixed noisy speech signal.

Preferably, pre-service specifically comprises: to discrete-time signal, divide frame, and the frame signal after minute frame is processed is carried out windowing.

Preferably, according to the signal characteristic of clean speech signal, carrying out dictionary training obtains voice dictionary and specifically comprises:

Utilize K-SVD algorithm to carry out dictionary training to the signal characteristic of described clean speech signal and obtain voice dictionary Φ ^s, computing formula is as follows:

\min {| | Y^{s} - Φ^{s} | |}_{2}^{2}

suject?to?||x _i|| ₀≤T ₀

Wherein, the signal characteristic of the training use being formed by the frame of M clean speech signal, X=[x ₁, x ₂..., x _m] be with respect to Y ^sone group of sparse vector collection, T ₀it is the sparse pre-set limit thresholding of training utterance dictionary.

Preferably, according to the signal characteristic of mixed noisy speech signal, dynamically update default noise training data, the signal characteristic that extracts the described noise training data after upgrading carries out online dictionary training and obtains noise dictionary and specifically comprise:

According to the signal characteristic of described mixed noisy speech signal, carry out dictionary training and obtain the mixed dictionary of making an uproar;

The signal characteristic that extracts default noise training data carries out dictionary training and obtains initial noise dictionary;

According to described voice dictionary and initial noise dictionary, described mixed noisy speech signal is carried out to rarefaction representation, from described mixed noisy speech signal, extract the noise data making new advances and dynamically update default noise training data;

The signal characteristic that extracts the described noise training data after upgrading carries out dictionary training and upgrades described initial noise dictionary, obtains noise dictionary.

Preferably, according to voice dictionary and noise dictionary, the signal frame of the mixed noisy speech signal of input being carried out to rarefaction representation specifically comprises:

Described voice dictionary and noise dictionary are carried out to dictionary splicing Generation of atoms dictionary;

According to described atom dictionary, utilize MCA algorithm to calculate the sparse coefficient of the signal frame of the mixed noisy speech signal of inputting, described sparse coefficient comprises the sparse coefficient of voice dictionary and the sparse coefficient of noise dictionary;

According to described sparse coefficient, the signal frame of the mixed noisy speech signal of input is carried out to rarefaction representation.

Preferably, extract the sparse coefficient in rarefaction representation, according to described sparse coefficient, the signal frame of the mixed noisy speech signal of input detected specifically and comprised:

Extract the sparse coefficient of described voice dictionary;

The sparse coefficient of described voice dictionary and default thresholding are compared, when in the sparse coefficient of voice dictionary, the number of nonzero element is greater than default thresholding, the signal frame of the mixed noisy speech signal of input is voice signal, otherwise described signal frame is non-speech audio.

In addition, the invention allows for a kind of device of voice activity detection, this device comprises:

Characteristic extracting module, for extracting the signal characteristic of clean speech signal and the signal characteristic of mixed noisy speech signal;

Voice dictionary training module, obtains voice dictionary for carrying out dictionary training according to the signal characteristic of described clean speech signal;

Noise dictionary training module, for dynamically update default noise training data according to the signal characteristic of described mixed noisy speech signal, the signal characteristic that extracts the described noise training data after upgrading carries out online dictionary training and obtains noise dictionary;

Its Sparse Decomposition module, for carrying out rarefaction representation according to described voice dictionary and noise dictionary to the signal frame of the mixed noisy speech signal of input;

Detection module, for extracting the sparse coefficient of described rarefaction representation, detects the signal frame of the mixed noisy speech signal of input according to described sparse coefficient.

Preferably, Its Sparse Decomposition module comprises:

Dictionary concatenation unit, for carrying out dictionary splicing Generation of atoms dictionary by described voice dictionary and noise dictionary;

Sparse coefficient calculation unit, for utilize MCA algorithm to calculate the sparse coefficient of the signal frame of the mixed noisy speech signal of inputting according to described atom dictionary, described sparse coefficient comprises the sparse coefficient of voice dictionary and the sparse coefficient of noise dictionary;

Rarefaction representation unit, for carrying out rarefaction representation according to described sparse coefficient by the signal frame of the mixed noisy speech signal of input.

Preferably, detection module comprises:

Extraction unit, for extracting the sparse coefficient of described voice dictionary from described sparse coefficient;

Detecting unit, for the sparse coefficient of described voice dictionary and default thresholding are compared, when in the sparse coefficient of voice dictionary, the number of nonzero element is greater than default thresholding, the signal frame of the mixed noisy speech signal of input is voice signal, otherwise described signal frame is non-speech audio.

(3) beneficial effect

A kind of voice activity detection method and device that the present invention proposes, adopt separation algorithm MCA and the dictionary training algorithm K-SVD of sparse signal representation to carry out voice activity detection, can accurately tell phonological component and the non-speech portion of voice signal under noise circumstance, the performance of raising voice activity detection under variable noise environment, comparing classic method has stronger detection robustness.

Accompanying drawing explanation

Fig. 1 is the process flow diagram of a kind of voice activity detection method of the present invention;

Fig. 2 is the module map of a kind of voice activity detection apparatus of the present invention.

Embodiment

Below in conjunction with drawings and Examples, the specific embodiment of the present invention is described in further detail.Following examples are used for illustrating the present invention, but are not used for limiting the scope of the invention.

The present invention proposes a kind of voice activity detection method, as shown in Figure 1, comprise the following steps:

S101 extracts the signal characteristic of clean speech signal and the signal characteristic of mixed noisy speech signal, specifically comprises: the discrete-time signal of clean speech is carried out to pre-service; Signal frame through pretreated clean speech signal is carried out to the amplitude spectrum that discrete Fourier transform (DFT) obtains clean speech signal, the signal characteristic using the amplitude spectrum of described clean speech signal as clean speech signal; Discrete-time signal to the mixed voice of making an uproar carries out pre-service; Signal frame through pretreated mixed noisy speech signal is carried out to the amplitude spectrum that discrete Fourier transform (DFT) obtains mixed noisy speech signal, the signal characteristic using the amplitude spectrum of described mixed noisy speech signal as mixed noisy speech signal.

Wherein, pre-service specifically comprises: to discrete-time signal, divide frame, and the frame signal after minute frame is processed is carried out windowing.

S102 carries out dictionary training according to the signal characteristic of clean speech signal and obtains voice dictionary, specifically comprises: utilize K-SVD algorithm to carry out dictionary training to the signal characteristic of described clean speech signal and obtain voice dictionary Φ ^s, computing formula is as follows:

\min {| | Y^{s} - Φ^{s} | |}_{2}^{2}

suject?to||?x _i|| ₀≤T ₀

Wherein, the signal characteristic of the training use being formed by the amplitude spectrum of M clean speech signal frame, X=[x ₁, x ₂..., x _m] be with respect to Y ^sone group of sparse vector collection, T ₀it is the sparse pre-set limit thresholding of training utterance dictionary.

S103 dynamically updates default noise training data according to the signal characteristic of described mixed noisy speech signal, the signal characteristic that extracts the described noise training data after upgrading carries out online dictionary training and obtains noise dictionary, specifically comprises: according to the signal characteristic of described mixed noisy speech signal, carry out dictionary training and obtain the mixed dictionary of making an uproar; The signal characteristic that extracts default noise training data carries out dictionary training and obtains initial noise dictionary; According to described voice dictionary and initial noise dictionary, described mixed noisy speech signal is carried out to rarefaction representation, from described mixed noisy speech signal, extract the noise data making new advances and dynamically update default noise training data; The signal characteristic that extracts the described noise training data after upgrading carries out dictionary training and upgrades initial noise dictionary, obtains noise dictionary.

S104 carries out rarefaction representation according to described voice dictionary and noise dictionary to the signal frame of the mixed noisy speech signal of input, specifically comprises: described voice dictionary and noise dictionary are carried out to dictionary splicing Generation of atoms dictionary; According to described atom dictionary, utilize MCA algorithm to calculate the sparse coefficient of the signal frame of the mixed noisy speech signal of inputting, described sparse coefficient comprises the sparse coefficient of voice dictionary and the sparse coefficient of noise dictionary; According to described sparse coefficient, the signal frame of the mixed noisy speech signal of input is carried out to rarefaction representation.

S105 extracts the sparse coefficient in described rarefaction representation, according to described sparse coefficient, the signal frame of the mixed noisy speech signal of input is detected, and specifically comprises: the sparse coefficient that extracts described voice dictionary; The sparse coefficient of described voice dictionary and default thresholding are compared, when in the sparse coefficient of voice dictionary, the number of nonzero element is greater than default thresholding, the signal frame of the mixed noisy speech signal of input is voice signal, otherwise described signal frame is non-speech audio.

Voice activity detection method disclosed by the invention realizes voice signal part and non-speech audio detection partly in mixed noisy speech signal based on anatomic element analysis (Morphological Component Analysis, MCA).Input in the embodiment of the present invention is the discrete-time signal of clean speech and the mixed voice of making an uproar, first extract the signal characteristic of clean speech signal and the signal characteristic of mixed noisy speech signal, specifically comprise the following steps: respectively the discrete-time signal of the discrete-time signal of clean speech and the mixed voice of making an uproar is carried out to pre-service, comprise a minute frame, windowing.Divide the object of frame to be time signal to be divided into overlapping voice snippet, i.e. frame mutually.Every frame length is generally 30ms left and right, and frame moves as 10ms.Next, to the windowing of every frame voice.The window function extensively adopting at present has Hamming window and Hanning window, and in the present embodiment, adopts Hamming window:

Wherein n is time sequence number, and L is that window is long.

Signal frame through pretreated clean speech signal is carried out to the amplitude spectrum that discrete Fourier transform (DFT) obtains clean speech signal, the signal characteristic using the amplitude spectrum of described clean speech signal as clean speech signal; Signal frame through pretreated mixed noisy speech signal is carried out to the amplitude spectrum that discrete Fourier transform (DFT) obtains mixed noisy speech signal, signal characteristic using the amplitude spectrum of described mixed noisy speech signal as mixed noisy speech signal, wherein the concrete formula of discrete Fourier transform (DFT) is as follows:

X_{a} (k) = Σ_{n = 0}^{N - 1} x (n) e^{- j 2 kπ / N}, 0 \leq k \leq N

X in formula (n) is the signal frame after windowing, and N represents counting of Fourier transform.Will be through pretreated clean speech signal y ^s(n) as discrete Fourier transform (DFT), obtain the amplitude spectrum Y of clean speech signal ^s; To as discrete Fourier transform (DFT), obtain through pretreated mixed noisy speech signal y (n) the amplitude spectrum Y of mixed noisy speech signal.

With the clean voice dictionary of K-SVD Algorithm for Training.Based on K svd (K-SVD), it is the complete dictionary training method of a kind of mistake being expanded by K-mean algorithm.This algorithm uses the method for rarefaction representation, compares traditional algorithm and has less calculated amount and better performance.

With K-SVD Algorithm for Training voice dictionary Φ ^s, dictionary training problem is described to:

\min {| | Y^{s} - Φ^{s} | |}_{2}^{2}

suject?to?||x _i|| ₀≤T ₀

Wherein, the signal characteristic of the training use being formed by the amplitude spectrum of M clean speech signal frame, X=[x ₁, x ₂..., x _m] be with respect to Y ^sone group of sparse vector collection, T ₀be the sparse pre-set limit thresholding of training utterance dictionary, can carry out dictionary learning, obtain voice dictionary Φ ^s.

For noise atom, adopt and to dynamically update strategy, become the impact of noise when following the trail of, we train by online mode and upgrade noise dictionary.

According to the signal characteristic of described mixed noisy speech signal, dynamically update default noise training data, the signal characteristic online updating noise dictionary that then extracts described noise training data is real-time update noise dictionary, and concrete steps are described below:

In the embodiment of the present invention, suppose that Γ is used for storing noise training data, y _srepresent Y _iat clean speech dictionary Φ ^son sparse coefficient.Ψ represents clean speech dictionary Φ ^swith initial noise dictionary Φ ^vthe big dictionary of splicing, Y=[Y ₁, Y ₂..., Y _p] be the mixed voice of making an uproar of test, P is data frame number.The noise dictionary Φ of Output rusults for upgrading ^v, concrete steps are as follows:

The first step, initialization stores the noise storehouse Γ of default noise training data for empty, and thresholding δ=2nd in this algorithm, the optimal value obtaining according to great many of experiments.

Second step, works as thresholding time, do circulation as follows:

1. initialization initialization noise data collection Γ is empty;

2. by each frame data Y _isparsely represent the upper rarefaction representation y that obtains of dictionary Ψ _i;

3. calculate each y _i1-norm, and be accumulated in together, assignment is given

4. with rarefaction representation, reconstruct signal, then calculate residual error, and residual error data is saved in and in Γ, upgrades default noise training data;

5. the noise training data after the renewal of take in Γ is input, with K-SVD Algorithm for Training noise dictionary, stores Φ into ^vin (upgrade noise dictionary);

6. calculate

When jump out circular treatment.

With the voice dictionary Φ having obtained ^swith noise dictionary Φ ^vbe spliced into a new dictionary Ψ=[Φ ^sΦ ^v] be atom dictionary.With MCA algorithm, the mixed speech frame of making an uproar of the process feature extraction of input is carried out to rarefaction representation.Mixed noisy speech signal frame is carried out to Its Sparse Decomposition and find the mixed rarefaction representation of voice on splicing dictionary of making an uproar.Intuitively, speech components is indicated on voice atom, and noise component is indicated on noise atom.When reconstruct, the coefficient on all noise components is set to 0, only retain the nonzero coefficient in speech components.

Suppose and have voice dictionary with noise dictionary composed atom dictionary Φ=[Φ ^sΦ ^v].Mixed noisy speech signal y=s+v, wherein s is clear voice, v is noise.The mixed voice of making an uproar are decomposed into x in redundant dictionary, have

y = Φx = [\begin{matrix} Φ^{s} & Φ^{v} \end{matrix}] [\begin{matrix} x^{s} \\ x^{v} \end{matrix}] = Φ^{s} x^{s} + Φ^{v} x^{v}

X wherein ^sfor the mixed voice coefficient vector on voice dictionary of making an uproar is the sparse coefficient of voice dictionary, x ^vfor the coefficient vector of y on noise dictionary is the sparse coefficient of noise dictionary.

With anatomic element analysis (MCA) algorithm, the mixed noisy speech signal of input is carried out to rarefaction representation, problem is described to:

suject?to?||Y-Ψx|| ₂＜α

Wherein, Y is mixed noisy speech signal, and x is the sparse coefficient of mixed noisy speech signal, Ψ=[Φ ^sΦ ^v] atom dictionary obtains by the splicing of two dictionaries, Φ wherein ^sfor voice dictionary, Φ ^vfor noise dictionary.So, according to this formula, the nonzero element number of the x that can send as an envoy to is minimum, and || Y-Ψ x|| ₂while being less than α, each frame of mixed noisy speech signal Y can represent with sparse coefficient x, x wherein ^sfor the sparse coefficient of voice dictionary, x ^vsparse coefficient for noise dictionary.

The sparse coefficient extracting in the embodiment of the present invention in described rarefaction representation detects the signal frame of the mixed noisy speech signal of input, the mixed noise cancellation signal that judges each frame input is voice signal, or non-speech audio, specifically comprises the following steps: the sparse coefficient x that extracts voice dictionary from sparse coefficient x ^s; By the sparse coefficient x of described voice dictionary ^scompare with default thresholding ξ, as the sparse coefficient x of voice dictionary ^swhen the number of middle nonzero element is greater than default thresholding ξ, the signal frame of the mixed noisy speech signal of input is voice signal, otherwise described signal frame is non-speech audio, specific as follows:

|| x ^s|| ₀> ξ is voice signal;

|| x ^s|| ₀≤ ξ is non-speech audio;

Wherein, || x ^s|| ₀for x ^sthe number of nonzero element, ξ is threshold value, as the sparse coefficient x of voice dictionary ^sthe number of nonzero element while being greater than ξ, show voice dictionary Φ ^sthe number of times being used is many, and the signal that this frame is described is so voice signal, otherwise the signal of this frame is non-speech audio, and in the embodiment of the present invention, threshold value is the optimal value obtaining by experiment, and value is 2.5.

In addition, the embodiment of the present invention has also proposed a kind of device of voice activity detection, and as shown in Figure 2, this device comprises:

Characteristic extracting module 1, for extracting the signal characteristic of clean speech signal and the signal characteristic of mixed noisy speech signal;

Voice dictionary training module 2, obtains voice dictionary for carrying out dictionary training according to the signal characteristic of described clean speech signal;

Noise dictionary training module 3, for dynamically update default noise training data according to the signal characteristic of described mixed noisy speech signal, the signal characteristic that extracts the described noise training data after upgrading carries out online dictionary training and obtains noise dictionary;

Its Sparse Decomposition module 4, for carrying out rarefaction representation according to described voice dictionary and noise dictionary to the signal frame of the mixed noisy speech signal of input;

Detection module 5, for extracting the sparse coefficient of described rarefaction representation, detects the signal frame of the mixed noisy speech signal of input according to described sparse coefficient.

Wherein, Its Sparse Decomposition module 4 comprises dictionary concatenation unit, sparse coefficient calculation unit and rarefaction representation unit;

Wherein, detection module 5 comprises extraction unit and detecting unit;

Detecting unit, for the sparse coefficient of described voice dictionary and default thresholding are compared, when in the sparse coefficient of voice dictionary, the number of nonzero element is less than default thresholding, the signal frame of the mixed noisy speech signal of input is voice signal, otherwise described signal frame is non-speech audio.

Above embodiment is only for illustrating the present invention; and be not limitation of the present invention; the those of ordinary skill in relevant technologies field; without departing from the spirit and scope of the present invention; can also make a variety of changes and modification; therefore all technical schemes that are equal to also belong to category of the present invention, and scope of patent protection of the present invention should be defined by the claims.

Claims

1. a voice activity detection method, is characterized in that, comprising:

2. the method for claim 1, is characterized in that, the signal characteristic of described extraction clean speech signal and the signal characteristic of mixed noisy speech signal specifically comprise:

The discrete-time signal of clean speech is carried out to pre-service;

3. method as claimed in claim 2, is characterized in that, pre-service specifically comprises: to discrete-time signal, divide frame, and the frame signal after minute frame is processed is carried out windowing.

4. the method for claim 1, is characterized in that, describedly according to the signal characteristic of clean speech signal, carries out dictionary training and obtains voice dictionary and specifically comprise:

\min {| | Y^{s} - Φ^{s} | |}_{2}^{2}

suject?to?||x _i|| ₀≤T ₀

5. the method for claim 1, it is characterized in that, the signal characteristic of the mixed noisy speech signal of described basis dynamically updates default noise training data, and the signal characteristic that extracts the described noise training data after upgrading carries out online dictionary training and obtains noise dictionary and specifically comprise:

6. the method for claim 1, is characterized in that, describedly according to voice dictionary and noise dictionary, the signal frame of the mixed noisy speech signal of input is carried out to rarefaction representation and specifically comprises:

According to described atom dictionary, utilize anatomic element to analyze the sparse coefficient that MCA algorithm calculates the signal frame of the mixed noisy speech signal of inputting, described sparse coefficient comprises the sparse coefficient of voice dictionary and the sparse coefficient of noise dictionary;

7. method as claimed in claim 6, is characterized in that, the sparse coefficient in described extraction rarefaction representation detects specifically and comprises the signal frame of the mixed noisy speech signal of input according to described sparse coefficient:

Extract the sparse coefficient of described voice dictionary;

8. a device for voice activity detection, is characterized in that, this device comprises:

9. device as claimed in claim 8, is characterized in that, described Its Sparse Decomposition module comprises:

10. device as claimed in claim 8, is characterized in that, described detection module comprises: