CN104036777A - Method and device for voice activity detection - Google Patents

Method and device for voice activity detection Download PDF

Info

Publication number
CN104036777A
CN104036777A CN201410217411.6A CN201410217411A CN104036777A CN 104036777 A CN104036777 A CN 104036777A CN 201410217411 A CN201410217411 A CN 201410217411A CN 104036777 A CN104036777 A CN 104036777A
Authority
CN
China
Prior art keywords
dictionary
signal
noise
voice
speech signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410217411.6A
Other languages
Chinese (zh)
Inventor
何勇军
孙广路
谢怡宁
郑云龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin University of Science and Technology
Original Assignee
Harbin University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin University of Science and Technology filed Critical Harbin University of Science and Technology
Priority to CN201410217411.6A priority Critical patent/CN104036777A/en
Publication of CN104036777A publication Critical patent/CN104036777A/en
Pending legal-status Critical Current

Links

Landscapes

  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention relates to a method and a device for voice activity detection. The method comprises the steps of extracting the signal characteristics of clean voice signals and the signal characteristics of noise mixed voice signals, carrying out dictionary training according to the signal characteristics of the clean voice signals to obtain a voice dictionary, dynamically updating predetermined noise training data according to the signal characteristics of the noise mixed voice signals, extracting the signal characteristics of the updated noise training data and carrying out online dictionary training to obtain a noise dictionary; performing sparse representation on the signal frames of a noise mixed voice signal input according to the voice dictionary and the noise dictionary, extracting a sparse coefficient in the sparse representation, and detecting the signal frames of the input noise mixed voice signal according to the sparse coefficient. The method and the device are capable of accurately recognizing the voice part and the non-voice part of a voice signal in a noise environment, and the performance of the voice activity detection in the varying noise environment is improved.

Description

A kind of voice activity detection method and device
Technical field
The present invention relates to voice process technology field, relate in particular to a kind of voice activity detection method and device.
Background technology
The matter of utmost importance that analysis and processed voice need to solve is voice and the non-voice detecting in voice signal, and this task is called as voice activity detection (Voice activity detection, VAD).This technology has vital role in speech processes field, and affects to a great extent the performance of other application technologies, typically has robust speech identification, Speaker Identification, voice programming and transmission, and associating noise reduction and echo elimination etc.
The basic skills of traditional VAD has G.729 standard etc., criterion calculation line spectrum frequency G.729, full frequency band energy, low-frequency range energy (<1khz), and zero-crossing rate.Then set thresholding each frame of signal is simply classified, also with level and smooth and adaptive correction, improve the accuracy of classification simultaneously.
Although said method can be obtained satisfied performance under without the environment of making an uproar, under noise circumstance, its performance will sharply reduce.For addressing this problem, some researchers have proposed the voice activity detection algorithms based on statistical model.Typically the spectral coefficient of hypothesis noise and voice signal can be carried out to modeling with complicated Gaussian random variable, thereby develop the voice activity detection algorithms based on likelihood ratio test.Afterwards, there were again many researchers to want by supposing that for voice signal different statistical models improves the performance of the voice activity detection algorithms based on statistical model.Such as having Gauss model, laplace model, snr measurement, a plurality of observation likelihood ratio test, broad sense gamma distributed model, Markov model etc.
These methods have good performance under stable noise circumstance, but under the condition of the noise changing, its performance is still difficult to meet the practical requirement of reality.For addressing this problem, researchers have further proposed acoustics event detection (AED) technology, method of transition card Thalmann filter (SKF) and clustering algorithm (as spectral clustering) etc.
In recent years, along with the maturation of Its Sparse Decomposition and reconstruct theory, the every field that sparse coding (Sparse Coding) is processed at signal has shown great potential.This technology under sparse property criterion by one group of primitive signal linear expression for signal, the rarefaction representation of picked up signal (Sparse Representation).Wherein, each primitive signal is called an atom (Atom), and all former molecular set are called atom dictionary (Atom Dictionary).A large amount of signals in reality, meet sparse property as voice, image etc. all meet or are similar to.
It is exactly a kind of separation method based on sparse signal representation that anatomic element is analyzed (Morphological Component Analysis, MCA).For each signal source in mixed signal, all there is this corresponding dictionary in this method hypothesis, can this signal of rarefaction representation, and other signals can not be with this dictionary rarefaction representation.Containing in noisy situation, MCA is a kind of effective rarefaction representation method.Based on K svd (KSVD), it is the complete dictionary training method of a kind of mistake being expanded by K-mean algorithm.This algorithm uses the method for rarefaction representation, compares traditional algorithm and has less calculated amount and better performance.
Summary of the invention
(1) technical matters that will solve
The object of this invention is to provide a kind of voice activity detection method and device, to solve the problem of prior art detection poor robustness of voice activity detection under the noise conditions changing.
(2) technical scheme
In order to achieve the above object, the present invention proposes a kind of voice activity detection method, the method comprises the following steps:
Extract the signal characteristic of clean speech signal and the signal characteristic of mixed noisy speech signal;
According to the signal characteristic of described clean speech signal, carry out dictionary training and obtain voice dictionary;
According to the signal characteristic of described mixed noisy speech signal, dynamically update default noise training data, the signal characteristic that extracts the described noise training data after upgrading carries out online dictionary training and obtains noise dictionary;
According to described voice dictionary and noise dictionary, the signal frame of the mixed noisy speech signal of input is carried out to rarefaction representation;
Extract the sparse coefficient in described rarefaction representation, according to described sparse coefficient, the signal frame of the mixed noisy speech signal of input is detected.
Preferably, extracting the signal characteristic of clean speech signal and the signal characteristic of mixed noisy speech signal specifically comprises:
The discrete-time signal of clean speech is carried out to pre-service;
Signal frame through pretreated clean speech signal is carried out to the amplitude spectrum that discrete Fourier transform (DFT) obtains clean speech signal, the signal characteristic using the amplitude spectrum of described clean speech signal as clean speech signal;
Discrete-time signal to the mixed voice of making an uproar carries out pre-service;
Signal frame through pretreated mixed noisy speech signal is carried out to the amplitude spectrum that discrete Fourier transform (DFT) obtains mixed noisy speech signal, the signal characteristic using the amplitude spectrum of described mixed noisy speech signal as mixed noisy speech signal.
Preferably, pre-service specifically comprises: to discrete-time signal, divide frame, and the frame signal after minute frame is processed is carried out windowing.
Preferably, according to the signal characteristic of clean speech signal, carrying out dictionary training obtains voice dictionary and specifically comprises:
Utilize K-SVD algorithm to carry out dictionary training to the signal characteristic of described clean speech signal and obtain voice dictionary Φ s, computing formula is as follows:
min | | Y s - &Phi; s | | 2 2 suject?to?||x i|| 0≤T 0
Wherein, the signal characteristic of the training use being formed by the frame of M clean speech signal, X=[x 1, x 2..., x m] be with respect to Y sone group of sparse vector collection, T 0it is the sparse pre-set limit thresholding of training utterance dictionary.
Preferably, according to the signal characteristic of mixed noisy speech signal, dynamically update default noise training data, the signal characteristic that extracts the described noise training data after upgrading carries out online dictionary training and obtains noise dictionary and specifically comprise:
According to the signal characteristic of described mixed noisy speech signal, carry out dictionary training and obtain the mixed dictionary of making an uproar;
The signal characteristic that extracts default noise training data carries out dictionary training and obtains initial noise dictionary;
According to described voice dictionary and initial noise dictionary, described mixed noisy speech signal is carried out to rarefaction representation, from described mixed noisy speech signal, extract the noise data making new advances and dynamically update default noise training data;
The signal characteristic that extracts the described noise training data after upgrading carries out dictionary training and upgrades described initial noise dictionary, obtains noise dictionary.
Preferably, according to voice dictionary and noise dictionary, the signal frame of the mixed noisy speech signal of input being carried out to rarefaction representation specifically comprises:
Described voice dictionary and noise dictionary are carried out to dictionary splicing Generation of atoms dictionary;
According to described atom dictionary, utilize MCA algorithm to calculate the sparse coefficient of the signal frame of the mixed noisy speech signal of inputting, described sparse coefficient comprises the sparse coefficient of voice dictionary and the sparse coefficient of noise dictionary;
According to described sparse coefficient, the signal frame of the mixed noisy speech signal of input is carried out to rarefaction representation.
Preferably, extract the sparse coefficient in rarefaction representation, according to described sparse coefficient, the signal frame of the mixed noisy speech signal of input detected specifically and comprised:
Extract the sparse coefficient of described voice dictionary;
The sparse coefficient of described voice dictionary and default thresholding are compared, when in the sparse coefficient of voice dictionary, the number of nonzero element is greater than default thresholding, the signal frame of the mixed noisy speech signal of input is voice signal, otherwise described signal frame is non-speech audio.
In addition, the invention allows for a kind of device of voice activity detection, this device comprises:
Characteristic extracting module, for extracting the signal characteristic of clean speech signal and the signal characteristic of mixed noisy speech signal;
Voice dictionary training module, obtains voice dictionary for carrying out dictionary training according to the signal characteristic of described clean speech signal;
Noise dictionary training module, for dynamically update default noise training data according to the signal characteristic of described mixed noisy speech signal, the signal characteristic that extracts the described noise training data after upgrading carries out online dictionary training and obtains noise dictionary;
Its Sparse Decomposition module, for carrying out rarefaction representation according to described voice dictionary and noise dictionary to the signal frame of the mixed noisy speech signal of input;
Detection module, for extracting the sparse coefficient of described rarefaction representation, detects the signal frame of the mixed noisy speech signal of input according to described sparse coefficient.
Preferably, Its Sparse Decomposition module comprises:
Dictionary concatenation unit, for carrying out dictionary splicing Generation of atoms dictionary by described voice dictionary and noise dictionary;
Sparse coefficient calculation unit, for utilize MCA algorithm to calculate the sparse coefficient of the signal frame of the mixed noisy speech signal of inputting according to described atom dictionary, described sparse coefficient comprises the sparse coefficient of voice dictionary and the sparse coefficient of noise dictionary;
Rarefaction representation unit, for carrying out rarefaction representation according to described sparse coefficient by the signal frame of the mixed noisy speech signal of input.
Preferably, detection module comprises:
Extraction unit, for extracting the sparse coefficient of described voice dictionary from described sparse coefficient;
Detecting unit, for the sparse coefficient of described voice dictionary and default thresholding are compared, when in the sparse coefficient of voice dictionary, the number of nonzero element is greater than default thresholding, the signal frame of the mixed noisy speech signal of input is voice signal, otherwise described signal frame is non-speech audio.
(3) beneficial effect
A kind of voice activity detection method and device that the present invention proposes, adopt separation algorithm MCA and the dictionary training algorithm K-SVD of sparse signal representation to carry out voice activity detection, can accurately tell phonological component and the non-speech portion of voice signal under noise circumstance, the performance of raising voice activity detection under variable noise environment, comparing classic method has stronger detection robustness.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of a kind of voice activity detection method of the present invention;
Fig. 2 is the module map of a kind of voice activity detection apparatus of the present invention.
Embodiment
Below in conjunction with drawings and Examples, the specific embodiment of the present invention is described in further detail.Following examples are used for illustrating the present invention, but are not used for limiting the scope of the invention.
The present invention proposes a kind of voice activity detection method, as shown in Figure 1, comprise the following steps:
S101 extracts the signal characteristic of clean speech signal and the signal characteristic of mixed noisy speech signal, specifically comprises: the discrete-time signal of clean speech is carried out to pre-service; Signal frame through pretreated clean speech signal is carried out to the amplitude spectrum that discrete Fourier transform (DFT) obtains clean speech signal, the signal characteristic using the amplitude spectrum of described clean speech signal as clean speech signal; Discrete-time signal to the mixed voice of making an uproar carries out pre-service; Signal frame through pretreated mixed noisy speech signal is carried out to the amplitude spectrum that discrete Fourier transform (DFT) obtains mixed noisy speech signal, the signal characteristic using the amplitude spectrum of described mixed noisy speech signal as mixed noisy speech signal.
Wherein, pre-service specifically comprises: to discrete-time signal, divide frame, and the frame signal after minute frame is processed is carried out windowing.
S102 carries out dictionary training according to the signal characteristic of clean speech signal and obtains voice dictionary, specifically comprises: utilize K-SVD algorithm to carry out dictionary training to the signal characteristic of described clean speech signal and obtain voice dictionary Φ s, computing formula is as follows:
min | | Y s - &Phi; s | | 2 2 suject?to||?x i|| 0≤T 0
Wherein, the signal characteristic of the training use being formed by the amplitude spectrum of M clean speech signal frame, X=[x 1, x 2..., x m] be with respect to Y sone group of sparse vector collection, T 0it is the sparse pre-set limit thresholding of training utterance dictionary.
S103 dynamically updates default noise training data according to the signal characteristic of described mixed noisy speech signal, the signal characteristic that extracts the described noise training data after upgrading carries out online dictionary training and obtains noise dictionary, specifically comprises: according to the signal characteristic of described mixed noisy speech signal, carry out dictionary training and obtain the mixed dictionary of making an uproar; The signal characteristic that extracts default noise training data carries out dictionary training and obtains initial noise dictionary; According to described voice dictionary and initial noise dictionary, described mixed noisy speech signal is carried out to rarefaction representation, from described mixed noisy speech signal, extract the noise data making new advances and dynamically update default noise training data; The signal characteristic that extracts the described noise training data after upgrading carries out dictionary training and upgrades initial noise dictionary, obtains noise dictionary.
S104 carries out rarefaction representation according to described voice dictionary and noise dictionary to the signal frame of the mixed noisy speech signal of input, specifically comprises: described voice dictionary and noise dictionary are carried out to dictionary splicing Generation of atoms dictionary; According to described atom dictionary, utilize MCA algorithm to calculate the sparse coefficient of the signal frame of the mixed noisy speech signal of inputting, described sparse coefficient comprises the sparse coefficient of voice dictionary and the sparse coefficient of noise dictionary; According to described sparse coefficient, the signal frame of the mixed noisy speech signal of input is carried out to rarefaction representation.
S105 extracts the sparse coefficient in described rarefaction representation, according to described sparse coefficient, the signal frame of the mixed noisy speech signal of input is detected, and specifically comprises: the sparse coefficient that extracts described voice dictionary; The sparse coefficient of described voice dictionary and default thresholding are compared, when in the sparse coefficient of voice dictionary, the number of nonzero element is greater than default thresholding, the signal frame of the mixed noisy speech signal of input is voice signal, otherwise described signal frame is non-speech audio.
Voice activity detection method disclosed by the invention realizes voice signal part and non-speech audio detection partly in mixed noisy speech signal based on anatomic element analysis (Morphological Component Analysis, MCA).Input in the embodiment of the present invention is the discrete-time signal of clean speech and the mixed voice of making an uproar, first extract the signal characteristic of clean speech signal and the signal characteristic of mixed noisy speech signal, specifically comprise the following steps: respectively the discrete-time signal of the discrete-time signal of clean speech and the mixed voice of making an uproar is carried out to pre-service, comprise a minute frame, windowing.Divide the object of frame to be time signal to be divided into overlapping voice snippet, i.e. frame mutually.Every frame length is generally 30ms left and right, and frame moves as 10ms.Next, to the windowing of every frame voice.The window function extensively adopting at present has Hamming window and Hanning window, and in the present embodiment, adopts Hamming window:
Wherein n is time sequence number, and L is that window is long.
Signal frame through pretreated clean speech signal is carried out to the amplitude spectrum that discrete Fourier transform (DFT) obtains clean speech signal, the signal characteristic using the amplitude spectrum of described clean speech signal as clean speech signal; Signal frame through pretreated mixed noisy speech signal is carried out to the amplitude spectrum that discrete Fourier transform (DFT) obtains mixed noisy speech signal, signal characteristic using the amplitude spectrum of described mixed noisy speech signal as mixed noisy speech signal, wherein the concrete formula of discrete Fourier transform (DFT) is as follows:
X a ( k ) = &Sigma; n = 0 N - 1 x ( n ) e - j 2 k&pi; / N , 0 &le; k &le; N
X in formula (n) is the signal frame after windowing, and N represents counting of Fourier transform.Will be through pretreated clean speech signal y s(n) as discrete Fourier transform (DFT), obtain the amplitude spectrum Y of clean speech signal s; To as discrete Fourier transform (DFT), obtain through pretreated mixed noisy speech signal y (n) the amplitude spectrum Y of mixed noisy speech signal.
With the clean voice dictionary of K-SVD Algorithm for Training.Based on K svd (K-SVD), it is the complete dictionary training method of a kind of mistake being expanded by K-mean algorithm.This algorithm uses the method for rarefaction representation, compares traditional algorithm and has less calculated amount and better performance.
With K-SVD Algorithm for Training voice dictionary Φ s, dictionary training problem is described to:
min | | Y s - &Phi; s | | 2 2 suject?to?||x i|| 0≤T 0
Wherein, the signal characteristic of the training use being formed by the amplitude spectrum of M clean speech signal frame, X=[x 1, x 2..., x m] be with respect to Y sone group of sparse vector collection, T 0be the sparse pre-set limit thresholding of training utterance dictionary, can carry out dictionary learning, obtain voice dictionary Φ s.
For noise atom, adopt and to dynamically update strategy, become the impact of noise when following the trail of, we train by online mode and upgrade noise dictionary.
According to the signal characteristic of described mixed noisy speech signal, dynamically update default noise training data, the signal characteristic online updating noise dictionary that then extracts described noise training data is real-time update noise dictionary, and concrete steps are described below:
In the embodiment of the present invention, suppose that Γ is used for storing noise training data, y srepresent Y iat clean speech dictionary Φ son sparse coefficient.Ψ represents clean speech dictionary Φ swith initial noise dictionary Φ vthe big dictionary of splicing, Y=[Y 1, Y 2..., Y p] be the mixed voice of making an uproar of test, P is data frame number.The noise dictionary Φ of Output rusults for upgrading v, concrete steps are as follows:
The first step, initialization stores the noise storehouse Γ of default noise training data for empty, and thresholding δ=2nd in this algorithm, the optimal value obtaining according to great many of experiments.
Second step, works as thresholding time, do circulation as follows:
1. initialization initialization noise data collection Γ is empty;
2. by each frame data Y isparsely represent the upper rarefaction representation y that obtains of dictionary Ψ i;
3. calculate each y i1-norm, and be accumulated in together, assignment is given
4. with rarefaction representation, reconstruct signal, then calculate residual error, and residual error data is saved in and in Γ, upgrades default noise training data;
5. the noise training data after the renewal of take in Γ is input, with K-SVD Algorithm for Training noise dictionary, stores Φ into vin (upgrade noise dictionary);
6. calculate
When jump out circular treatment.
With the voice dictionary Φ having obtained swith noise dictionary Φ vbe spliced into a new dictionary Ψ=[Φ sΦ v] be atom dictionary.With MCA algorithm, the mixed speech frame of making an uproar of the process feature extraction of input is carried out to rarefaction representation.Mixed noisy speech signal frame is carried out to Its Sparse Decomposition and find the mixed rarefaction representation of voice on splicing dictionary of making an uproar.Intuitively, speech components is indicated on voice atom, and noise component is indicated on noise atom.When reconstruct, the coefficient on all noise components is set to 0, only retain the nonzero coefficient in speech components.
Suppose and have voice dictionary with noise dictionary composed atom dictionary Φ=[Φ sΦ v].Mixed noisy speech signal y=s+v, wherein s is clear voice, v is noise.The mixed voice of making an uproar are decomposed into x in redundant dictionary, have
y = &Phi;x = &Phi; s &Phi; v x s x v = &Phi; s x s + &Phi; v x v
X wherein sfor the mixed voice coefficient vector on voice dictionary of making an uproar is the sparse coefficient of voice dictionary, x vfor the coefficient vector of y on noise dictionary is the sparse coefficient of noise dictionary.
With anatomic element analysis (MCA) algorithm, the mixed noisy speech signal of input is carried out to rarefaction representation, problem is described to:
suject?to?||Y-Ψx|| 2<α
Wherein, Y is mixed noisy speech signal, and x is the sparse coefficient of mixed noisy speech signal, Ψ=[Φ sΦ v] atom dictionary obtains by the splicing of two dictionaries, Φ wherein sfor voice dictionary, Φ vfor noise dictionary.So, according to this formula, the nonzero element number of the x that can send as an envoy to is minimum, and || Y-Ψ x|| 2while being less than α, each frame of mixed noisy speech signal Y can represent with sparse coefficient x, x wherein sfor the sparse coefficient of voice dictionary, x vsparse coefficient for noise dictionary.
The sparse coefficient extracting in the embodiment of the present invention in described rarefaction representation detects the signal frame of the mixed noisy speech signal of input, the mixed noise cancellation signal that judges each frame input is voice signal, or non-speech audio, specifically comprises the following steps: the sparse coefficient x that extracts voice dictionary from sparse coefficient x s; By the sparse coefficient x of described voice dictionary scompare with default thresholding ξ, as the sparse coefficient x of voice dictionary swhen the number of middle nonzero element is greater than default thresholding ξ, the signal frame of the mixed noisy speech signal of input is voice signal, otherwise described signal frame is non-speech audio, specific as follows:
|| x s|| 0> ξ is voice signal;
|| x s|| 0≤ ξ is non-speech audio;
Wherein, || x s|| 0for x sthe number of nonzero element, ξ is threshold value, as the sparse coefficient x of voice dictionary sthe number of nonzero element while being greater than ξ, show voice dictionary Φ sthe number of times being used is many, and the signal that this frame is described is so voice signal, otherwise the signal of this frame is non-speech audio, and in the embodiment of the present invention, threshold value is the optimal value obtaining by experiment, and value is 2.5.
In addition, the embodiment of the present invention has also proposed a kind of device of voice activity detection, and as shown in Figure 2, this device comprises:
Characteristic extracting module 1, for extracting the signal characteristic of clean speech signal and the signal characteristic of mixed noisy speech signal;
Voice dictionary training module 2, obtains voice dictionary for carrying out dictionary training according to the signal characteristic of described clean speech signal;
Noise dictionary training module 3, for dynamically update default noise training data according to the signal characteristic of described mixed noisy speech signal, the signal characteristic that extracts the described noise training data after upgrading carries out online dictionary training and obtains noise dictionary;
Its Sparse Decomposition module 4, for carrying out rarefaction representation according to described voice dictionary and noise dictionary to the signal frame of the mixed noisy speech signal of input;
Detection module 5, for extracting the sparse coefficient of described rarefaction representation, detects the signal frame of the mixed noisy speech signal of input according to described sparse coefficient.
Wherein, Its Sparse Decomposition module 4 comprises dictionary concatenation unit, sparse coefficient calculation unit and rarefaction representation unit;
Dictionary concatenation unit, for carrying out dictionary splicing Generation of atoms dictionary by described voice dictionary and noise dictionary;
Sparse coefficient calculation unit, for utilize MCA algorithm to calculate the sparse coefficient of the signal frame of the mixed noisy speech signal of inputting according to described atom dictionary, described sparse coefficient comprises the sparse coefficient of voice dictionary and the sparse coefficient of noise dictionary;
Rarefaction representation unit, for carrying out rarefaction representation according to described sparse coefficient by the signal frame of the mixed noisy speech signal of input.
Wherein, detection module 5 comprises extraction unit and detecting unit;
Extraction unit, for extracting the sparse coefficient of described voice dictionary from described sparse coefficient;
Detecting unit, for the sparse coefficient of described voice dictionary and default thresholding are compared, when in the sparse coefficient of voice dictionary, the number of nonzero element is less than default thresholding, the signal frame of the mixed noisy speech signal of input is voice signal, otherwise described signal frame is non-speech audio.
A kind of voice activity detection method and device that the present invention proposes, adopt separation algorithm MCA and the dictionary training algorithm K-SVD of sparse signal representation to carry out voice activity detection, can accurately tell phonological component and the non-speech portion of voice signal under noise circumstance, the performance of raising voice activity detection under variable noise environment, comparing classic method has stronger detection robustness.
Above embodiment is only for illustrating the present invention; and be not limitation of the present invention; the those of ordinary skill in relevant technologies field; without departing from the spirit and scope of the present invention; can also make a variety of changes and modification; therefore all technical schemes that are equal to also belong to category of the present invention, and scope of patent protection of the present invention should be defined by the claims.

Claims (10)

1. a voice activity detection method, is characterized in that, comprising:
Extract the signal characteristic of clean speech signal and the signal characteristic of mixed noisy speech signal;
According to the signal characteristic of described clean speech signal, carry out dictionary training and obtain voice dictionary;
According to the signal characteristic of described mixed noisy speech signal, dynamically update default noise training data, the signal characteristic that extracts the described noise training data after upgrading carries out online dictionary training and obtains noise dictionary;
According to described voice dictionary and noise dictionary, the signal frame of the mixed noisy speech signal of input is carried out to rarefaction representation;
Extract the sparse coefficient in described rarefaction representation, according to described sparse coefficient, the signal frame of the mixed noisy speech signal of input is detected.
2. the method for claim 1, is characterized in that, the signal characteristic of described extraction clean speech signal and the signal characteristic of mixed noisy speech signal specifically comprise:
The discrete-time signal of clean speech is carried out to pre-service;
Signal frame through pretreated clean speech signal is carried out to the amplitude spectrum that discrete Fourier transform (DFT) obtains clean speech signal, the signal characteristic using the amplitude spectrum of described clean speech signal as clean speech signal;
Discrete-time signal to the mixed voice of making an uproar carries out pre-service;
Signal frame through pretreated mixed noisy speech signal is carried out to the amplitude spectrum that discrete Fourier transform (DFT) obtains mixed noisy speech signal, the signal characteristic using the amplitude spectrum of described mixed noisy speech signal as mixed noisy speech signal.
3. method as claimed in claim 2, is characterized in that, pre-service specifically comprises: to discrete-time signal, divide frame, and the frame signal after minute frame is processed is carried out windowing.
4. the method for claim 1, is characterized in that, describedly according to the signal characteristic of clean speech signal, carries out dictionary training and obtains voice dictionary and specifically comprise:
Utilize K-SVD algorithm to carry out dictionary training to the signal characteristic of described clean speech signal and obtain voice dictionary Φ s, computing formula is as follows:
min | | Y s - &Phi; s | | 2 2 suject?to?||x i|| 0≤T 0
Wherein, the signal characteristic of the training use being formed by the frame of M clean speech signal, X=[x 1, x 2..., x m] be with respect to Y sone group of sparse vector collection, T 0it is the sparse pre-set limit thresholding of training utterance dictionary.
5. the method for claim 1, it is characterized in that, the signal characteristic of the mixed noisy speech signal of described basis dynamically updates default noise training data, and the signal characteristic that extracts the described noise training data after upgrading carries out online dictionary training and obtains noise dictionary and specifically comprise:
According to the signal characteristic of described mixed noisy speech signal, carry out dictionary training and obtain the mixed dictionary of making an uproar;
The signal characteristic that extracts default noise training data carries out dictionary training and obtains initial noise dictionary;
According to described voice dictionary and initial noise dictionary, described mixed noisy speech signal is carried out to rarefaction representation, from described mixed noisy speech signal, extract the noise data making new advances and dynamically update default noise training data;
The signal characteristic that extracts the described noise training data after upgrading carries out dictionary training and upgrades described initial noise dictionary, obtains noise dictionary.
6. the method for claim 1, is characterized in that, describedly according to voice dictionary and noise dictionary, the signal frame of the mixed noisy speech signal of input is carried out to rarefaction representation and specifically comprises:
Described voice dictionary and noise dictionary are carried out to dictionary splicing Generation of atoms dictionary;
According to described atom dictionary, utilize anatomic element to analyze the sparse coefficient that MCA algorithm calculates the signal frame of the mixed noisy speech signal of inputting, described sparse coefficient comprises the sparse coefficient of voice dictionary and the sparse coefficient of noise dictionary;
According to described sparse coefficient, the signal frame of the mixed noisy speech signal of input is carried out to rarefaction representation.
7. method as claimed in claim 6, is characterized in that, the sparse coefficient in described extraction rarefaction representation detects specifically and comprises the signal frame of the mixed noisy speech signal of input according to described sparse coefficient:
Extract the sparse coefficient of described voice dictionary;
The sparse coefficient of described voice dictionary and default thresholding are compared, when in the sparse coefficient of voice dictionary, the number of nonzero element is greater than default thresholding, the signal frame of the mixed noisy speech signal of input is voice signal, otherwise described signal frame is non-speech audio.
8. a device for voice activity detection, is characterized in that, this device comprises:
Characteristic extracting module, for extracting the signal characteristic of clean speech signal and the signal characteristic of mixed noisy speech signal;
Voice dictionary training module, obtains voice dictionary for carrying out dictionary training according to the signal characteristic of described clean speech signal;
Noise dictionary training module, for dynamically update default noise training data according to the signal characteristic of described mixed noisy speech signal, the signal characteristic that extracts the described noise training data after upgrading carries out online dictionary training and obtains noise dictionary;
Its Sparse Decomposition module, for carrying out rarefaction representation according to described voice dictionary and noise dictionary to the signal frame of the mixed noisy speech signal of input;
Detection module, for extracting the sparse coefficient of described rarefaction representation, detects the signal frame of the mixed noisy speech signal of input according to described sparse coefficient.
9. device as claimed in claim 8, is characterized in that, described Its Sparse Decomposition module comprises:
Dictionary concatenation unit, for carrying out dictionary splicing Generation of atoms dictionary by described voice dictionary and noise dictionary;
Sparse coefficient calculation unit, for utilize MCA algorithm to calculate the sparse coefficient of the signal frame of the mixed noisy speech signal of inputting according to described atom dictionary, described sparse coefficient comprises the sparse coefficient of voice dictionary and the sparse coefficient of noise dictionary;
Rarefaction representation unit, for carrying out rarefaction representation according to described sparse coefficient by the signal frame of the mixed noisy speech signal of input.
10. device as claimed in claim 8, is characterized in that, described detection module comprises:
Extraction unit, for extracting the sparse coefficient of described voice dictionary from described sparse coefficient;
Detecting unit, for the sparse coefficient of described voice dictionary and default thresholding are compared, when in the sparse coefficient of voice dictionary, the number of nonzero element is greater than default thresholding, the signal frame of the mixed noisy speech signal of input is voice signal, otherwise described signal frame is non-speech audio.
CN201410217411.6A 2014-05-22 2014-05-22 Method and device for voice activity detection Pending CN104036777A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410217411.6A CN104036777A (en) 2014-05-22 2014-05-22 Method and device for voice activity detection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410217411.6A CN104036777A (en) 2014-05-22 2014-05-22 Method and device for voice activity detection

Publications (1)

Publication Number Publication Date
CN104036777A true CN104036777A (en) 2014-09-10

Family

ID=51467524

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410217411.6A Pending CN104036777A (en) 2014-05-22 2014-05-22 Method and device for voice activity detection

Country Status (1)

Country Link
CN (1) CN104036777A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106952643A (en) * 2017-02-24 2017-07-14 华南理工大学 A kind of sound pick-up outfit clustering method based on Gaussian mean super vector and spectral clustering
CN108962275A (en) * 2018-08-01 2018-12-07 电信科学技术研究院有限公司 A kind of music noise suppressing method and device
CN113470621A (en) * 2021-08-23 2021-10-01 杭州网易智企科技有限公司 Voice detection method, device, medium and electronic equipment

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1141548A (en) * 1995-02-17 1997-01-29 索尼公司 Method and apparatus for reducing noise in speech signal
JP2003308093A (en) * 2002-04-15 2003-10-31 Denso Corp Method and device for extracting signal component
EP1469471A2 (en) * 2003-04-14 2004-10-20 Sony Corporation Information processing apparatus for detecting inter-track boundaries
JP2005195955A (en) * 2004-01-08 2005-07-21 Toshiba Corp Device and method for noise suppression
US20120265526A1 (en) * 2011-04-13 2012-10-18 Continental Automotive Systems, Inc. Apparatus and method for voice activity detection
CN102959625A (en) * 2010-12-24 2013-03-06 华为技术有限公司 Method and apparatus for adaptively detecting voice activity in input audio signal
CN103020654A (en) * 2012-12-12 2013-04-03 北京航空航天大学 Synthetic aperture radar (SAR) image bionic recognition method based on sample generation and nuclear local feature fusion
CN103345923A (en) * 2013-07-26 2013-10-09 电子科技大学 Sparse representation based short-voice speaker recognition method

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1141548A (en) * 1995-02-17 1997-01-29 索尼公司 Method and apparatus for reducing noise in speech signal
JP2003308093A (en) * 2002-04-15 2003-10-31 Denso Corp Method and device for extracting signal component
EP1469471A2 (en) * 2003-04-14 2004-10-20 Sony Corporation Information processing apparatus for detecting inter-track boundaries
JP2005195955A (en) * 2004-01-08 2005-07-21 Toshiba Corp Device and method for noise suppression
CN102959625A (en) * 2010-12-24 2013-03-06 华为技术有限公司 Method and apparatus for adaptively detecting voice activity in input audio signal
US20120265526A1 (en) * 2011-04-13 2012-10-18 Continental Automotive Systems, Inc. Apparatus and method for voice activity detection
CN103020654A (en) * 2012-12-12 2013-04-03 北京航空航天大学 Synthetic aperture radar (SAR) image bionic recognition method based on sample generation and nuclear local feature fusion
CN103345923A (en) * 2013-07-26 2013-10-09 电子科技大学 Sparse representation based short-voice speaker recognition method

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
M. AHARON ET AL.: "K-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation", 《IEEE TRANSACTIONS ON SIGNAL PROCESSING》 *
何勇军 等: "基于稀疏编码的鲁棒说话人识别", 《数据采集与处理》 *
谢怡宁 等: "噪声环境下智能机器人语音控制特征提取方法", 《北京邮电大学学报》 *
韩卫丽 等: "一种基于信号稀疏表示的语音去噪新方法", 《北方工业大学学报》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106952643A (en) * 2017-02-24 2017-07-14 华南理工大学 A kind of sound pick-up outfit clustering method based on Gaussian mean super vector and spectral clustering
CN108962275A (en) * 2018-08-01 2018-12-07 电信科学技术研究院有限公司 A kind of music noise suppressing method and device
CN113470621A (en) * 2021-08-23 2021-10-01 杭州网易智企科技有限公司 Voice detection method, device, medium and electronic equipment
CN113470621B (en) * 2021-08-23 2023-10-24 杭州网易智企科技有限公司 Voice detection method, device, medium and electronic equipment

Similar Documents

Publication Publication Date Title
Sainath et al. Learning the speech front-end with raw waveform CLDNNs.
Chang et al. Robust CNN-based speech recognition with Gabor filter kernels.
Zhang et al. Boosted deep neural networks and multi-resolution cochleagram features for voice activity detection
US20170154640A1 (en) Method and electronic device for voice recognition based on dynamic voice model selection
CN110263322A (en) Audio for speech recognition corpus screening technique, device and computer equipment
CN104700843A (en) Method and device for identifying ages
CN103646649A (en) High-efficiency voice detecting method
CN103345923A (en) Sparse representation based short-voice speaker recognition method
CN105023580A (en) Unsupervised noise estimation and speech enhancement method based on separable deep automatic encoding technology
CN103474066A (en) Ecological voice recognition method based on multiband signal reconstruction
CN102915742A (en) Single-channel monitor-free voice and noise separating method based on low-rank and sparse matrix decomposition
CN105374352A (en) Voice activation method and system
CN101833951A (en) Multi-background modeling method for speaker recognition
CN104050965A (en) English phonetic pronunciation quality evaluation system with emotion recognition function and method thereof
Mitra et al. Joint modeling of articulatory and acoustic spaces for continuous speech recognition tasks
CN103117067A (en) Voice endpoint detection method under low signal-to-noise ratio
Ismail et al. Mfcc-vq approach for qalqalahtajweed rule checking
CN104505100A (en) Non-supervision speech enhancement method based robust non-negative matrix decomposition and data fusion
Ferrer et al. Spoken language recognition based on senone posteriors.
CN104952449A (en) Method and device for identifying environmental noise sources
CN104036777A (en) Method and device for voice activity detection
CN105869622B (en) Chinese hot word detection method and device
CN106297769B (en) A kind of distinctive feature extracting method applied to languages identification
Thomas et al. Acoustic and data-driven features for robust speech activity detection
Mesgarani et al. Adaptive stream fusion in multistream recognition of speech

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20140910