CN104992708A

CN104992708A - Short-time specific audio detection model generating method and short-time specific audio detection method

Info

Publication number: CN104992708A
Application number: CN201510236568.8A
Authority: CN
Inventors: 云晓春; 颜永红; 袁庆升; 黄宇飞; 任彦; 周若华; 黄文廷; 邹学强; 包秀国
Original assignee: Institute of Acoustics CAS; National Computer Network and Information Security Management Center
Current assignee: Institute of Acoustics CAS; National Computer Network and Information Security Management Center
Priority date: 2015-05-11
Filing date: 2015-05-11
Publication date: 2015-10-21
Anticipated expiration: 2035-05-11
Also published as: CN104992708B

Abstract

The invention relates to a short-time specific audio detection model generating method comprising: extracting a characteristic of training voice data, wherein the training voice data comprises unspecific audio data and specific audio data; training a universal background model by using the characteristic of training voice data; self-adaptively acquiring a model of a kind of specific audio data according to the universal background model and the characteristic of the kind of specific audio data in the training voice data; and repeating the operation until the models of all kinds of specific audio data in the training voice data are obtained. The invention also provides a short-time specific audio detection method which detects the specific audio data by model scoring. The method not only well solves a problem of insufficient specific audio model training data, but also suppresses the background noise of input data to a certain extent.

Description

Special audio detection model generates and detection method in short-term

Technical field

The present invention relates to the method that special audio in short-term detects, more particularly, the present invention relates to the detection utilizing mixed Gauss model to carry out special audio in short-term.

Background technology

In a lot of fields, special audio has important effect in short-term, and especially in security fields, in some specific cases, we need to detect that the special audio in short-term of a certain class processes for some urgent events timely to facilitate us.Such as, in public, we need supervise public safety and detect the generation of mishap, as unexpected birdie, unexpected explosive sound or gunshot, we must detect in time these in short-term special audio to facilitate these fortuitous events of process in time.In addition, in the place that some are relatively important, in short-term special audio detection can also be used for abnormal sound detect, can well early warning be played a part.

The at present problem that runs into of special audio detection method or a lot of in short-term, the first because in short-term special audio that time of origin that is very fast and event occurs is very of short duration, so the information in how utilizing audio frequency is in short-term very important; The second, the frequency that special audio occurs in short-term is not very high, so have in the face of the insufficient problem of training data; 3rd, because the scene used often has complicated ground unrest, so Background suppression noise also becomes that special audio in short-term detects well is also an important problem.

Summary of the invention

The object of the invention is to overcome training data existing for the existing detection method of special audio in short-term not enough, cannot the defect of Background suppression noise, thus provide a kind of model generation of special audio in short-term based on mixed Gauss model and detection method.

Present invention also offers one special audio detection model generation method in short-term, comprising:

Step 101, feature extraction is carried out to training utterance data; Wherein, described training utterance data comprise nonspecific voice data and special audio data;

Step 102, the feature of training utterance data obtained by step 101, carry out the training of universal background model; Wherein, described universal background model is mixed Gauss model, and its expression formula is:

p (x | λ) = Σ_{i = 1}^{M} w_{i} p_{i} (x);

The weight of each Gauss that what wi represented is, span 0 ~ 1, and meets normalizing condition: x represents the frame feature of training utterance fragment; λ represents the set of all parameters in gauss hybrid models; p _ix () represents the probability density function of each single Gauss model, its expression formula is:

p_{i} = \frac{1}{{(2 π)}^{D / 2} {| Σi |}^{1 / 2}} \exp {- \frac{1}{2} {(x - μ_{i})}^{'} {(Σi)}^{- 1} (x - μ_{i})};

The dimension of the frame feature of training utterance fragment that what D represented is; The covariance matrix of what Σ i represented is this Gaussian function; μ _iwhat represent is the mean vector of this Gaussian function;

Step 103, feature by class special audio data a certain in training utterance data, according to the model obtaining such special audio data in the universal background model that step 102 obtains adaptively; Repeat this operation, until obtain the model of all class special audio data in training utterance data.

In technique scheme, in a step 101, the feature extracted training utterance data is mel cepstrum coefficients.

In technique scheme, in a step 102, the training carrying out universal background model comprises period of use and hopes that maximized method carries out parameter estimation to universal background model, the parameter estimated comprises three classes: Gauss weight w, Gauss variance δ and Gaussian mean μ, and wherein w is each Gauss weight w _iset, δ is each Gauss variance δ _iset, μ is each Gaussian mean μ _iset, i represents the numbering of each single Gauss model; Specifically comprise:

Step 102-1, to a kth Gauss weight w _krenewal:

A kth Gauss weight w _krenewal process is as shown in following formula:

w_{k} = \frac{1}{T} Σ_{t = 1}^{T} p (k | x_{t}, λ)

Wherein, x _trepresenting the t frame proper vector in the training utterance x of input, is the known vector calculated at characteristic extraction procedure; λ is the general name to parameters all in gauss hybrid models, and these all can provide initial value in the initialization of the incipient stage of training, are known parameters; The totalframes of what T represented the is training utterance of all inputs to calculate known numeric value; What k represented is kth single Gauss model numbering in gauss hybrid models; P (k|x _t, λ) and that represent is the training utterance frame x inputted _tposterior probability on a universal background model kth Gauss, by incoming frame x _tgained is calculated with mixed Gauss model parameter lambda;

Step 102-2, to a kth Gaussian mean μ _krenewal:

A kth Gaussian mean μ _krenewal process is as shown in following formula:

μ_{k} = \frac{Σ_{t - 1}^{T} p (k | x_{t}, λ) \cdot x_{t}}{Σ_{t = 1}^{T} p (k | x_{t}, λ)}

Wherein, T, x _twith the variable that λ is known, and p (k|x _t, λ) and be by incoming frame x _tgained is calculated with mixed Gauss model parameter lambda;

Step 102-3, to kth Gauss's variance renewal:

A kth Gaussian mean renewal process is as shown in following formula:

δ_{k}^{2} = \frac{Σ_{t = 1}^{T} p (k | x_{t}, λ) \cdot x_{t}^{2}}{Σ_{t = 1}^{T} p (k | x_{t}, λ)} - μ_{k}^{2}

Wherein, T, x _t, λ and μ _kall known variable, and p (k|x _t, λ) and be by incoming frame x _tgained is calculated with mixed Gauss model parameter lambda.

In technique scheme, in step 103, the model obtaining class special audio data in the universal background model obtained according to step 102 adaptively comprises:

Step 103-1, first calculate the posterior probability n of each speech frame on universal background model according to the proper vector of special audio of training _i, first order statistic E _i(x) and second-order statistic E _i(x ²); Concrete computation process is as shown in following formula:

n_{i} = Σ_{t = 1}^{T} \Pr (i | x_{t})

E_{i} (x) = \frac{1}{n_{i}} Σ_{t = 1}^{T} \Pr (i | x_{t}) \cdot x_{t}

E_{i} (x^{2}) = \frac{1}{n_{i}} Σ_{t = 1}^{T} \Pr (i | x_{t}) \cdot x_{t}

Wherein, Pr (i|x _t) represent the posterior probability of input audio frequency x t frame in universal background model i-th Gauss; x _trepresent the feature of input audio frequency x t frame data; What T represented is the totalframes inputting audio frequency; The numbering of i-th single Gauss in universal background model that what i represented is;

Step 103-2, the posterior probability utilizing step 103-1 to calculate, first order statistic and second-order statistic, do self-adaptative adjustment to the parameter of universal background model, obtains the weight of special audio model average and covariance the formula of self-adaptative adjustment is as follows:

{\hat{w}}_{i} = [α_{i}^{w} n_{i} / T + (1 - α_{i}^{w}) w_{i}] γ

{\hat{μ}}_{i} = α_{i}^{m} E_{i} (x) + (1 - α_{i}^{m}) μ_{i}

{\hat{δ}}_{i}^{2} = α_{i}^{v} E_{i} (x^{2}) + (1 - α_{i}^{v}) (σ_{i}^{2} + μ_{i}^{2}) - {\hat{μ}}_{i}^{2}

Wherein, with variance, average, weight adjusting coefficient respectively; What T represented is such special audio training data totalframes, and γ represents normalized parameter, ensures w _iwhat represent is the weight of i-th Gauss model in universal background model; μ _iwhat represent is the average of i-th Gauss model in universal background model; represent the covariance of i-th Gauss in universal background model, μ _iwhat represent is the average of i-th Gauss in universal background model, what represent is the average of i-th Gauss of this special audio model that self-adaptation obtains.

Invention further provides one special audio detection method in short-term, comprising:

Step 201, feature extraction is done to inputted tested speech;

Step 202, tested speech feature step 201 extracted are input in the middle of the universal background model that the described detection model of special audio in short-term generation method obtains, and calculate the score of tested speech on universal background model;

The mixed Gauss model of all kinds of special audios that step 203, the detection model of the special audio in short-term generation method described in the input of tested speech feature step 201 extracted obtain, calculates the score of tested speech on the mixed Gauss model of each class special audio;

The score of tested speech on the mixed Gauss model of all kinds of special audio that step 204, the tested speech obtained step 202 obtain in score and the step 203 of universal background model asks difference respectively, difference and threshold value are compared, thus adjudicate this testing audio and belong to which kind of special audio, if have multiple model score all in threshold range, then adopt the method for getting maximal value to adjudicate, select the special audio of the maximum model sign of mark as tested speech final judging result.

In technique scheme, in step 202., calculate the score of tested speech on universal background model to comprise: choose N number of Gauss that posterior probability in universal background model is maximum, and calculate this N number of probability sum, with this N number of gaussian sequence number of tense marker.

In technique scheme, in step 203, calculate the score of tested speech on the mixed Gauss model of each class special audio to comprise: N number of gaussian sequence of the universal background model recorded by step 202, calculate the posterior probability sum of this N number of Gauss in the mixed Gauss model of special audio accordingly, using this value as the score of tested speech on the mixed Gauss model of all kinds of special audio.

In technique scheme, in step 201, the feature extracted tested speech is mel cepstrum coefficients.

The invention has the advantages that:

Method of the present invention not only can overcome the insufficient problem of special audio model training data in short-term well, can also Background suppression noise well to a certain extent.

Accompanying drawing explanation

Fig. 1 be in short-term in special audio detection model generation method about the training ultimate principle block diagram of universal background model;

Fig. 2 be in short-term in special audio detection model generation method about the training ultimate principle block diagram of special audio model;

Fig. 3 is the process flow diagram of special audio detection method in short-term.

Embodiment

Existing composition graphs 1 and Fig. 2 are described in further detail the specific embodiment of the present invention.

The detection method of special audio in short-term of the present invention comprises two stages, and the first stage utilizes training utterance data training pattern, and subordinate phase is that the model after utilizing training detects tested speech.

One, the model training stage

Step 101, carry out feature extraction to training utterance data, the feature extracted is mel cepstrum coefficients (MFCC feature), and this category feature comprises energy value and single order, second order difference;

In one embodiment, the frame length of the mel cepstrum coefficients extracted is 20ms, and frame moves as 10ms, comprises energy value and single order, second order difference; Characteristic dimension is 60 dimensions altogether;

Described training utterance data should comprise the data of a large amount of nonspecific audio frequency and the data of a certain amount of special audio.

Step 102, the feature of training utterance data utilizing step 101 to obtain, i.e. mel cepstrum coefficients, carries out the training (UBM model) of universal background model;

With reference to the training schematic diagram of the universal background model given by figure 1, universal background model is as formula (1):

p (x | λ) = Σ_{i = 1}^{M} w_{i} p_{i} (x) - - - (1)

W in formula (1) _iwhat represent is the weight of each Gauss, and span 0 ~ 1, and meets normalizing condition: x represents training utterance fragment frames feature; λ represents the set of all parameters in gauss hybrid models; M represents Gaussian Mixture number in gauss hybrid models.

P in formula (1) _ix () represents the probability density function of each single Gauss model, its expression is as formula (2):

p_{i} = \frac{1}{{(2 π)}^{D / 2} {| Σi |}^{1 / 2}} \exp {- \frac{1}{2} {(x - μ_{i})}^{'} {(Σi)}^{- 1} (x - μ_{i})} - - - (2)

Wherein p _iwhat x () was represented by following parameter characterization: D is the dimension of the frame feature of training utterance fragment, determined by characteristic dimension in characteristic extraction procedure; The covariance matrix of what Σ i represented is this Gaussian function; μ _iwhat represent is the mean vector of this Gaussian function.

Be exactly more than the concrete expression of universal background model, the gauss hybrid models linear weighted function summation of multiple single Gauss carrys out the probability distribution function of the general speaker of matching for voiced speech feature, i.e. distribution probability density function.So the distribution of speaker's sounding can be characterized well by common Gaussian mixture model, speaker's pronunciation character can be characterized well.

On the basis of above-mentioned universal background model, the process utilizing the feature of training utterance data to carry out universal background model training refers to utilize expects that maximized method carries out parameter estimation.

After parameter estimation, can obtain universal background model, this model essence is exactly mixed Gauss model, and its parameter just comprises three: Gauss weight w, Gauss variance δ and Gaussian mean μ, and wherein w is each Gauss weight w _iset, δ is each Gauss variance δ _iset, μ is each Gaussian mean μ _iset, i represents the numbering of not single Gauss model.Trained by training data, these three parameters obtained are unique.

Concrete parameter estimation procedure is as follows:

Step 102-1, to a kth Gauss weight w _krenewal:

A kth Gauss weight w _krenewal process is as formula (3):

w_{k} = \frac{1}{T} Σ_{t = 1}^{T} p (k | x_{t}, λ) - - - (3)

Wherein, x _trepresenting the t frame proper vector in the training utterance x of input, is the known vector calculated at characteristic extraction procedure; λ is the same with the expression in formula (1), is the general name to parameters all in gauss hybrid models, and these all can provide initial value in the initialization of the incipient stage of training, are known parameters; The totalframes of what T represented the is training utterance of all inputs to calculate known numeric value; What k represented is kth single Gauss model numbering in gauss hybrid models; P (k|x _t, λ) and that represent is the training utterance frame x inputted _tposterior probability on a universal background model kth Gauss, this is by incoming frame x _tgained is calculated with mixed Gauss model parameter lambda.

Step 102-2, to a kth Gaussian mean μ _krenewal:

A kth Gaussian mean μ _krenewal process is as formula (4):

μ_{k} = \frac{Σ_{t = 1}^{T} p (k | x_{t}, λ) \cdot x_{t}}{Σ_{t - 1}^{T} p (k | x_{t}, λ)} - - - (4)

Wherein each parameter is the same with the implication in formula (3), wherein, and T, x _twith the variable that λ is known, and p (k|x _t, λ) and be by incoming frame x _tgained is calculated with mixed Gauss model parameter lambda.

Step 102-3, to kth Gauss's variance renewal:

A kth Gaussian mean renewal process is as formula (5):

δ_{k}^{2} = \frac{Σ_{t = 1}^{T} p (k | x_{t}, λ) \cdot x_{t}^{2}}{Σ_{t = 1}^{T} p (k | x_{t}, λ)} - μ_{k}^{2} - - - (5)

Wherein each parameter is the same with the implication in formula (3) and formula (4), wherein, and T, x _t, λ and μ _kall known variable, and p (k|x _t, λ) and be by incoming frame x _tgained is calculated with mixed Gauss model parameter lambda.

Step 103, in order to obtain the model of each special audio, first need first to obtain such special audio voice of part as model training voice, if the more difficult acquisition of such special audio data, such special audio data during training universal background model can be used, if such new special audio data can be obtained, just use such new voice data as training data, no matter training data has how many, and a class special audio obtains corresponding a kind of special audio model.

In this step, as shown in Figure 2, will utilize a small amount of certain class special audio training data and Bayesian adaptation, from universal background model, self-adaptation obtains such special audio model, and concrete adaptive process is as follows:

Step 103-1, first calculate the posterior probability of each speech frame on universal background model, first order statistic and second-order statistic according to the proper vector of special audio of training; Concrete computation process is as formula (6) (7) (8):

n_{i} = Σ_{t = 1}^{T} \Pr (i | x_{t}) - - - (6)

E_{i} (x) = \frac{1}{n_{i}} Σ_{t = 1}^{T} \Pr (i | x_{t}) \cdot x_{t} - - - (7)

E_{i} (x^{2}) = \frac{1}{n_{i}} Σ_{t = 1}^{T} \Pr (i | x_{t}) \cdot x_{t} - - - (8)

Wherein, Pr (i|x _t) represent the posterior probability of input audio frequency x t frame in universal background model i-th Gauss; x _trepresent the feature of input audio frequency x t frame data; What T represented is the totalframes inputting audio frequency; The numbering of i-th single Gauss in universal background model that what i represented is.

Because when each special audio model of training, it is different to be used for adaptive often kind of special audio data, thus be used for training the posterior probability calculated of each special audio model, single order second-order statistic is also different.

Step 103-2, the posterior probability utilizing step 103-1 to calculate, first order statistic and second-order statistic, do self-adaptative adjustment to the parameter of universal background model, obtains the weight of special audio model average and covariance because special audio model is also gauss hybrid models in essence, so obtain the weight of special audio model average and covariance after, just can characterize the mixed Gauss model of this special audio.

Concrete self-adaptation formula is as shown in (9) (10) (11):

{\hat{w}}_{i} = [α_{i}^{w} n_{i} / T + (1 - α_{i}^{w}) w_{i}] γ - - - (9)

{\hat{μ}}_{i} = α_{i}^{m} E_{i} (x) + (1 - α_{i}^{m}) μ_{i} - - - (10)

{\hat{δ}}_{i}^{2} = α_{i}^{v} E_{i} (x^{2}) + (1 - α_{i}^{v}) (σ_{i}^{2} + μ_{i}^{2}) - {\hat{μ}}_{i}^{2} - - - (11)

Wherein, with variance, average, weight adjusting coefficient respectively; n _i, E _i(x) and E _i(x ²) be exactly posterior probability, first order statistic, the second-order statistic of this special audio training data calculated by formula (6) (7) (8); In formula (9), what T represented is such special audio training data totalframes, and γ represents normalized parameter, ensures w _iwhat represent is the weight of i-th Gauss model in universal background model; In formula (10), μ _iwhat represent is the average of i-th Gauss model in universal background model; In formula (11), represent the covariance of i-th Gauss in universal background model, μ _iwhat represent is the average of i-th Gauss in universal background model, what represent is the average of i-th Gauss of this special audio model that self-adaptation obtains.

After above-mentioned computation process, just obtain such special audio model.

Can know from step 103-1, because the self-adapting data of each special audio model is different, so also different by the posterior probability that calculates, single order second-order statistic, so the final special audio model obtained is also just different after calculating through 103-2.

Two, test phase

With reference to figure 3, test phase comprises the following steps:

Step 201, feature extraction is done to inputted tested speech;

The feature extracted in this step is identical with the type of the feature extracted in step 101, as being mel cepstrum coefficients;

Step 202, tested speech feature step 201 extracted are input in the middle of the universal background model of training in step 102, calculate the score of tested speech on universal background model.

Can be known by explanation before, universal background model is gauss hybrid models in essence, and tested speech is exactly each Gauss's posterior probability sum in the score of universal background model.As the preferred implementation of one, calculate to accelerate score, when actual computation be not calculate whole Gauss posterior probability but choose the maximum N number of Gauss of posterior probability, and calculate this N number of probability sum, with this N number of gaussian sequence number of tense marker.

The mixed Gauss model of the respective special audio that step 203, the tested speech feature input step 103 step 201 extracted obtain, calculate the score of tested speech on the mixed Gauss model of each special audio, if there be M special audio model, the so final score obtained has M altogether.

The concrete grammar calculating the score of tested speech on the mixed Gauss model of respective special audio remains and calculates tested speech posterior probability sum above each Gauss on special audio model.As the preferred implementation of one, in order to improve computing velocity, N number of gaussian sequence of the universal background model recorded by step 202, calculate the posterior probability sum of this N number of Gauss in the mixed Gauss model of special audio accordingly, using this value as the score of tested speech on the mixed Gauss model of respective special audio.

Step 204, the score of tested speech on the mixed Gauss model of respective special audio that the tested speech obtained step 202 obtains in score and the step 203 of universal background model asks difference respectively, difference and threshold value are compared, thus adjudicate this testing audio and belong to which special audio, if have multiple model score all in threshold range, the method of getting maximal value is then adopted to adjudicate, namely compare the model score of these scores in threshold range, select the special audio of the maximum model sign of mark as tested speech final judging result.

It should be noted last that, above embodiment is only in order to illustrate technical scheme of the present invention and unrestricted.Although with reference to embodiment to invention has been detailed description, those of ordinary skill in the art is to be understood that, modify to technical scheme of the present invention or equivalent replacement, do not depart from the spirit and scope of technical solution of the present invention, it all should be encompassed in the middle of right of the present invention.

Claims

1. a special audio detection model generation method in short-term, comprising:

p (x | λ) = Σ_{i = 1}^{M} w_{i} p_{i} (x);

W _iwhat represent is the weight of each Gauss, and span 0 ~ 1, and meets normalizing condition: x represents the frame feature of training utterance fragment; λ represents the set of all parameters in gauss hybrid models; p _ix () represents the probability density function of each single Gauss model, its expression formula is:

p_{i} = \frac{1}{{(2 π)}^{D / 2} {| Σi |}^{1 / 2}} \exp {- \frac{1}{2} {(x - μ_{i})}^{'} {(Σi)}^{- 1} (x - μ_{i})};

2. the detection model of special audio in short-term generation method according to claim 1, is characterized in that, in a step 101, the feature extracted training utterance data is mel cepstrum coefficients.

3. the detection model of special audio in short-term generation method according to claim 1, it is characterized in that, in a step 102, the training carrying out universal background model comprises period of use and hopes that maximized method carries out parameter estimation to universal background model, the parameter estimated comprises three classes: Gauss weight w, Gauss variance δ and Gaussian mean μ, and wherein w is each Gauss weight w _iset, δ is each Gauss variance δ _iset, μ is each Gaussian mean μ _iset, i represents the numbering of each single Gauss model; Specifically comprise:

Step 102-1, to a kth Gauss weight w _krenewal:

A kth Gauss weight w _krenewal process is as shown in following formula:

w_{k} = \frac{1}{T} Σ_{t = 1}^{T} p (k | x_{t}, λ)

Step 102-2, to a kth Gaussian mean μ _krenewal:

A kth Gaussian mean μ _krenewal process is as shown in following formula:

μ_{k} = \frac{Σ_{t = 1}^{T} p (k | x_{t}, λ) \cdot x_{t}}{Σ_{t = 1}^{T} p (k | x_{t}, λ)}

Step 102-3, to kth Gauss's variance renewal:

A kth Gaussian mean renewal process is as shown in following formula:

δ_{k}^{2} = \frac{Σ_{t = 1}^{T} p (k | x_{t}, λ) \cdot x_{t}^{2}}{Σ_{t = 1}^{T} p (k | x_{t}, λ)} - μ_{k}^{2}

4. the detection model of special audio in short-term generation method according to claim 1, is characterized in that, in step 103, the model according to obtaining class special audio data in the universal background model that step 102 obtains adaptively comprises:

n_{i} = Σ_{t = 1}^{T} \Pr (i | x_{t})

E_{i} (x) = - \frac{1}{n_{i}} Σ_{t = 1}^{T} \Pr (i | x_{t}) \cdot x_{t}

E_{i} (x^{2}) = \frac{1}{n_{i}} Σ_{t = 1}^{T} \Pr (i | x_{t}) \cdot x_{t}

{\hat{w}}_{i} = [α_{i}^{w} n_{i} / T + (1 - α_{i}^{w}) w_{i}] γ

{\hat{μ}}_{i} = α_{i}^{m} E_{i} (x) + (1 - α_{i}^{m}) μ_{i}

{\hat{δ}}_{i}^{2} = α_{i}^{v} E_{i} (x^{2}) + (1 - α_{i}^{v}) (σ_{i}^{2} + μ_{i}^{2}) - {\hat{μ}}_{i}^{2}

5. a special audio detection method in short-term, comprising:

Step 201, feature extraction is done to inputted tested speech;

Step 202, tested speech feature step 201 extracted are input in the middle of the universal background model that the described detection model of the special audio in short-term method of generationing of one of claim 1-4 obtains, the score of calculating tested speech on universal background model;

Step 203, tested speech feature step 201 extracted input the mixed Gauss model of all kinds of special audios that the described detection model of the special audio in short-term generation method of one of claim 1-4 obtains, and calculate the score of tested speech on the mixed Gauss model of each class special audio;

6. the detection method of special audio in short-term according to claim 5, it is characterized in that, in step 202., calculate the score of tested speech on universal background model to comprise: choose N number of Gauss that posterior probability in universal background model is maximum, and calculate this N number of probability sum, with this N number of gaussian sequence number of tense marker.

7. the detection method of special audio in short-term according to claim 6, it is characterized in that, in step 203, calculate the score of tested speech on the mixed Gauss model of each class special audio to comprise: N number of gaussian sequence of the universal background model recorded by step 202, calculate the posterior probability sum of this N number of Gauss in the mixed Gauss model of special audio accordingly, using this value as the score of tested speech on the mixed Gauss model of all kinds of special audio.

8. the detection method of special audio in short-term according to claim 5, is characterized in that, in step 201, the feature extracted tested speech is mel cepstrum coefficients.