CN104992708A - Short-time specific audio detection model generating method and short-time specific audio detection method - Google Patents

Short-time specific audio detection model generating method and short-time specific audio detection method Download PDF

Info

Publication number
CN104992708A
CN104992708A CN201510236568.8A CN201510236568A CN104992708A CN 104992708 A CN104992708 A CN 104992708A CN 201510236568 A CN201510236568 A CN 201510236568A CN 104992708 A CN104992708 A CN 104992708A
Authority
CN
China
Prior art keywords
model
gauss
special audio
universal background
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510236568.8A
Other languages
Chinese (zh)
Other versions
CN104992708B (en
Inventor
云晓春
颜永红
袁庆升
黄宇飞
任彦
周若华
黄文廷
邹学强
包秀国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Acoustics CAS
National Computer Network and Information Security Management Center
Original Assignee
Institute of Acoustics CAS
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Acoustics CAS, National Computer Network and Information Security Management Center filed Critical Institute of Acoustics CAS
Priority to CN201510236568.8A priority Critical patent/CN104992708B/en
Publication of CN104992708A publication Critical patent/CN104992708A/en
Application granted granted Critical
Publication of CN104992708B publication Critical patent/CN104992708B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Electrically Operated Instructional Devices (AREA)
  • Complex Calculations (AREA)

Abstract

The invention relates to a short-time specific audio detection model generating method comprising: extracting a characteristic of training voice data, wherein the training voice data comprises unspecific audio data and specific audio data; training a universal background model by using the characteristic of training voice data; self-adaptively acquiring a model of a kind of specific audio data according to the universal background model and the characteristic of the kind of specific audio data in the training voice data; and repeating the operation until the models of all kinds of specific audio data in the training voice data are obtained. The invention also provides a short-time specific audio detection method which detects the specific audio data by model scoring. The method not only well solves a problem of insufficient specific audio model training data, but also suppresses the background noise of input data to a certain extent.

Description

Special audio detection model generates and detection method in short-term
Technical field
The present invention relates to the method that special audio in short-term detects, more particularly, the present invention relates to the detection utilizing mixed Gauss model to carry out special audio in short-term.
Background technology
In a lot of fields, special audio has important effect in short-term, and especially in security fields, in some specific cases, we need to detect that the special audio in short-term of a certain class processes for some urgent events timely to facilitate us.Such as, in public, we need supervise public safety and detect the generation of mishap, as unexpected birdie, unexpected explosive sound or gunshot, we must detect in time these in short-term special audio to facilitate these fortuitous events of process in time.In addition, in the place that some are relatively important, in short-term special audio detection can also be used for abnormal sound detect, can well early warning be played a part.
The at present problem that runs into of special audio detection method or a lot of in short-term, the first because in short-term special audio that time of origin that is very fast and event occurs is very of short duration, so the information in how utilizing audio frequency is in short-term very important; The second, the frequency that special audio occurs in short-term is not very high, so have in the face of the insufficient problem of training data; 3rd, because the scene used often has complicated ground unrest, so Background suppression noise also becomes that special audio in short-term detects well is also an important problem.
Summary of the invention
The object of the invention is to overcome training data existing for the existing detection method of special audio in short-term not enough, cannot the defect of Background suppression noise, thus provide a kind of model generation of special audio in short-term based on mixed Gauss model and detection method.
Present invention also offers one special audio detection model generation method in short-term, comprising:
Step 101, feature extraction is carried out to training utterance data; Wherein, described training utterance data comprise nonspecific voice data and special audio data;
Step 102, the feature of training utterance data obtained by step 101, carry out the training of universal background model; Wherein, described universal background model is mixed Gauss model, and its expression formula is:
p ( x | λ ) = Σ i = 1 M w i p i ( x ) ;
The weight of each Gauss that what wi represented is, span 0 ~ 1, and meets normalizing condition: x represents the frame feature of training utterance fragment; λ represents the set of all parameters in gauss hybrid models; p ix () represents the probability density function of each single Gauss model, its expression formula is:
p i = 1 ( 2 π ) D / 2 | Σi | 1 / 2 exp { - 1 2 ( x - μ i ) ′ ( Σi ) - 1 ( x - μ i ) } ;
The dimension of the frame feature of training utterance fragment that what D represented is; The covariance matrix of what Σ i represented is this Gaussian function; μ iwhat represent is the mean vector of this Gaussian function;
Step 103, feature by class special audio data a certain in training utterance data, according to the model obtaining such special audio data in the universal background model that step 102 obtains adaptively; Repeat this operation, until obtain the model of all class special audio data in training utterance data.
In technique scheme, in a step 101, the feature extracted training utterance data is mel cepstrum coefficients.
In technique scheme, in a step 102, the training carrying out universal background model comprises period of use and hopes that maximized method carries out parameter estimation to universal background model, the parameter estimated comprises three classes: Gauss weight w, Gauss variance δ and Gaussian mean μ, and wherein w is each Gauss weight w iset, δ is each Gauss variance δ iset, μ is each Gaussian mean μ iset, i represents the numbering of each single Gauss model; Specifically comprise:
Step 102-1, to a kth Gauss weight w krenewal:
A kth Gauss weight w krenewal process is as shown in following formula:
w k = 1 T Σ t = 1 T p ( k | x t , λ )
Wherein, x trepresenting the t frame proper vector in the training utterance x of input, is the known vector calculated at characteristic extraction procedure; λ is the general name to parameters all in gauss hybrid models, and these all can provide initial value in the initialization of the incipient stage of training, are known parameters; The totalframes of what T represented the is training utterance of all inputs to calculate known numeric value; What k represented is kth single Gauss model numbering in gauss hybrid models; P (k|x t, λ) and that represent is the training utterance frame x inputted tposterior probability on a universal background model kth Gauss, by incoming frame x tgained is calculated with mixed Gauss model parameter lambda;
Step 102-2, to a kth Gaussian mean μ krenewal:
A kth Gaussian mean μ krenewal process is as shown in following formula:
μ k = Σ t - 1 T p ( k | x t , λ ) · x t Σ t = 1 T p ( k | x t , λ )
Wherein, T, x twith the variable that λ is known, and p (k|x t, λ) and be by incoming frame x tgained is calculated with mixed Gauss model parameter lambda;
Step 102-3, to kth Gauss's variance renewal:
A kth Gaussian mean renewal process is as shown in following formula:
δ k 2 = Σ t = 1 T p ( k | x t , λ ) · x t 2 Σ t = 1 T p ( k | x t , λ ) - μ k 2
Wherein, T, x t, λ and μ kall known variable, and p (k|x t, λ) and be by incoming frame x tgained is calculated with mixed Gauss model parameter lambda.
In technique scheme, in step 103, the model obtaining class special audio data in the universal background model obtained according to step 102 adaptively comprises:
Step 103-1, first calculate the posterior probability n of each speech frame on universal background model according to the proper vector of special audio of training i, first order statistic E i(x) and second-order statistic E i(x 2); Concrete computation process is as shown in following formula:
n i = Σ t = 1 T Pr ( i | x t )
E i ( x ) = 1 n i Σ t = 1 T Pr ( i | x t ) · x t
E i ( x 2 ) = 1 n i Σ t = 1 T Pr ( i | x t ) · x t
Wherein, Pr (i|x t) represent the posterior probability of input audio frequency x t frame in universal background model i-th Gauss; x trepresent the feature of input audio frequency x t frame data; What T represented is the totalframes inputting audio frequency; The numbering of i-th single Gauss in universal background model that what i represented is;
Step 103-2, the posterior probability utilizing step 103-1 to calculate, first order statistic and second-order statistic, do self-adaptative adjustment to the parameter of universal background model, obtains the weight of special audio model average and covariance the formula of self-adaptative adjustment is as follows:
w ^ i = [ α i w n i / T + ( 1 - α i w ) w i ] γ
μ ^ i = α i m E i ( x ) + ( 1 - α i m ) μ i
δ ^ i 2 = α i v E i ( x 2 ) + ( 1 - α i v ) ( σ i 2 + μ i 2 ) - μ ^ i 2
Wherein, with variance, average, weight adjusting coefficient respectively; What T represented is such special audio training data totalframes, and γ represents normalized parameter, ensures w iwhat represent is the weight of i-th Gauss model in universal background model; μ iwhat represent is the average of i-th Gauss model in universal background model; represent the covariance of i-th Gauss in universal background model, μ iwhat represent is the average of i-th Gauss in universal background model, what represent is the average of i-th Gauss of this special audio model that self-adaptation obtains.
Invention further provides one special audio detection method in short-term, comprising:
Step 201, feature extraction is done to inputted tested speech;
Step 202, tested speech feature step 201 extracted are input in the middle of the universal background model that the described detection model of special audio in short-term generation method obtains, and calculate the score of tested speech on universal background model;
The mixed Gauss model of all kinds of special audios that step 203, the detection model of the special audio in short-term generation method described in the input of tested speech feature step 201 extracted obtain, calculates the score of tested speech on the mixed Gauss model of each class special audio;
The score of tested speech on the mixed Gauss model of all kinds of special audio that step 204, the tested speech obtained step 202 obtain in score and the step 203 of universal background model asks difference respectively, difference and threshold value are compared, thus adjudicate this testing audio and belong to which kind of special audio, if have multiple model score all in threshold range, then adopt the method for getting maximal value to adjudicate, select the special audio of the maximum model sign of mark as tested speech final judging result.
In technique scheme, in step 202., calculate the score of tested speech on universal background model to comprise: choose N number of Gauss that posterior probability in universal background model is maximum, and calculate this N number of probability sum, with this N number of gaussian sequence number of tense marker.
In technique scheme, in step 203, calculate the score of tested speech on the mixed Gauss model of each class special audio to comprise: N number of gaussian sequence of the universal background model recorded by step 202, calculate the posterior probability sum of this N number of Gauss in the mixed Gauss model of special audio accordingly, using this value as the score of tested speech on the mixed Gauss model of all kinds of special audio.
In technique scheme, in step 201, the feature extracted tested speech is mel cepstrum coefficients.
The invention has the advantages that:
Method of the present invention not only can overcome the insufficient problem of special audio model training data in short-term well, can also Background suppression noise well to a certain extent.
Accompanying drawing explanation
Fig. 1 be in short-term in special audio detection model generation method about the training ultimate principle block diagram of universal background model;
Fig. 2 be in short-term in special audio detection model generation method about the training ultimate principle block diagram of special audio model;
Fig. 3 is the process flow diagram of special audio detection method in short-term.
Embodiment
Existing composition graphs 1 and Fig. 2 are described in further detail the specific embodiment of the present invention.
The detection method of special audio in short-term of the present invention comprises two stages, and the first stage utilizes training utterance data training pattern, and subordinate phase is that the model after utilizing training detects tested speech.
One, the model training stage
Step 101, carry out feature extraction to training utterance data, the feature extracted is mel cepstrum coefficients (MFCC feature), and this category feature comprises energy value and single order, second order difference;
In one embodiment, the frame length of the mel cepstrum coefficients extracted is 20ms, and frame moves as 10ms, comprises energy value and single order, second order difference; Characteristic dimension is 60 dimensions altogether;
Described training utterance data should comprise the data of a large amount of nonspecific audio frequency and the data of a certain amount of special audio.
Step 102, the feature of training utterance data utilizing step 101 to obtain, i.e. mel cepstrum coefficients, carries out the training (UBM model) of universal background model;
With reference to the training schematic diagram of the universal background model given by figure 1, universal background model is as formula (1):
p ( x | λ ) = Σ i = 1 M w i p i ( x ) - - - ( 1 )
W in formula (1) iwhat represent is the weight of each Gauss, and span 0 ~ 1, and meets normalizing condition: x represents training utterance fragment frames feature; λ represents the set of all parameters in gauss hybrid models; M represents Gaussian Mixture number in gauss hybrid models.
P in formula (1) ix () represents the probability density function of each single Gauss model, its expression is as formula (2):
p i = 1 ( 2 π ) D / 2 | Σi | 1 / 2 exp { - 1 2 ( x - μ i ) ′ ( Σi ) - 1 ( x - μ i ) } - - - ( 2 )
Wherein p iwhat x () was represented by following parameter characterization: D is the dimension of the frame feature of training utterance fragment, determined by characteristic dimension in characteristic extraction procedure; The covariance matrix of what Σ i represented is this Gaussian function; μ iwhat represent is the mean vector of this Gaussian function.
Be exactly more than the concrete expression of universal background model, the gauss hybrid models linear weighted function summation of multiple single Gauss carrys out the probability distribution function of the general speaker of matching for voiced speech feature, i.e. distribution probability density function.So the distribution of speaker's sounding can be characterized well by common Gaussian mixture model, speaker's pronunciation character can be characterized well.
On the basis of above-mentioned universal background model, the process utilizing the feature of training utterance data to carry out universal background model training refers to utilize expects that maximized method carries out parameter estimation.
After parameter estimation, can obtain universal background model, this model essence is exactly mixed Gauss model, and its parameter just comprises three: Gauss weight w, Gauss variance δ and Gaussian mean μ, and wherein w is each Gauss weight w iset, δ is each Gauss variance δ iset, μ is each Gaussian mean μ iset, i represents the numbering of not single Gauss model.Trained by training data, these three parameters obtained are unique.
Concrete parameter estimation procedure is as follows:
Step 102-1, to a kth Gauss weight w krenewal:
A kth Gauss weight w krenewal process is as formula (3):
w k = 1 T Σ t = 1 T p ( k | x t , λ ) - - - ( 3 )
Wherein, x trepresenting the t frame proper vector in the training utterance x of input, is the known vector calculated at characteristic extraction procedure; λ is the same with the expression in formula (1), is the general name to parameters all in gauss hybrid models, and these all can provide initial value in the initialization of the incipient stage of training, are known parameters; The totalframes of what T represented the is training utterance of all inputs to calculate known numeric value; What k represented is kth single Gauss model numbering in gauss hybrid models; P (k|x t, λ) and that represent is the training utterance frame x inputted tposterior probability on a universal background model kth Gauss, this is by incoming frame x tgained is calculated with mixed Gauss model parameter lambda.
Step 102-2, to a kth Gaussian mean μ krenewal:
A kth Gaussian mean μ krenewal process is as formula (4):
μ k = Σ t = 1 T p ( k | x t , λ ) · x t Σ t - 1 T p ( k | x t , λ ) - - - ( 4 )
Wherein each parameter is the same with the implication in formula (3), wherein, and T, x twith the variable that λ is known, and p (k|x t, λ) and be by incoming frame x tgained is calculated with mixed Gauss model parameter lambda.
Step 102-3, to kth Gauss's variance renewal:
A kth Gaussian mean renewal process is as formula (5):
δ k 2 = Σ t = 1 T p ( k | x t , λ ) · x t 2 Σ t = 1 T p ( k | x t , λ ) - μ k 2 - - - ( 5 )
Wherein each parameter is the same with the implication in formula (3) and formula (4), wherein, and T, x t, λ and μ kall known variable, and p (k|x t, λ) and be by incoming frame x tgained is calculated with mixed Gauss model parameter lambda.
Step 103, in order to obtain the model of each special audio, first need first to obtain such special audio voice of part as model training voice, if the more difficult acquisition of such special audio data, such special audio data during training universal background model can be used, if such new special audio data can be obtained, just use such new voice data as training data, no matter training data has how many, and a class special audio obtains corresponding a kind of special audio model.
In this step, as shown in Figure 2, will utilize a small amount of certain class special audio training data and Bayesian adaptation, from universal background model, self-adaptation obtains such special audio model, and concrete adaptive process is as follows:
Step 103-1, first calculate the posterior probability of each speech frame on universal background model, first order statistic and second-order statistic according to the proper vector of special audio of training; Concrete computation process is as formula (6) (7) (8):
n i = Σ t = 1 T Pr ( i | x t ) - - - ( 6 )
E i ( x ) = 1 n i Σ t = 1 T Pr ( i | x t ) · x t - - - ( 7 )
E i ( x 2 ) = 1 n i Σ t = 1 T Pr ( i | x t ) · x t - - - ( 8 )
Wherein, Pr (i|x t) represent the posterior probability of input audio frequency x t frame in universal background model i-th Gauss; x trepresent the feature of input audio frequency x t frame data; What T represented is the totalframes inputting audio frequency; The numbering of i-th single Gauss in universal background model that what i represented is.
Because when each special audio model of training, it is different to be used for adaptive often kind of special audio data, thus be used for training the posterior probability calculated of each special audio model, single order second-order statistic is also different.
Step 103-2, the posterior probability utilizing step 103-1 to calculate, first order statistic and second-order statistic, do self-adaptative adjustment to the parameter of universal background model, obtains the weight of special audio model average and covariance because special audio model is also gauss hybrid models in essence, so obtain the weight of special audio model average and covariance after, just can characterize the mixed Gauss model of this special audio.
Concrete self-adaptation formula is as shown in (9) (10) (11):
w ^ i = [ α i w n i / T + ( 1 - α i w ) w i ] γ - - - ( 9 )
μ ^ i = α i m E i ( x ) + ( 1 - α i m ) μ i - - - ( 10 )
δ ^ i 2 = α i v E i ( x 2 ) + ( 1 - α i v ) ( σ i 2 + μ i 2 ) - μ ^ i 2 - - - ( 11 )
Wherein, with variance, average, weight adjusting coefficient respectively; n i, E i(x) and E i(x 2) be exactly posterior probability, first order statistic, the second-order statistic of this special audio training data calculated by formula (6) (7) (8); In formula (9), what T represented is such special audio training data totalframes, and γ represents normalized parameter, ensures w iwhat represent is the weight of i-th Gauss model in universal background model; In formula (10), μ iwhat represent is the average of i-th Gauss model in universal background model; In formula (11), represent the covariance of i-th Gauss in universal background model, μ iwhat represent is the average of i-th Gauss in universal background model, what represent is the average of i-th Gauss of this special audio model that self-adaptation obtains.
After above-mentioned computation process, just obtain such special audio model.
Can know from step 103-1, because the self-adapting data of each special audio model is different, so also different by the posterior probability that calculates, single order second-order statistic, so the final special audio model obtained is also just different after calculating through 103-2.
Two, test phase
With reference to figure 3, test phase comprises the following steps:
Step 201, feature extraction is done to inputted tested speech;
The feature extracted in this step is identical with the type of the feature extracted in step 101, as being mel cepstrum coefficients;
Step 202, tested speech feature step 201 extracted are input in the middle of the universal background model of training in step 102, calculate the score of tested speech on universal background model.
Can be known by explanation before, universal background model is gauss hybrid models in essence, and tested speech is exactly each Gauss's posterior probability sum in the score of universal background model.As the preferred implementation of one, calculate to accelerate score, when actual computation be not calculate whole Gauss posterior probability but choose the maximum N number of Gauss of posterior probability, and calculate this N number of probability sum, with this N number of gaussian sequence number of tense marker.
The mixed Gauss model of the respective special audio that step 203, the tested speech feature input step 103 step 201 extracted obtain, calculate the score of tested speech on the mixed Gauss model of each special audio, if there be M special audio model, the so final score obtained has M altogether.
The concrete grammar calculating the score of tested speech on the mixed Gauss model of respective special audio remains and calculates tested speech posterior probability sum above each Gauss on special audio model.As the preferred implementation of one, in order to improve computing velocity, N number of gaussian sequence of the universal background model recorded by step 202, calculate the posterior probability sum of this N number of Gauss in the mixed Gauss model of special audio accordingly, using this value as the score of tested speech on the mixed Gauss model of respective special audio.
Step 204, the score of tested speech on the mixed Gauss model of respective special audio that the tested speech obtained step 202 obtains in score and the step 203 of universal background model asks difference respectively, difference and threshold value are compared, thus adjudicate this testing audio and belong to which special audio, if have multiple model score all in threshold range, the method of getting maximal value is then adopted to adjudicate, namely compare the model score of these scores in threshold range, select the special audio of the maximum model sign of mark as tested speech final judging result.
It should be noted last that, above embodiment is only in order to illustrate technical scheme of the present invention and unrestricted.Although with reference to embodiment to invention has been detailed description, those of ordinary skill in the art is to be understood that, modify to technical scheme of the present invention or equivalent replacement, do not depart from the spirit and scope of technical solution of the present invention, it all should be encompassed in the middle of right of the present invention.

Claims (8)

1. a special audio detection model generation method in short-term, comprising:
Step 101, feature extraction is carried out to training utterance data; Wherein, described training utterance data comprise nonspecific voice data and special audio data;
Step 102, the feature of training utterance data obtained by step 101, carry out the training of universal background model; Wherein, described universal background model is mixed Gauss model, and its expression formula is:
p ( x | λ ) = Σ i = 1 M w i p i ( x ) ;
W iwhat represent is the weight of each Gauss, and span 0 ~ 1, and meets normalizing condition: x represents the frame feature of training utterance fragment; λ represents the set of all parameters in gauss hybrid models; p ix () represents the probability density function of each single Gauss model, its expression formula is:
p i = 1 ( 2 π ) D / 2 | Σi | 1 / 2 exp { - 1 2 ( x - μ i ) ′ ( Σi ) - 1 ( x - μ i ) } ;
The dimension of the frame feature of training utterance fragment that what D represented is; The covariance matrix of what Σ i represented is this Gaussian function; μ iwhat represent is the mean vector of this Gaussian function;
Step 103, feature by class special audio data a certain in training utterance data, according to the model obtaining such special audio data in the universal background model that step 102 obtains adaptively; Repeat this operation, until obtain the model of all class special audio data in training utterance data.
2. the detection model of special audio in short-term generation method according to claim 1, is characterized in that, in a step 101, the feature extracted training utterance data is mel cepstrum coefficients.
3. the detection model of special audio in short-term generation method according to claim 1, it is characterized in that, in a step 102, the training carrying out universal background model comprises period of use and hopes that maximized method carries out parameter estimation to universal background model, the parameter estimated comprises three classes: Gauss weight w, Gauss variance δ and Gaussian mean μ, and wherein w is each Gauss weight w iset, δ is each Gauss variance δ iset, μ is each Gaussian mean μ iset, i represents the numbering of each single Gauss model; Specifically comprise:
Step 102-1, to a kth Gauss weight w krenewal:
A kth Gauss weight w krenewal process is as shown in following formula:
w k = 1 T Σ t = 1 T p ( k | x t , λ )
Wherein, x trepresenting the t frame proper vector in the training utterance x of input, is the known vector calculated at characteristic extraction procedure; λ is the general name to parameters all in gauss hybrid models, and these all can provide initial value in the initialization of the incipient stage of training, are known parameters; The totalframes of what T represented the is training utterance of all inputs to calculate known numeric value; What k represented is kth single Gauss model numbering in gauss hybrid models; P (k|x t, λ) and that represent is the training utterance frame x inputted tposterior probability on a universal background model kth Gauss, by incoming frame x tgained is calculated with mixed Gauss model parameter lambda;
Step 102-2, to a kth Gaussian mean μ krenewal:
A kth Gaussian mean μ krenewal process is as shown in following formula:
μ k = Σ t = 1 T p ( k | x t , λ ) · x t Σ t = 1 T p ( k | x t , λ )
Wherein, T, x twith the variable that λ is known, and p (k|x t, λ) and be by incoming frame x tgained is calculated with mixed Gauss model parameter lambda;
Step 102-3, to kth Gauss's variance renewal:
A kth Gaussian mean renewal process is as shown in following formula:
δ k 2 = Σ t = 1 T p ( k | x t , λ ) · x t 2 Σ t = 1 T p ( k | x t , λ ) - μ k 2
Wherein, T, x t, λ and μ kall known variable, and p (k|x t, λ) and be by incoming frame x tgained is calculated with mixed Gauss model parameter lambda.
4. the detection model of special audio in short-term generation method according to claim 1, is characterized in that, in step 103, the model according to obtaining class special audio data in the universal background model that step 102 obtains adaptively comprises:
Step 103-1, first calculate the posterior probability n of each speech frame on universal background model according to the proper vector of special audio of training i, first order statistic E i(x) and second-order statistic E i(x 2); Concrete computation process is as shown in following formula:
n i = Σ t = 1 T Pr ( i | x t )
E i ( x ) = - 1 n i Σ t = 1 T Pr ( i | x t ) · x t
E i ( x 2 ) = 1 n i Σ t = 1 T Pr ( i | x t ) · x t
Wherein, Pr (i|x t) represent the posterior probability of input audio frequency x t frame in universal background model i-th Gauss; x trepresent the feature of input audio frequency x t frame data; What T represented is the totalframes inputting audio frequency; The numbering of i-th single Gauss in universal background model that what i represented is;
Step 103-2, the posterior probability utilizing step 103-1 to calculate, first order statistic and second-order statistic, do self-adaptative adjustment to the parameter of universal background model, obtains the weight of special audio model average and covariance the formula of self-adaptative adjustment is as follows:
w ^ i = [ α i w n i / T + ( 1 - α i w ) w i ] γ
μ ^ i = α i m E i ( x ) + ( 1 - α i m ) μ i
δ ^ i 2 = α i v E i ( x 2 ) + ( 1 - α i v ) ( σ i 2 + μ i 2 ) - μ ^ i 2
Wherein, with variance, average, weight adjusting coefficient respectively; What T represented is such special audio training data totalframes, and γ represents normalized parameter, ensures w iwhat represent is the weight of i-th Gauss model in universal background model; μ iwhat represent is the average of i-th Gauss model in universal background model; represent the covariance of i-th Gauss in universal background model, μ iwhat represent is the average of i-th Gauss in universal background model, what represent is the average of i-th Gauss of this special audio model that self-adaptation obtains.
5. a special audio detection method in short-term, comprising:
Step 201, feature extraction is done to inputted tested speech;
Step 202, tested speech feature step 201 extracted are input in the middle of the universal background model that the described detection model of the special audio in short-term method of generationing of one of claim 1-4 obtains, the score of calculating tested speech on universal background model;
Step 203, tested speech feature step 201 extracted input the mixed Gauss model of all kinds of special audios that the described detection model of the special audio in short-term generation method of one of claim 1-4 obtains, and calculate the score of tested speech on the mixed Gauss model of each class special audio;
The score of tested speech on the mixed Gauss model of all kinds of special audio that step 204, the tested speech obtained step 202 obtain in score and the step 203 of universal background model asks difference respectively, difference and threshold value are compared, thus adjudicate this testing audio and belong to which kind of special audio, if have multiple model score all in threshold range, then adopt the method for getting maximal value to adjudicate, select the special audio of the maximum model sign of mark as tested speech final judging result.
6. the detection method of special audio in short-term according to claim 5, it is characterized in that, in step 202., calculate the score of tested speech on universal background model to comprise: choose N number of Gauss that posterior probability in universal background model is maximum, and calculate this N number of probability sum, with this N number of gaussian sequence number of tense marker.
7. the detection method of special audio in short-term according to claim 6, it is characterized in that, in step 203, calculate the score of tested speech on the mixed Gauss model of each class special audio to comprise: N number of gaussian sequence of the universal background model recorded by step 202, calculate the posterior probability sum of this N number of Gauss in the mixed Gauss model of special audio accordingly, using this value as the score of tested speech on the mixed Gauss model of all kinds of special audio.
8. the detection method of special audio in short-term according to claim 5, is characterized in that, in step 201, the feature extracted tested speech is mel cepstrum coefficients.
CN201510236568.8A 2015-05-11 2015-05-11 Specific audio detection model generation in short-term and detection method Expired - Fee Related CN104992708B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510236568.8A CN104992708B (en) 2015-05-11 2015-05-11 Specific audio detection model generation in short-term and detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510236568.8A CN104992708B (en) 2015-05-11 2015-05-11 Specific audio detection model generation in short-term and detection method

Publications (2)

Publication Number Publication Date
CN104992708A true CN104992708A (en) 2015-10-21
CN104992708B CN104992708B (en) 2018-07-24

Family

ID=54304511

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510236568.8A Expired - Fee Related CN104992708B (en) 2015-05-11 2015-05-11 Specific audio detection model generation in short-term and detection method

Country Status (1)

Country Link
CN (1) CN104992708B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106251861A (en) * 2016-08-05 2016-12-21 重庆大学 A kind of abnormal sound in public places detection method based on scene modeling
CN107517207A (en) * 2017-03-13 2017-12-26 平安科技(深圳)有限公司 Server, auth method and computer-readable recording medium
CN108305616A (en) * 2018-01-16 2018-07-20 国家计算机网络与信息安全管理中心 A kind of audio scene recognition method and device based on long feature extraction in short-term
CN110135492A (en) * 2019-05-13 2019-08-16 山东大学 Equipment fault diagnosis and method for detecting abnormality and system based on more Gauss models
CN113888777A (en) * 2021-09-08 2022-01-04 南京金盾公共安全技术研究院有限公司 Voiceprint unlocking method and device based on cloud machine learning

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102509546A (en) * 2011-11-11 2012-06-20 北京声迅电子股份有限公司 Noise reduction and abnormal sound detection method applied to rail transit
CN102623009A (en) * 2012-03-02 2012-08-01 安徽科大讯飞信息技术股份有限公司 Abnormal emotion automatic detection and extraction method and system on basis of short-time analysis
CN103198605A (en) * 2013-03-11 2013-07-10 成都百威讯科技有限责任公司 Indoor emergent abnormal event alarm system
CN103226951A (en) * 2013-04-19 2013-07-31 清华大学 Speaker verification system creation method based on model sequence adaptive technique
CN103366738A (en) * 2012-04-01 2013-10-23 佳能株式会社 Methods and devices for generating sound classifier and detecting abnormal sound, and monitoring system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102509546A (en) * 2011-11-11 2012-06-20 北京声迅电子股份有限公司 Noise reduction and abnormal sound detection method applied to rail transit
CN102623009A (en) * 2012-03-02 2012-08-01 安徽科大讯飞信息技术股份有限公司 Abnormal emotion automatic detection and extraction method and system on basis of short-time analysis
CN103366738A (en) * 2012-04-01 2013-10-23 佳能株式会社 Methods and devices for generating sound classifier and detecting abnormal sound, and monitoring system
CN103198605A (en) * 2013-03-11 2013-07-10 成都百威讯科技有限责任公司 Indoor emergent abnormal event alarm system
CN103226951A (en) * 2013-04-19 2013-07-31 清华大学 Speaker verification system creation method based on model sequence adaptive technique

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
罗森林; 王坤; 谢尔曼; 潘丽敏; 李金玉: "融合GMM及SVM的特定音频事件高精度识别方法", 《北京理工大学学报》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106251861A (en) * 2016-08-05 2016-12-21 重庆大学 A kind of abnormal sound in public places detection method based on scene modeling
CN106251861B (en) * 2016-08-05 2019-04-23 重庆大学 A kind of abnormal sound in public places detection method based on scene modeling
CN107517207A (en) * 2017-03-13 2017-12-26 平安科技(深圳)有限公司 Server, auth method and computer-readable recording medium
WO2018166187A1 (en) * 2017-03-13 2018-09-20 平安科技(深圳)有限公司 Server, identity verification method and system, and a computer-readable storage medium
CN108305616A (en) * 2018-01-16 2018-07-20 国家计算机网络与信息安全管理中心 A kind of audio scene recognition method and device based on long feature extraction in short-term
CN110135492A (en) * 2019-05-13 2019-08-16 山东大学 Equipment fault diagnosis and method for detecting abnormality and system based on more Gauss models
CN113888777A (en) * 2021-09-08 2022-01-04 南京金盾公共安全技术研究院有限公司 Voiceprint unlocking method and device based on cloud machine learning
CN113888777B (en) * 2021-09-08 2023-08-18 南京金盾公共安全技术研究院有限公司 Voiceprint unlocking method and device based on cloud machine learning

Also Published As

Publication number Publication date
CN104992708B (en) 2018-07-24

Similar Documents

Publication Publication Date Title
CN104992708A (en) Short-time specific audio detection model generating method and short-time specific audio detection method
Xu et al. Deep sparse rectifier neural networks for speech denoising
Chai et al. A cross-entropy-guided measure (CEGM) for assessing speech recognition performance and optimizing DNN-based speech enhancement
De Leon et al. Detection of synthetic speech for the problem of imposture
CN110308485B (en) Microseismic signal classification method and device based on deep learning and storage medium
CN110349597B (en) Voice detection method and device
CN104900235A (en) Voiceprint recognition method based on pitch period mixed characteristic parameters
CN104835498A (en) Voiceprint identification method based on multi-type combination characteristic parameters
Rao et al. Target speaker extraction for overlapped multi-talker speaker verification
CN103456302B (en) A kind of emotional speaker recognition method based on the synthesis of emotion GMM Model Weight
CN104485108A (en) Noise and speaker combined compensation method based on multi-speaker model
Ghalehjegh et al. Deep bottleneck features for i-vector based text-independent speaker verification
CN108320732A (en) The method and apparatus for generating target speaker's speech recognition computation model
Chazan et al. A phoneme-based pre-training approach for deep neural network with application to speech enhancement
Allen et al. Language identification using warping and the shifted delta cepstrum
CN106297769A (en) A kind of distinctive feature extracting method being applied to languages identification
CN106251861A (en) A kind of abnormal sound in public places detection method based on scene modeling
Yamamoto et al. Denoising autoencoder-based speaker feature restoration for utterances of short duration.
Pohjalainen et al. Automatic detection of anger in telephone speech with robust autoregressive modulation filtering
Hong et al. Modified-prior PLDA and score calibration for duration mismatch compensation in speaker recognition system.
Dong et al. Long-term SNR estimation using noise residuals and a two-stage deep-learning framework
Wang et al. F0 estimation in noisy speech based on long-term harmonic feature analysis combined with neural network classification
Soni et al. Effectiveness of ideal ratio mask for non-intrusive quality assessment of noise suppressed speech
Garg et al. Deep convolutional neural network-based speech signal enhancement using extensive speech features
Mansour et al. A comparative study in emotional speaker recognition in noisy environment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20180724