CN104992708B

CN104992708B - Specific audio detection model generation in short-term and detection method

Info

Publication number: CN104992708B
Application number: CN201510236568.8A
Authority: CN
Inventors: 云晓春; 颜永红; 袁庆升; 黄宇飞; 任彦; 周若华; 黄文廷; 邹学强; 包秀国
Original assignee: Institute of Acoustics CAS; National Computer Network and Information Security Management Center
Current assignee: Institute of Acoustics CAS; National Computer Network and Information Security Management Center
Priority date: 2015-05-11
Filing date: 2015-05-11
Publication date: 2018-07-24
Anticipated expiration: 2035-05-11
Also published as: CN104992708A

Abstract

The present invention relates to a kind of specific audio detection model generation methods in short-term, including：Feature extraction is carried out to training voice data；Wherein, the trained voice data includes nonspecific audio data and specific audio data；With the feature of training voice data, the training of universal background model is carried out；By the feature of certain a kind of specific audio data in training voice data, according to the model for adaptively obtaining such specific audio data in universal background model；This operation is repeated, until obtaining training the model of all class specific audio data in voice data.The present invention also provides a kind of specific audio detection method in short-term, this method is given a mark by model carries out the detection of specific audio.This method can not only well solve the insufficient problem of specific audio model training data, can be inhibited with a degree of ambient noise to input data.

Description

Specific audio detection model generation in short-term and detection method

Technical field

The present invention relates to the methods of the detection of specific audio in short-term, it is more particularly related to utilize mixed Gaussian mould Type carries out the detection of specific audio in short-term.

Background technology

In many fields, specific audio suffers from important role in short-term, especially in security fields, in some specific feelings Under condition, it would be desirable to detect certain a kind of specific audio in short-term to facilitate us to carry out timely some urgent events Processing.For example, in public, it would be desirable to supervise public safety and detect the generation of contingency, as unexpected is screamed Sound, unexpected explosive sound or gunshot, we must detect in time these in short-term specific audio with facilitate timely processing this A little fortuitous events.In addition to this, in some relatively important places, the detection of specific audio in short-term can be also used for abnormal sound Sound detects, and can be very good to play a part of early warning.

The problem of specific audio detection method encounters in short-term or many at present, first, because specific audio is sent out in short-term Life is quickly and the time of origin of event is very of short duration, so how critically important using the information in audio in short-term；Second, it is special in short-term The accordatura raw frequency that takes place frequently is not very high, so having to face the insufficient problem of training data；Third, due to the use of field Scape often has complicated ambient noise, so inhibiting ambient noise well also to become specific audio detection in short-term to be also a weight The problem of wanting.

Invention content

It is an object of the invention to overcome training data deficiency, nothing present in the existing detection method of specific audio in short-term Method inhibits the defect of ambient noise, to provide a kind of generation of the model of specific audio in short-term and detection based on mixed Gauss model Method.

The present invention also provides a kind of specific audio detection model generation methods in short-term, including：

Step 101 carries out feature extraction to training voice data；Wherein, the trained voice data includes nonspecific sound Frequency evidence and specific audio data；

Step 102, with the feature of the obtained trained voice data of step 101, carry out the training of universal background model；Its In, the universal background model is mixed Gauss model, and expression formula is：

What wi was indicated is the weight of each Gauss, and value range meets normalizing condition 0~1：X tables Show the frame feature of trained sound bite；λ indicates the set of all parameters in gauss hybrid models；p_i(x) each single Gauss is indicated The probability density function of model, expression formula are：

What D was indicated is the dimension of the frame feature of trained sound bite；What Σ i were indicated is the covariance square of the Gaussian function Battle array；μ_iWhat is indicated is the mean vector of the Gaussian function；

Step 103, the feature by training a kind of specific audio data of certain in voice data, it is obtained according to step 102 The model of such specific audio data is adaptively obtained in universal background model；This operation is repeated, until obtaining training language The model of all class specific audio data in sound data.

In above-mentioned technical proposal, in a step 101, training voice data is extracted and is characterized as mel cepstrum coefficients.

In above-mentioned technical proposal, in a step 102, the training for carrying out universal background model includes hoping maximumlly period of use Method carries out parameter Estimation to universal background model, and the parameter to be estimated includes three classes：Gauss weight w, Gauss variance δ and Gaussian mean μ, wherein w are each Gauss weight w_iSet, δ is each Gauss variance δ_iSet, μ is each Gaussian mean μ_iSet, i indicates the number of each single Gauss model；It specifically includes：

Step 102-1, to k-th of Gauss weight w_kUpdate：

K-th of Gauss weight w_kRenewal process is as shown in following equation：

Wherein, x_tIt indicates the t frame feature vectors in the training voice x of input, is calculated in characteristic extraction procedure Known vector；λ is the general name to all parameters in gauss hybrid models, and what these all can be in the trained incipient stage is initial Initial value is provided in change, is known parameter；What T was indicated is the totalframes of the training voice of all inputs, is that can calculate Carry out known numeric value；What k was indicated is k-th of single Gauss model number in gauss hybrid models；p(k|x_t, λ) and what is indicated is input Training speech frame x_tPosterior probability on k-th of Gauss of universal background model, by input frame x_tWith mixed Gauss model parameter λ calculates gained；

Step 102-2, to k-th of Gaussian mean μ_kUpdate：

K-th of Gaussian mean μ_kRenewal process is as shown in following equation：

Wherein, T, x_tAll it is known variable with λ, and p (k | x_t, λ) and it is by input frame x_tWith mixed Gauss model parameter lambda Calculate gained；

Step 102-3, to k-th of Gauss varianceUpdate：

K-th of Gaussian meanRenewal process is as shown in following equation：

Wherein, T, x_t, λ and μ_kAll it is known variable, and p (k | x_t, λ) and it is by input frame x_tJoin with mixed Gauss model Number λ calculates gained.

In above-mentioned technical proposal, in step 103, according in the obtained universal background model of step 102 adaptively The model for obtaining a kind of specific audio data includes：

Step 103-1, each speech frame is calculated in common background mould according to the feature vector of trained specific audio first Posterior probability n in type_i, first order statistic E_i(x) and second-order statistic E_i(x²)；Specific calculating process such as following equation institute Show：

Wherein, Pr (i | x_t) indicate input audio x t frames i-th of Gauss of universal background model posterior probability；x_tTable Show the feature of input audio x t frame data；What T was indicated is the totalframes for inputting audio；What i was indicated is in universal background model The number of i-th of single Gauss；

Step 103-2, posterior probability, first order statistic and the second-order statistic being calculated using step 103-1, it is right The parameter of universal background model does adaptive adjustment, obtains the weight of specific audio modelMean valueAnd covariance The formula adaptively adjusted is as follows：

Wherein,WithIt is variance, mean value, weight regulation coefficient respectively；What T was indicated is such specific audio Training data totalframes, γ indicate normalized parameter, ensurew_iWhat is indicated is i-th high in universal background model The weight of this model；μ_iWhat is indicated is the mean value of i-th of Gauss model in universal background model；Indicate universal background model In i-th of Gauss covariance, μ_iWhat is indicated is the mean value of i-th of Gauss in universal background model,What is indicated is adaptive The mean value of i-th of Gauss of the obtained specific audio model.

Invention further provides a kind of specific audio detection methods in short-term, including：

Step 201 does feature extraction to the tested speech inputted；

The tested speech feature that step 201 is extracted is input to the detection model of the specific audio in short-term life by step 202 In the obtained universal background model of method, score of the tested speech on universal background model is calculated；

Step 203, the tested speech feature input for extracting step 201 detection model of specific audio in short-term generate The mixed Gauss model of the obtained all kinds of specific audios of method, mixed Gaussian of the calculating tested speech in every a kind of specific audio Score above model；

Step 204 obtains the obtained tested speech of step 202 in score and the step 203 of universal background model Score of the tested speech on the mixed Gauss model of all kinds of specific audios seeks difference respectively, and difference and threshold value are compared Compared with to adjudicate which kind of specific audio this testing audio belongs to, if there is multiple model scores are all in threshold range, then It is adjudicated using the method being maximized, selects the specific audio that score maximum model characterizes as tested speech conclusive judgement As a result.

In above-mentioned technical proposal, in step 202, calculating score of the tested speech on universal background model includes： The sum of choose the maximum N number of Gauss of posterior probability in universal background model, and calculate this N number of probability, while marking this N number of Gauss Sequence number.

In above-mentioned technical proposal, in step 203, mixed Gauss model of the calculating tested speech in every a kind of specific audio Score above includes：By the N number of gaussian sequence for the universal background model that step 202 records, specific audio is accordingly calculated Mixed Gauss model in this N number of Gauss the sum of posterior probability, using the value as tested speech in the mixed of all kinds of specific audios Close the score above Gauss model.

In above-mentioned technical proposal, in step 201, tested speech is extracted and is characterized as mel cepstrum coefficients.

The advantage of the invention is that：

The method of the present invention can not only overcome the problems, such as that specific audio model training data are insufficient in short-term well, also It can inhibit ambient noise well to a certain extent.

Description of the drawings

Fig. 1 is the training basic principle frame in specific audio detection model generation method about universal background model in short-term Figure；

Fig. 2 is the training basic principle frame in specific audio detection model generation method about specific audio model in short-term Figure；

Fig. 3 is the flow chart of specific audio detection method in short-term.

Specific implementation mode

The specific implementation mode of the present invention is described in further detail in conjunction with Fig. 1 and Fig. 2.

The detection method of specific audio in short-term of the present invention includes two stages, and the first stage is instructed using training voice data Practice model, second stage is detected to tested speech using the model after training.

One, model training stage

Step 101 carries out feature extraction to training voice data, is extracted and is characterized as mel cepstrum coefficients (MFCC spies Sign), this category feature includes energy value and single order, second differnce；

In one embodiment, the frame length of the mel cepstrum coefficients extracted is 20ms, and it is 10ms that frame, which moves, including energy value With single order, second differnce；Characteristic dimension is 60 dimensions altogether；

The trained voice data should include the data of a large amount of nonspecific audio and a certain amount of specific audio Data.

Step 102, using the feature of the obtained trained voice data of step 101, i.e. mel cepstrum coefficients, carry out general The training (UBM model) of background model；

The training schematic diagram of universal background model with reference to given by figure 1, universal background model such as formula (1):

W in formula (1)_iWhat is indicated is the weight of each Gauss, and value range meets normalizing condition 0~1：X indicates training sound bite frame feature；λ indicates the set of all parameters in gauss hybrid models；M indicates Gauss Gaussian Mixture number in mixed model.

P in formula (1)_i(x) probability density function of each single Gauss model, expression such as formula are indicated (2)：

Wherein p_i(x) by following parameter characterization：What D was indicated is the dimension of the frame feature of trained sound bite, by spy Characteristic dimension is determined in sign extraction process；What Σ i were indicated is the covariance matrix of the Gaussian function；μ_iThat indicate is the Gauss The mean vector of function.

Above is exactly the specific expression of universal background model, and gauss hybrid models are summed with the linear weighted function of multiple single Gausses To be fitted general speaker for the probability-distribution function of voiced speech feature, i.e. distribution probability density function.So by logical The distribution of speaker's sounding can be characterized well with gauss hybrid models, can characterize speaker's pronunciation character well.

On the basis of above-mentioned universal background model, universal background model training is carried out using the feature of training voice data Process refer to utilizing it is expected that maximized method carries out parameter Estimation.

After parameter Estimation, universal background model can be obtained, which is exactly mixed Gauss model, its parameter is just Including three：Gauss weight w, Gauss variance δ and Gaussian mean μ, wherein w are each Gauss weight w_iSet, δ is each Gauss variance δ_iSet, μ is each Gaussian mean μ_iSet, i indicates a number for not single Gauss model.Pass through training number According to training, these three obtained parameters are unique.

Specific parameter estimation procedure is as follows：

Step 102-1, to k-th of Gauss weight w_kUpdate：

K-th of Gauss weight w_kRenewal process such as formula (3)：

Wherein, x_tIt indicates the t frame feature vectors in the training voice x of input, is calculated in characteristic extraction procedure Known vector；λ is the same with the expression in formula (1), is the general name to all parameters in gauss hybrid models, these Initial value will be provided in the initialization of trained incipient stage, be known parameter；What T was indicated is the training of all inputs The totalframes of voice is can to calculate known numeric value；What k was indicated is that k-th of single Gauss model is compiled in gauss hybrid models Number；p(k|x_t, λ) and that indicate is the training speech frame x of input_tPosterior probability on k-th of Gauss of universal background model, this A is by input frame x_tGained is calculated with mixed Gauss model parameter lambda.

Step 102-2, to k-th of Gaussian mean μ_kUpdate：

K-th of Gaussian mean μ_kRenewal process such as formula (4)：

Wherein each parameter in formula (3) as being meant that, wherein T, x_tAll it is known variable with λ, and p (k|x_t, λ) and it is by input frame x_tGained is calculated with mixed Gauss model parameter lambda.

Step 102-3, to k-th of Gauss varianceUpdate：

K-th of Gaussian meanRenewal process such as formula (5)：

Wherein each parameter is as the meaning in formula (3) and formula (4), wherein T, x_t, λ and μ_kAll it is known change Amount, and p (k | x_t, λ) and it is by input frame x_tGained is calculated with mixed Gauss model parameter lambda.

The model of step 103, in order to obtain each specific audio, it is necessary first to first obtain such specific audio language of part Sound is as model training voice, if such more difficult acquisition of specific audio data, when can use training universal background model Such specific audio data just made using such new audio data if such new specific audio data can be obtained For training data, no matter training data how many, a kind of specific audio obtains a kind of corresponding specific audio model.

In this step, as shown in Fig. 2, certain a small amount of class specific audio training data and Bayesian adaptation will be utilized to calculate Method, adaptively obtains such specific audio model from universal background model, and specific adaptive process is as follows：

Step 103-1, each speech frame is calculated in common background mould according to the feature vector of trained specific audio first Posterior probability, first order statistic in type and second-order statistic；Specific calculating process such as formula (6) (7) (8)：

Wherein, Pr (i | x_t) indicate input audio x t frames i-th of Gauss of universal background model posterior probability；x_tTable Show the feature of input audio x t frame data；What T was indicated is the totalframes for inputting audio；What i was indicated is in universal background model The number of i-th of single Gauss.

Because when training each specific audio model, it is used for each adaptive specific audio data respectively not phase Together, so being also each not phase for training the posterior probability of each specific audio model being calculated, single order second-order statistic With.

Step 103-2, posterior probability, first order statistic and the second-order statistic being calculated using step 103-1, it is right The parameter of universal background model does adaptive adjustment, obtains the weight of specific audio modelMean valueAnd covariance Because specific audio model is substantially also gauss hybrid models, the weight of specific audio model is obtainedMean value And covarianceAfterwards, so that it may to characterize the mixed Gauss model of the specific audio.

Specific adaptive formula is such as shown in (9) (10) (11)：

Wherein,WithIt is variance, mean value, weight regulation coefficient respectively；n_i、E_i(x) and E_i(x²) it is exactly by public affairs The posterior probability for the specific audio training data that formula (6) (7) (8) is calculated, first order statistic, second-order statistic；Formula (9) in, what T was indicated is such specific audio training data totalframes, and γ indicates normalized parameter, ensuresw_iIt indicates Be i-th of Gauss model in universal background model weight；In formula (10), μ_iWhat is indicated is the in universal background model The mean value of i Gauss model；In formula (11),Indicate the covariance of i-th of Gauss in universal background model, μ_iIt indicates It is the mean value of i-th of Gauss in universal background model,What is indicated is the i-th high of the specific audio model adaptively obtained This mean value.

After above-mentioned calculating process, such specific audio model has just been obtained.

From step 103-1 it is recognised that since the self-adapting data of each specific audio model is different, so passing through The posterior probability that is calculated, single order second-order statistic are also different, so finally obtained after being calculated by 103-2 Specific audio model is also just different.

Two, test phase

With reference to figure 3, test phase includes the following steps：

Step 201 does feature extraction to the tested speech inputted；

The feature extracted in this step is identical as the type for the feature extracted in step 101, is such as that Meier is fallen Spectral coefficient；

Step 202, the universal background model that the tested speech feature that step 201 is extracted is input to training in step 102 In the middle, score of the tested speech on universal background model is calculated.

By explanation before it is recognised that universal background model is substantially gauss hybrid models, tested speech is general The score of background model is exactly the sum of each Gauss posterior probability.As a kind of preferred implementation, calculated to accelerate score, It not is to calculate the posterior probability of whole Gausses but choose the maximum N number of Gauss of posterior probability, and calculate when actually calculating The sum of this N number of probability, while marking this N number of gaussian sequence number.

Step 203, the respective specific audio for obtaining the tested speech feature input step 103 that step 201 is extracted it is mixed Gauss model is closed, score of the tested speech on the mixed Gauss model of each specific audio is calculated, if there is M is a specific Audio model, then the score finally obtained is total M.

The specific method for calculating score of the tested speech on the mixed Gauss model of respective specific audio is still meter Calculate tested speech the sum of posterior probability above each Gauss above the specific audio model.As a kind of preferred implementation, it is Raising calculating speed by the N number of gaussian sequence for the universal background model that step 202 records accordingly calculates specific audio Mixed Gauss model in this N number of Gauss the sum of posterior probability, using the value as tested speech in the mixed of respective specific audio Close the score above Gauss model.

Step 204 obtains the obtained tested speech of step 202 in score and the step 203 of universal background model Score of the tested speech on the mixed Gauss model of respective specific audio seeks difference respectively, and difference and threshold value are compared Compared with to adjudicate which specific audio this testing audio belongs to, if there is multiple model scores are all in threshold range, then It is adjudicated using the method being maximized, that is, compares model score of these scores in threshold range, select score most The specific audio that large-sized model is characterized is as tested speech final judging result.

It should be noted last that the above examples are only used to illustrate the technical scheme of the present invention and are not limiting.Although ginseng It is described the invention in detail according to embodiment, it will be understood by those of ordinary skill in the art that, to the technical side of the present invention Case is modified or replaced equivalently, and without departure from the spirit and scope of technical solution of the present invention, should all be covered in the present invention Right in.

Claims

1. a kind of specific audio detection model generation method in short-term, including：

Step 101 carries out feature extraction to training voice data；Wherein, the trained voice data includes nonspecific audio number According to specific audio data；

Step 102, with the feature of the obtained trained voice data of step 101, carry out the training of universal background model；Wherein, The universal background model is mixed Gauss model, and expression formula is：

w_iWhat is indicated is the weight of each Gauss, and value range meets normalizing condition 0~1：X indicates training The frame feature of sound bite；λ indicates the set of all parameters in gauss hybrid models；p_i(x) each single Gauss model is indicated Probability density function, expression formula are：

What D was indicated is the dimension of the frame feature of trained sound bite；∑_iWhat is indicated is the covariance matrix of the Gaussian function；μ_iTable What is shown is the mean vector of the Gaussian function；

Step 103, the feature by training a kind of specific audio data of certain in voice data are obtained general according to step 102 The model of such specific audio data is adaptively obtained in background model；This operation is repeated, until obtaining training voice number The model of all class specific audio data in；

In step 103, according to adaptively obtaining a kind of specific audio data in the obtained universal background model of step 102 Model include：

Step 103-1, each speech frame is calculated on universal background model according to the feature vector of trained specific audio first Posterior probability n_i, first order statistic E_i(x) and second-order statistic E_i(x²)；Specific calculating process is as shown in following equation：

Wherein, Pr (i | x_t) indicate input audio x t frames i-th of Gauss of universal background model posterior probability；x_tIndicate defeated Enter the feature of audio x t frame data；What T was indicated is the totalframes for inputting audio；What i was indicated is i-th in universal background model The number of single Gauss；

Step 103-2, posterior probability, first order statistic and the second-order statistic being calculated using step 103-1, to general The parameter of background model does adaptive adjustment, obtains the weight of specific audio modelMean valueAnd covarianceAdaptively The formula of adjustment is as follows：

Wherein,WithIt is variance, mean value, weight regulation coefficient respectively；What T was indicated is such specific audio training number According to totalframes, γ indicates normalized parameter, ensuresw_iWhat is indicated is i-th of Gauss model in universal background model Weight；Indicate the covariance of i-th of Gauss in universal background model,What is indicated is the specific audio adaptively obtained The mean value of i-th of Gauss of model.

2. the detection model generation method of specific audio in short-term according to claim 1, which is characterized in that in a step 101, Training voice data is extracted and is characterized as mel cepstrum coefficients.

3. the detection model generation method of specific audio in short-term according to claim 1, which is characterized in that in a step 102, The training for carrying out universal background model includes period of use maximized method being hoped to carry out parameter Estimation to universal background model, The parameter of estimation includes three classes：Gauss weight w, Gauss variance δ and Gaussian mean μ, wherein w are each Gauss weight w_iCollection It closes, δ is each Gauss variance δ_iSet, μ is each Gaussian mean μ_iSet, i indicates the number of each single Gauss model； It specifically includes：

Step 102-1, to k-th of Gauss weight w_kUpdate：

K-th of Gauss weight w_kRenewal process is as shown in following equation：

Wherein, x_tIndicate input training voice x in t frame feature vectors, be calculated in characteristic extraction procedure known to Vector；λ is the general name to all parameters in gauss hybrid models, these can all give in the initialization of trained incipient stage Go out initial value, is known parameter；What T was indicated is the totalframes of the training voice of all inputs, be can calculate it is known Numerical value；What k was indicated is k-th of single Gauss model number in gauss hybrid models；p(k|x_t, λ) and what is indicated is the training language of input Sound frame x_tPosterior probability on k-th of Gauss of universal background model, by input frame x_tIt is calculated with mixed Gauss model parameter lambda Gained；

Step 102-2, to k-th of Gaussian mean μ_kUpdate：

K-th of Gaussian mean μ_kRenewal process is as shown in following equation：

Wherein, T, x_tAll it is known variable with λ, and p (k | x_t, λ) and it is by input frame x_tIt is calculated with mixed Gauss model parameter lambda Gained；

Step 102-3, to k-th of Gauss varianceUpdate：

K-th of Gaussian meanRenewal process is as shown in following equation：

Wherein, T, x_t, λ and μ_kAll it is known variable, and p (k | x_t, λ) and it is by input frame x_tWith mixed Gauss model parameter lambda meter Calculate gained.

4. a kind of specific audio detection method in short-term, including：

Step 201 does feature extraction to the tested speech inputted；

The tested speech feature that step 201 is extracted is input to sound specific in short-term described in one of claim 1-3 by step 202 In the obtained universal background model of frequency detection model generation method, tested speech obtaining on universal background model is calculated Point；

Specific audio in short-term described in one of step 203, the tested speech feature input claim 1-3 for extracting step 201 The mixed Gauss model of the obtained all kinds of specific audios of detection model generation method calculates tested speech in every specific sound of one kind Score above the mixed Gauss model of frequency；

Step 204, the test that the obtained tested speech of step 202 is obtained in the score of universal background model with step 203 Score of the voice on the mixed Gauss model of all kinds of specific audios seeks difference respectively, and difference is compared with threshold value, from And adjudicate this testing audio and which kind of specific audio belonged to, if there is multiple model scores are all in threshold range, then use The method being maximized is adjudicated, and selects the specific audio that score maximum model characterizes as tested speech conclusive judgement knot Fruit.

5. specific audio detection method in short-term according to claim 4, which is characterized in that in step 202, calculate test Score of the voice on universal background model include：The maximum N number of Gauss of posterior probability in universal background model is chosen, and The sum of this N number of probability is calculated, while marking this N number of gaussian sequence number.

6. specific audio detection method in short-term according to claim 5, which is characterized in that in step 203, calculate test Score of the voice on the mixed Gauss model of every a kind of specific audio include：The common background mould recorded by step 202 N number of gaussian sequence of type accordingly calculates the sum of the posterior probability of this N number of Gauss in the mixed Gauss model of specific audio, will Score of the value as tested speech on the mixed Gauss model of all kinds of specific audios.

7. specific audio detection method in short-term according to claim 4, which is characterized in that in step 201, to testing language Sound, which is extracted, is characterized as mel cepstrum coefficients.