CN109036390B

CN109036390B - Broadcast keyword identification method based on integrated gradient elevator

Info

Publication number: CN109036390B
Application number: CN201810929482.7A
Authority: CN
Inventors: 雒瑞森; 龚晓峰; 王琛; 费绍敏; 余勤; 王建; 冯谦; 杨晓梅; 任小梅; 曾晓东
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2018-08-15
Filing date: 2018-08-15
Publication date: 2022-07-08
Anticipated expiration: 2038-08-15
Also published as: CN109036390A

Abstract

The invention discloses a broadcast keyword identification method based on an integrated gradient elevator. Under the condition that a single keyword exists, the method can improve the recall rate (one of important indexes for evaluating the keyword identification and often needing to be improved most) of the keyword identification to more than 80 percent, and simultaneously maintain the overall accuracy rate of about 70 percent. Meanwhile, for the index F1 score for identifying the unbalance sample, the method can improve the identification reliability from about 0.04 to about 0.31 of a single gradient elevator on the test sample.

Description

Broadcast keyword identification method based on integrated gradient elevator

Technical Field

The invention relates to the technical field of information acquisition, in particular to a broadcast keyword identification method based on an integrated gradient elevator.

Background

The broadcast keyword recognition is mainly applied to broadcast content analysis, and has wide applications in the aspects of related information acquisition, efficient data mining, radio spectrum control and the like of broadcast content. The working principle of the broadcast keyword recognition is to automatically find out a segment containing a specific keyword from a segment of broadcast recording and automatically analyze the broadcast content according to the keyword segment. Conventionally, the work of analyzing the broadcast content is mostly completed manually, and the method has the defects of high cost, long time consumption, easy error occurrence and the like. The automatic broadcast keyword recognition can be realized by a computer or an integrated system through a reliable algorithm, so that the cost is reduced, the efficiency is improved, and errors possibly caused by manual work are avoided.

The core of broadcast keyword identification is the algorithm to find keywords from broadcast segments. Intuitively, we can design a rule-based algorithm to determine keywords according to the characteristics of the broadcast paragraphs. However, since the broadcast signal is a voice signal, the amount of information is large, and the data structure is complicated, the simple rule-based method often cannot achieve the expected effect. In addition to the rule-based method, since the broadcast signal is one of the voice signals, there is a design in which processing is performed using a voice recognition system in the conventional method. However, the broadcast signal is often different from the general speech signal, and has interference information such as special noise and music background, and in radio spectrum management, the recognition system is often required to be used in an off-line environment due to the need of confidentiality, so that it is difficult for the general speech recognition system to obtain an ideal effect on the recognition of the broadcast keyword. In addition, since the broadcast keyword recognition needs to face a large number of unbalanced samples (the keywords account for only a small part of the whole broadcast), the general algorithm is prone to miss keywords or misclassify non-keywords into keywords, which results in recognition errors.

Disclosure of Invention

Aiming at the defects in the prior art, the method for identifying the broadcast keywords based on the integrated gradient elevator solves the problem that the identification of the broadcast keywords is easy to be wrong.

In order to achieve the purpose of the invention, the invention adopts the technical scheme that: a broadcast keyword identification method based on an integrated gradient elevator comprises the following steps:

s1, dividing the training broadcast into training broadcast segments of 3-5S, and performing feature transformation on the training broadcast segments to obtain training data MFCC features;

s2, extracting training samples from the training data according to the MFCC characteristics of the training data, and undersampling a plurality of groups of random non-keyword samples through random place-back sampling to obtain a plurality of groups of balanced training subsets;

s3, performing Tomek Link noise reduction processing on each group of balance training subsets to obtain noise reduction balance training subsets;

s4, training an independent gradient lifting machine model for the noise reduction balance training subset of the single keyword through a GBM algorithm to obtain a gradient lifting classifier;

s5, integrating the gradient lifting classifier through a bagging algorithm to obtain an integrated gradient lifting classifier, and adjusting the probability threshold of the integrated gradient lifting classifier through training data;

s6, dividing the to-be-tested broadcast into 3-5S to-be-tested broadcast segments, and performing feature transformation on the to-be-tested broadcast segments to obtain MFCC (Mel frequency cepstrum coefficient) features of to-be-tested data;

and S7, putting the MFCC features of the data to be tested into an integrated gradient boost classifier for keyword recognition to obtain a recognition result.

Further: the Tomek Link denoising processing in step S3 specifically includes: for balance training subset X except data X_iAnd data x_jData x of_kI.e. x_k∈X\{x_i,x_jIs satisfied to x_iSum of distance to x_jAre all greater than x_iAnd x_jDistance of (d), i.e. dist (x)_i,x_j)<dist(x_i,x_k) Anddist(x_i,x_j)<dist(x_j,x_k) Then x is_iAnd x_jIs a pair of Tomek links, if the data x in the pair of Tomek links_iAnd data x_jIf they belong to different classes, x is deleted_iOr x_j。

Further: the GBM algorithm in step S4 includes the following specific steps:

s41 order model F_K(x) Comprises the following steps:

in the above formula, f_k(x；θ_k) Is a sub-model of step k, α_kIs f_k(x；θ_k) Corresponding weight, K is the current step number, i.e. the current total operation step number, x is the recording sample, i.e. the training broadcast segment, theta_kIs f_k(x；θ_k) A set of parameters of (a);

s42, calculating the distance r between the predicted value and the true value of the model in the step K +1_K+1The calculation formula is as follows:

in the above formula, L (y, F)_K(x) Is a loss function, F_K(x) Is a model predicted value, and y is a true value;

s43, fitting the model parameter theta of the K +1 step_K+1The fitting formula is:

in the above formula, θ is a model parameter, f_K+1(x；θ_k) The submodel in the step K + 1;

s44, calculating the weight alpha of the K +1 step_K+1The calculation formula is as follows:

in the above formula, α is a weight coefficient;

s45, model F_K(x) Iteration is carried out to obtain an updated optimization model F_K+1(x) I.e. gradient boost classifier:

F_K+1(x)＝F_K(x)+α_K+1f_K+1(x；θ_K+1)。

further: the specific adjusting method of the integrated gradient boost classifier probability threshold in step S5 is as follows:

s51, calculating the probability of the prediction examples in the keyword class, wherein the calculation formula is as follows:

in the above formula, T is the number of the balance training subsets, H_t(x_i) As a result of prediction of a single classifier, alpha_tTaking alpha as the weight employed_t＝1/T；

S52, judging whether the sample contains keywords or not, wherein the judgment formula is as follows:

in the above formula, δ is an adjustable probability threshold;

and S53, outputting the probability threshold value of the determined keywords/non-keywords through probability adjustment of the example in the keyword class.

The invention has the beneficial effects that: the method can improve the recall rate (one of important indexes for evaluating the keyword recognition and often needing to be improved) of the keyword recognition to more than 80 percent, and simultaneously keeps the overall accuracy rate of about 70 percent. Meanwhile, for the index F1 score for identifying the unbalance sample, the method can improve the identification reliability from about 0.04 to about 0.31 of a single gradient elevator on the test sample.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a schematic diagram of the performance of a single reference gradient elevator based on a complete data set according to the present invention;

FIG. 3 is a schematic diagram of the performance of a single gradient elevator using an undersampling scheme of the present invention;

FIG. 4 is a schematic diagram of the performance of the present invention using 5 integrated gradient elevators;

fig. 5 is a schematic diagram of the performance of the present invention using 10 integrated gradient elevators.

Detailed Description

The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.

As shown in fig. 1, a broadcast keyword recognition method based on an integrated gradient elevator is characterized by comprising the following steps:

and S1, dividing the training broadcast into training broadcast segments of 3-5S, and performing feature transformation on the training broadcast segments to obtain the MFCC features of the training data.

And S2, extracting training samples from the training data according to the MFCC characteristics of the training data, and undersampling a plurality of groups of random non-keyword samples through random place-back sampling to obtain a plurality of groups of balanced training subsets. Because keyword data typically accounts for a small percentage of the data set, the amount is limited. If the keyword data is further undersampled, the number of keywords is insufficient. On the contrary, because the data volume of the non-keyword is large, the non-keyword is put back for random sampling, so that the reserved non-keyword sample volume is equivalent to the keyword sample volume, the imbalance problem can be reduced, and the manifold of the keyword data distribution cannot be damaged. Ideally, if the number of the keyword data is oneNumber m_kThen, to obtain balanced samples, we set the number p of non-keyword sample samples to m_k. However, since we need to remove some non-keyword instances overlapping with the keyword class by Tomek Link denoising subsequently, we need to multisample some non-keyword instances at the beginning, i.e. set

And S3, performing Tomek Link noise reduction processing on each group of balance training subsets to obtain noise reduction balance training subsets. The Tomek Link noise reduction processing specifically comprises the following steps: for balance training subset X except data X_iAnd data x_jData x of_kI.e. x_k∈X\{x_i,x_jIs satisfied to x_iSum of distance to x_jAre all greater than x_iAnd x_jDistance of (d), i.e. dist (x)_i,x_j)<dist(x_i,x_k) And dist (x)_i,x_j)<dist(x_j,x_k) Then x_iAnd x_jIs a pair of Tomek links, if the data x in the pair of Tomek links_iAnd data x_jBelong to different classes, delete x_iOr x_j。

S4, training an independent gradient lifting machine model for the noise reduction balance training subset of the single keyword through a GBM algorithm to obtain a gradient lifting classifier, wherein the GBM algorithm comprises the following specific steps:

s41 order model F_K(x) Comprises the following steps:

in the above formula, α is a weight coefficient;

F_K+1(x)＝F_K(x)+α_K+1f_K+1(x；θ_K+1)。

s5, integrating the gradient boost classifier through a bagging method to obtain an integrated gradient boost classifier, and adjusting the probability threshold of the integrated gradient boost classifier through training data, wherein the specific adjusting method of the probability threshold of the integrated gradient boost classifier is as follows:

in the above formula, δ is an adjustable probability threshold;

s53, outputting a probability threshold value for determining the keyword/non-keyword by adjusting the probability in the keyword class by example, which is generally 0.5.

S6, dividing the to-be-tested broadcast into 3-5S to-be-tested broadcast segments, and performing feature transformation on the to-be-tested broadcast segments to obtain the MFCC features of the to-be-tested data.

In one embodiment of the present invention, to demonstrate the effectiveness of the present invention, we extracted a sentence of data consisting of 133 radio broadcast records. A small part of these contain the keyword "beijing time" (mandarin), and our goal is to identify the keyword from the broadcast segment. All broadcast audio was processed for 5 second segments, and with this processing we obtained a total of 6906 recordings, of which 197 contain keywords.

Since the labels identified by broadcast keywords are highly skewed (keyword/non-keyword sample size is unbalanced), the classification of data can achieve high accuracy by simply predicting all instances as non-keywords, and thus a common accuracy index cannot sufficiently represent the quality of the algorithm. In label imbalance data classification, indexes of precision and recall (recall) are usually adopted to measure the quality of the algorithm. Using TP, FP, TN, FN to indicate that the classification result is determined as true positive, false positive, true negative and false negative, and then the accuracy precision and recall of the positive category can be calculated by the following formula:

the same method can be applied to the negative category. For a specific positive or negative class, we can calculate the F1 score for this algorithm:

in the experiment, we focused on four evaluation indices: overall classification accuracy, recall of non-keywords, recall of keywords, and F1 score for keyword classes. We take the F1 score of the keyword class as the overall evaluation index because our task is to identify keywords. The output of our model is the probability of the example being a few classes (key, label 1), whose delta value we can adjust to obtain the best prediction result. In our experiments, the tested delta values ranged from 0 to 1 (open interval) with a precision length of 0.05. As shown in fig. 2, the variation of classification accuracy, majority (non-keyword) recall, and minority (keyword) recall is demonstrated. 4 different models were tested in the experiment based on a gradient elevator classifier and the parameters were optimally adjusted by validation. The reference model is a single gradient elevator (xgboost implementation) classifier. In the figure, the x-axis represents the delta value and the y-axis represents the precision/recall. As can be seen from fig. 2, the keyword recall rate and the precision rate are the highest in the second graph (training set), and the keyword recall rate is greatly reduced in both the verification set and the test set, indicating that the model has the problem of overfitting.

The second model tested was a single gradient elevator classifier using an undersampling method. This method can be interpreted as "single model integration" and can take advantage of the advantages of undersampling and Tomek Link, but does not have the advantages of an ensemble-based classifier. As can be seen in fig. 3, the overfitting problem is alleviated and the classifier no longer predicts most instances as non-key words. Although the recall rate of non-keyword data is reduced, the overall performance is improved.

Finally, 5 and 10 integrated gradient elevator classifiers based on the bagging algorithm, which followed the previously proposed technique, were tested using the same data set, with the results shown in fig. 4 and 5. These two figures can be seen with two lifts: first, the comprehensive recall rate of keywords/non-keywords is greatly increased. Second, when adding 1 more classifier per whole model, the impact of different delta values on performance becomes significant. The choice of δ can be done by verification and we can get the best output and prediction confidence for each instance.

Table 1 shows the best F1 score for the test minority (keyword) data and the accuracy/recall at that score. There is also an additional "balanced F1 score" index, which means the F1 score calculated assuming the same number of keyword and non-keyword instances. The index further emphasizes successful identification of keyword class data and increases its retrieval recall.

TABLE 1

Claims

1. A broadcast keyword identification method based on an integrated gradient elevator is characterized by comprising the following steps:

2. The integrated gradient elevator-based broadcast keyword recognition method as claimed in claim 1, wherein the Tomek Link denoising process in step S3 specifically comprises: for balance training subset X except data X_iAnd data x_jData x of_kI.e. x_k∈X\{x_i,x_jIs satisfied to x_iSum of distance to x_jAre all greater than x_iAnd x_jDistance of (d), i.e. dist (x)_i,x_j)<dist(x_i,x_k) And dist (x)_i,x_j)<dist(x_j,x_k) Then x_iAnd x_jIs a pair of Tomek links, if the data x in the pair of Tomek links_iAnd data x_jIf they belong to different classes, x is deleted_iOr x_j。

3. The integrated gradient elevator-based broadcast keyword recognition method as claimed in claim 1, wherein the GBM algorithm in step S4 comprises the following specific steps:

s41 order model F_K(x) Comprises the following steps:

the upper typeIn f_k(x；θ_k) Is a sub-model of step k, α_kIs f_k(x；θ_k) Corresponding weight, K is the current step number, i.e. the current total operation step number, x is the recording sample, i.e. the training broadcast segment, theta_kIs f_k(x；θ_k) A set of parameters of (a);

in the above formula, α is a weight coefficient;

F_K+1(x)＝F_K(x)+α_K+1f_K+1(x；θ_K+1)。

4. the integrated gradient elevator-based broadcast keyword recognition method as claimed in claim 1, wherein the specific adjustment method of the integrated gradient elevator classifier probability threshold in step S5 is as follows:

in the above formula, δ is an adjustable probability threshold;