CN114550697A

CN114550697A - Voice sample equalization method combining mixed sampling and random forest

Info

Publication number: CN114550697A
Application number: CN202210083571.0A
Authority: CN
Inventors: 张晓俊; 周长伟; 朱欣程; 陶智; 赵鹤鸣
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2022-01-17
Filing date: 2022-01-17
Publication date: 2022-05-27
Anticipated expiration: 2042-01-17
Also published as: CN114550697B

Abstract

The invention relates to a voice sample equalization method combining mixed sampling and random forest, which comprises the steps of firstly, carrying out feature extraction on an initial voice data set; then, the extracted voice data feature set is balanced by using SMOTE-ENN mixed sampling, and a current balanced voice data set is obtained; secondly, inputting the current balanced voice data set into a double-factor random forest model, and outputting classification evaluation indexes and out-of-bag error classification rates of the double-factor random forest model; finally, judging whether the classification evaluation index is converged, and if the classification evaluation index is converged, outputting a current balanced voice data set; otherwise, updating the mixed sampling rate of SMOTE-ENN mixed sampling according to the out-of-bag error classification rate, returning to perform equalization processing on the extracted voice data set again until the classification evaluation index is converged, and outputting the current equalized voice data set. According to the method, the SMOTE-ENN mixed sampling and the double-factor random forest model are combined to balance the data set, so that sample data with high information value is reserved to the maximum extent.

Description

Voice sample equalization method combining mixed sampling and random forest

Technical Field

The invention relates to the technical field of data processing, in particular to a voice sample equalization method, a device and equipment for combining mixed sampling and random forest and a computer readable storage medium.

Background

In recent years, artificial intelligence technology has been developed in a breakthrough in speech recognition. However, the data imbalance problem has been a challenging problem in machine learning. The unevenly distributed data of the classes can cause the recognition capability of the classifier to be obviously biased to the majority of classes, and the satisfactory classification performance cannot be achieved for the minority of classes.

At present, traditional unbalanced learning techniques for solving the problem of unbalanced data classification can be divided into two categories: internal methods and external methods. The internal method is to improve the existing classification algorithm to reduce its sensitivity to class imbalance. The external method preprocesses the training data to balance it. Among external methods, the sampling method for balancing unbalanced data sets can be divided into: SMOTE oversampling and ENN undersampling.

The basic idea of SMOTE oversampling is to analyze a few classes of samples and artificially synthesize new samples from the few classes of samples to add to a data set, but the distribution of nearby most classes of samples is not considered when generating new samples, and K neighbor selection has blindness, meets much noise, and invades into a most classes of sample space. The ENN undersampling obtains an ideal class distribution rate by eliminating most types of samples, but causes the loss of classification information in a data set, so that a voice sample equalization method combining mixed sampling and random forest needs to be designed.

Disclosure of Invention

Therefore, the technical problem to be solved by the invention is to overcome the defects that in the prior art, the distribution condition of a plurality of nearby samples is not considered when SMOTE oversampling is used for generating new samples, and a lot of noise invades into a plurality of sample spaces and ENN undersampling causes the loss of classification information in data set.

In order to solve the technical problem, the invention provides a voice sample equalization method combining mixed sampling and random forest, which comprises the following steps:

s101: acquiring an initial voice data set, and performing feature extraction on the initial voice data set to obtain an extracted voice data feature set;

s102: analyzing a few class samples of the voice data feature set by using oversampling SMOTE, generating a new target few class sample according to the few class samples, analyzing a nearest neighbor sample of the target few class sample and a nearest neighbor sample of a plurality of class samples in the voice data feature set by using undersampling ENN, deleting the target few class sample and the majority class sample according to the nearest neighbor sample of the target few class sample and the nearest neighbor sample of the majority class sample, and obtaining a current balanced voice data set;

s103: calculating the information gain rate and the kini coefficient of the current balanced voice data set, and linearly combining the information gain rate and the kini coefficient of the current balanced voice data set by using double factors to construct a double-factor random forest model;

s104: inputting the current balanced voice data set into the double-factor random forest model, and outputting a classification evaluation index and an out-of-bag error classification rate of the current balanced voice data set under a preset double-factor condition;

s105: judging whether the classification evaluation index is converged, and if the classification evaluation index is converged, outputting the current balanced voice data set; and if the classification evaluation index diverges, updating the oversampling rate of the oversampling SMOTE and the undersampling rate of the undersampling ENN according to the out-of-bag error classification rate, returning to execute the step S102 until the classification evaluation index converges, and outputting the current balanced voice data set.

In an embodiment of the present invention, the analyzing a few class samples of the voice data feature set by using oversampling SMOTE and generating a new target few class sample from the few class samples, analyzing a nearest neighbor sample of the target few class sample and a nearest neighbor sample of a plurality of class samples in the voice data feature set by using undersampling ENN, and deleting the target few class sample and the majority class sample from the nearest neighbor sample of the target few class sample and the nearest neighbor sample of the majority class sample to obtain a current equalized voice data set includes:

s201: analyzing the minority sample S using the oversampled SMOTE_minAnd according to the minority sample S_minGenerating a sample T_genThe sample T is_genStore to minority sample space K_min[]Performing the following steps; wherein the sample T_gen＝count(K_min)；

S202: judging the sample T_genWhether less than the number M of samples that the oversampled SMOTE needs to generate_upIf T is_gen＜M_upReturning to execute the step S201, otherwise executing the step S203; wherein M is_upClass S_minX oversampling ratio N₁；

S203: analyzing the sample T with the undersampled ENN_genAnd a plurality of classes of samples S in the speech data feature set_majIf said sample T is a nearest neighbor sample_genThe nearest neighbor samples of (A) are k and k or more and the samples T_genIf the samples are of different types, deleting K_min[]Of the corresponding sample T_genIf the majority of samples S_majThe nearest neighbor samples of (A) are k and more than k and the plurality of types of samples S_majIf the samples with different categories are not the same, deleting the majority of samples S_maj(ii) a Wherein the undersampled ENN deleted samples T_del＝T_gen+S_maj；

S204: determining the samples T deleted by the undersampled ENN_delWhether less than the number M of samples that the undersampled ENN needs to delete_downIf T is_del＜M_downReturning to execute the step S203, otherwise, outputting the current balanced voice data set; wherein M is_downMajority class sample S_majX undersampling rate N₂。

In one embodiment of the present invention, the analyzing the minority sample S using oversampling SMOTE_minAnd according to the minority sample S_minGenerating a sample T_genThe method comprises the following steps:

in the minority sample S_minMiddle search k nearest neighbor samples S_{min_i}；

Assuming that the number of samples generated by the oversampling SMOTE is M_upFrom said S_{min_i}In the random selection of said M_upA sample of said M_upOne sample is marked as S_{min_1}，S_{min_2}，......S_{min_j}；

Associating said S_{min_i}And said S_{min_j}Generating samples T by a random interpolation operation_gen＝S_{min_i}+rand(0，1)(S_{min_j}-S_{min_i}) (ii) a Where rand (0, 1) denotes a random number in the interval (0, 1), i ═ 1, 2,.. 9., k, j ═ 1, 2_up。

In an embodiment of the present invention, if the classification evaluation index diverges, the oversampling rate of the oversampled SMOTE and the undersampling rate of the undersampled ENN are updated according to the out-of-bag misclassification rate, and the step S102 is executed again until the classification evaluation index converges, and outputting the currently equalized voice data set includes:

if the classification evaluation index diverges, updating the oversampling rate of the oversampling SMOTE and the undersampling rate of the undersampled ENN according to the out-of-bag error classification rate, and initializing the out-of-bag error classification rate and T_gen、M_up、M_downAnd T_delAnd returning to execute the step S102 until the classification evaluation index is converged, and outputting the current balanced voice data set.

In an embodiment of the present invention, the calculating the information gain rate and the kini coefficient of the current equalized voice data set, and linearly combining the information gain rate and the kini coefficient of the current equalized voice data set by using two factors, and the constructing the two-factor random forest model includes:

calculating the information gain rate and the kini coefficient of the current balanced voice data set, linearly combining the information gain rate and the kini coefficient of the current balanced voice data set by using double factors, and adaptively splitting decision tree nodes of the double-factor random forest model;

constructing the dual-factor random forest model according to the self-adaptive splitting of the decision tree nodes;

and judging whether the out-of-bag error value of the double-factor random forest model is a preset out-of-bag error value or not, if so, outputting the double-factor random forest model under a preset double-factor condition, otherwise, updating the double factors of the self-adaptive splitting of the nodes of the decision tree, and reconstructing the double-factor random forest model.

In an embodiment of the present invention, the calculating the information gain rate and the kini coefficient of the current equalized voice data set, and linearly combining the information gain rate and the kini coefficient of the current equalized voice data set by using two factors, and adaptively splitting the decision tree nodes of the two-factor random forest model includes:

dividing the current equalized speech data set D into subsets D₁，...，D_kCalculating the information gain of the current equalized speech data set

Wherein the entropy of the current equalized speech data set D

Normalizing the information gain of the current balanced voice data set by using the value number of the characteristic to obtain the information gain rate of the current balanced voice data set

Computing a kini coefficient for the current equalized speech data set

Wherein the content of the first and second substances,

linearly combining information gain ratio and a kini coefficient of the current equalized speech data set by the dual factor psi (D; D)₁，...，D_k)＝α[β₁Gini(D；D₁，...，D_k)-β₂Gain_ratio(D；D₁，...，D_k)]Adaptively splitting decision tree nodes of the dual-factor random forest model; wherein, alpha is a random factor, beta_iIs a balance factor of the node splitting index.

if the classification evaluation index diverges, then according to

Updating the oversampling rate of the oversampling SMOTE and the undersampling rate of the undersampled ENN, returning to execute the step S102 until the classification evaluation index is converged, and outputting the current equalized voice data set;

wherein, OOBM_{mis_rate}Is the out-of-bag error classification rate, N, of the two-factor random forest model_{mis_maj}Number of classification errors for majority class samples, N_{mis_min_i}For the number of sample classification errors of the ith minority class, minclass is the number of minority class classes.

The invention provides a voice sample equalization device combining mixed sampling and random forest, which is characterized by comprising the following steps:

the voice recognition system comprises an acquisition module, a recognition module and a processing module, wherein the acquisition module is used for acquiring an initial voice data set, and extracting the characteristics of the initial voice data set to obtain an extracted voice data characteristic set;

the analysis module is used for analyzing a minority class sample of the voice data feature set by using oversampling SMOTE, generating a new target minority class sample according to the minority class sample, analyzing a nearest neighbor sample of the target minority class sample and a nearest neighbor sample of a plurality of classes of samples in the voice data feature set by using undersampling ENN, deleting the target minority class sample and the majority class sample according to the nearest neighbor sample of the target minority class sample and the nearest neighbor sample of the majority class sample, and obtaining a current balanced voice data set;

the construction module is used for calculating the information gain rate and the kini coefficient of the current balanced voice data set, and linearly combining the information gain rate and the kini coefficient of the current balanced voice data set by using double factors to construct a double-factor random forest model;

the input module is used for inputting the current balanced voice data set into the double-factor random forest model and outputting a classification evaluation index and an out-of-bag error classification rate of the current balanced voice data set under a preset double-factor condition;

the judging module is used for judging whether the classification evaluation index is converged or not, and if the classification evaluation index is converged, outputting the current balanced voice data set; and if the classification evaluation index diverges, updating the oversampling rate of the oversampling SMOTE and the undersampling rate of the undersampling ENN according to the out-of-bag error classification rate, returning to execute the step S102 until the classification evaluation index converges, and outputting the current balanced voice data set.

a memory for storing a computer program;

and the processor is used for realizing the steps of the voice sample equalization method combining the mixed sampling and the random forest when the computer program is executed.

The invention provides a computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of a method for speech sample equalization combining mixed sampling and random forest as described above.

Compared with the prior art, the technical scheme of the invention has the following advantages:

the invention relates to a voice sample equalization method for joint mixed sampling and random forests, which comprises the steps of firstly, collecting an initial voice data set, and carrying out feature extraction on the initial voice data set to obtain an extracted voice data feature set; then, analyzing a minority class sample of the voice data feature set by utilizing oversampling SMOTE, generating a new target minority class sample according to the minority class sample, analyzing a nearest neighbor sample of the target minority class sample and a nearest neighbor sample of a plurality of classes of samples in the voice data feature set by utilizing undersampling ENN, deleting the target minority class sample and the majority class sample according to the nearest neighbor sample of the target minority class sample and the nearest neighbor sample of the majority class sample, and obtaining a current balanced voice data set; secondly, calculating the information gain rate and the kini coefficient of the current balanced voice data set, and linearly combining the information gain rate and the kini coefficient of the current balanced voice data set by using double factors so as to construct a double-factor random forest model; inputting the current balanced voice data set into a double-factor random forest model, and outputting a classification evaluation index and an out-of-bag error classification rate of the double-factor random forest model under a preset double-factor condition; finally, judging whether the classification evaluation index is converged, and if the classification evaluation index is converged, outputting the current balanced voice data set; and if the classification evaluation index diverges, updating the oversampling rate of the oversampling SMOTE and the undersampling rate of the undersampled ENN according to the out-of-bag error classification rate, returning to perform equalization processing on the extracted voice data feature set again until the classification evaluation index converges, and outputting the current equalized voice data set. The method increases the consideration of the inherent characteristics of the sample by extracting the characteristics of the voice data set; by applying the under-sampled ENN to the target few samples generated by the over-sampled SMOTE to remove the samples, the problem that the distribution condition of the nearby most samples is not considered when the SMOTE over-sampled generates a new sample is solved, and the generation of noise samples is reduced; meanwhile, self-adaptive double-factor parameters are introduced into the random forest to adjust the bias of the double-factor random forest model, iterative analysis is carried out on the data characteristics of the double-factor random forest input in each turn, and the data characteristics are fed back to the mixed sampling stage according to the classification evaluation indexes, so that the mixed sampling technology can be assisted to obtain more reliable data results, sample data with high information value is reserved to the maximum extent, and the loss of the classification information of the data set is reduced.

Drawings

In order that the present disclosure may be more readily and clearly understood, reference is now made to the following detailed description of the present disclosure taken in conjunction with the accompanying drawings, in which:

FIG. 1 is a flow chart of a first embodiment of a method for equalizing a speech sample of a combined mixed sampling and random forest according to the present invention;

FIG. 2 is a flow chart of a second embodiment of a method for equalizing a speech sample of a combined mixed sampling and random forest according to the present invention;

FIG. 3 is a schematic diagram of a method for equalizing a speech sample of a combined mixed sampling and random forest according to the present invention;

FIG. 4 is a schematic diagram of feature extraction for a speech data set in accordance with the present invention;

FIG. 5 is a flow chart of a two-factor random forest of the present invention;

fig. 6 is a block diagram of a structure of a speech sample equalization method combining mixed sampling and random forest according to an embodiment of the present invention.

Detailed Description

The present invention is further described below in conjunction with the following figures and specific examples so that those skilled in the art may better understand the present invention and practice it, but the examples are not intended to limit the present invention.

Referring to fig. 1, fig. 1 is a flowchart of a first embodiment of a method for equalizing a speech sample of a combined mixed sampling and random forest according to the present invention; the specific operation steps are as follows:

step S101: acquiring an initial voice data set, and performing feature extraction on the initial voice data set to obtain an extracted voice data feature set;

step S102: analyzing a few class samples of the voice data feature set by using oversampling SMOTE, generating a new target few class sample according to the few class samples, analyzing a nearest neighbor sample of the target few class sample and a nearest neighbor sample of a plurality of class samples in the voice data feature set by using undersampling ENN, deleting the target few class sample and the majority class sample according to the nearest neighbor sample of the target few class sample and the nearest neighbor sample of the majority class sample, and obtaining a current balanced voice data set;

step S103: calculating the information gain rate and the kini coefficient of the current balanced voice data set, and linearly combining the information gain rate and the kini coefficient of the current balanced voice data set by using double factors so as to construct a double-factor random forest model;

step S104: inputting the current balanced voice data set into the double-factor random forest model, and outputting a classification evaluation index and an out-of-bag error classification rate of the current balanced voice data set under a preset double-factor condition;

step S105: judging whether the classification evaluation index is converged, and if the classification evaluation index is converged, outputting the current balanced voice data set; and if the classification evaluation index diverges, updating the oversampling rate of the oversampling SMOTE and the undersampling rate of the undersampling ENN according to the out-of-bag error classification rate, returning to execute the step S102 until the classification evaluation index converges, and outputting the current balanced voice data set.

In the method provided by this embodiment, the extracted voice data set is equalized by a hybrid sampling technique, and a new sample generated by applying an under-sampling algorithm to an over-sampling algorithm is removed. Meanwhile, the data characteristics are analyzed by means of random forests and fed back to the mixed sampling stage, and a more reliable data result is obtained by the aid of the mixed sampling technology. By introducing self-adaptive double-factor parameter adjustment model bias into the random forest, iterative analysis is carried out on data characteristics of the double-factor random forest input in each round, the out-of-bag error classification rate obtained after each round of iteration is used as guidance and fed back to a mixed sampling stage, and sample data with high information value is maximally reserved.

Based on the above embodiments, the present embodiment further describes the speech sample equalization method, and with reference to fig. 2 and fig. 3, the specific operation steps are as follows:

step S201: acquiring an initial voice data set, and performing feature extraction on the initial voice data set to obtain an extracted voice data feature set;

as shown in fig. 4, in order to analyze the nonlinear phenomenon caused by the vortex at the glottis during the utterance, a Bark wavelet sub-band filter bank is firstly adopted to filter the speech signal, then the feature is extracted by a discrete cosine transform method at the low frequency band, and the correlation and the maximum lyapunov feature are extracted at the high frequency band, so that the characteristics of the voice can be embodied in detail at each frequency band. A fluid-solid coupling feature extraction idea based on glottic flow field distribution to be extracted is as follows: firstly, dividing voice frequency bands into 24 frequency bands according to a Bark filter bank, then calculating logarithmic energy after carrying out Fourier transform on a low frequency band according to an MFCC extraction method, then carrying out discrete cosine transform, carrying out nonlinear dynamics analysis on a high frequency band, extracting correlation dimension and a maximum Lyapunov exponent, and then fusing multiple features.

Further analysis of the speech signal from vocal cord vibration perspective, the vocal cord model equation set is described as follows:

in the formula, α ═ 1, and 2 represent left and right portions, respectively. x is the number of_1α，υ_1αRespectively the motion displacement and the speed of the mass block; m is_1α，k_1α，k_cα，r_1αRespectively representing the mass of the mass block, the elastic coefficient of the spring, the coupling elastic coefficient and the damping constant; l and d are the vocal cord length and the thickness of the lower-layer mass block; k is a radical of_1α，I_1αRespectively, the bernoulli pressure and the impact force generated at the time of collision.

Setting the model mass, elastic coefficient, coupling coefficient, damping constant and subglottic pressure as optimizable parameters, expressed as vectors: phi: is ═ m_1α，k_1α，k_cα，r_1α，P_S]And a proper phi is searched by using a variation particle swarm quasi-Newton method, so that the glottic fluid-solid coupling model can accurately generate a glottic waveform. In order to avoid directly using a gradient method to obtain a local minimum solution in a non-convex search space, firstly, a variation particle swarm method is used for obtaining an optimal solution, then, a quasi-Newton method is used for carrying out local optimization on the obtained solution, and a global optimal point is found.

The selection and crossing process adopts a roulette selection rule to select M individuals. And the particle swarm algorithm is terminated under the condition that the obtained highest fitness exceeds a preset threshold or reaches a preset iteration number. Target voice source U_geAnd waveform U simulated with parameter vector phi_gsThe time-domain error therebetween is defined as an objective function F:

in the formula, N represents U_ge，U_gsAnd (6) counting the number of points. When the value of the objective function F reaches the global minimum, the simulation glottic airflow U of the vocal cord mass block model is shown_gsAnd target glottic airflow U_geConsistently, the vocal cord mass model can accurately reflect the actual vocal cord structure of the accurately simulated target vocal sound source.

The flow-solid characteristics of the voice signal are extracted through sub-band nonlinear analysis and a vocal band mass block model, the sub-band nonlinear analysis reflects the nonlinear characteristic caused by the optimal airflow vortex in the voice signal generation process, and the actual vocal band structure of the target voice signal is simulated through the vocal band mass block model. And applying the extracted flow-fixed characteristics of the voice signal to subsequent voice recognition.

Step S202: analyzing a minority class sample S of the speech data feature set using oversampled SMOTE_minAnd according to the minority sample S_minGenerating a sample T_genAnd combining the samples T_genStore to minority sample space K_min[]Performing the following steps; wherein the sample T_gen＝count(K_min)；

Step S203: judging the sample T_genWhether less than the number M of samples that the oversampled SMOTE needs to generate_upIf T is_gen＜M_upReturning to execute the step S202, otherwise executing the step S204; wherein M is_upClass S_minX oversampling ratio N₁

Step S204: analyzing the sample T with undersampled ENN_genAnd a plurality of classes of samples S in the speech data feature set_majIf said sample T is a nearest neighbor sample_genThe nearest neighbor samples of (A) are k and k or more and the samples T_genIf the samples are of different types, deleting K_min[]Of the corresponding sample T_genIf said plurality of samples S_majThe nearest neighbor samples of (A) are k and more than k and the plurality of types of samples S_majIf the samples with different categories are not the same, deleting the majority of samples S_maj(ii) a Wherein the undersampled ENN deleted samples T_del＝T_gen+S_maj；

Step S205: determining the samples T deleted by the undersampled ENN_delWhether less than the number M of samples that the undersampled ENN needs to delete_downIf T is_del＜M_downReturning to execute the step S204, otherwise, outputting the current balanced voice data set; wherein M is_downMajority class sample S_majX undersampling rate N₂；

The SMOTE oversampling algorithm searches k nearest neighbor samples S in a minority sample class based on a k nearest neighbor thought_{min_i}. Assume that the number of generated samples of the data set is M_upThen from S_{min_i}In random selection of M_upA sample, M_upOne sample is marked as S_{min_1}，S_{min_2}，......S_{min_j}. Correlating data samples S_{min_i}And S_{min_j}Obtaining a synthetic sample S through corresponding random interpolation operation_new。

S_new＝S_{min_i}+rand(0，1)(S_{min_j}-S_{min_i})；

Where rand (0, 1) denotes a random number in the interval (0, 1), i ═ 1, 2,.. 9., k, j ═ 1, 2_up，M_upTo generate the number of samples. The number of generated samples is determined by the over-sampling rate.

The ENN (edited Nearest neighbor) undersampling algorithm also reduces majority class and minority class samples based on the K neighbor selection strategy. The basic idea is as follows: assuming an unbalanced data set D, S_majRepresents the majority of class samples, traverse S_majEach sample S in (1)_{maj_i}Finding out S_{maj_i}If two or more of the three nearest neighbor samples are S_{maj_i}If the types are different, deleting the sample S_{maj_i}. By combining SMOTE and ENN, the extracted voice data set is subjected to equalization processing through SMOTE-ENN mixed sampling, and the new samples generated by the oversampling algorithm acted by the undersampling algorithm are subjected to sample elimination, so that the problem of generating noise samples is solved under the condition that data set information is not lost.

Step S206: calculating the information gain rate and the kini coefficient of the current balanced voice data set, linearly combining the information gain rate and the kini coefficient of the current balanced voice data set by using double factors, and adaptively splitting decision tree nodes of the double-factor random forest model;

step S207: constructing the dual-factor random forest model according to the self-adaptive splitting of the decision tree nodes;

the classification performance of random forests is reduced when non-uniform data sets are processed. The main reasons are two, firstly, in the random forest construction process, the training set is selected by bootstrap self-sampling. Because few samples of the original data set are fewer and the probability of sampling of the few samples is lower, the number of the few samples in the sub-training set is smaller than that of the original data set, and the non-equilibrium of the data set is aggravated. Secondly, because the number of the samples of the minority class in the original data set is low, the decision tree based on the sub-training set lacks the generalization capability and cannot embody the characteristics of the minority class.

A random forest is an integrated classifier R ═ h (x, θ) composed of a set of decision trees_k) K is 1, 2,. K }, where { θ · is equal to_kThe random vectors obeying independent and same distribution, K is the number of decision trees in the random forest, and the training set of each classifier is selected from a data set D<X，Y>And randomly sampling to obtain. The marginal function of a random forest is:

wherein, the classification performance of the base classifier { h (x, θ) } is defined as:

s＝E_X，Ymr(X，Y)；

assume s ≧ 0, i.e., the base classifier is a weak classifier. Random forest generalization error PE^*The upper limit of (A) is:

wherein the subscripts X, Y denote the probability P covering the X, Y space,

and (4) averaging the correlation coefficients among all the classifiers, and showing that the generalization error of the random forest is related to the classification performance of the base classifier and the correlation coefficient among all the base classifiers. Therefore, a dual-factor decision tree splitting algorithm is provided, the correlation coefficient among all the base classifiers is reduced, the classification performance of the base classifiers is improved, and the generalization error of the random forest is reduced.

The node splitting algorithm of the decision tree mainly comprises ID3, C4.5[23], CART [24] and the like. The ID3 algorithm selects the information gain as the segmentation criterion. The "feature-value" combination based on the maximum information gain would be selected as the segmentation. The disadvantage is that the information gain criterion favors features with many possible values, but ignores the relevance to the classification, and the classification result cannot be generalized. The C4.5 and CART algorithms use "information gain ratio" and "kini coefficient", respectively, as criteria for selecting the segmentation. The ID3 algorithm using information gain as a node splitting standard can only process discrete features, and C4.5 and CART using information gain rate and a kini coefficient as indexes can process numerical features. The difference between the information gain rate and the Keyny coefficient is that the information gain rate is multiplied by the logarithm of the class probability by the class probability to calculate the entropy difference before and after splitting, which is beneficial to smaller distribution with less quantity and a plurality of characteristic values; the kini coefficient is derived by subtracting the sum of the squared probabilities of each class from one class, favoring a larger data distribution. Both are algorithms based on information theory, and the reason for node splitting is somewhat approximated. Therefore, the combination between the two is established, and a random factor and a balance factor are introduced to realize the node self-adaptive splitting.

Given the current equalized speech data set D, the entropy of this current equalized speech data set is defined as:

when the current equalized speech data set D is divided into subsets D₁，...，D_kWhen the entropy is reduced, the corresponding entropy is reduced to obtain the' information gain

The information gain rate is normalized by using the value number of the feature on the basis of the information gain, that is, the information gain rate is normalized by using the value number of the feature

The kini coefficient of the current equalized speech data set D is then defined as:

wherein, the first and the second end of the pipe are connected with each other,

considering the linear combination of the information gain rate and the kini coefficient, the two-factor node splitting algorithm is as follows:

ψ(D；D₁，...，D_k)＝α[β₁Gini(D；D₁，...，D_k)-β₂Gain_ratio(D；D₁，...，D_k)]

wherein α is a randomness factor (α ≦ 0 ≦ 1) controlling the randomness of node splitting, when α is 1, the generated decision tree is the same as the deterministic decision tree, and when α is 0, the generated decision tree is a completely random tree. Beta is a_i(i is 1, 2) is a balance factor of node splitting index, and beta is more than or equal to 0_i(i ═ 1, 2) ≦ 1 may not be 0 or 1 at the same time, and on the boundary, only (1, 0) or (0, 1) in combination. As shown in FIG. 5, a dual-factor random forest is constructed by the dual-factor node splitting algorithm, when the nodes of the decision tree are split, the CART has the smallest Keyny coefficient, while the C4.5 algorithm has the largest information gain rate, if both indexes reach the optimum, psi (D; D)₁，...，D_k) And taking the minimum value as an optimal rule to split the nodes. After the random forest is generated, the out-of-bag error is estimated. If the error outside the bag reaches the minimum, outputting the random forest under the condition of the optimal factor; otherwise, updating the double factors and reconstructing the random forest.

Step S208: inputting the current balanced voice data set into a double-factor random forest model, and outputting a classification evaluation index and an out-of-bag error classification rate of the double-factor random forest model under a preset double-factor condition;

the out-of-bag misclassification rate is:

wherein N is_{mis_maj}Number of classification errors for majority class samples, N_{mis_min_i}For the number of sample classification errors of the ith minority class, minclass is the number of minority class classes.

Step S209: judging whether the classification evaluation index is converged, and if the classification evaluation index is converged, outputting the current balanced voice data set; if the classification evaluation index diverges, updating the oversampling rate of the oversampling SMOTE and the undersampling rate of the undersampled ENN according to the out-of-bag error classification rate, and initializing the out-of-bag error classification rate and T_gen、M_up、M_downAnd T_delAnd returning to execute the step S202 until the classification evaluation index is converged, and outputting the current balanced voice data set.

In the invention, over-sampling and under-sampling can equalize data distribution as much as possible, but the existing sampling algorithm does not pay full attention to the problems of class overlapping and noise, and the spatial distribution of data is distorted after sampling. Therefore, a mixed sampling algorithm of the combined double-factor random forest is provided, a new sample is synthesized for a few classes according to a sample distribution rule, and redundant information is removed under the condition that the space structure of the majority classes is not changed according to the feedback of the double-factor random forest. And pre-equalizing the data set by combining a mixed sampling algorithm of SMOTE and ENN, then evaluating the pre-equalized data set by using a double-factor random forest, and respectively calculating a classification evaluation index and an error classification rate. And correcting the mixed sampling rate according to the error classification rate, wherein in the iterative process of mixed sampling, the sampling rate is dynamically changed along with the out-of-bag error classification rate of the random forest, but not the unbalance degree of the data set. Judging whether the classification evaluation indexes are converged or not according to the classification evaluation indexes F1-macro serving as iteration stop standards, if so, namely when F1-macro continuously descends twice or keeps unchanged, ending mixed sampling, stopping an iteration process, and outputting a data set which is an optimal balanced data set conforming to the original data distribution; if the classification evaluation index diverges, updating the mixed sampling rate of the SMOTE-ENN mixed sampling according to the out-of-bag error classification rate, and initializing the out-of-bag errorMisclassification rate, T_gen、M_up、M_downAnd T_delAnd returning to carry out equalization processing on the extracted voice data set again until the classification evaluation index is converged, and outputting the current equalized voice data set.

The specific flow steps of the OOBM-SMOTE-ENN combined double-factor random forest mixed sampling algorithm are as follows:

inputting: data set D, majority class samples S_majMinority class sample S_minNumber of nearest neighbor samples k, initial oversampling rate N₁Initial undersampling rate N₂。

And (3) outputting: the data set D' is pre-equalized.

1. Initializing OOBM_{mis_rate}Initializing oversampling requires generating the number of samples M, 1_up(M_up＝S_min×N₁) Will M_upSet to 0;

2. correcting N according to double-factor random forest feedback₁、N₂Setting T_gen0, S for each minority sample_{min_i}Traversing, and storing a few samples generated by the SMOTE algorithm into a space K_min[]；

3、T_gen＝T_gen+count(K_min) If T is_gen＜M_upReturning to the step 2, otherwise, performing the step 4;

4. initializing the number of samples M that need to be deleted for undersampling_down(M_down＝S_maj×N₂) Will M_downSet to 0;

5. setting T_del0 and for each majority sample S_{maj_i}Go through and compare S_{maj_i}And K_min[]Label of medium sample, if said K_min[]Middle T_{gen_i}There are k and k or more nearest neighbor samples with the T_{gen_i}If the samples are of different types, deleting K_min[]Corresponding few classes in (1) generate samples T_{gen_i}. Meanwhile, the neighborhood samples are compared through ENN definition, if S is_{maj_i}The nearest neighbor samples have k and more than k AND pointsS is_{maj_i}If the samples are different in category, deleting the sample S_{maj_i}；

6、T_del＝T_del+(T_{gen_i}+S_{maj_i}) If T is_del＜M_downReturning to step 5, otherwise, outputting a once equalized sample set D'.

And the current balanced voice data set is divided into a training set and a testing set, the random forest recognition model is trained by using the characteristic parameters of the voice of the training set, and the characteristic parameters of the testing set are subjected to prediction classification by using the trained random forest model. The comparative experiment results of the OOB MSE algorithm and the classical sampling algorithm are shown in the following table 1;

table 1 comparative experiment results of OOB MSE algorithm and classical sampling algorithm

	Raw data	SMOTE	ADASYN	BSM	CNN	OOBMSE
							Recognition rate/%)	97.05	99.03	99.04	99.03	91.43	100.00
Recall/%)	92.31	99.16	98.86	99.16	91.34	100.00
							Kappa number/%	89.89	98.03	98.02	98.03	82.82	100.00
F1 fraction/%)	94.94	99.02	99.01	99.02	91.40	100.00

The table, SMOTE, is called Synthetic minimum ownership over sampling Technique, i.e. a Technique for synthesizing Minority samples, and is an improved scheme based on a random Oversampling algorithm, and since the random Oversampling adopts a simple sample replication strategy to add Minority samples, the basic idea of the SMOTE algorithm is to analyze the Minority samples and artificially synthesize new samples according to the Minority samples to add the new samples to a data set.

ADASYNN is an adaptive comprehensive sampling method for unbalanced learning. Based on the idea of adaptively generating minority class data samples according to their distribution: fewer class samples that are more difficult to learn will generate more synthetic data than those that are easier to learn. The ADASYNN method can reduce learning bias brought by original unbalanced data distribution, and can adaptively transfer decision boundaries to samples which are difficult to learn.

Borderline SMOTE (BSM) is an improved over-sampling algorithm based on SMOTE, which uses only a few classes of samples on the boundary to synthesize new samples, thereby improving the class distribution of the samples.

Condensed Nearest Neighbor concentration, or CNN for short, is an undersampling technique used to find subsets of a sample set without incurring a loss in model performance (referred to as the minimum consistency set).

As can be seen from the above table, the OOB MSE mixed sampling algorithm provided by the invention is superior to the traditional SMOTE, ADASYNN, BSM and CNN algorithms. The accuracy of OOB MSE in the random forest classifier reaches 100%, and other evaluation indexes reach optimal values, so that the method is superior to the traditional method. Therefore, the equalization algorithm provided by the invention improves the recognition rate and reliability of the system.

The method provided by the embodiment is quite common in voice recognition and even intelligent medical diagnosis, and the mixed sampling algorithm combining the double-factor random forest provided by the invention is based on the double-factor random forest and combines SMOTE and ENN to solve the problem of unbalanced data classification in voice recognition. In view of the shortcomings of the conventional oversampling algorithm, in the hybrid sampling process, the oversampling rate is dynamically changed according to the out-of-bag error classification rate of the two-factor random forest, not the unbalanced rate of the data set, and simultaneously, noise in the oversampling generated samples is removed through ENN. And combining double-factor random forest and mixed sampling according to the out-of-bag error classification rate, dynamically correcting the mixed sampling rate, increasing the number of a few types of samples, and removing noise and repeated information in the samples to balance data.

Referring to fig. 6, fig. 6 is a block diagram illustrating a structure of a voice sample equalization method combining mixed sampling and random forest according to an embodiment of the present invention; the specific device may include:

the voice recognition system comprises an acquisition module 100, a processing module and a processing module, wherein the acquisition module 100 is used for acquiring an initial voice data set, and extracting the characteristics of the initial voice data set to obtain an extracted voice data characteristic set;

an analysis module 200, configured to analyze a minority class sample of the voice data feature set by using oversampling SMOTE and generate a new target minority class sample according to the minority class sample, analyze a nearest neighbor sample of the target minority class sample and a nearest neighbor sample of a plurality of classes samples in the voice data feature set by using undersampling ENN, delete the target minority class sample and the majority class sample according to the nearest neighbor sample of the target minority class sample and the nearest neighbor sample of the majority class sample, and obtain a current balanced voice data set;

the construction module 300 is configured to calculate an information gain rate and a kini coefficient of the current equalized voice data set, and linearly combine the information gain rate and the kini coefficient of the current equalized voice data set by using two factors to construct a two-factor random forest model;

an input module 400, configured to input the current balanced voice data set into the two-factor random forest model, and output a classification evaluation index and an out-of-bag error classification rate of the current balanced voice data set under a preset two-factor condition;

a determining module 500, configured to determine whether the classification evaluation indicator converges, and if the classification evaluation indicator converges, output the current balanced voice data set; and if the classification evaluation index diverges, updating the oversampling rate of the oversampling SMOTE and the undersampling rate of the undersampling ENN according to the out-of-bag error classification rate, returning to execute the step S102 until the classification evaluation index converges, and outputting the current balanced voice data set.

The apparatus for jointly mixing sampling and equalizing voice samples in a random forest according to this embodiment is used to implement the foregoing method for jointly mixing sampling and equalizing voice samples in a random forest, and thus specific embodiments in the apparatus for jointly mixing sampling and equalizing voice samples in a random forest may be found in the foregoing embodiments of the method for jointly mixing sampling and equalizing voice samples in a random forest, for example, 100, 200, 300, 400, and 500 are respectively used to implement steps S101, S102, S103, S104, and S105 in the method for jointly mixing sampling and equalizing voice samples in a random forest, and therefore, the specific embodiments thereof may refer to descriptions of corresponding embodiments of various parts, and are not described herein again.

The embodiment of the invention also provides voice sample equalization equipment combining mixed sampling and random forest, which comprises: a memory for storing a computer program; and the processor is used for realizing the steps of the voice sample equalization method combining the mixed sampling and the random forest when the computer program is executed.

A specific embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements the steps of the method for equalizing the voice samples of the combined mixed sampling and random forest.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications therefrom are within the scope of the invention.

Claims

1. A voice sample equalization method combining mixed sampling and random forest is characterized by comprising the following steps:

2. The method of claim 1, wherein the analyzing a minority class sample of the voice data feature set using oversampling SMOTE and generating a new target minority class sample from the minority class sample, analyzing a nearest neighbor sample of the target minority class sample and a nearest neighbor sample of a plurality of classes samples in the voice data feature set using undersampling ENN, and deleting the target minority class sample and the plurality of classes samples from the nearest neighbor sample of the target minority class sample and the nearest neighbor sample of the plurality of classes samples to obtain a current equalized voice data set comprises:

s201: analyzing the minority sample S using the oversampled SMOTE_minAnd according to the minority sample S_minGenerating a sample T_genThe sample T is_genStore to a small numberClass sample space K_min[]Performing the following steps; wherein the sample T_gen＝count(K_min)；

S202: judging the sample T_genWhether less than the number M of samples that the oversampled SMOTE needs to generate_upIf T is_gen＜M_upReturning to execute the step S201, otherwise executing the step S203; wherein M is_upSample S of minority class_minX oversampling ratio N₁；

S203: analyzing the sample T with the undersampled ENN_genAnd a plurality of classes of samples S in the speech data feature set_majIf said sample T is a nearest neighbor sample_genThe nearest neighbor samples of (A) are k and k or more and the samples T_genIf the samples are of different types, deleting K_min[]Of the corresponding sample T_genIf said plurality of samples S_majThe nearest neighbor samples of (A) are k and more than k and the plurality of types of samples S_majIf the samples with different categories are not the same, deleting the majority of samples S_maj(ii) a Wherein the undersampled ENN deleted samples T_del＝T_gen+S_maj；

3. The method of claim 2, wherein the analyzing the minority sample S using oversampling SMOTE_minAnd according to the minority class samples S_minGenerating a sample T_genThe method comprises the following steps:

Assuming that the number of samples generated by the oversampling SMOTE is M_upFrom said S_{min_i}Medium random selectionSelecting said M_upA sample of said M_upOne sample is marked as S_{min_1}，S_{min_2}，......S_{min_j}；

4. The method of claim 2, wherein if the classification evaluation index diverges, updating the oversampling rate of the oversampled SMOTE and the undersampling rate of the undersampled ENN according to the out-of-bag misclassification rate, returning to perform step S102 until the classification evaluation index converges, and outputting the current equalized speech data set comprises:

5. The method of claim 1, wherein the calculating the information gain rate and the kini coefficient of the current equalized speech data set, and the linearly combining the information gain rate and the kini coefficient of the current equalized speech data set by using two factors, and the constructing the two-factor random forest model comprises:

6. The method of claim 5, wherein the calculating the information gain rate and the kini coefficient of the current equalized speech data set, and the linearly combining the information gain rate and the kini coefficient of the current equalized speech data set by using two factors, and wherein adaptively splitting the decision tree nodes of the two-factor random forest model comprises:

Wherein the entropy Ent (D) - Σ of the current equalized speech data set D_y∈yP(y|D)logP(y|D)；

Computing a kini coefficient for the current equalized speech data set

Wherein, I (D) is 1-Sigma_y∈yP(y|D)²；

7. The method of claim 1, wherein if the classification evaluation index diverges, updating the oversampling rate of the oversampled SMOTE and the undersampling rate of the undersampled ENN according to the out-of-bag misclassification rate, returning to perform step S102 until the classification evaluation index converges, and outputting the current equalized speech data set comprises:

if the classification evaluation index diverges, then according to

wherein, OOBM_{mis_rate}Is the out-of-bag error classification rate, N, of the two-factor random forest model_{mis_maj}Number of classification errors for most classes of samples, N_{mis_min_i}For the number of sample classification errors of the ith minority class, minclass is the number of minority class classes.

8. A speech sample equalization apparatus that combines hybrid sampling with random forests, comprising:

9. A voice sample equalization apparatus that combines hybrid sampling and random forest, comprising:

a memory for storing a computer program;

a processor for implementing the steps of a method of speech sample equalization combining mixed sampling and random forest as claimed in any one of claims 1 to 7 when executing said computer program.

10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of a method for speech sample equalization for combined mixed sampling and random forest according to any one of claims 1 to 7.