CN114550697B

CN114550697B - Voice sample equalization method combining mixed sampling and random forest

Info

Publication number: CN114550697B
Application number: CN202210083571.0A
Authority: CN
Inventors: 张晓俊; 周长伟; 朱欣程; 陶智; 赵鹤鸣
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2022-01-17
Filing date: 2022-01-17
Publication date: 2022-11-18
Anticipated expiration: 2042-01-17
Also published as: CN114550697A

Abstract

The invention relates to a voice sample equalization method combining mixed sampling and random forest, which comprises the steps of firstly, carrying out feature extraction on an initial voice data set; then, the extracted voice data feature set is balanced by using SMOTE-ENN mixed sampling, and a current balanced voice data set is obtained; secondly, inputting the current balanced voice data set into a double-factor random forest model, and outputting classification evaluation indexes and out-of-bag error classification rates of the double-factor random forest model; finally, judging whether the classification evaluation index is converged, and if the classification evaluation index is converged, outputting a current balanced voice data set; otherwise, updating the mixed sampling rate of SMOTE-ENN mixed sampling according to the out-of-bag error classification rate, returning to perform equalization processing on the extracted voice data set again until the classification evaluation index is converged, and outputting the current equalized voice data set. The invention maximally reserves the sample data with high information value.

Description

Voice sample equalization method combining mixed sampling and random forest

Technical Field

The invention relates to the technical field of data processing, in particular to a voice sample equalization method, a device and equipment for combining mixed sampling and random forest and a computer readable storage medium.

Background

In recent years, artificial intelligence technology has been developed in a breakthrough in speech recognition. However, the data imbalance problem has been a challenging problem in machine learning. The unevenly distributed data of the categories can cause the recognition capability of the classifier to be obviously biased to the majority of categories, so that the satisfactory classification performance of the minority of categories cannot be achieved.

At present, traditional unbalanced learning techniques for solving the problem of unbalanced data classification can be divided into two categories: internal methods and external methods. The internal method is to improve the existing classification algorithm to reduce its sensitivity to class imbalance. The external method preprocesses the training data to balance it. Among external methods, the sampling method for balancing unbalanced data sets can be divided into: SMOTE oversampling and ENN undersampling.

The basic idea of SMOTE oversampling is to analyze the minority samples and artificially synthesize new samples according to the minority samples to be added into the data set, but the distribution of the nearby majority samples is not considered when generating the new samples, and the K neighbor selection has blindness, meets much noise and invades into the majority sample space. ENN undersampling eliminates most samples to obtain an ideal class distribution rate, but causes the loss of classification information in a data set, so that a voice sample equalization method combining mixed sampling and random forest needs to be designed.

Disclosure of Invention

Therefore, the technical problem to be solved by the invention is to overcome the defects that in the prior art, the distribution condition of a plurality of nearby samples is not considered when SMOTE oversampling is used for generating new samples, and a lot of noise invades into a plurality of sample spaces and ENN undersampling causes the loss of classification information in data set.

In order to solve the technical problem, the invention provides a voice sample equalization method combining mixed sampling and random forest, which comprises the following steps:

s101: acquiring an initial voice data set, and performing feature extraction on the initial voice data set to obtain an extracted voice data feature set;

s102: analyzing a few class samples of the voice data feature set by using oversampling SMOTE, generating a new target few class sample according to the few class samples, analyzing a nearest neighbor sample of the target few class sample and a nearest neighbor sample of a plurality of class samples in the voice data feature set by using undersampling ENN, deleting the target few class sample and the majority class sample according to the nearest neighbor sample of the target few class sample and the nearest neighbor sample of the majority class sample, and obtaining a current balanced voice data set;

s103: calculating the information gain rate and the kini coefficient of the current balanced voice data set, and linearly combining the information gain rate and the kini coefficient of the current balanced voice data set by using double factors to construct a double-factor random forest model;

s104: inputting the current balanced voice data set into the double-factor random forest model, and outputting a classification evaluation index and an out-of-bag error classification rate of the current balanced voice data set under a preset double-factor condition;

s105: judging whether the classification evaluation index is converged, and if the classification evaluation index is converged, outputting the current balanced voice data set; and if the classification evaluation index diverges, updating the oversampling rate of the oversampling SMOTE and the undersampling rate of the undersampled ENN according to the out-of-bag error classification rate, returning to execute the step S102 until the classification evaluation index converges, and outputting the current balanced voice data set.

In an embodiment of the present invention, the analyzing a few class samples of the voice data feature set by using oversampling SMOTE and generating a new target few class sample from the few class samples, analyzing a nearest neighbor sample of the target few class sample and a nearest neighbor sample of a plurality of class samples in the voice data feature set by using undersampling ENN, and deleting the target few class sample and the majority class sample from the nearest neighbor sample of the target few class sample and the nearest neighbor sample of the majority class sample to obtain a current equalized voice data set includes:

s201: analyzing the minority class samples S using the oversampled SMOTE _min And according to the minority sample S _min Generating a sample T _gen The sample T is _gen Store to at leastClass of samples space K _min []Performing the following steps; wherein, sample C _gen ＝count(K _min )；

S202: judging the sample C _gen Whether less than the number M of samples that the oversampled SMOTE needs to generate _up If C is _gen ＜M _up Returning to execute the step S201, otherwise executing the step S203; wherein M is _up = few class samples S _min X oversampling ratio N ₁ ；

S203: analyzing the sample T with the undersampled ENN _gen And a plurality of classes of samples S in the speech data feature set _maj If said sample T is a nearest neighbor sample _gen K and k or more and the samples T _gen If the samples are different in category, delete K _min []Of the corresponding sample T _gen If said plurality of samples S _maj The nearest neighbor samples of (A) are k and more than k and the plurality of types of samples S _maj If the samples with different categories are not the same, deleting the majority of samples S _maj (ii) a Wherein the undersampled ENN deleted samples T _del ＝T _gen +S _maj ；

S204: determining the samples T deleted by the undersampled ENN _del Whether less than the number M of samples that the undersampled ENN needs to delete _down If T is _del ＜M _down Returning to execute the step S203, otherwise, outputting the current balanced voice data set; wherein, M _down = majority class sample S _maj X undersampling rate N ₂ 。

In one embodiment of the present invention, the analyzing the minority sample S using oversampling SMOTE _min And according to the minority sample S _min Generating a sample T _gen The method comprises the following steps:

in the minority sample S _min Middle search k nearest neighbor samples S _{min_i} ；

Assume that the number of samples generated by the over-sampled SMOTE is M _up From said S _{min_i} In the random selection of said M _up A sample of said M _up SampleThis mark is S _{min_1} ，S _{min_2} ，......S _{min_j} ；

Associating said S _{min_i} And said S _{min_j} Generating samples T by a random interpolation operation _gen ＝S _{min_i} +rand(0，1)(S _{min_j} -S _{min_i} ) (ii) a Wherein rand (0,1) represents a random number within the interval (0,1), i =1,2, · _up 。

In an embodiment of the present invention, if the classification evaluation index diverges, the oversampling rate of the oversampled SMOTE and the undersampling rate of the undersampled ENN are updated according to the out-of-bag misclassification rate, and the step S102 is executed again until the classification evaluation index converges, and outputting the currently equalized voice data set includes:

if the classification evaluation index diverges, updating the oversampling rate of the oversampling SMOTE and the undersampling rate of the undersampled ENN according to the out-of-bag error classification rate, and initializing the out-of-bag error classification rate and the T _gen 、M _up 、M _down And T _del And returning to execute the step S102 until the classification evaluation index is converged, and outputting the current balanced voice data set.

In an embodiment of the present invention, the calculating the information gain rate and the kini coefficient of the current equalized voice data set, and linearly combining the information gain rate and the kini coefficient of the current equalized voice data set by using two factors, and the constructing the two-factor random forest model includes:

calculating the information gain rate and the kini coefficient of the current balanced voice data set, linearly combining the information gain rate and the kini coefficient of the current balanced voice data set by using double factors, and adaptively splitting decision tree nodes of the double-factor random forest model;

constructing the dual-factor random forest model according to the self-adaptive splitting of the decision tree nodes;

and judging whether the out-of-bag error value of the double-factor random forest model is a preset out-of-bag error value or not, if so, outputting the double-factor random forest model under a preset double-factor condition, otherwise, updating the double factors of the self-adaptive splitting of the nodes of the decision tree, and reconstructing the double-factor random forest model.

In an embodiment of the present invention, the calculating the information gain rate and the kini coefficient of the current equalized voice data set, and linearly combining the information gain rate and the kini coefficient of the current equalized voice data set by using two factors, and adaptively splitting the decision tree nodes of the two-factor random forest model includes:

dividing the current equalized speech data set D into subsets D ₁ ，...，D _k Calculating the information gain of the current equalized speech data set

Wherein the entropy of the current equalized speech data set D

Normalizing the information gain of the current balanced voice data set by using the value number of the characteristic to obtain the information gain rate of the current balanced voice data set

Computing a kini coefficient for the current equalized speech data set

Wherein,

utilizing the double factor to equalize the current number of speechesLinear combination of information gain ratio and kini coefficient of data set ψ (D; D) ₁ ，...，D _k )＝α[β ₁ Gini(D；D ₁ ，...，D _k )-β ₂ Gain_ratio(D；D ₁ ，...，D _k )]Adaptively splitting decision tree nodes of the dual-factor random forest model; wherein, alpha is a random factor, beta _i Is a balance factor of the node splitting index.

if the classification evaluation index diverges, then according to

Updating the oversampling rate of the oversampling SMOTE and the undersampling rate of the undersampled ENN, returning to execute the step S102 until the classification evaluation index is converged, and outputting the current equalized voice data set;

wherein, OOBM _{mis_rate} Is the out-of-bag error classification rate, N, of the two-factor random forest model _{mis_maj} Number of classification errors for majority class samples, N _{mis_min_i} For the number of sample classification errors for the ith minority class, minclass is the number of minority class classes.

The invention provides a voice sample equalization device combining mixed sampling and random forest, which is characterized by comprising the following components:

the voice recognition system comprises an acquisition module, a recognition module and a processing module, wherein the acquisition module is used for acquiring an initial voice data set, and extracting the characteristics of the initial voice data set to obtain an extracted voice data characteristic set;

the analysis module is used for analyzing a minority class sample of the voice data feature set by using oversampling SMOTE, generating a new target minority class sample according to the minority class sample, analyzing a nearest neighbor sample of the target minority class sample and a nearest neighbor sample of a plurality of classes of samples in the voice data feature set by using undersampling ENN, deleting the target minority class sample and the majority class sample according to the nearest neighbor sample of the target minority class sample and the nearest neighbor sample of the majority class sample, and obtaining a current balanced voice data set;

the construction module is used for calculating the information gain rate and the kini coefficient of the current balanced voice data set, and linearly combining the information gain rate and the kini coefficient of the current balanced voice data set by using double factors to construct a double-factor random forest model;

the input module is used for inputting the current balanced voice data set into the double-factor random forest model and outputting a classification evaluation index and an out-of-bag error classification rate of the current balanced voice data set under a preset double-factor condition;

the judging module is used for judging whether the classification evaluation index is converged or not, and if the classification evaluation index is converged, outputting the current balanced voice data set; and if the classification evaluation index diverges, updating the oversampling rate of the oversampling SMOTE and the undersampling rate of the undersampling ENN according to the out-of-bag error classification rate, returning to the analysis module until the classification evaluation index converges, and outputting the current balanced voice data set.

The invention provides a voice sample equalization device combining mixed sampling and random forest, comprising:

a memory for storing a computer program;

and the processor is used for realizing the steps of the voice sample equalization method combining the mixed sampling and the random forest when the computer program is executed.

The invention provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of a method of speech sample equalization combining mixed sampling and random forest as described above.

Compared with the prior art, the technical scheme of the invention has the following advantages:

the invention relates to a voice sample equalization method for joint mixed sampling and random forests, which comprises the steps of firstly, collecting an initial voice data set, and carrying out feature extraction on the initial voice data set to obtain an extracted voice data feature set; then, analyzing a minority class sample of the voice data feature set by utilizing oversampling SMOTE, generating a new target minority class sample according to the minority class sample, analyzing a nearest neighbor sample of the target minority class sample and a nearest neighbor sample of a plurality of classes of samples in the voice data feature set by utilizing undersampling ENN, deleting the target minority class sample and the majority class sample according to the nearest neighbor sample of the target minority class sample and the nearest neighbor sample of the majority class sample, and obtaining a current balanced voice data set; secondly, calculating the information gain rate and the kini coefficient of the current balanced voice data set, and linearly combining the information gain rate and the kini coefficient of the current balanced voice data set by using double factors so as to construct a double-factor random forest model; inputting the current balanced voice data set into a double-factor random forest model, and outputting a classification evaluation index and an out-of-bag error classification rate of the double-factor random forest model under a preset double-factor condition; finally, judging whether the classification evaluation index is converged, and if the classification evaluation index is converged, outputting the current balanced voice data set; and if the classification evaluation index diverges, updating the oversampling rate of the oversampling SMOTE and the undersampling rate of the undersampled ENN according to the out-of-bag error classification rate, returning to perform equalization processing on the extracted voice data feature set again until the classification evaluation index converges, and outputting the current equalized voice data set. According to the method, the characteristic extraction is carried out on the voice data set, so that the consideration on the inherent characteristic of the sample is increased; by applying the under-sampled ENN to the target few samples generated by the over-sampled SMOTE to remove the samples, the problem that the distribution condition of the nearby most samples is not considered when the SMOTE over-sampled generates a new sample is solved, and the generation of noise samples is reduced; meanwhile, self-adaptive double-factor parameters are introduced into the random forest to adjust the bias of the double-factor random forest model, iterative analysis is carried out on the data characteristics of the double-factor random forest input in each turn, and the data characteristics are fed back to the mixed sampling stage according to the classification evaluation indexes, so that the mixed sampling technology can be assisted to obtain more reliable data results, sample data with high information value is reserved to the maximum extent, and the loss of the classification information of the data set is reduced.

Drawings

In order that the present disclosure may be more readily and clearly understood, reference is now made to the following detailed description of the present disclosure taken in conjunction with the accompanying drawings, in which:

FIG. 1 is a flow chart of a first embodiment of a method for equalizing a speech sample of a combined mixed sampling and random forest according to the present invention;

FIG. 2 is a flow chart of a second embodiment of a method for equalizing a speech sample of a combined mixed sampling and random forest according to the present invention;

FIG. 3 is a schematic diagram of a method for equalizing a speech sample of a combined mixed sampling and random forest according to the present invention;

FIG. 4 is a schematic diagram of feature extraction for a speech data set in accordance with the present invention;

FIG. 5 is a flow chart of a two-factor random forest of the present invention;

fig. 6 is a block diagram of a structure of a speech sample equalization method combining mixed sampling and random forest according to an embodiment of the present invention.

Detailed Description

The present invention is further described below in conjunction with the drawings and the embodiments so that those skilled in the art can better understand the present invention and can carry out the present invention, but the embodiments are not to be construed as limiting the present invention.

Referring to fig. 1, fig. 1 is a flowchart of a first embodiment of a method for equalizing a speech sample of a combined mixed sampling and random forest according to the present invention; the specific operation steps are as follows:

step S101: acquiring an initial voice data set, and performing feature extraction on the initial voice data set to obtain an extracted voice data feature set;

step S102: analyzing a few class samples of the voice data feature set by using oversampling SMOTE, generating a new target few class sample according to the few class samples, analyzing a nearest neighbor sample of the target few class sample and a nearest neighbor sample of a plurality of class samples in the voice data feature set by using undersampling ENN, deleting the target few class sample and the majority class sample according to the nearest neighbor sample of the target few class sample and the nearest neighbor sample of the majority class sample, and obtaining a current balanced voice data set;

step S103: calculating the information gain rate and the kini coefficient of the current balanced voice data set, and linearly combining the information gain rate and the kini coefficient of the current balanced voice data set by using double factors so as to construct a double-factor random forest model;

step S104: inputting the current balanced voice data set into the double-factor random forest model, and outputting a classification evaluation index and an out-of-bag error classification rate of the current balanced voice data set under a preset double-factor condition;

step S105: judging whether the classification evaluation index is converged, and if the classification evaluation index is converged, outputting the current balanced voice data set; and if the classification evaluation index diverges, updating the oversampling rate of the oversampling SMOTE and the undersampling rate of the undersampling ENN according to the out-of-bag error classification rate, returning to execute the step S102 until the classification evaluation index converges, and outputting the current balanced voice data set.

In the method provided by this embodiment, the extracted voice data set is equalized by a hybrid sampling technique, and a new sample generated by applying an under-sampling algorithm to an over-sampling algorithm is removed. Meanwhile, the data characteristics are analyzed by means of random forests and fed back to the mixed sampling stage, and a more reliable data result is obtained by the aid of the mixed sampling technology. By introducing self-adaptive double-factor parameter adjustment model bias into the random forest, iterative analysis is carried out on data characteristics of the double-factor random forest input in each round, the out-of-bag error classification rate obtained after each round of iteration is used as guidance and fed back to a mixed sampling stage, and sample data with high information value is maximally reserved.

Based on the above embodiments, the present embodiment further describes the speech sample equalization method, and with reference to fig. 2 and fig. 3, the specific operation steps are as follows:

step S201: acquiring an initial voice data set, and performing feature extraction on the initial voice data set to obtain an extracted voice data feature set;

as shown in fig. 4, in order to analyze the nonlinear phenomenon caused by the eddy current at the glottis during the sounding process, a Bark wavelet sub-band filter bank is firstly adopted to filter the voice signal, then a discrete cosine transform method is adopted to extract the characteristics at the low frequency band, and the correlation and the maximum lyapunov characteristics are extracted at the high frequency band, so that the characteristics of the voice can be embodied in detail at each frequency band. The fluid-solid coupling feature extraction thought based on glottis flow field distribution to be extracted is as follows: firstly, dividing voice frequency bands into 24 frequency bands according to a Bark filter bank, then calculating logarithmic energy after carrying out Fourier transform on a low frequency band according to an MFCC extraction method, then carrying out discrete cosine transform, carrying out nonlinear dynamics analysis on a high frequency band, extracting correlation dimension and a maximum Lyapunov exponent, and then fusing multiple features.

Further analysis of the speech signal from vocal cord vibration perspective, the vocal cord model equation set is described as follows:

in the formula, α =1,2 denotes left and right side portions, respectively. x is a radical of a fluorine atom _1α ，υ _1α Respectively the motion displacement and the speed of the mass block; m is _1α ，k _1α ，k _cα ，r _1α Respectively representing the mass of the mass block, the elastic coefficient of the spring, the coupling elastic coefficient and the damping constant; l and d are the vocal cord length and the thickness of the lower-layer mass block; k is a radical of formula _1α ，I _1α Respectively, the bernoulli pressure and the impact force generated at the time of collision.

Setting the model mass, elastic coefficient, coupling coefficient, damping constant and subglottic pressure as optimizable parameters, expressed as vectors:Φ：＝[m _1α ，k _1α ，k _cα ，r _1α ，P _S ]and a proper phi is searched by using a variation particle swarm quasi-Newton method, so that the glottic fluid-solid coupling model can accurately generate a glottic waveform. In order to avoid directly obtaining a local minimum solution in a non-convex search space by using a gradient method, firstly, an optimization solution is obtained by using a variation particle swarm optimization method, and then, a quasi-Newton method is used for carrying out local optimization on the obtained solution to find a global optimum point.

The selection and crossing process adopts a roulette selection rule to select M individuals. And the particle swarm algorithm is terminated under the condition that the obtained highest fitness exceeds a preset threshold or reaches a preset iteration number. Target voice source U _ge And waveform U simulated with parameter vector phi _gs The time-domain error therebetween is defined as an objective function F:

in the formula, N represents U _ge ，U _gs And (6) counting the number of points. When the value of the objective function F reaches the global minimum, the simulation glottic airflow U of the vocal cord mass block model is shown _gs And target glottic airflow U _ge And the vocal cord mass model can accurately reflect the actual vocal cord structure of the accurate simulation target voice source.

The flow-solid characteristics of the voice signal are extracted through sub-band nonlinear analysis and a vocal cord mass block model, the sub-band nonlinear analysis reflects the nonlinear characteristics caused by optimal airflow vortex in the voice signal generation process, and the actual vocal cord structure of the target voice signal is simulated through the vocal cord mass block model. And applying the extracted flow-fixed characteristics of the voice signal to subsequent voice recognition.

Step S202: analyzing a minority sample class S of the speech data feature set using oversampled SMOTE _min And according to the minority sample S _min Generating a sample T _gen And combining the samples T _gen Store to minority sample space K _min []Performing the following steps; wherein, sample C _gen ＝count(K _min )；

Step (ii) ofS203: judging the sample C _gen Whether less than the number M of samples that the oversampled SMOTE needs to generate _up If C is _gen ＜M _up Returning to execute the step S202, otherwise executing the step S204; wherein M is _up = few class samples S _min X oversampling ratio N ₁

Step S204: analyzing the sample T with undersampling ENN _gen And a plurality of classes of samples S in the speech data feature set _maj If said sample T is a nearest neighbor sample _gen The nearest neighbor samples of (A) are k and k or more and the samples T _gen If the samples are of different types, deleting K _min []Of the corresponding sample T _gen If the majority of samples S _maj The nearest neighbor samples of (A) are k and more than k and the plurality of types of samples S _maj If the samples with different categories are not the same, deleting the majority of samples S _maj (ii) a Wherein the undersampled ENN deleted samples T _del ＝T _gen +S _maj ；

Step S205: determining the samples T deleted by the undersampled ENN _del Whether less than the number M of samples that the undersampled ENN needs to delete _down If T is _del ＜M _down Returning to execute the step S204, otherwise, outputting the current balanced voice data set; wherein M is _down = majority class sample S _maj X undersampling rate N ₂ ；

The SMOTE oversampling algorithm searches k nearest neighbor samples S in a minority sample class based on a k nearest neighbor thought _{min_i} . Assume that the number of generated samples of the data set is M _up Then from S _{min_i} In random selection of M _up A sample, M _up One sample is marked as S _{min_1} ，S _{min_2} ，......S _{min_j} . Correlating data samples S _{min_i} And S _{min_j} Obtaining a synthetic sample S by corresponding random interpolation operation _new 。

S _new ＝S _{min_i} +rand(0，1)(S _{min_j} -S _{min_i} )；

Wherein rand (0,1) representsA random number in the interval (0,1), i =1,2.. K, j =1,2.... For, M _up ，M _up To generate the number of samples. The number of generated samples is determined by the over-sampling rate.

The ENN (Edited Nearest Neighbor) undersampling algorithm is also based on a K Neighbor selection strategy, and reduces majority class samples and minority class samples. The basic idea is as follows: suppose an unbalanced data set D, S _maj Represents the majority of class samples, traverse S _maj Each sample S in (1) _{maj_i} Finding out S _{maj_i} If two or more of the three nearest neighbor samples are associated with S _{maj_i} If the types are different, deleting the sample S _{maj_i} . By combining SMOTE and ENN, the extracted voice data set is subjected to equalization processing through SMOTE-ENN mixed sampling, and the new samples generated by the oversampling algorithm acted by the undersampling algorithm are subjected to sample elimination, so that the problem of generating noise samples is solved under the condition that data set information is not lost.

Step S206: calculating the information gain rate and the kini coefficient of the current balanced voice data set, linearly combining the information gain rate and the kini coefficient of the current balanced voice data set by using double factors, and adaptively splitting decision tree nodes of the double-factor random forest model;

step S207: constructing the dual-factor random forest model according to the self-adaptive splitting of the decision tree nodes;

the classification performance of random forests is reduced when non-uniform data sets are processed. The main reasons for this are two, one is that in the process of random forest construction, the training set is selected by bootstrap self-sampling. Because few samples of the original data set are fewer and the probability of sampling of the few samples is lower, the number of the few samples in the sub-training set is smaller than that of the original data set, and the non-equilibrium of the data set is aggravated. Secondly, because the number of the samples of the minority class in the original data set is low, the decision tree based on the sub-training set lacks the generalization capability and cannot embody the characteristics of the minority class.

A random forest is an integrated classifier R = { h (x, θ) composed of a set of decision trees _k ) K =1,2.. K }, where { θ } _k The K is the number of decision trees in the random forest, and the training set of each classifier is selected from a data set D =<X，Y>And randomly sampling to obtain. The marginal function of a random forest is:

wherein, the classification performance of the base classifier { h (x, θ) } is defined as:

s＝E _X，Y mr(X，Y)；

assume s ≧ 0, i.e., the base classifier is a weak classifier. Random forest generalization error PE ^* The upper limit of (A) is:

wherein the subscripts X, Y denote the probability P covering the X, Y space,

and (4) averaging the correlation coefficients among all the classifiers, and showing that the generalization error of the random forest is related to the classification performance of the base classifier and the correlation coefficient among all the base classifiers. Therefore, a dual-factor decision tree splitting algorithm is provided, the correlation coefficient among all the base classifiers is reduced, the classification performance of the base classifiers is improved, and the generalization error of the random forest is reduced.

The node splitting algorithm of the decision tree mainly comprises ID3, C4.5[23], CART [24] and the like. The ID3 algorithm selects the information gain as the segmentation criterion. The "feature-value" combination based on the maximum information gain would be selected for segmentation. The disadvantage is that the information gain criterion favors features with many possible values, but ignores the relevance to the classification, and the classification result cannot be generalized. The C4.5 and CART algorithms use "information gain ratio" and "kini coefficient", respectively, as criteria for selecting the segmentation. The ID3 algorithm using information gain as a node splitting criterion can only process discrete features, and C4.5 and CART using information gain rate and a kini coefficient as indexes can process numerical features. The difference between the information gain rate and the Keyny coefficient is that the information gain rate is multiplied by the logarithm of the class probability by the class probability to calculate the entropy difference before and after splitting, which is beneficial to smaller distribution with less quantity and a plurality of characteristic values; the kini coefficient is derived by subtracting the sum of the squared probabilities of each class from one class, favoring a larger data distribution. Both are information-theoretic based algorithms, and there is some approximation of the reason for node splitting. Therefore, the combination between the two is established, and a random factor and a balance factor are introduced to realize the node self-adaptive splitting.

Given the current equalized speech data set D, the entropy of this current equalized speech data set is defined as:

when the current equalized speech data set D is divided into subsets D ₁ ，...，D _k When the entropy is reduced, the corresponding entropy is reduced to obtain the' information gain

The information gain rate is normalized by using the value number of the feature on the basis of the information gain, that is, the information gain rate is normalized by using the value number of the feature

The kini coefficient of the current equalized speech data set D is then defined as:

wherein,

line considering information gain rate and kini coefficientSex combination, the two-factor node splitting algorithm is as follows:

ψ(D；D ₁ ，...，D _k )＝α[β ₁ Gini(D；D ₁ ，...，D _k )-β ₂ Gain_ratio(D；D ₁ ，...，D _k )]

wherein, alpha is a random factor (alpha is more than or equal to 0 and less than or equal to 1) to control the randomness of node splitting, when alpha =1, the generated decision tree is the same as the deterministic decision tree, and when alpha =0, the generated decision tree is a completely random tree. Beta is a _i (i =1,2) is a balance factor of the node splitting index, and beta is more than or equal to 0 _i (i =1,2) ≦ 1 may not be 0 or 1 at the same time, and on the boundary, there are only two combinations (1,0) or (0,1). As shown in FIG. 5, a dual-factor random forest is constructed by the dual-factor node splitting algorithm, when the nodes of the decision tree are split, the CART has the smallest Keyny coefficient, while the C4.5 algorithm has the largest information gain rate, if both indexes reach the optimum, psi (D; D) ₁ ，...，D _k ) And taking the minimum value as an optimal rule to split the nodes. After the random forest is generated, the out-of-bag error is estimated. If the error outside the bag reaches the minimum, outputting the random forest under the condition of the optimal factor; otherwise, updating the double factors and reconstructing the random forest.

Step S208: inputting the current balanced voice data set into a double-factor random forest model, and outputting a classification evaluation index and an out-of-bag error classification rate of the double-factor random forest model under a preset double-factor condition;

the out-of-bag misclassification rate is:

wherein, N _{mis_maj} Number of classification errors for majority class samples, N _{mis_min_i} For the number of sample classification errors for the ith minority class, minclass is the number of minority class classes.

Step S209: judging whether the classification evaluation index is converged, and if the classification evaluation index is converged, outputting the current balanced voice data set; if said score isIf the class evaluation index diverges, updating the oversampling rate of the oversampling SMOTE and the undersampling rate of the undersampled ENN according to the out-of-bag error classification rate, and initializing the out-of-bag error classification rate and T _gen 、M _up 、M _down And T _del And returning to execute the step S202 until the classification evaluation index is converged, and outputting the current balanced voice data set.

In the invention, oversampling and undersampling can equalize data distribution as much as possible, but the existing sampling algorithm does not pay sufficient attention to the problems of class overlapping and noise, and the spatial distribution of data is distorted after sampling. Therefore, a mixed sampling algorithm of the combined double-factor random forest is provided, a new sample is synthesized for a few classes according to a sample distribution rule, and redundant information is removed under the condition that the space structure of the majority classes is not changed according to the feedback of the double-factor random forest. And pre-equalizing the data set by combining a mixed sampling algorithm of SMOTE and ENN, then evaluating the pre-equalized data set by using a double-factor random forest, and respectively calculating a classification evaluation index and an error classification rate. And correcting the mixed sampling rate according to the error classification rate, wherein in the iterative process of mixed sampling, the sampling rate is dynamically changed along with the out-of-bag error classification rate of the random forest, but not the unbalance degree of the data set. Judging whether the classification evaluation indexes are converged or not according to the classification evaluation indexes F1-macro serving as iteration stop standards, if so, namely when the F1-macro continuously descends twice or is kept unchanged, ending the mixed sampling, stopping the iteration process, and outputting a data set which is an optimal balanced data set conforming to the distribution of original data; if the classification evaluation index diverges, updating the mixed sampling rate of the SMOTE-ENN mixed sampling according to the out-of-bag error classification rate, and initializing the out-of-bag error classification rate and T _gen 、M _up 、M _down And T _del And returning to carry out equalization processing on the extracted voice data set again until the classification evaluation index is converged, and outputting the current equalized voice data set.

The specific flow steps of the OOBM-SMOTE-ENN combined double-factor random forest mixed sampling algorithm are as follows:

inputting: data set D, majority class samples S _maj Minority class sample S _min Number of nearest neighbor samples k, initial oversampling rate N ₁ Initial undersampling rate N ₂ 。

And (3) outputting: the data set D' is pre-equalized.

1. Initializing OOBM _{mis_rate} =1, number of samples M needed to be generated to initialize oversampling _up (M _up ＝S _min ×N ₁ ) Will M _up Set to 0;

2. correcting N according to double-factor random forest feedback ₁ 、N ₂ Set up C _gen =0, for each few samples S _{min_i} Traversing, and storing a few samples generated by the SMOTE algorithm into a space K _min []；

3、C _gen ＝C _gen +count(K _min ) If C is present _gen ＜M _up Returning to the step 2, otherwise, performing the step 4;

4. initializing the number of samples M that need to be deleted for undersampling _down (M _down ＝S _maj ×N ₂ ) A 1, M _down Set to 0;

5. setting T _del =0, and S for each majority sample _{maj_i} Go through and compare S _{maj_i} And K _min []Label of medium sample, if said K _min []Middle T _{gen_i} There are k and k or more nearest neighbor samples with the T _{gen_i} If the samples are of different types, deleting K _min []Corresponding few classes in (1) generate samples T _{gen_i} . Meanwhile, the neighborhood samples are compared through ENN definition, if S is _{maj_i} There are k and more than k nearest neighbor samples with the S _{maj_i} If the samples are different in category, deleting the sample S _{maj_i} ；

6、T _del ＝T _del +(T _{gen_i} +S _{maj_i} ) If T is _del ＜M _down Returning to step 5, otherwise, outputting a once equalized sample set D'.

And the current balanced voice data set is divided into a training set and a testing set, the random forest recognition model is trained by using the characteristic parameters of the voice of the training set, and the characteristic parameters of the testing set are subjected to prediction classification by using the trained random forest model. The comparative experiment results of the OOB MSE algorithm and the classical sampling algorithm are shown in the following table 1;

TABLE 1 comparative experiment results of OOBCE algorithm and classical sampling algorithm

	Raw data	SMOTE	ADASYN	BSM	CNN	OOBMSE
							Recognition rate/%)	97.05	99.03	99.04	99.03	91.43	100.00
Recall/%)	92.31	99.16	98.86	99.16	91.34	100.00
							Kappa number/%	89.89	98.03	98.02	98.03	82.82	100.00
F1 fraction/%)	94.94	99.02	99.01	99.02	91.40	100.00

The table, SMOTE, is called Synthetic minimum ownership over sampling Technique, i.e. a Technique for synthesizing Minority samples, and is an improved scheme based on a random Oversampling algorithm, and since the random Oversampling adopts a simple sample replication strategy to add Minority samples, the basic idea of the SMOTE algorithm is to analyze the Minority samples and artificially synthesize new samples according to the Minority samples to add the new samples to a data set.

ADASYNN is an adaptive comprehensive sampling method for unbalanced learning. The idea of adaptively generating minority class data samples based on their distribution is based on the idea that the minority class samples that are more difficult to learn generate more synthetic data than those that are easier to learn. The ADASYNN method can reduce learning bias brought by original unbalanced data distribution, and can adaptively transfer decision boundaries to samples which are difficult to learn.

Borderline SMOTE (BSM) is an improved over-sampling algorithm based on SMOTE, which uses only a few classes of samples on the boundary to synthesize new samples, thereby improving the class distribution of the samples.

Condensed Nearest Neighbor concentration, or CNN for short, is an undersampling technique used to find subsets of a sample set without incurring a loss in model performance (referred to as the minimum consistency set).

As can be seen from the above table, the OOB MSE mixed sampling algorithm provided by the invention is superior to the traditional SMOTE, ADASYNN, BSM and CNN algorithms. The accuracy of OOB MSE in the random forest classifier reaches 100%, and other evaluation indexes reach optimal values, so that the method is superior to the traditional method. Therefore, the equalization algorithm provided by the invention improves the recognition rate and reliability of the system.

The method provided by the embodiment is quite common in voice recognition and even intelligent medical diagnosis, and the mixed sampling algorithm combining the double-factor random forest provided by the invention is based on the double-factor random forest and combines SMOTE and ENN to solve the problem of unbalanced data classification in voice recognition. In view of the shortcomings of the conventional oversampling algorithm, in the hybrid sampling process, the oversampling rate is dynamically changed according to the out-of-bag error classification rate of the two-factor random forest, not the unbalanced rate of the data set, and simultaneously, noise in the oversampling generated samples is removed through ENN. And combining double-factor random forest and mixed sampling according to the out-of-bag error classification rate, dynamically correcting the mixed sampling rate, increasing the number of a few types of samples, and removing noise and repeated information in the samples to balance data.

Referring to fig. 6, fig. 6 is a block diagram illustrating a structure of a voice sample equalization method combining mixed sampling and random forest according to an embodiment of the present invention; the specific device may include:

the voice recognition system comprises an acquisition module 100, a processing module and a processing module, wherein the acquisition module 100 is used for acquiring an initial voice data set, and extracting the characteristics of the initial voice data set to obtain an extracted voice data characteristic set;

an analysis module 200, configured to analyze a minority class sample of the voice data feature set by using oversampling SMOTE and generate a new target minority class sample according to the minority class sample, analyze a nearest neighbor sample of the target minority class sample and a nearest neighbor sample of a plurality of classes samples in the voice data feature set by using undersampling ENN, delete the target minority class sample and the majority class sample according to the nearest neighbor sample of the target minority class sample and the nearest neighbor sample of the majority class sample, and obtain a current balanced voice data set;

a constructing module 300, configured to calculate an information gain ratio and a kini coefficient of the current equalized voice data set, and linearly combine the information gain ratio and the kini coefficient of the current equalized voice data set by using two factors to construct a two-factor random forest model;

an input module 400, configured to input the current balanced voice data set into the two-factor random forest model, and output a classification evaluation index and an out-of-bag error classification rate of the current balanced voice data set under a preset two-factor condition;

a determining module 500, configured to determine whether the classification evaluation index converges, and if the classification evaluation index converges, output the current balanced speech data set; and if the classification evaluation index diverges, updating the oversampling rate of the oversampling SMOTE and the undersampling rate of the undersampling ENN according to the out-of-bag error classification rate, returning to the analysis module until the classification evaluation index converges, and outputting the current balanced voice data set.

The apparatus for jointly mixing sampling and equalizing a voice sample of a random forest according to this embodiment is used to implement the method for jointly mixing sampling and equalizing a voice sample of a random forest, and thus specific embodiments of the apparatus for jointly mixing sampling and equalizing a voice sample of a random forest can be seen in the foregoing embodiments of the method for jointly mixing sampling and equalizing a voice sample of a random forest, for example, 100, 200, 300, 400, and 500 are respectively used to implement steps S101, S102, S103, S104, and S105 in the method for jointly mixing sampling and equalizing a voice sample of a random forest, and therefore, specific embodiments thereof may refer to descriptions of corresponding embodiments of various parts, and are not described herein again.

The embodiment of the invention also provides voice sample equalization equipment combining mixed sampling and random forest, which comprises: a memory for storing a computer program; and the processor is used for realizing the steps of the voice sample equalization method combining the mixed sampling and the random forest when the computer program is executed.

The specific embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps of the above method for equalizing voice samples by combining mixed sampling and random forest.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Various other modifications and alterations will occur to those skilled in the art upon reading the foregoing description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications therefrom are within the scope of the invention.

Claims

1. A voice sample equalization method combining mixed sampling and random forest is characterized by comprising the following steps:

s103: calculating the information gain rate and the kini coefficient of the current balanced voice data set, and linearly combining the information gain rate and the kini coefficient of the current balanced voice data set by using double factors to construct a double-factor random forest model, wherein the method comprises the following steps:

calculating the information gain rate and the kini coefficient of the current balanced voice data set, linearly combining the information gain rate and the kini coefficient of the current balanced voice data set by using double factors, and adaptively splitting decision tree nodes of the double-factor random forest model, wherein the decision tree nodes comprise:

dividing the current equalized speech data set D into subsets D ₁ ,...,D _k Calculating the information gain of the current equalized speech data set

Wherein the entropy of the current equalized speech data set D

Computing a kini coefficient for the current equalized speech data set

Wherein,

using the two factors to determine the information gain rate and the Keyny coefficient of the current equalized speech data setLinear combination psi (D; D) ₁ ,...,D _k )＝α[β ₁ Gini(D；D ₁ ,...,D _k )-β ₂ Gain_ratio(D；D ₁ ,...,D _k )]Adaptively splitting decision tree nodes of the dual-factor random forest model; wherein, alpha is a random factor, beta _i A balance factor which is a node splitting index;

constructing the double-factor random forest model according to the self-adaptive splitting of the decision tree nodes;

judging whether the out-of-bag error value of the dual-factor random forest model is a preset out-of-bag error value or not, if so, outputting the dual-factor random forest model under a preset dual-factor condition, otherwise, updating the dual-factor of the self-adaptive splitting of the nodes of the decision tree, and reconstructing the dual-factor random forest model;

s105: judging whether the classification evaluation index is converged, and if the classification evaluation index is converged, outputting the current balanced voice data set; and if the classification evaluation index diverges, updating the oversampling rate of the oversampling SMOTE and the undersampling rate of the undersampling ENN according to the out-of-bag error classification rate, returning to execute the step S102 until the classification evaluation index converges, and outputting the current balanced voice data set.

2. The method of claim 1, wherein the analyzing a minority class sample of the voice data feature set using oversampling SMOTE and generating a new target minority class sample from the minority class sample, analyzing a nearest neighbor sample of the target minority class sample and a nearest neighbor sample of a plurality of classes samples in the voice data feature set using undersampling ENN, and deleting the target minority class sample and the plurality of classes samples from the nearest neighbor sample of the target minority class sample and the nearest neighbor sample of the plurality of classes samples to obtain a current equalized voice data set comprises:

s201: analyzing the minority sample S using the oversampled SMOTE _min And according to the minority class samples S _min Generating a sample T _gen The sample T is _gen Store to minority sample space K _min []Performing the following steps; wherein, sample C _gen ＝count(K _min )；

S202: judging the sample C _gen Whether less than the number M of samples that the oversampled SMOTE needs to generate _up If C is _gen <M _up If not, returning to execute the step S201, otherwise, executing the step S203; wherein M is _up = few class samples S _min X oversampling ratio N ₁ ；

S203: analyzing the sample T with the undersampled ENN _gen And a plurality of classes of samples S in the speech data feature set _maj If said sample T is a nearest neighbor sample _gen The nearest neighbor samples of (A) are k and k or more and the samples T _gen If the samples are of different types, deleting K _min []Of the corresponding sample T _gen If said plurality of samples S _maj The nearest neighbor samples of (A) are k and more than k and the plurality of types of samples S _maj If the samples with different categories are selected, deleting the samples S with most categories _maj (ii) a Wherein the undersampled ENN deleted samples T _del ＝T _gen +S _maj ；

S204: determining the samples T deleted by the undersampled ENN _del Whether less than the number M of samples that the undersampled ENN needs to delete _down If T is _del <M _down Returning to execute the step S203, otherwise, outputting the current balanced voice data set; wherein, M _down = majority class sample S _maj X undersampling rate N ₂ 。

3. The method of claim 2, wherein the analyzing the minority sample S using oversampling SMOTE _min And according to the minority sample S _min Generating a sample T _gen The method comprises the following steps:

Assuming that the number of samples generated by the oversampling SMOTE is M _up From said S _{min_i} In the random selection of said M _up A sample of said M _up One sample is marked as S _{min_1} ,S _{min_2} ,……S _{min_j} ；

Associating the S _{min_i} And said S _{min_j} Generating samples T by a random interpolation operation _gen ＝S _{min_i} +rand(0,1)(S _{min_j} -S _{min_i} ) (ii) a Wherein rand (0,1) represents a random number within the interval (0,1), i =1,2, · _up 。

4. The method of claim 2, wherein if the classification evaluation index diverges, updating the oversampling rate of the oversampled SMOTE and the undersampling rate of the undersampled ENN according to the out-of-bag misclassification rate, and returning to the step S102 until the classification evaluation index converges, and outputting the current equalized speech data set comprises:

if the classification evaluation index diverges, updating the oversampling rate of the oversampling SMOTE and the undersampling rate of the undersampled ENN according to the out-of-bag error classification rate, and initializing the out-of-bag error classification rate and T _gen 、M _up 、M _down And T _del And returning to execute the step S102 until the classification evaluation index is converged, and outputting the current balanced voice data set.

5. The method of claim 1, wherein if the classification evaluation index diverges, updating the oversampling rate of the oversampled SMOTE and the undersampling rate of the undersampled ENN according to the out-of-bag misclassification rate, returning to perform step S102 until the classification evaluation index converges, and outputting the current equalized speech data set comprises:

if the classification evaluation index diverges, then according to

6. A speech sample equalization apparatus that combines hybrid sampling and random forest, comprising:

the construction module is used for calculating the information gain rate and the kini coefficient of the current balanced voice data set, linearly combining the information gain rate and the kini coefficient of the current balanced voice data set by using double factors, and constructing a double-factor random forest model, and comprises the following steps:

dividing the current equalized speech data set D into subsets D ₁ ,...,D _k Calculating an information gain of the current equalized speech data set

Wherein the entropy of the current equalized speech data set D

Computing a kini coefficient for the current equalized speech data set

Wherein,

linearly combining information gain ratio and a kini coefficient of the current equalized speech data set by the dual factor psi (D; D) ₁ ,...,D _k )＝α[β ₁ Gini(D；D ₁ ,...,D _k )-β ₂ Gain_ratio(D；D ₁ ,...,D _k )]Adaptively splitting decision tree nodes of the dual-factor random forest model; wherein, alpha is a random factor, beta _i A balance factor which is a node splitting index;

the judging module is used for judging whether the classification evaluation index is converged or not, and if the classification evaluation index is converged, outputting the current balanced voice data set; and if the classification evaluation index diverges, updating the oversampling rate of the oversampling SMOTE and the undersampling rate of the undersampled ENN according to the out-of-bag error classification rate, returning to the analysis module until the classification evaluation index converges, and outputting the current balanced voice data set.

7. A voice sample equalization apparatus that combines hybrid sampling and random forest, comprising:

a memory for storing a computer program;

a processor for implementing the steps of a method of speech sample equalization combining mixed sampling and random forest as claimed in any one of claims 1 to 5 when executing said computer program.

8. A computer-readable storage medium, having stored thereon a computer program which, when being executed by a processor, carries out the steps of a method for speech sample equalization in combined mixed sampling and random forest according to any one of claims 1 to 5.