CN114818947A

CN114818947A - Training method and device of classifier and classification device

Info

Publication number: CN114818947A
Application number: CN202210482911.7A
Authority: CN
Inventors: 刘吉; 侯亚新; 张重生; 周航; 窦德景
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-05-05
Filing date: 2022-05-05
Publication date: 2022-07-29

Abstract

The disclosure provides a training method and device of a classifier and a classification device, and relates to the technical field of computers, in particular to the field of deep learning. The specific implementation scheme is as follows: acquiring an original unbalanced sample set; generating a first set of samples from the original set of unbalanced samples and random noise, wherein the first set of samples includes samples labeled as positive samples and samples labeled as negative samples; removing samples which do not meet preset conditions from the first sample set to obtain a target sample set; and training a target classifier by taking the collection of the original unbalanced sample set and the target sample set as a training data set, wherein the target classifier is used for finishing the classification of the unbalanced sample set to be classified.

Description

Training method and device of classifier and classification device

Technical Field

The present disclosure relates to the field of computer technology, and more particularly, to the field of deep learning.

Background

The problem of class imbalance of data, also called data skew, is often encountered in machine learning. Data skew occurs in many practical application scenarios, such as disease detection, credit card fraud detection, network intrusion detection, etc. Data skew has a very poor effect on the algorithm result of machine learning, and makes the algorithm itself more biased to the label with a large data amount, and performs poorly to the label with a small data amount. For the problem of data skew, a random resampling method is often adopted in the related art to balance the number of multi-class samples, specifically, the samples are randomly removed from the multi-class samples in the training process through an undersampling method to reduce the number of the multi-class samples, and a few class samples are randomly selected from an original data set through an oversampling method and copied to increase the number of the few class samples. Or increasing the importance of low confidence classes by decreasing the importance of high confidence classes. In addition, data enhancement may also be used to address data skew. For example, Synthetic minimal approximation sampling Technique (SMOTE) algorithm, which uses the principle of linear difference, can generate new data under a small class of samples.

Disclosure of Invention

The present disclosure provides a method, apparatus, device, and storage medium for classifier training.

According to an aspect of the present disclosure, there is provided a training method of a classifier, including: acquiring an original unbalanced sample set; generating a first sample set from the original unbalanced sample set and the random noise, wherein the first sample set comprises samples marked as positive samples and samples marked as negative samples; removing samples which do not accord with preset conditions from the first sample set to obtain a target sample set; and training a target classifier by taking a collection of the original unbalanced sample set and the target sample set as a training data set, wherein the target classifier is used for finishing classification of the unbalanced sample set to be classified.

Optionally, removing the samples that do not meet the preset condition from the first sample set to obtain a target sample set, including: randomly eliminating negative samples from the first sample set by using a first discriminator to obtain a second sample set, wherein the eliminating probability of all samples marked as negative samples in the first sample set is a set probability; and removing the samples identified as negative samples by the second identifier from the second sample set by using the second identifier to obtain a target sample set.

Optionally, the removing, by the second discriminator, the samples discriminated as negative samples by the second discriminator from the second sample set to obtain a target sample set, includes: obtaining labels of a plurality of predetermined first classifiers in a second discriminator for samples in a first sample set; determining samples in the second sample set marked as negative samples by the target number of first classifiers as samples identified as negative samples by the second discriminator; and eliminating the samples identified as negative samples by the second identifier to obtain a target sample set.

Optionally, the method further includes: adjusting the loss weights of the first type of samples and the second type of samples in the unbalanced sample set to be classified to enable the loss weight of the first type of samples to be smaller than that of the second type of samples; and classifying the to-be-classified unbalanced sample set by utilizing the adjusted loss weight.

Optionally, generating a first set of samples from the original set of unbalanced samples and the random noise comprises: generating a candidate sample set according to the original unbalanced sample set and the random noise; marking samples in the candidate sample set to obtain a first sample set, wherein the marking comprises: marked as positive and marked as negative.

Optionally, generating a candidate sample set from the original unbalanced sample set and the random noise comprises: carrying out normalization processing on an original sample set by using a pre-constructed Gaussian mixture model; and combining the original unbalanced sample set subjected to normalization processing with the noise randomly generated by the Gaussian mixture model to generate a candidate sample set.

Optionally, labeling the samples in the candidate sample set to obtain a first sample set, including: and labeling the candidate sample set according to the binary cross entropy loss and the sample spacing of the candidate sample set and the original unbalanced sample set, and determining the labeled candidate sample set as a first sample set.

According to another aspect of the present disclosure, there is provided a classification apparatus for an unbalanced sample set, including: a data generator, a quality controller and a classifier; a data generator for generating a first set of samples; the quality controller is used for screening the first sample set to generate a target sample set, and the target sample set and a pre-acquired collection of an original unbalanced sample set are used for training the classifier; the classifier is used for classifying the unbalanced sample set.

According to still another aspect of the present disclosure, there is provided a training apparatus of a classifier, including: the device comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring an original unbalanced sample set; a generating module, configured to generate a first sample set according to the original unbalanced sample set and the random noise, where the first sample set includes samples marked as positive samples and samples marked as negative samples; the second acquisition module is used for removing the samples which do not accord with the preset conditions from the first sample set to obtain a target sample set; and the training module is used for training the classifier by taking the collection of the original unbalanced sample set and the target sample set as a training data set, wherein the classifier is used for finishing the classification of the unbalanced sample set to be classified.

According to still another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method described above.

According to yet another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the above-described method.

According to yet another aspect of the disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the method described above.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow chart of a method of training a classifier according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of an apparatus for classifying unbalanced sample sets according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a classifier training network framework according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a first stage of a classifier training method according to an embodiment of the present disclosure;

FIG. 5 is a diagram illustrating a second stage of a classifier training method according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of an apparatus for implementing a classifier training method according to an embodiment of the present disclosure;

FIG. 7a is a flow chart of a sample validation according to an embodiment of the present disclosure;

FIG. 7b is a flow chart of another sample validation according to an embodiment of the present disclosure;

FIG. 8 is a schematic block diagram of an example electronic device for implementing embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the data balancing method adopted in the related art, samples need to be randomly removed from multiple classes in a training process through an undersampling method so as to reduce the number of the samples of the multiple classes, and a few classes of samples are randomly selected from an original data set through an oversampling method and are copied so as to increase the number of the samples of the few classes. Or increasing the importance of low confidence classes by decreasing the importance of high confidence classes. In addition, the SMOTE algorithm based on the linear difference can be adopted to generate new data under a small category of samples. However, in the method, key information can be discarded in the process of deleting most types of samples, so that the classification accuracy of the trained model is reduced; the oversampling approach may result in an overfitting of the model due to the same excessive cost in the generated extended samples. The reason for this is that the data itself is directly concerned, and the probability distribution of the data is not considered. The same is true of the method of improving the loss function and the linear interpolation method, which do not utilize information of the data distribution. Therefore, the methods provided by the related art all cause the problem that the classification accuracy of the trained classification model on the unbalanced data is low.

While the disclosed embodiments provide a method embodiment for a training method for a classifier, it should be noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer-executable instructions and that while a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than presented herein.

Fig. 1 is a flowchart of a training method of a classifier according to an embodiment of the present disclosure, as shown in fig. 1, the method includes the following steps:

step S102, obtaining an original unbalanced sample set;

in the technical solution provided by the above step S102 in the present disclosure, the original unbalanced data set widely exists in the application scenario of deep learning, for example: image processing, relational data processing, and the like. In an actual application scenario, a part of samples may also be extracted from an unbalanced sample to be classified as an original unbalanced sample set, where the original unbalanced sample set is a sample set in which the number of samples of different classes in a classification task is greatly different, and is usually a sample set in which a ratio of majority class samples to minority class samples in the sample set is significantly greater than 1:1, and in daily life, unbalanced samples are frequently encountered, for example: fraudulent transactions account for a very small fraction of the total transaction volume, and in some tasks, attention is often paid to cases of a few classes of samples instead. For example, in the case of fraud identification, the ratio of the few class samples to the majority class samples is 1:1000, and if the sample set with this ratio is used directly for learning, a classification model that predicts all samples as the majority class case can be easily learned. Therefore, the influence of the imbalance of the samples is that the model learns the prior information of the sample proportion in the training data set, so that the majority of samples are emphasized when actually predicting. Acquisition of a sample set there are currently a number of means of acquisition, for example: through questionnaires, web crawl or statistics through existing data in a database.

Step S104, generating a first sample set according to the original unbalanced sample set and the random noise, wherein the first sample set comprises samples marked as positive samples and samples marked as negative samples;

in the technical solution provided in the above step S104 of the present disclosure, there are various ways to generate the first sample set by using the random noise of the original unbalanced sample set, and the generation of the countermeasure network may be generally adopted to generate the first sample set. Samples marked as positive and samples marked as negative can also be considered samples marked as true and samples marked as false by discriminators in the generated countermeasure network. It should be noted that the first sample set includes the above two samples.

Step S106, samples which do not accord with preset conditions are removed from the first sample set to obtain a target sample set;

in the technical solution provided in the above step S106 of the present disclosure, samples that do not meet the preset condition are removed from the first sample set, and the sample generation rate and the quality of the samples are controlled by setting the preset condition, for example, the generation rate of the samples can be increased by relaxing the preset condition to quickly complete the sample generation stage, and the generation rate of the samples can be reduced by tightening the preset condition while controlling the quality of the generated samples. Wherein the preset conditions include, but are not limited to: the preset probability that the samples in the first sample set are rejected or the result of the recognition of the classifier trained in advance by the target number. 2

In this embodiment, taking the preset condition as an example that the probability of rejecting each negative sample with the generated countermeasure network flag being false is 0.8, each negative sample in the first sample set is traversed, and random rejection is performed until the last sample, or each negative sample in the first sample set may be randomly extracted for rejection, and each extracted sample is added with a flag until the extraction flag added to the last negative sample indicates that each negative sample in the first sample set is extracted completely. And S108, training a target classifier by taking a collection of the original unbalanced sample set and the target sample set as a training data set, wherein the target classifier is used for finishing classification of the unbalanced sample set to be classified.

The classifier can be a classification model, and can also be a computer device for bearing the classification model. There are various kinds of classifiers, for example: decision tree classifier, selection tree classifier and evidence classifier, and the main factor influencing the accuracy of the classifier is the number of samples and whether the distribution of the samples is balanced.

In the technical solution provided in the above step S108 of the present disclosure, the target sample set is added to the original unbalanced sample set to balance the distribution of the samples, so as to obtain a training data set with balanced sample distribution to train the classifier.

In the related art, the generation of a countermeasure neural Network (GAN) is widely applied to the fields of style migration, image synthesis, image super-resolution, image restoration, relationship data, and the like, and a data set having a distribution similar to that of an original data set can be learned. However, in the case of data skew, the quality of data generated by GAN in the related art is still not high. Conventional GANs can be used to solve the data skew problem, but relying directly on randomly generated GANs may generate samples of uncontrolled quality, thereby degrading the performance of the unbalanced classification model. And verifying the quality of the generated sample can measure the quality of the generated feature by the similarity of the generated sample and the feature of the original sample, and measure the quality of the generated label by the probability that the label of the generated sample corresponds to the feature of the generated sample in the original data distribution.

Through the steps S102 to S108 described above, an original unbalanced sample set is obtained; generating a first sample set from the original unbalanced sample set and the random noise, wherein the first sample set comprises samples marked as positive samples and samples marked as negative samples; removing samples which do not accord with preset conditions from the first sample set to obtain a target sample set; and training a target classifier by taking a collection of the original unbalanced sample set and the target sample set as a training data set, wherein the target classifier is used for finishing classification of the unbalanced sample set to be classified. That is to say, the generation speed of the samples and the quality of the generated samples can be controlled by adjusting the preset conditions, so that the high-quality target samples are stably obtained, the target samples are added into the original unbalanced sample set, the purpose of balancing the original unbalanced sample set is achieved, the technical problem of low classification accuracy of the classifier is solved, and the technical effect of improving the classification accuracy of the classifier is achieved.

The above-described method of this embodiment is further described below.

As an alternative implementation, in step S104, a first sample set is generated according to the original unbalanced sample set and the random noise, and a candidate sample set may be generated by inputting the original unbalanced sample set and the random noise into the generation countermeasure network model; and then marking the samples in the candidate sample set to obtain a first sample set, wherein the marks can be marked as positive samples and marked as negative samples.

The candidate sample set can be generated by combining the original unbalanced sample set subjected to normalization processing with the gaussian mixture model constructed in advance and then the noise generated randomly by the original unbalanced sample set subjected to normalization processing with the gaussian mixture model.

The gaussian mixture model is a model formed by decomposing an object into a plurality of objects based on a gaussian probability density function, and accurately quantizes the object by the gaussian probability density function (normal distribution curve). The original unbalanced sample set is normalized by using the Gaussian mixture model, namely, the characteristic value of each sample in the original unbalanced sample set is mapped into the same interval, the influence of dimension on a final result is eliminated, different characteristics have comparability, the characteristics which are possibly distributed with larger difference originally have the same weight influence on the model, and the convergence rate of the model is improved.

It will be appreciated that inputting random noise into the generator translates the random noise into a distribution that conforms to the samples in the original unbalanced sample set to obtain candidate samples, from which the candidate sample set is composed. The generated samples are preliminarily marked by the generation countermeasure network, so that the quality of the generated samples can be preliminarily known, and the accuracy of marking the samples in the subsequent steps is improved. The random noise may be from a normal distribution, a uniform distribution, or any other distribution, and will not be described herein. It should be noted that, in the process of labeling the candidate sample set, the candidate sample set may be labeled according to the binary cross entropy loss and the sample distance of the candidate sample set and the original unbalanced sample set, and the labeled candidate sample set is determined as the first sample set, so as to conveniently express that the sample distance in the embodiment of the present disclosure uses the Wassertein distance.

In this embodiment, the original unbalanced sample set is normalized using a gaussian mixture model, and the randomly generated seed is sent to the generator in the generation countermeasure network along with the original unbalanced sample set to generate a new sample. In an alternative approach, the neural network of the generator may be updated using equation 1.

L _G ＝L _rec (Z,X)-L _D (Z)+L _C (Z)＝∑ _z∈Z,x∈X ‖z-x‖ ² -L _D (Z)+L _C (Z) formula 1

In the formula, X represents an original unbalanced sample set after normalization processing, Z represents a candidate sample set, and L _rec (Z, X) represents the reconstruction loss, L _D (Z) represents the discrimination loss for the candidate sample, L _C (Z) represents a multi-class loss function for the candidate sample and the original unbalanced sample, X represents the sample in X, and Z represents the sample in Z.

The positive and negative samples are labeled with binary cross entropy loss and sample spacing, as shown in equation 2.

L _D (Z)＝L _BCE (Z)+L _WGAN-GP (Z) formula 2

In the formula, L _BCE (Z) represents a binary cross entropy loss, L _WGAN-GP (Z) represents the Wassertein distance with a gradient penalty.

Wherein L is _BCE (Z) can be obtained by the following formula 3.

L _BCE (Z)＝∑ _X∈X∪Z -[1| _x∈X logD(x)+1| _x∈Z log(1-D(x))]Formula 3

In the formula, d (x) represents the probability that a sample is a positive sample.

Wherein, the Wassertein distance of the sample can be obtained by equation 4.

In the formula, E _x～X [D(X)]-E _x～Z [D(G(x))]Showing the loss of the original one of the two,

represents the penalty loss and lambda represents the tuning parameter.

As an optional implementation manner, in step S106, a target sample set may be obtained specifically by first randomly removing, by using a first discriminator, samples marked as negative samples from a first sample set to obtain a second sample set, where the probability that all samples marked as negative samples in the first sample set are removed is a set probability; and then, the second discriminator is used for removing the samples discriminated as negative samples by the second discriminator from the second sample set to obtain a target sample set.

It should be noted that the present disclosure utilizes two discriminators to realize the adjustment of the preset condition in the above manner, and thus controls the quality of the generated sample.

The first discriminator has the main function of improving the generation rate of the samples in a random elimination mode, namely, eliminating a part of the samples marked as negative samples in the first sample set in a random mode instead of directly eliminating the samples marked as negative samples in the first sample set, reserving a part of negative samples, and uniformly identifying the negative samples which are not eliminated as the samples marked as positive samples, namely the positive samples, so as to improve the generation rate of the samples.

And inputting a second sample set which is regarded as a positive sample by the first discriminator into the second discriminator for secondary discrimination to control the quality of the generated sample, namely, eliminating the negative sample in the second sample set by utilizing the second discriminator to obtain a target sample set.

Specifically, the target sample set may be obtained by: obtaining labels of a plurality of predetermined first classifiers in a second discriminator on samples in a first sample set; determining samples in the second sample set marked as negative samples by the target number of first classifiers as samples identified as negative samples by the second discriminator; and eliminating the samples identified as negative samples by the second identifier to obtain a target sample set.

It should be noted that the second discriminator includes a plurality of pre-trained first classifiers, which can be obtained by using various training methods in the related art, and details are not repeated herein. Since the first classifier is trained in a conventional manner in the related art, the discrimination performance is low, and the discrimination accuracy can be increased by voting with a plurality of first classifiers.

The multiple first classifiers verify the same sample and then participate in voting, and the truth of the sample is determined by means of voting, so that the accuracy of sample verification can be further improved, and the sample with high quality can be obtained.

And rejecting the samples marked as negative samples by the target number first classifier in the second sample set as negative samples, wherein in the actual implementation process, the quality of sample identification and the sample generation rate can be adjusted by adjusting the size of the target number, the target number can be reduced at the initial stage of training the target classifier to improve the sample generation rate, and the target data can be added at the later stage of training the target classifier to improve the quality of the generated samples. Through the dual regulation and control of the first discriminator and the second discriminator, the generation rate and the generation quality of the sample can be flexibly adjusted, and the training efficiency of the classification model is further improved.

As an optional way, in the process of classifying the to-be-classified unbalanced sample set by using the target classifier trained and obtained by the method provided by the present disclosure, the loss weight of the first type sample can be made smaller than that of the second type sample by adjusting the loss weights of the first type sample and the second type sample in the to-be-classified unbalanced sample set; and classifying the to-be-classified unbalanced sample set by utilizing the adjusted loss weight.

The first type of samples may be a majority of the samples in the unbalanced sample set and the second type of samples may be a minority of the samples in the unbalanced sample set. The loss weight of the second type sample is increased, so that the identification accuracy of the few types of samples can be improved.

In this embodiment, the total loss function of the target classifier can be represented by equation 5:

in the formula, L _C (X', Y) represents the original unbalanced sample set and the generated sample set, MIC represents the set of second type samples, MAC represents the set of first type samples, and L represents the cross entropy of one sample.

Equation 6 shows a definitional equation for cross entropy.

Where C represents the set of all classes and P (x, C) represents the probability that a sample calculated by the target classification belongs to class C.

Before the original unbalanced sample set and the random noise are used for generating samples, normalization processing is carried out on the original unbalanced sample set, data can be scaled to the same specific interval according to the proportion, the speed of solving the optimal solution through gradient descent is increased, and the accuracy of model training can be improved.

The accuracy of sample labeling can be improved by distinguishing true and false samples through the binary cross entropy loss and the sample distance between the candidate sample set and the original unbalanced sample set, the quality of generated samples is further improved, the high-quality generated sample and the original unbalanced sample are used as training data sets to train the classifier, the training accuracy of the classifier can be improved, and finally the classification performance of the target classifier is improved.

The embodiment of the present disclosure further provides a classification apparatus for an unbalanced sample set, as shown in fig. 2, including: a data generator 20, a quality controller 22 and a classifier 24; the data generator 20 is for generating a first set of samples; the quality controller 22 is configured to screen the first sample set to generate a target sample set, and the target sample set and a pre-acquired collection of original unbalanced sample sets are used to train the classifier 24; the classifier 24 is used to classify the set of unbalanced samples.

Wherein the data generator 20 includes: the system comprises a sample generator 201 and a feature discriminator 202, wherein the sample generator 201 is used for generating a candidate sample set, and the feature discriminator 202 is used for labeling the candidate sample set to obtain a first sample set. The quality controller 22 includes: the first discriminator 221 and the second discriminator 222, the first discriminator 221 is used for randomly rejecting part of negative samples in the first sample set to obtain a second sample set; the second discriminator 222 is configured to perform secondary discrimination on the samples in the second sample set to eliminate the samples discriminated as negative samples to obtain the target sample set.

Fig. 3 shows a classifier training network framework applied to the present disclosure, and as shown in fig. 3, random noise and an original unbalanced sample set output by a gaussian mixture model are input into a GAN (generation countermeasure network model) to generate samples, which are labeled by a feature identifier and then input into a quality controller, and a high-quality target sample set and the original unbalanced sample set are obtained by screening by a semantic identifier and a tag identifier and are used for training a classifier together.

The sample generator, feature discriminator, semantic discriminator, label discriminator and classifier shown in fig. 3 are parts of a classifier training net that may be applied in the entity modules of the training apparatus shown in fig. 2, for example: the first discriminator 221 in fig. 2 applies the semantic discriminator in fig. 3, and the second discriminator 222 may apply the tag discriminator in fig. 3.

It can be understood that the training network framework provided in the embodiment of the present disclosure adjusts the generation rate of the samples and the quality of the samples by adding the quality controller based on the generation countermeasure network in the related art. The sample generator and the feature discriminator respectively correspond to the generator and the discriminator in the generation countermeasure network in the related art.

Fig. 4 and 5 respectively show two stages of the classifier training method provided by the present disclosure, and fig. 4 shows the first stage thereof, namely a training stage of generating a countermeasure network, in which the generation countermeasure network is trained, and in which the sample generator and the feature discriminator are interactively updated. At this stage, the output of the feature identifier is the input to the semantic identifier, which randomly deletes some of the generated distinguishable samples. The output of the semantic discriminator is then passed through a label discriminator to remove or adjust the poor quality labels. Finally, the generated samples that achieve the desired quality will be used together with the original samples to train the classifier. FIG. 5 illustrates the second stage, after training of the generated anti-net model, when the sample generator and feature identifier are no longer updated, i.e., the sample generator and features are fixed, the feature identifier also helps the semantic identifier to remove samples with lower semantic quality. In this stage, the samples generated by the data generator are filtered by the semantic identifier and then adjusted or filtered by the tag identifier before being sent to the classifier. After the predefined condition is met, the classifier is finally trained using the raw data set and the generated high quality samples.

The present disclosure also provides a training apparatus for a classifier, as shown in fig. 6, including: a first obtaining module 60, configured to obtain an original unbalanced sample set; a generating module 62, configured to generate a first sample set according to the original unbalanced sample set and the random noise, where the first sample set includes samples marked as positive samples and samples marked as negative samples; a second obtaining module 64, configured to remove samples that do not meet the preset condition from the first sample set to obtain a target sample set; and a training module 66, configured to train a classifier by using the collection of the original unbalanced sample set and the target sample set as a training data set, where the classifier is configured to complete classification of the unbalanced sample set to be classified.

Optionally, the generating module 62 includes: the first generation submodule is used for carrying out normalization processing on an original sample set by utilizing a pre-constructed Gaussian mixture model; combining the original unbalanced sample set subjected to normalization processing with noise randomly generated by the Gaussian mixture model to generate a candidate sample set, wherein the second generation submodule is used for generating the candidate sample set according to the original unbalanced sample set and the random noise; marking samples in the candidate sample set to obtain a first sample set, wherein the marking comprises: marked as positive and marked as negative.

The first generation submodule includes: a generation unit and a marking unit; the generating unit is used for carrying out normalization processing on the original sample set by utilizing a pre-constructed Gaussian mixture model; combining the original unbalanced sample set subjected to normalization processing with noise randomly generated by a Gaussian mixture model to generate a candidate sample set;

the marking unit is used for marking the candidate sample set according to the binary cross entropy loss and the sample spacing of the candidate sample set and the original unbalanced sample set, and determining the marked candidate sample set as a first sample set.

The second obtaining module 64 includes: the target submodule is used for randomly eliminating negative samples from the first sample set by using the first discriminator to obtain a second sample set, and the eliminating probability of all samples marked as the negative samples in the first sample set is set probability; and removing the samples identified as negative samples by the second identifier from the second sample set by using the second identifier to obtain a target sample set.

The target sub-module further comprises: a target unit for obtaining labels of samples in the first sample set by a plurality of predetermined first classifiers in the second discriminator; determining samples in the second sample set marked as negative samples by the target number of first classifiers as samples identified as negative samples by the second discriminator; and eliminating the samples identified as negative samples by the second identifier to obtain a target sample set.

Optionally, the training device further includes a classification module, where the classification module is configured to adjust loss weights of the first type of sample and the second type of sample in the unbalanced sample set to be classified, so that the loss weight of the first type of sample is smaller than the loss weight of the second type of sample; and classifying the to-be-classified unbalanced sample set by utilizing the adjusted loss weight.

FIG. 7a shows a sample verification process, in which a sample generated by a generated countermeasure network is identified by a plurality of first classifiers to obtain a plurality of classification results (positive samples or negative samples), and if the target number of first classifiers determines that the classification results are negative samples, the target number of input samples is determined to be negative samples, and the target number is greater than a set threshold T _l 。

Fig. 7b shows another example verification process, and the rate and quality of sample verification can be controlled by adjusting the size of the set threshold. The set threshold may be reduced at the early stage of sample generation in order to speed up the verification rate of the sample, and may be increased at the later stage of sample generation in order to control the quality of the generated sample to ensure that the generated sample is verified by more first classifiers. A sample in the input first sample set is determined to be a positive sample if a target number of first classifiers verify as true, i.e., positive samples.

It should be noted that in the related art in the generative countermeasure network, a trained generator can manipulate specific feature attributes to fool a discriminator, and the discriminator can be used to distinguish real samples from false samples synthesized by the generator. The quality of the generated samples is improved in the mutual confrontation of the generator and the discriminator. However, due to the lack of samples and the imbalance, there is not enough samples to distinguish different sample features. The training method of the classifier provided by the disclosure adjusts the acquisition rate of the sample and the quality of the sample by adjusting the setting condition of the quality controller. Therefore, a large number of high-quality generated samples can be obtained to balance the original unbalanced sample set, the performance (sample distinguishing capability) of the classifier obtained by training with enough samples as the training data set is improved, and the classification accuracy is further improved.

It should be further noted that in practical application scenarios, such as a fault detection scenario, a small fault may also cause a chain reaction. As a predictive maintenance means, state monitoring and fault diagnosis technologies are currently used in many fault diagnosis methods, such as expert system models, physical models, data-driven models, and so on. However, by adopting the data driving model, the abnormal inspection model can be quickly obtained only by needing enough related monitoring data and maintenance data, and the dependence on the prior knowledge of the equipment can be avoided. However, this method needs massive data to obtain a high-precision model, but the problem of insufficient data volume is difficult to overcome, and although it can be overcome by optimization algorithm, simulation data, and reinforcement learning. Most data-driven fault diagnosis methods assume that the data sets are evenly distributed, i.e. the number of samples of different classes is close. However, data in practical applications is often unbalanced, and for a normally operating device, the fault samples are inevitably much less than the normal samples. When these data-driven classification algorithms are used directly for fault diagnosis, it is difficult to obtain satisfactory results. The prediction results tend to be biased toward most categories, so that the accuracy of fault diagnosis is very low. However, in practical applications, fault class data is significantly more important. Therefore, in the face of unbalanced data, the bias caused by it must be overcome. Among them, the Synthetic minor sampling Technique (SMOTE) is a common method, and the classification performance is improved by adjusting data distribution by adding and synthesizing a small number of samples. Commonly used generation countermeasure networks are also used for the generation of synthetic samples due to their efficiency and flexibility. Unlike SMOTE and its variants, which rely primarily on expert knowledge to design the production rules for synthesizing minority populations, the GAN method can automatically learn its intrinsic distribution and produce minority samples that are similar to real samples. A GAN includes two variable networks: a generator and a discriminator, denoted G and D respectively, which are trained to game each other in GAN. The sample generated by the generator G is judged and evaluated by the discriminator D, and then the generator G is optimized according to the evaluation result, so that the efficiency and the quality of the sample generation process can be greatly improved. At present, GAN and its variants have been successfully applied in many fields such as image restoration, scene synthesis, face recognition, etc., but directly relying on randomly generated GAN may generate samples with uncontrolled quality, thereby reducing the performance of the unbalanced classification model. There are also related art techniques that introduce a new neural network-based classifier and corresponding loss terms to measure the difference between the label of the generated record and the label predicted by the classifier. But also does not take into account the distribution of data. The method provided by the disclosure improves the sample quality of the training data set used for training the classifier by directly improving the quality of the generated sample, so that the classification accuracy of the classifier is improved, and the operability and the practicability are more reliable.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 8 illustrates a schematic block diagram of an example electronic device 800 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The calculation unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

A number of components in the device 800 are connected to the I/O interface 805, including: an input unit 806, such as a keyboard, a mouse, or the like; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, or the like; and a communication unit 809 such as a network card, modem, wireless communication transceiver, etc. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

Computing unit 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 801 executes the respective methods and processes described above, such as the training method of the method classifier. For example, in some embodiments, the training method of the method classifier may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 808. In some embodiments, part or all of the computer program can be loaded and/or installed onto device 800 via ROM 802 and/or communications unit 809. When the computer program is loaded into RAM 803 and executed by the computing unit 801, one or more steps of the training method of the method classifier described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the training method of the method classifier in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method of training a classifier, comprising:

acquiring an original unbalanced sample set;

generating a first set of samples from the original set of unbalanced samples and random noise, wherein the first set of samples includes samples labeled as positive samples and samples labeled as negative samples;

removing samples which do not meet preset conditions from the first sample set to obtain a target sample set;

and training a target classifier by taking the collection of the original unbalanced sample set and the target sample set as a training data set, wherein the target classifier is used for finishing the classification of the unbalanced sample set to be classified.

2. The method of claim 1, wherein the removing samples from the first sample set that do not meet a preset condition to obtain a target sample set comprises:

randomly eliminating the negative samples from the first sample set by using a first discriminator to obtain a second sample set, wherein the eliminating probability of all samples marked as negative samples in the first sample set is a set probability;

and removing the samples which are identified as negative samples by the second identifier from the second sample set by using the second identifier to obtain the target sample set.

3. The method of claim 2, wherein said culling, with a second discriminator, samples from the second set of samples that are discriminated as negative samples by the second discriminator to obtain the target set of samples comprises:

obtaining labels of a plurality of predetermined first classifiers in the second discriminator to samples in the first sample set;

determining samples in the second sample set marked as negative by a target number of the first classifiers as the samples identified as negative by the second discriminator;

and rejecting the samples identified as negative samples by the second identifier to obtain the target sample set.

4. The method of claim 1, further comprising:

adjusting the loss weight of a first type sample and a second type sample in the unbalanced sample set to be classified to enable the loss weight of the first type sample to be smaller than that of the second type sample;

and classifying the unbalanced sample set to be classified by utilizing the adjusted loss weight.

5. The method of claim 1, wherein the generating a first set of samples from the original set of unbalanced samples and random noise comprises:

generating a candidate sample set according to the original unbalanced sample set and the random noise;

marking samples in the candidate sample set to obtain the first sample set, wherein the marking comprises: marked as positive and marked as negative.

6. The method of claim 5, wherein the generating a set of candidate samples from the original set of unbalanced samples and the random noise comprises:

carrying out normalization processing on the original sample set by using a pre-constructed Gaussian mixture model;

and combining the original unbalanced sample set subjected to normalization processing with the noise randomly generated by the Gaussian mixture model to generate the candidate sample set.

7. The method of claim 5, wherein said labeling samples in said candidate set of samples, resulting in said first set of samples, comprises:

and labeling the candidate sample set according to the binary cross entropy loss and the sample spacing of the candidate sample set and the original unbalanced sample set, and determining the labeled candidate sample set as the first sample set.

8. An apparatus for classifying an unbalanced sample set, comprising:

a data generator, a quality controller and a classifier;

the data generator is used for generating a first sample set;

the quality controller is used for screening the first sample set to generate a target sample set, and the target sample set and a pre-acquired collection of original unbalanced sample sets are used for training the classifier;

the classifier is configured to classify the unbalanced sample set.

9. A training apparatus of a classifier, comprising:

the device comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring an original unbalanced sample set;

a generating module for generating a first sample set from the original unbalanced sample set and random noise, wherein the first sample set comprises samples marked as positive samples and samples marked as negative samples;

the second acquisition module is used for removing the samples which do not accord with the preset conditions from the first sample set to obtain a target sample set;

and the training module is used for training a classifier by taking the collection of the original unbalanced sample set and the target sample set as a training data set, wherein the classifier is used for finishing the classification of the unbalanced sample set to be classified.

10. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

11. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.

12. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-7.