CN114494772A - Unbalanced sample classification method and device - Google Patents

Unbalanced sample classification method and device Download PDF

Info

Publication number
CN114494772A
CN114494772A CN202210048383.4A CN202210048383A CN114494772A CN 114494772 A CN114494772 A CN 114494772A CN 202210048383 A CN202210048383 A CN 202210048383A CN 114494772 A CN114494772 A CN 114494772A
Authority
CN
China
Prior art keywords
sample
classification
positive
samples
unknown
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210048383.4A
Other languages
Chinese (zh)
Other versions
CN114494772B (en
Inventor
赵家志
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Changjiang Computing Technology Co ltd
Fiberhome Telecommunication Technologies Co Ltd
Original Assignee
Fiberhome Telecommunication Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fiberhome Telecommunication Technologies Co Ltd filed Critical Fiberhome Telecommunication Technologies Co Ltd
Priority to CN202210048383.4A priority Critical patent/CN114494772B/en
Publication of CN114494772A publication Critical patent/CN114494772A/en
Application granted granted Critical
Publication of CN114494772B publication Critical patent/CN114494772B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to an unbalanced sample classification method and device. The method mainly comprises the following steps: model construction and training are carried out: constructing and training a variational self-encoder network, calculating the maximum reconstruction error of a negative sample, and carrying out classification model training on a positive sample; predicting the classification of the unknown sample: and setting the upper and lower limits of the critical region and the classification threshold of the positive sample to be compared with the reconstruction error of the unknown sample and the classification probability distribution of the positive sample so as to predict the classification of the unknown sample. The invention can realize the selection of the prediction preference by setting the parameters of the critical zone, namely setting the upper limit and the lower limit of the critical zone and the classification threshold of the positive sample during prediction, and can dynamically adjust the prediction preference.

Description

Unbalanced sample classification method and device
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a method and a device for classifying unbalanced samples.
Background
In the field of machine learning and data processing, the situation of imbalance of positive and negative samples is very common. The negative sample refers to sample data acquired under normal conditions of the system, such as physiological indexes of medically healthy people, and indexes representing diseases are negative, so that a large amount of negative sample data can be easily obtained; the positive sample refers to sample data acquired under the condition of system abnormality, such as physiological indexes of people suffering from certain diseases in medicine, some indexes representing the diseases are positive, positive sample data is difficult to obtain, and only a small amount of positive sample data with labels is available. For such a sample distribution, in addition to positive and negative sample classification, positive samples are classified, for example, in medical science, in addition to health judgment, the type of disease suffered by a patient is also judged. For another example, in the communication industry, there are similar imbalance between positive and negative samples, such as spam identification, and generally, there are many negative samples (i.e., there are many normal mails), few positive samples (i.e., there are few spam), and spam has categories such as sales promotion, advertisement, insurance promotion, etc., i.e., there are many different categories for fewer positive samples. For example, in a network failure analysis or performance analysis, generally, a network is in a normal operation state, negative samples are many (that is, the network is normal and has normal performance), while positive samples are few (that is, the network is failed and has error performance), wherein the network is failed and has error performance, which are classified into various reasons, that is, the positive samples are classified into various categories.
The existing methods for solving the unbalanced sample problem comprise over-sampling, under-sampling, loss functions with weights, positive sample synthesis and integration methods. In addition, the prediction preference of the methods in the prediction stage is not adjustable, and a large amount of training needs to be carried out again when classification is newly added.
In view of this, how to overcome the defects existing in the prior art and solve the above technical problems is a difficult problem to be solved in the technical field.
Disclosure of Invention
Aiming at the defects or improvement requirements in the prior art, the invention provides an unbalanced sample classification method and device, which combine the advantages of deep learning and traditional machine learning algorithms, divide the classification of samples into two steps of model training and prediction, introduce reconstruction errors, use the deep learning algorithm model of a variational self-encoder to separate positive and negative samples, use the traditional machine learning algorithm model to classify positive samples with small data volume, introduce a critical zone and a positive sample classification threshold, and flexibly adjust the classification preference in the prediction stage.
The embodiment of the invention adopts the following technical scheme:
in a first aspect, the present invention provides an unbalanced sample classification method, including:
model construction and training are carried out: constructing and training a variational self-encoder network, calculating the maximum reconstruction error of a negative sample, and carrying out classification model training on a positive sample;
predicting the classification of the unknown sample: and setting the upper and lower limits of the critical region and the classification threshold of the positive sample to be compared with the reconstruction error of the unknown sample and the classification probability distribution of the positive sample so as to predict the classification of the unknown sample.
Further, the model building and training specifically includes:
constructing a variational self-encoder network and setting a loss function and a reconstruction error;
training a variational self-encoder network and storing the network and parameters of the variational self-encoder network;
inputting all negative samples into a variational self-encoder network for reconstruction and calculating a reconstruction error, and taking the maximum value of the calculated reconstruction error as the maximum reconstruction error of the negative samples;
and training classification models of all positive samples and storing the classification models and parameters.
Further, the variational self-encoder network comprises an encoder network and a decoder network, and when the variational self-encoder network is stored, the network and the parameters of the encoder network and the parameters of the decoder network are respectively stored.
Further, the setting of the loss function and the reconstruction error specifically includes:
reconstructing a feature vector x of a sample into
Figure BDA0003473410420000031
Wherein E (-) represents an encoder function, S (-) represents a Gaussian distributed sampling function, and D (-) represents a decoder function;
setting reconstruction errors
Figure BDA0003473410420000032
The loss function for a single sample is:
Figure BDA0003473410420000033
in which mu is Gaussian distributedMean, σ, is the covariance of the gaussian distribution.
Further, the training of the variational self-encoder network specifically includes:
dividing the negative samples into a plurality of batches for training in sequence;
calculating the reconstruction loss of each batch and the gradient of the reconstruction loss to the network parameters;
optimizing network parameters through a gradient descent optimization strategy;
and judging whether all batches are trained or not and whether the maximum training times are reached or not, and if so, finishing the training.
Further, the predicting the class to which the unknown sample belongs specifically includes:
setting a lower limit coefficient alpha and an upper limit coefficient beta of a critical zone to obtain a critical interval [ alpha multiplied by M, beta multiplied by M ], wherein alpha is less than beta, and M is the maximum reconstruction error of a negative sample;
setting a positive sample classification threshold P, wherein 1/K is more than P and less than 1, and K is the positive sample classification number;
inputting unknown samples into an encoder network and a decoder network and calculating reconstruction errors R of the unknown samplesx
Judging the reconstruction error R of the unknown samplexWhether the output prediction result is less than the critical area lower limit alpha multiplied by M or not is judged to be a negative sample or the classification probability distribution of the positive sample of the unknown sample is calculated;
judging whether the unknown sample can be classified into a certain class of positive samples with the probability greater than the positive sample classification threshold P and the reconstruction error R of the unknown samplexWhether the output prediction result is within the critical interval or not is determined to be a positive sample or a negative sample.
Further, the reconstruction error R of the unknown sample is judgedxWhether the prediction result is smaller than the critical section lower limit α × M or not to determine that the output prediction result is a negative sample or calculate the classification probability distribution of the positive samples of the unknown samples specifically includes:
reconstruction error R if unknown samplexIf the prediction result is less than the lower limit alpha multiplied by M of the critical zone, the prediction result is output as a negative sample, otherwise, the classification probability distribution of the positive sample of the unknown sample is calculated through the stored classification model。
Further, the determination is made whether the unknown sample can be classified as a positive sample of a certain class with a probability greater than the positive sample classification threshold P and the reconstruction error R of the unknown samplexWhether the output prediction result is within the critical interval or not to determine whether the output prediction result is a positive sample or a negative sample specifically includes:
if the unknown sample can be classified into a certain class of positive samples according to the probability greater than the positive sample classification threshold P, outputting a prediction result as the positive sample classification;
if the unknown sample can not be classified as a certain type of positive sample with the probability greater than the positive sample classification threshold P, and the reconstruction error R of the unknown samplexIn a critical interval [ alpha × M, beta × M]Classifying the unknown sample into a negative sample, and outputting a prediction result as the negative sample;
if the unknown sample can not be classified as a certain class of positive samples with the probability larger than the positive sample classification threshold P, and the reconstruction error R of the unknown samplexAnd if the prediction result is larger than the upper limit of the critical zone, the unknown sample is considered to be a newly-added classified sample, the newly-added classified sample is added into the positive sample set, and the prediction result is output as the positive sample classification.
Further, when the unknown sample of the new class is added into the positive sample set, the label of the unknown sample of the new class is set to be K +1, the class number K of the positive sample is modified to be K +1, and then the classification models of all the positive samples are retrained for the next prediction.
On the other hand, the invention provides an unbalanced sample classification device, which specifically comprises: the method comprises at least one processor and a memory, wherein the at least one processor and the memory are connected through a data bus, and the memory stores instructions capable of being executed by the at least one processor, and the instructions are used for completing the unbalanced sample classification method in the first aspect after being executed by the processor.
Compared with the prior art, the invention has the beneficial effects that: during prediction, the selection of prediction preference is realized by setting parameters of a critical zone, namely setting upper and lower limits of the critical zone and a classification threshold of a positive sample, and the prediction preference can be dynamically adjusted. For newly added positive sample classification, only a few classification models of the positive samples need to be retrained, and all samples do not need to be retrained, so that the rapid updating iteration of the models can be realized. In addition, the method is different from the existing method, and the operation of newly adding or discarding the characteristic information of the sample does not exist, so that the under-fitting and over-fitting risks are reduced compared with the existing algorithm.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below. It is obvious that the drawings described below are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.
Fig. 1 is a flowchart of an unbalanced sample classification method according to embodiment 1 of the present invention;
FIG. 2 is an expanded flowchart of step 100 provided in embodiment 1 of the present invention;
fig. 3 is an expanded flowchart of step 102 provided in embodiment 1 of the present invention;
FIG. 4 is an expanded flowchart of step 200 provided in embodiment 1 of the present invention;
fig. 5 is a schematic flowchart of a process for completing the construction and training of a training model according to embodiment 2 of the present invention;
fig. 6 is a schematic diagram of a network structure of a variational self-encoder according to embodiment 2 of the present invention;
fig. 7 is a schematic diagram of a training process of a variational self-encoder network according to embodiment 2 of the present invention;
FIG. 8 is a diagram of an example of input and output of a Softmax classifier provided in embodiment 2 of the present invention;
FIG. 9 is a schematic view of a process for classifying unknown samples according to the embodiment 2 of the present invention;
fig. 10 is a schematic structural diagram of an unbalanced sample classification apparatus according to embodiment 3 of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The present invention is a system structure of a specific function system, so the functional logic relationship of each structural module is mainly explained in the specific embodiment, and the specific software and hardware implementation is not limited.
In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other. The invention will be described in detail below with reference to the figures and examples.
Example 1:
as shown in fig. 1, an embodiment of the present invention provides an unbalanced sample classification method, which provides a method combining deep learning and conventional machine learning for multi-classification problem of unbalanced positive and negative samples. The method comprises the following two steps.
Step 100 (model construction and training): and constructing and training a variational self-encoder network, calculating the maximum reconstruction error of the negative sample, and carrying out classification model training on the positive sample. The method is mainly used for establishing a variational self-coder network model and a classification model of a positive sample, and training the two models for the next prediction.
Step 200 (predict class to which unknown sample belongs): and setting the upper and lower limits of the critical region and the classification threshold of the positive sample to be compared with the reconstruction error of the unknown sample and the classification probability distribution of the positive sample so as to predict the classification of the unknown sample. The step is mainly used for predicting the classification of the unknown sample, and the prediction is based on the variational self-coder network model trained in the step 100 and the classification model of the positive sample.
By the method, the upper and lower limits of the critical zone and the classification threshold of the positive sample are set during prediction to realize selection of the prediction preference, so that the problem that the preference of the prediction stage cannot be flexibly adjusted can be solved. In addition, when the model is constructed and trained, the classification model of the positive sample is constructed and trained independently, so that only a small number of classification models of the positive sample need to be retrained for newly added positive sample classification, all samples do not need to be retrained, and the model can be updated and iterated quickly.
The following describes step 100 (performing model construction and training) and step 200 (predicting the classification to which the unknown sample belongs) in this embodiment in detail.
As shown in fig. 2, in the preferred embodiment, the step 100 (performing model building and training) may specifically include the following steps:
step 101: and constructing a variational self-encoder network and setting a loss function and a reconstruction error. In this step of the preferred embodiment, the constructed variational self-encoder network includes an encoder network and a decoder network. For a sample, there may be a loss or reconstruction error after reconstructing (i.e. encoding and decoding), for example, reconstructing the eigenvector x of a sample into
Figure BDA0003473410420000071
Then the preferred embodiment sets
Figure BDA0003473410420000072
Wherein E (-) represents an encoder function, S (-) represents a Gaussian distributed sampling function, and D (-) represents a decoder function; reconstruction error
Figure BDA0003473410420000073
The loss function for a single sample is:
Figure BDA0003473410420000074
where μ is the mean of the gaussian distribution and σ is the covariance of the gaussian distribution.
Step 102: training the variational self-encoder network and storing the network and parameters. In this step of the preferred embodiment, based on step 101 above, this step trains the encoder network and decoder network of the variational self-encoder using existing negative sample data to obtain E (-) and D (-) respectively. When storing, E (-) and D (-) are two sub-networks of the variational autoencoder, and are stored as two files, respectively.
Step 103: and inputting all negative samples into a variational self-encoder network for reconstruction, calculating a reconstruction error, and taking the maximum value of the calculated reconstruction error as the maximum reconstruction error of the negative samples. In this step of the preferred embodiment, all negative samples are input into the trained variational self-coder network for reconstruction, and then the transformed variational self-coder network is used
Figure BDA0003473410420000075
Representing the ith negative sample feature vector (similarly, available for use
Figure BDA0003473410420000076
Representing the ith positive sample feature vector), then the motion vector will be
Figure BDA0003473410420000077
Is reconstructed into
Figure BDA0003473410420000078
After that, the air conditioner is started to work,
Figure BDA0003473410420000079
reconstruction error
Figure BDA00034734104200000710
Figure BDA00034734104200000711
The loss function for a single sample is:
Figure BDA00034734104200000712
Figure BDA00034734104200000713
finally, the maximum value of the calculated reconstruction error is taken as the maximum reconstruction error M of the negative sample,
Figure BDA00034734104200000714
step 104: and training Softmax classifiers of all positive samples and saving Softmax models and parameters. I.e. training the classification models of all positive samples and storing the classification models and parameters. The present embodiment preferably uses a Softmax classifier for the classification model training of positive samples. In this step of the preferred embodiment, in the case that the positive samples are few relative to the negative samples, the Softmax classifier is trained on the few positive samples to obtain a Softmax model: softmaxK(. K), here, is the positive sample classification number.
Through the steps 101 to 104, the construction and training of the variational self-coder network model and the positive sample classification model (namely, the Softmax model) can be completed. The steps 101 to 103 are to use a large number of negative samples (relative to the positive samples, the negative samples are large) to construct and train the self-encoder network model, and the step 104 is to use a small number of positive samples (relative to the negative samples, the positive samples are small) to construct and train the Softmax model, and the construction and training of the two models can be parallel, without clear precedence, that is, the steps 101 to 103 and the step 104 can be executed in parallel.
As shown in fig. 3, in the preferred embodiment, the training of the variational self-encoder network in step 102 can be further specifically divided into the following steps:
step 102-1: the negative samples are divided into several batches for training in sequence.
Step 102-2: the reconstruction loss for each batch and the gradient of the reconstruction loss to the network parameters are calculated.
Step 102-3: and optimizing network parameters through a gradient descent optimization strategy.
Step 102-4: and judging whether all batches are trained or not and whether the maximum training times are reached or not, and if so, finishing the training. If not all batches are trained, or the maximum number of training times is not reached, then continuing from step 102-1, training the variational self-encoder network using the next batch of negative samples in sequence.
In combination with the above steps, the preferred embodiment may also use input and output modes to describe the training of the above two models (the variational self-coder network model and the Softmax model), and the specific input and output processes are as follows:
inputting: negative sample feature set
Figure BDA0003473410420000081
Set of positive samples
Figure BDA0003473410420000082
Figure BDA0003473410420000083
m1 > m2, K > 1; the error loss computation function R (·, ·) is reconstructed. Wherein, yiThe label of the ith positive sample is represented, K in the value range of 1-K represents the number of the positive sample types,
Figure BDA0003473410420000084
a vector representing the ith positive sample feature vector and its label,
Figure BDA0003473410420000085
indicating that both positive and negative samples are n-dimensional real vectors, m1 > m2 indicates that m1 (negative number of samples) is much larger than m2 (positive number of samples).
And (3) outputting: an encoder model E (-) of a negative sample variational self-encoder, a Gaussian distribution sampling function S (-) where the mean μ and covariance matrix σ of the Gaussian distribution are the output of the encoder, a decoder model D (-) and a negative sample reconstruction error maximum M, a Softmax classifier model Softmax of K classified positive samplesK(·)。
The above is a detailed description of step 100 (performing model construction and training) in the present preferred embodiment, and step 200 (predicting the classification to which the unknown sample belongs) in the present preferred embodiment is described in detail below.
As shown in fig. 4, in the preferred embodiment, the step 200 (predicting the class to which the unknown sample belongs) may specifically include the following steps:
step 201: setting a lower limit coefficient alpha and an upper limit coefficient beta of a critical area to obtain a critical interval [ alpha multiplied by M, beta multiplied by M ], and simultaneously setting a positive sample classification threshold P, wherein alpha is less than beta, 1/K is less than P and less than 1, M is the maximum reconstruction error of a negative sample, and K is the classification number of the positive sample. The step of the preferred embodiment is a pre-preparation step, which is mainly used to set various parameters.
Step 202: inputting unknown samples into an encoder network and a decoder network and calculating reconstruction errors R of the unknown samplesx. The step of the preferred embodiment starts to predict an unknown sample x (i.e. a sample to be predicted), and first inputs the unknown sample x into a previously trained encoder network and decoder network to obtain a reconstructed feature vector thereof
Figure BDA0003473410420000091
And reconstruction error
Figure BDA0003473410420000092
Step 203: judging the reconstruction error R of the unknown samplexAnd whether the value is less than the critical section lower limit alpha multiplied by M or not, so as to determine that the output prediction result is a negative sample or calculate the classification probability distribution of the positive sample of the unknown sample. In this step of the preferred embodiment, if the reconstruction error R of the unknown sample is not knownxAnd if the output classification is less than the lower limit alpha multiplied by M of the critical zone, outputting the classification to be 0 (namely outputting the prediction result to be a negative sample) and finishing the prediction, otherwise, calculating the classification probability distribution of the positive sample of the unknown sample through a stored Softmax model so as to carry out the prediction judgment of the next step. The alpha of the step is equivalent to a credible threshold for defining the sample classification as a negative sample, the sample smaller than the threshold is credible and classified as a negative sample, and the smaller the value is, the higher the probability is, the more the process of the refiltering in the next step is carried out. The alpha parameter setting in this step represents the beneficial effects that the embodiment can realize the selection of the prediction preference and can dynamically adjust the prediction preference.
In the preferred embodiment, the specific calculation method of the positive sample classification probability distribution is as follows: using the previously trained positive sample classifier SoftmaxK(. i.e. a trained Softmax model) to predict the positive sample classification probability distribution of unknown samples x:
Figure BDA0003473410420000101
Figure BDA0003473410420000102
step 204: judging whether the unknown sample can be classified into a certain class of positive samples with the probability greater than the positive sample classification threshold P and the reconstruction error R of the unknown samplexWhether the output prediction result is within the critical interval or not is determined to be a positive sample or a negative sample.
In step 204 of the preferred embodiment, if the unknown sample can be classified as a certain type of positive sample with a probability greater than the positive sample classification threshold P (i.e. the calculated probability distribution P)1,p2,...,pKA certain probability distribution greater than the positive sample classification threshold P) exists, the prediction result is output as the positive sample classification corresponding to the probability distribution, and the prediction is ended. Namely: if maxipiIf > P, argmax is outputip (i.e. classification of positive samples), end prediction, otherwise go to the next step. P of the step is equivalent to a threshold for defining credible classification of the positive samples, and the smaller P represents the higher probability of classifying the unknown sample x into a certain class of existing positive samples. The P parameter setting in this step also embodies the beneficial effects that the present embodiment can realize the selection of the prediction preference, and can dynamically adjust the prediction preference.
In step 204 of the preferred embodiment, the unknown sample cannot be classified as a positive sample of a certain class with a probability greater than the positive sample classification threshold P (i.e. the calculated probability distribution P)1,p2,...,pKThere is no probability distribution greater than the positive sample classification threshold P) and the reconstruction error R of the unknown samplexIn a critical interval [ alpha × M, beta × M]And if so, re-classifying the unknown sample into a negative sample, outputting a prediction result as the negative sample, and finishing the prediction. That is, on the basis of the last judgment: if R isxAnd if the output is less than or equal to beta multiplied by M, the classification is 0 (namely, the classification is a negative sample), the prediction is finished, and otherwise, the next step is carried out. This step achieves a re-filtering of negative samples and beta in this step corresponds to a preference defining a new class of positive samples, this stepSmaller values indicate a greater bias towards new positive sample classifications. Therefore, the beta parameter setting of the step also represents the beneficial effect that the embodiment can realize the selection of the prediction preference and can dynamically adjust the prediction preference.
In step 204 of the preferred embodiment, if the unknown sample can not be classified as a certain type of positive sample with a probability greater than the positive sample classification threshold P, the reconstruction error R of the unknown sample isxAnd if the prediction result is larger than the upper limit of the critical zone, the unknown sample is considered to be a newly-added classified sample, the newly-added classified sample is added into the positive sample set, the prediction result is output to be the positive sample classification, and then the prediction is finished. In this step, when the unknown sample x of the new class is added to the positive sample set as the new positive sample class, the label of the unknown sample x of the new class is set to K +1, and the positive sample class number K is modified to K +1, and then the process goes to step 104 to retrain all the Softmax classifiers of the positive samples for the next prediction.
In combination with the above steps, the preferred embodiment may further use input and output modes to describe the process of predicting the classification to which the unknown sample belongs, where the specific input and output processes are as follows:
inputting: sample to be predicted
Figure BDA0003473410420000111
The upper and lower limits of the critical section are alpha and beta (alpha is less than beta, such as 0.9 and 1.1), and the abnormal sample classification threshold is P (1/K is less than P1, such as P is 0.9).
And (3) outputting: and (4) classifying the sample to be predicted.
In summary, it can be seen from the preferred embodiment that, in the prediction, the selection of the prediction preference is realized by setting the critical section parameter, that is, setting the upper and lower limits of the critical section and the positive sample classification threshold, and the prediction preference can be dynamically adjusted. For newly added positive sample classification, only a few Softmax classifiers of the positive samples need to be retrained, and all samples do not need to be retrained, so that the rapid updating iteration of the model can be realized. In addition, the method is different from the existing method, and the operation of newly adding or discarding the characteristic information of the sample does not exist, so that the under-fitting and over-fitting risks are reduced compared with the existing algorithm. Although the method of the preferred embodiment mainly aims at classifying unbalanced samples, the method can also classify relatively balanced samples, and is not limited to be used only for unbalanced samples.
Example 2:
based on the unbalanced sample classification method provided in embodiment 1, embodiment 2 provides a concrete implementation example based on embodiment 1. In the following, an implementation is given by taking the MNIST dataset as an example according to the algorithm implementation steps. The MNIST data set is a classical data set in the field of machine learning and consists of 60000 training samples and 10000 testing samples, and each sample is a 28 x 28 pixel grayscale handwritten digital picture. In the model training of this embodiment 2, 5000 "0" pictures are selected as negative samples, and 5 "1", 10 "2", and 5 "3" pictures are selected as positive samples.
Fig. 5 is a schematic flow chart of the present embodiment for completing the construction and training of the training model. Wherein step S101 and step S105 may be performed in parallel. The specific flow of this embodiment is described as follows:
s101, constructing a variational self-encoder network. Specifically, a variational self-encoder network is constructed using tensrflow according to the variational self-encoder network structure shown in fig. 6. TensorFlow is a symbolic mathematical system based on data flow programming, and is widely applied to programming realization of various machine learning algorithms.
In this embodiment, the network structure of the encoder E (-) is: the 28 × 28 × 1 grayscale image is subjected to 32 3 × 3 convolution kernel convolutions (downsampling) to obtain a 28 × 28 × 32 feature image, then subjected to 16 3 × 3 convolution kernel convolutions (downsampling) to obtain a 28 × 28 × 16 feature image, the feature image is unfolded, a full-connection network is used for transforming the feature image into a 16-dimensional vector, and finally two full-connection networks are used for fitting the mean value and the variance of the hidden space Gaussian distribution. The hidden space is assumed to be an independent multidimensional gaussian distribution with a dimension of 10. The mean is a 10-dimensional vector and the covariance is a 10 x 10 diagonal matrix.
Mean value: mu-1 (mu)0,μ1,…,μ9)
Covariance:
Figure BDA0003473410420000121
the sampling operation S (-) is: and sampling the hidden space with independent multidimensional Gaussian distribution to obtain Z. The specific operation is as follows: sampling independent multidimensional standard gaussian distributions to obtain e, and Z ═ mu + e × sigma.
The network structure of decoder D (-) is: the 10-dimensional Z vector is subjected to full-connection network transformation into a 1568-dimensional vector, then transformed into a 7 × 7 × 32 feature image, deconvoluted (upsampled) using 32 3 × 3 deconvolution kernels to obtain a 14 × 14 × 32 feature image, deconvoluted (upsampled) using 16 3 × 3 deconvolution kernels to obtain a 28 × 28 × 16 feature image, and finally deconvoluted (upsampled) using 1 3 × 3 deconvolution kernel to obtain a 28 × 28 × 1 generated image.
And S102, defining a loss function and training a variational self-encoder network.
Where the reconstruction loss (using the same function as the reconstruction error) employs the mean square error loss (cross entropy loss may also be used):
Figure BDA0003473410420000131
the loss of a single sample is defined as follows:
Figure BDA0003473410420000132
Figure BDA0003473410420000133
in addition, the training process for training the variational self-coder network is shown in fig. 7. The concrete description is as follows:
and S201, inputting and reconstructing a batch of samples. The present embodiment divides 5000 "0" samples into 50 batches, and optimizes the network parameters once for each 100 samples.
S202, calculating the reconstruction loss. The reconstruction loss of 100 samples per batch was calculated by TensorFlow using Mean Squared Error (MSE) as follows.
Figure BDA0003473410420000134
And S203, calculating the gradient of the reconstruction loss to the network parameters. This step invokes the TensorFlow interface to compute the gradient of L to all parameters of the network.
And S204, optimizing network parameters by gradient descent. In the step, a TensorFlow Adam gradient descent optimization strategy is selected to optimize network parameters.
And S205, judging whether training of all batches is completed. This step of this example requires a determination of whether a total of 50 batches are complete. If the sample is not completed, the process goes back to S201 to input and reconstruct the next batch of samples.
And S206, judging whether the maximum iteration number is reached. In this embodiment, the maximum training frequency is defined as 50, and it is determined whether the maximum training frequency is reached, if so, the training of the variational self-encoder network is ended and the subsequent step S103 is performed, otherwise, the step returns to step S201 to perform the next training. Generally, multiple rounds of training are required. In this embodiment, the training samples are divided into 50 batches for training, and after 50 batches of training are completed, one round of training is completed, and after 50 rounds of training (that is, the defined maximum training times 50 are reached), all the training is completed.
S103, storing the variational self-encoder network and the parameters. This step is used to save the networks and parameters trained in steps S201-S206 above. Where E (-) and D (-) are two subnets of the variational encoder, and are stored as two files, respectively, and the network and parameters can be stored as h5 files using the interface of tensrflow.
And S104, calculating the maximum reconstruction errors of all negative samples. In the step, 5000 '0' samples are input into a trained variational self-encoder network (a network shown in fig. 6) to be reconstructed, and a reconstruction error is calculated to obtain a maximum reconstruction error M.
Figure BDA0003473410420000141
And S105, training Softmax classifiers of all positive samples. In the step, all positive samples are trained by using Logistic Regression of SciKit leanin (SKLearn is short for SciKit leanin and is a python library and a module specially used for machine learning), a Softmax classifier is the popularization of two-classification Logistic Regression in multi-classification, the two-classification Logistic Regression is uniformly realized by the Logistic Regression of SciKit leanin, the traditional Logistic Regression can only be classified by two, and the two-classification Logistic Regression and the multi-classification are realized by the Logistic Regression. In the present embodiment, 5 "1", 10 "2", and 5 "3" are used as positive samples, so that the positive sample class K is 3, and Softmax outputs a probability distribution of the samples. For example, as shown in FIG. 8, a 3-dimensional one-hot coded vector of sample "1" is input to a Softmax classifier, and a sample probability distribution (p) is output1,p2,...,pi,...,pK) Wherein p isiRepresenting the probability that a sample belongs to class i. It should be noted that other conventional machine learning algorithms can be used to train the classification model of the positive sample, such as a simple two-layer neural network, and is not limited to the Softmax classifier. The Softmax classifier is preferred because of its convenient probability distribution output, less training data required, and simpler requirements.
And S106, saving the Softmax model and the parameters. This step uses the Python's pickle module to save the trained Logistic Regression model.
After model training is completed, the classification of an unknown sample x is predicted according to the following steps, and the flow is shown in fig. 9 and specifically described as follows:
s301, inputting prediction parameters. This step defines the critical section upper and lower limits α ═ 0.9 and β ═ 1.1, and the positive sample classification threshold P ═ 0.9. The critical interval [ α × M, β × M ] defines a fuzzy interval of positive and negative samples.
And S302, calculating a reconstruction error. The step is that a sample x to be predicted is obtained by a trained variational self-encoder network
Figure BDA0003473410420000151
Calculate its weightError of formation
Figure BDA0003473410420000152
S303, judging whether the reconstruction error is smaller than the lower limit of the critical area.
And S304, if so, outputting the classification as a negative sample. In this step, if Rx< α × M, the output is classified as 0 (i.e., as a negative sample). Where α defines a sample classification as negative sample confidence threshold, samples below this threshold are classified as negative samples, and a smaller value indicates a greater probability of entering the following refiltering process.
And S305, if not, calculating the probability distribution of the positive sample. If the credible classification cannot be determined as negative sample, the saved positive sample classifier Softmax of S106 is usedK(. h) predict the positive sample classification probability distribution of x:
Figure BDA0003473410420000153
s306, judging whether the sample can be classified into a certain type of positive sample with the probability greater than P.
And S307, if yes, outputting positive sample classification. In this step, if x can be classified into a certain class of positive samples with a probability greater than P, a positive sample classification label is output, and prediction is ended. Namely: maxipi> P, output argmaxipi(i.e., classification of positive samples) and the prediction is ended. Here, P defines a threshold for the confidence classification of the positive samples, and a smaller P indicates a higher probability of classifying x into a certain class of existing positive samples.
And S308, if not, judging whether the reconstruction error is smaller than the upper limit of the critical area.
And S309, if so, outputting the classification as a negative sample. In this step, if x cannot be classified as a positive sample with probability greater than P and the sample reconstruction error is in the critical region, x is classified as a negative sample again. Namely: rxBeta is less than or equal to beta multiplied by M, the output is classified as 0 (namely, the classification is a negative sample), and the prediction is finished. Here, a re-filtering of negative samples is achieved.
And S310, if not, newly adding an abnormal classification, and retraining the Softmax classifier.
And S311, outputting the positive sample classification.
S310 and S311 indicate that x can be classified as a certain type of positive sample neither with a probability greater than P, and the sample reconstruction error is above the critical section upper limit, at this time, x is considered as a newly classified sample, the sample is added to the positive sample set, x is labeled as K +1, and K is modified to be K +1, and then the flow of S105 and S106 is called to retrain the Softmax classifier, and the retrained Softmax classifier is used as the next prediction. β in S309 defines a preference for the new positive sample class, and a smaller value indicates a higher preference for the new positive sample class.
In summary, in this embodiment 2, during prediction, the selection of the prediction preference is realized by setting the parameters of the critical section, that is, setting the upper and lower limits of the critical section and the classification threshold of the positive sample, and the prediction preference can be dynamically adjusted. For newly added positive sample classification, only a few Softmax classifiers of the positive samples need to be retrained, and all samples do not need to be retrained, so that the rapid updating iteration of the model can be realized. In addition, the method is different from the existing method, and the operation of newly adding or discarding the characteristic information of the sample does not exist, so that the under-fitting and over-fitting risks are reduced compared with the existing algorithm.
Example 3:
on the basis of the unbalanced sample classification methods provided in embodiments 1 to 2, the present invention further provides an unbalanced sample classification apparatus for implementing the methods and systems, as shown in fig. 10, which is a schematic diagram of an apparatus architecture according to an embodiment of the present invention. The unbalanced sample classification apparatus of the present embodiment includes one or more processors 21 and a memory 22. In fig. 10, one processor 21 is taken as an example.
The processor 21 and the memory 22 may be connected by a bus or other means, and fig. 10 illustrates the connection by a bus as an example.
The memory 22, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as the unbalanced sample classification method in embodiments 1 to 2. The processor 21 executes various functional applications and data processing of the unbalanced sample classification device by running the nonvolatile software programs, instructions, and modules stored in the memory 22, that is, implements the unbalanced sample classification methods of embodiments 1 to 2.
The memory 22 may include high speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the memory 22 may optionally include memory located remotely from the processor 21, which may be connected to the processor 21 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
Program instructions/modules are stored in the memory 22 and, when executed by the one or more processors 21, perform the unbalanced sample classification method of embodiments 1 to 2 described above, for example, perform the respective steps shown in fig. 1 to 9 described above.
Those of ordinary skill in the art will appreciate that all or part of the steps of the various methods of the embodiments may be implemented by associated hardware as instructed by a program, which may be stored on a computer-readable storage medium, which may include: a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and the like.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (10)

1. A method for classifying unbalanced samples, comprising:
model construction and training are carried out: constructing and training a variational self-encoder network, calculating the maximum reconstruction error of a negative sample, and carrying out classification model training on a positive sample;
predicting the classification of the unknown sample: and setting the upper and lower limits of the critical region and the classification threshold of the positive sample to be compared with the reconstruction error of the unknown sample and the classification probability distribution of the positive sample so as to predict the classification of the unknown sample.
2. The method for classifying unbalanced samples according to claim 1, wherein the performing model construction and training specifically comprises:
constructing a variational self-encoder network and setting a loss function and a reconstruction error;
training a variational self-encoder network and storing the network and parameters of the variational self-encoder network;
inputting all negative samples into a variational self-encoder network for reconstruction and calculating a reconstruction error, and taking the maximum value of the calculated reconstruction error as the maximum reconstruction error of the negative samples;
and training classification models of all positive samples and storing the classification models and parameters.
3. The method for classifying unbalanced samples according to claim 2, wherein the variational self-encoder network comprises an encoder network and a decoder network, and when saving the variational self-encoder network, the network and the parameters of the encoder network and the parameters of the decoder network are saved respectively.
4. The method of claim 3, wherein the setting the loss function and the reconstruction error specifically comprises:
reconstructing a feature vector x of a sample into
Figure FDA0003473410410000011
Figure FDA0003473410410000012
Wherein E (-) represents an encoder function, S (-) represents a Gaussian distributed sampling function, and D (-) represents a decoder function;
setting reconstruction errors
Figure FDA0003473410410000013
The loss function for a single sample is:
Figure FDA0003473410410000014
where μ is the mean of the gaussian distribution and σ is the covariance of the gaussian distribution.
5. The method of claim 4, wherein the training the variational self-encoder network specifically comprises:
dividing the negative samples into a plurality of batches for training in sequence;
calculating the reconstruction loss of each batch and the gradient of the reconstruction loss to the network parameters;
optimizing network parameters through a gradient descent optimization strategy;
and judging whether all batches are trained or not and whether the maximum training times are reached or not, and if so, finishing the training.
6. The method as claimed in claim 5, wherein the predicting the class to which the unknown sample belongs comprises:
setting a lower limit coefficient alpha and an upper limit coefficient beta of a critical zone to obtain a critical interval [ alpha multiplied by M, beta multiplied by M ], wherein alpha is less than beta, and M is the maximum reconstruction error of a negative sample;
setting a positive sample classification threshold P, wherein 1/K is more than P and less than 1, and K is the positive sample classification number;
inputting unknown samples into an encoder network and a decoder network and calculating reconstruction errors R of the unknown samplesx
Judging the reconstruction error R of the unknown samplexWhether the output prediction result is less than the critical area lower limit alpha multiplied by M or not is judged to be a negative sample or the classification probability distribution of the positive sample of the unknown sample is calculated;
determining whether the unknown sample can be classified as a class of positive samples and unknown with a probability greater than the positive sample classification threshold PReconstruction error R of samplexWhether the output prediction result is within the critical interval or not is determined to be a positive sample or a negative sample.
7. The method for classifying unbalanced samples according to claim 6, wherein the determining of the reconstruction error R of the unknown samplexWhether the prediction result is smaller than the critical section lower limit α × M or not to determine that the output prediction result is a negative sample or calculate the classification probability distribution of the positive samples of the unknown samples specifically includes:
reconstruction error R if unknown samplexAnd if the prediction result is smaller than the critical region lower limit alpha multiplied by M, outputting the prediction result as a negative sample, and otherwise, calculating the classification probability distribution of the positive sample of the unknown sample through the stored classification model.
8. The method of claim 7, wherein the determining whether the unknown sample can be classified as a class of positive samples with a probability greater than the positive sample classification threshold P and the reconstruction error R of the unknown samplexWhether the output prediction result is within the critical interval or not to determine whether the output prediction result is a positive sample or a negative sample specifically includes:
if the unknown sample can be classified into a certain class of positive samples according to the probability greater than the positive sample classification threshold P, outputting a prediction result as the positive sample classification;
if the unknown sample can not be classified into a certain class of positive samples with the probability greater than the positive sample classification threshold P, and the reconstruction error R of the unknown samplexIn a critical interval [ alpha × M, beta × M]Classifying the unknown sample into a negative sample, and outputting a prediction result as the negative sample;
if the unknown sample can not be classified as a certain class of positive samples with the probability larger than the positive sample classification threshold P, and the reconstruction error R of the unknown samplexAnd if the prediction result is larger than the upper limit of the critical zone, the unknown sample is considered to be a newly-added classified sample, the newly-added classified sample is added into the positive sample set, and the prediction result is output as the positive sample classification.
9. The unbalanced sample classification method as claimed in claim 8, wherein when adding the unknown samples of the new class to the positive sample set, the label of the unknown samples of the new class is set to K +1, and the classification number K of the positive samples is modified to K +1, and then the classification models of all the positive samples are retrained for the next prediction.
10. An unbalanced sample classification device, characterized in that:
comprising at least one processor and a memory, said at least one processor and memory being connected by a data bus, said memory storing instructions executable by said at least one processor, said instructions, after execution by said processor, being adapted to perform the method for classifying an unbalanced sample according to any one of claims 1 to 9.
CN202210048383.4A 2022-01-17 2022-01-17 Unbalanced sample classification method and device Active CN114494772B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210048383.4A CN114494772B (en) 2022-01-17 2022-01-17 Unbalanced sample classification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210048383.4A CN114494772B (en) 2022-01-17 2022-01-17 Unbalanced sample classification method and device

Publications (2)

Publication Number Publication Date
CN114494772A true CN114494772A (en) 2022-05-13
CN114494772B CN114494772B (en) 2024-05-14

Family

ID=81511265

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210048383.4A Active CN114494772B (en) 2022-01-17 2022-01-17 Unbalanced sample classification method and device

Country Status (1)

Country Link
CN (1) CN114494772B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070189600A1 (en) * 2006-01-13 2007-08-16 Yun-Qing Shi Method for identifying marked content
US20130070997A1 (en) * 2011-09-16 2013-03-21 Arizona Board of Regents, a body Corporate of the State of Arizona, Acting for and on Behalf of Ariz Systems, methods, and media for on-line boosting of a classifier
CN106569840A (en) * 2015-10-08 2017-04-19 上海智瞳通科技有限公司 Method for machine vision driving assistance system to automatically obtain sample to improve recognition accuracy
CN110147321A (en) * 2019-04-19 2019-08-20 北京航空航天大学 A kind of recognition methods of the defect high risk module based on software network
CN111585997A (en) * 2020-04-27 2020-08-25 国家计算机网络与信息安全管理中心 Network flow abnormity detection method based on small amount of labeled data
CN111695594A (en) * 2020-04-29 2020-09-22 平安科技(深圳)有限公司 Image category identification method and device, computer equipment and medium
WO2021139236A1 (en) * 2020-06-30 2021-07-15 平安科技(深圳)有限公司 Autoencoder-based anomaly detection method, apparatus and device, and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070189600A1 (en) * 2006-01-13 2007-08-16 Yun-Qing Shi Method for identifying marked content
US20130070997A1 (en) * 2011-09-16 2013-03-21 Arizona Board of Regents, a body Corporate of the State of Arizona, Acting for and on Behalf of Ariz Systems, methods, and media for on-line boosting of a classifier
CN106569840A (en) * 2015-10-08 2017-04-19 上海智瞳通科技有限公司 Method for machine vision driving assistance system to automatically obtain sample to improve recognition accuracy
CN110147321A (en) * 2019-04-19 2019-08-20 北京航空航天大学 A kind of recognition methods of the defect high risk module based on software network
CN111585997A (en) * 2020-04-27 2020-08-25 国家计算机网络与信息安全管理中心 Network flow abnormity detection method based on small amount of labeled data
CN111695594A (en) * 2020-04-29 2020-09-22 平安科技(深圳)有限公司 Image category identification method and device, computer equipment and medium
WO2021139236A1 (en) * 2020-06-30 2021-07-15 平安科技(深圳)有限公司 Autoencoder-based anomaly detection method, apparatus and device, and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
朱江;明月;王森;: "基于深度自编码网络的安全态势要素获取机制", 计算机应用, no. 03, 10 March 2017 (2017-03-10) *

Also Published As

Publication number Publication date
CN114494772B (en) 2024-05-14

Similar Documents

Publication Publication Date Title
CN108231201B (en) Construction method, system and application method of disease data analysis processing model
US20210089922A1 (en) Joint pruning and quantization scheme for deep neural networks
Wu et al. Cascaded fully convolutional networks for automatic prenatal ultrasound image segmentation
CN106951825B (en) Face image quality evaluation system and implementation method
US8473430B2 (en) Deep-structured conditional random fields for sequential labeling and classification
CN113657561B (en) Semi-supervised night image classification method based on multi-task decoupling learning
CN110309868A (en) In conjunction with the hyperspectral image classification method of unsupervised learning
Rahimi et al. A parallel fuzzy c-mean algorithm for image segmentation
WO2021051987A1 (en) Method and apparatus for training neural network model
WO2023217290A1 (en) Genophenotypic prediction based on graph neural network
Murugan Implementation of deep convolutional neural network in multi-class categorical image classification
CN112861752B (en) DCGAN and RDN-based crop disease identification method and system
CN114463605A (en) Continuous learning image classification method and device based on deep learning
CN115290326A (en) Rolling bearing fault intelligent diagnosis method
CN114743037A (en) Deep medical image clustering method based on multi-scale structure learning
CN111144296B (en) Retina fundus picture classification method based on improved CNN model
CN112786160A (en) Multi-image input multi-label gastroscope image classification method based on graph neural network
KR20220079726A (en) Method for predicting disease based on medical image
CN117422942A (en) Model training method, image classification device, and storage medium
CN110288002B (en) Image classification method based on sparse orthogonal neural network
CN117079017A (en) Credible small sample image identification and classification method
CN114494772A (en) Unbalanced sample classification method and device
CN108304546B (en) Medical image retrieval method based on content similarity and Softmax classifier
US20240135708A1 (en) Permutation invariant convolution (pic) for recognizing long-range activities
CN115346084A (en) Sample processing method, sample processing apparatus, electronic device, storage medium, and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20240627

Address after: 430000 No. 6, High-tech Fourth Road, Donghu High-tech Development Zone, Wuhan City, Hubei Province

Patentee after: FIBERHOME TELECOMMUNICATION TECHNOLOGIES Co.,Ltd.

Country or region after: China

Patentee after: Wuhan Changjiang Computing Technology Co.,Ltd.

Address before: 430000 No. 6, High-tech Fourth Road, Donghu High-tech Development Zone, Wuhan City, Hubei Province

Patentee before: FIBERHOME TELECOMMUNICATION TECHNOLOGIES Co.,Ltd.

Country or region before: China

TR01 Transfer of patent right