Disclosure of Invention
Aiming at the defects or improvement requirements in the prior art, the invention provides an unbalanced sample classification method and device, which combine the advantages of deep learning and traditional machine learning algorithms, divide the classification of samples into two steps of model training and prediction, introduce reconstruction errors, use the deep learning algorithm model of a variational self-encoder to separate positive and negative samples, use the traditional machine learning algorithm model to classify positive samples with small data volume, introduce a critical zone and a positive sample classification threshold, and flexibly adjust the classification preference in the prediction stage.
The embodiment of the invention adopts the following technical scheme:
in a first aspect, the present invention provides an unbalanced sample classification method, including:
model construction and training are carried out: constructing and training a variational self-encoder network, calculating the maximum reconstruction error of a negative sample, and carrying out classification model training on a positive sample;
predicting the classification of the unknown sample: and setting the upper and lower limits of the critical region and the classification threshold of the positive sample to be compared with the reconstruction error of the unknown sample and the classification probability distribution of the positive sample so as to predict the classification of the unknown sample.
Further, the model building and training specifically includes:
constructing a variational self-encoder network and setting a loss function and a reconstruction error;
training a variational self-encoder network and storing the network and parameters of the variational self-encoder network;
inputting all negative samples into a variational self-encoder network for reconstruction and calculating a reconstruction error, and taking the maximum value of the calculated reconstruction error as the maximum reconstruction error of the negative samples;
and training classification models of all positive samples and storing the classification models and parameters.
Further, the variational self-encoder network comprises an encoder network and a decoder network, and when the variational self-encoder network is stored, the network and the parameters of the encoder network and the parameters of the decoder network are respectively stored.
Further, the setting of the loss function and the reconstruction error specifically includes:
reconstructing a feature vector x of a sample into
Wherein E (-) represents an encoder function, S (-) represents a Gaussian distributed sampling function, and D (-) represents a decoder function;
setting reconstruction errors
The loss function for a single sample is:
in which mu is Gaussian distributedMean, σ, is the covariance of the gaussian distribution.
Further, the training of the variational self-encoder network specifically includes:
dividing the negative samples into a plurality of batches for training in sequence;
calculating the reconstruction loss of each batch and the gradient of the reconstruction loss to the network parameters;
optimizing network parameters through a gradient descent optimization strategy;
and judging whether all batches are trained or not and whether the maximum training times are reached or not, and if so, finishing the training.
Further, the predicting the class to which the unknown sample belongs specifically includes:
setting a lower limit coefficient alpha and an upper limit coefficient beta of a critical zone to obtain a critical interval [ alpha multiplied by M, beta multiplied by M ], wherein alpha is less than beta, and M is the maximum reconstruction error of a negative sample;
setting a positive sample classification threshold P, wherein 1/K is more than P and less than 1, and K is the positive sample classification number;
inputting unknown samples into an encoder network and a decoder network and calculating reconstruction errors R of the unknown samplesx;
Judging the reconstruction error R of the unknown samplexWhether the output prediction result is less than the critical area lower limit alpha multiplied by M or not is judged to be a negative sample or the classification probability distribution of the positive sample of the unknown sample is calculated;
judging whether the unknown sample can be classified into a certain class of positive samples with the probability greater than the positive sample classification threshold P and the reconstruction error R of the unknown samplexWhether the output prediction result is within the critical interval or not is determined to be a positive sample or a negative sample.
Further, the reconstruction error R of the unknown sample is judgedxWhether the prediction result is smaller than the critical section lower limit α × M or not to determine that the output prediction result is a negative sample or calculate the classification probability distribution of the positive samples of the unknown samples specifically includes:
reconstruction error R if unknown samplexIf the prediction result is less than the lower limit alpha multiplied by M of the critical zone, the prediction result is output as a negative sample, otherwise, the classification probability distribution of the positive sample of the unknown sample is calculated through the stored classification model。
Further, the determination is made whether the unknown sample can be classified as a positive sample of a certain class with a probability greater than the positive sample classification threshold P and the reconstruction error R of the unknown samplexWhether the output prediction result is within the critical interval or not to determine whether the output prediction result is a positive sample or a negative sample specifically includes:
if the unknown sample can be classified into a certain class of positive samples according to the probability greater than the positive sample classification threshold P, outputting a prediction result as the positive sample classification;
if the unknown sample can not be classified as a certain type of positive sample with the probability greater than the positive sample classification threshold P, and the reconstruction error R of the unknown samplexIn a critical interval [ alpha × M, beta × M]Classifying the unknown sample into a negative sample, and outputting a prediction result as the negative sample;
if the unknown sample can not be classified as a certain class of positive samples with the probability larger than the positive sample classification threshold P, and the reconstruction error R of the unknown samplexAnd if the prediction result is larger than the upper limit of the critical zone, the unknown sample is considered to be a newly-added classified sample, the newly-added classified sample is added into the positive sample set, and the prediction result is output as the positive sample classification.
Further, when the unknown sample of the new class is added into the positive sample set, the label of the unknown sample of the new class is set to be K +1, the class number K of the positive sample is modified to be K +1, and then the classification models of all the positive samples are retrained for the next prediction.
On the other hand, the invention provides an unbalanced sample classification device, which specifically comprises: the method comprises at least one processor and a memory, wherein the at least one processor and the memory are connected through a data bus, and the memory stores instructions capable of being executed by the at least one processor, and the instructions are used for completing the unbalanced sample classification method in the first aspect after being executed by the processor.
Compared with the prior art, the invention has the beneficial effects that: during prediction, the selection of prediction preference is realized by setting parameters of a critical zone, namely setting upper and lower limits of the critical zone and a classification threshold of a positive sample, and the prediction preference can be dynamically adjusted. For newly added positive sample classification, only a few classification models of the positive samples need to be retrained, and all samples do not need to be retrained, so that the rapid updating iteration of the models can be realized. In addition, the method is different from the existing method, and the operation of newly adding or discarding the characteristic information of the sample does not exist, so that the under-fitting and over-fitting risks are reduced compared with the existing algorithm.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The present invention is a system structure of a specific function system, so the functional logic relationship of each structural module is mainly explained in the specific embodiment, and the specific software and hardware implementation is not limited.
In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other. The invention will be described in detail below with reference to the figures and examples.
Example 1:
as shown in fig. 1, an embodiment of the present invention provides an unbalanced sample classification method, which provides a method combining deep learning and conventional machine learning for multi-classification problem of unbalanced positive and negative samples. The method comprises the following two steps.
Step 100 (model construction and training): and constructing and training a variational self-encoder network, calculating the maximum reconstruction error of the negative sample, and carrying out classification model training on the positive sample. The method is mainly used for establishing a variational self-coder network model and a classification model of a positive sample, and training the two models for the next prediction.
Step 200 (predict class to which unknown sample belongs): and setting the upper and lower limits of the critical region and the classification threshold of the positive sample to be compared with the reconstruction error of the unknown sample and the classification probability distribution of the positive sample so as to predict the classification of the unknown sample. The step is mainly used for predicting the classification of the unknown sample, and the prediction is based on the variational self-coder network model trained in the step 100 and the classification model of the positive sample.
By the method, the upper and lower limits of the critical zone and the classification threshold of the positive sample are set during prediction to realize selection of the prediction preference, so that the problem that the preference of the prediction stage cannot be flexibly adjusted can be solved. In addition, when the model is constructed and trained, the classification model of the positive sample is constructed and trained independently, so that only a small number of classification models of the positive sample need to be retrained for newly added positive sample classification, all samples do not need to be retrained, and the model can be updated and iterated quickly.
The following describes step 100 (performing model construction and training) and step 200 (predicting the classification to which the unknown sample belongs) in this embodiment in detail.
As shown in fig. 2, in the preferred embodiment, the step 100 (performing model building and training) may specifically include the following steps:
step 101: and constructing a variational self-encoder network and setting a loss function and a reconstruction error. In this step of the preferred embodiment, the constructed variational self-encoder network includes an encoder network and a decoder network. For a sample, there may be a loss or reconstruction error after reconstructing (i.e. encoding and decoding), for example, reconstructing the eigenvector x of a sample into
Then the preferred embodiment sets
Wherein E (-) represents an encoder function, S (-) represents a Gaussian distributed sampling function, and D (-) represents a decoder function; reconstruction error
The loss function for a single sample is:
where μ is the mean of the gaussian distribution and σ is the covariance of the gaussian distribution.
Step 102: training the variational self-encoder network and storing the network and parameters. In this step of the preferred embodiment, based on step 101 above, this step trains the encoder network and decoder network of the variational self-encoder using existing negative sample data to obtain E (-) and D (-) respectively. When storing, E (-) and D (-) are two sub-networks of the variational autoencoder, and are stored as two files, respectively.
Step 103: and inputting all negative samples into a variational self-encoder network for reconstruction, calculating a reconstruction error, and taking the maximum value of the calculated reconstruction error as the maximum reconstruction error of the negative samples. In this step of the preferred embodiment, all negative samples are input into the trained variational self-coder network for reconstruction, and then the transformed variational self-coder network is used
Representing the ith negative sample feature vector (similarly, available for use
Representing the ith positive sample feature vector), then the motion vector will be
Is reconstructed into
After that, the air conditioner is started to work,
reconstruction error
The loss function for a single sample is:
finally, the maximum value of the calculated reconstruction error is taken as the maximum reconstruction error M of the negative sample,
step 104: and training Softmax classifiers of all positive samples and saving Softmax models and parameters. I.e. training the classification models of all positive samples and storing the classification models and parameters. The present embodiment preferably uses a Softmax classifier for the classification model training of positive samples. In this step of the preferred embodiment, in the case that the positive samples are few relative to the negative samples, the Softmax classifier is trained on the few positive samples to obtain a Softmax model: softmaxK(. K), here, is the positive sample classification number.
Through the steps 101 to 104, the construction and training of the variational self-coder network model and the positive sample classification model (namely, the Softmax model) can be completed. The steps 101 to 103 are to use a large number of negative samples (relative to the positive samples, the negative samples are large) to construct and train the self-encoder network model, and the step 104 is to use a small number of positive samples (relative to the negative samples, the positive samples are small) to construct and train the Softmax model, and the construction and training of the two models can be parallel, without clear precedence, that is, the steps 101 to 103 and the step 104 can be executed in parallel.
As shown in fig. 3, in the preferred embodiment, the training of the variational self-encoder network in step 102 can be further specifically divided into the following steps:
step 102-1: the negative samples are divided into several batches for training in sequence.
Step 102-2: the reconstruction loss for each batch and the gradient of the reconstruction loss to the network parameters are calculated.
Step 102-3: and optimizing network parameters through a gradient descent optimization strategy.
Step 102-4: and judging whether all batches are trained or not and whether the maximum training times are reached or not, and if so, finishing the training. If not all batches are trained, or the maximum number of training times is not reached, then continuing from step 102-1, training the variational self-encoder network using the next batch of negative samples in sequence.
In combination with the above steps, the preferred embodiment may also use input and output modes to describe the training of the above two models (the variational self-coder network model and the Softmax model), and the specific input and output processes are as follows:
inputting: negative sample feature set
Set of positive samples
m1 > m2, K > 1; the error loss computation function R (·, ·) is reconstructed. Wherein, y
iThe label of the ith positive sample is represented, K in the value range of 1-K represents the number of the positive sample types,
a vector representing the ith positive sample feature vector and its label,
indicating that both positive and negative samples are n-dimensional real vectors, m1 > m2 indicates that m1 (negative number of samples) is much larger than m2 (positive number of samples).
And (3) outputting: an encoder model E (-) of a negative sample variational self-encoder, a Gaussian distribution sampling function S (-) where the mean μ and covariance matrix σ of the Gaussian distribution are the output of the encoder, a decoder model D (-) and a negative sample reconstruction error maximum M, a Softmax classifier model Softmax of K classified positive samplesK(·)。
The above is a detailed description of step 100 (performing model construction and training) in the present preferred embodiment, and step 200 (predicting the classification to which the unknown sample belongs) in the present preferred embodiment is described in detail below.
As shown in fig. 4, in the preferred embodiment, the step 200 (predicting the class to which the unknown sample belongs) may specifically include the following steps:
step 201: setting a lower limit coefficient alpha and an upper limit coefficient beta of a critical area to obtain a critical interval [ alpha multiplied by M, beta multiplied by M ], and simultaneously setting a positive sample classification threshold P, wherein alpha is less than beta, 1/K is less than P and less than 1, M is the maximum reconstruction error of a negative sample, and K is the classification number of the positive sample. The step of the preferred embodiment is a pre-preparation step, which is mainly used to set various parameters.
Step 202: inputting unknown samples into an encoder network and a decoder network and calculating reconstruction errors R of the unknown samples
x. The step of the preferred embodiment starts to predict an unknown sample x (i.e. a sample to be predicted), and first inputs the unknown sample x into a previously trained encoder network and decoder network to obtain a reconstructed feature vector thereof
And reconstruction error
Step 203: judging the reconstruction error R of the unknown samplexAnd whether the value is less than the critical section lower limit alpha multiplied by M or not, so as to determine that the output prediction result is a negative sample or calculate the classification probability distribution of the positive sample of the unknown sample. In this step of the preferred embodiment, if the reconstruction error R of the unknown sample is not knownxAnd if the output classification is less than the lower limit alpha multiplied by M of the critical zone, outputting the classification to be 0 (namely outputting the prediction result to be a negative sample) and finishing the prediction, otherwise, calculating the classification probability distribution of the positive sample of the unknown sample through a stored Softmax model so as to carry out the prediction judgment of the next step. The alpha of the step is equivalent to a credible threshold for defining the sample classification as a negative sample, the sample smaller than the threshold is credible and classified as a negative sample, and the smaller the value is, the higher the probability is, the more the process of the refiltering in the next step is carried out. The alpha parameter setting in this step represents the beneficial effects that the embodiment can realize the selection of the prediction preference and can dynamically adjust the prediction preference.
In the preferred embodiment, the specific calculation method of the positive sample classification probability distribution is as follows: using the previously trained positive sample classifier Softmax
K(. i.e. a trained Softmax model) to predict the positive sample classification probability distribution of unknown samples x:
step 204: judging whether the unknown sample can be classified into a certain class of positive samples with the probability greater than the positive sample classification threshold P and the reconstruction error R of the unknown samplexWhether the output prediction result is within the critical interval or not is determined to be a positive sample or a negative sample.
In step 204 of the preferred embodiment, if the unknown sample can be classified as a certain type of positive sample with a probability greater than the positive sample classification threshold P (i.e. the calculated probability distribution P)1,p2,...,pKA certain probability distribution greater than the positive sample classification threshold P) exists, the prediction result is output as the positive sample classification corresponding to the probability distribution, and the prediction is ended. Namely: if maxipiIf > P, argmax is outputip (i.e. classification of positive samples), end prediction, otherwise go to the next step. P of the step is equivalent to a threshold for defining credible classification of the positive samples, and the smaller P represents the higher probability of classifying the unknown sample x into a certain class of existing positive samples. The P parameter setting in this step also embodies the beneficial effects that the present embodiment can realize the selection of the prediction preference, and can dynamically adjust the prediction preference.
In step 204 of the preferred embodiment, the unknown sample cannot be classified as a positive sample of a certain class with a probability greater than the positive sample classification threshold P (i.e. the calculated probability distribution P)1,p2,...,pKThere is no probability distribution greater than the positive sample classification threshold P) and the reconstruction error R of the unknown samplexIn a critical interval [ alpha × M, beta × M]And if so, re-classifying the unknown sample into a negative sample, outputting a prediction result as the negative sample, and finishing the prediction. That is, on the basis of the last judgment: if R isxAnd if the output is less than or equal to beta multiplied by M, the classification is 0 (namely, the classification is a negative sample), the prediction is finished, and otherwise, the next step is carried out. This step achieves a re-filtering of negative samples and beta in this step corresponds to a preference defining a new class of positive samples, this stepSmaller values indicate a greater bias towards new positive sample classifications. Therefore, the beta parameter setting of the step also represents the beneficial effect that the embodiment can realize the selection of the prediction preference and can dynamically adjust the prediction preference.
In step 204 of the preferred embodiment, if the unknown sample can not be classified as a certain type of positive sample with a probability greater than the positive sample classification threshold P, the reconstruction error R of the unknown sample isxAnd if the prediction result is larger than the upper limit of the critical zone, the unknown sample is considered to be a newly-added classified sample, the newly-added classified sample is added into the positive sample set, the prediction result is output to be the positive sample classification, and then the prediction is finished. In this step, when the unknown sample x of the new class is added to the positive sample set as the new positive sample class, the label of the unknown sample x of the new class is set to K +1, and the positive sample class number K is modified to K +1, and then the process goes to step 104 to retrain all the Softmax classifiers of the positive samples for the next prediction.
In combination with the above steps, the preferred embodiment may further use input and output modes to describe the process of predicting the classification to which the unknown sample belongs, where the specific input and output processes are as follows:
inputting: sample to be predicted
The upper and lower limits of the critical section are alpha and beta (alpha is less than beta, such as 0.9 and 1.1), and the abnormal sample classification threshold is P (1/K is less than P1, such as P is 0.9).
And (3) outputting: and (4) classifying the sample to be predicted.
In summary, it can be seen from the preferred embodiment that, in the prediction, the selection of the prediction preference is realized by setting the critical section parameter, that is, setting the upper and lower limits of the critical section and the positive sample classification threshold, and the prediction preference can be dynamically adjusted. For newly added positive sample classification, only a few Softmax classifiers of the positive samples need to be retrained, and all samples do not need to be retrained, so that the rapid updating iteration of the model can be realized. In addition, the method is different from the existing method, and the operation of newly adding or discarding the characteristic information of the sample does not exist, so that the under-fitting and over-fitting risks are reduced compared with the existing algorithm. Although the method of the preferred embodiment mainly aims at classifying unbalanced samples, the method can also classify relatively balanced samples, and is not limited to be used only for unbalanced samples.
Example 2:
based on the unbalanced sample classification method provided in embodiment 1, embodiment 2 provides a concrete implementation example based on embodiment 1. In the following, an implementation is given by taking the MNIST dataset as an example according to the algorithm implementation steps. The MNIST data set is a classical data set in the field of machine learning and consists of 60000 training samples and 10000 testing samples, and each sample is a 28 x 28 pixel grayscale handwritten digital picture. In the model training of this embodiment 2, 5000 "0" pictures are selected as negative samples, and 5 "1", 10 "2", and 5 "3" pictures are selected as positive samples.
Fig. 5 is a schematic flow chart of the present embodiment for completing the construction and training of the training model. Wherein step S101 and step S105 may be performed in parallel. The specific flow of this embodiment is described as follows:
s101, constructing a variational self-encoder network. Specifically, a variational self-encoder network is constructed using tensrflow according to the variational self-encoder network structure shown in fig. 6. TensorFlow is a symbolic mathematical system based on data flow programming, and is widely applied to programming realization of various machine learning algorithms.
In this embodiment, the network structure of the encoder E (-) is: the 28 × 28 × 1 grayscale image is subjected to 32 3 × 3 convolution kernel convolutions (downsampling) to obtain a 28 × 28 × 32 feature image, then subjected to 16 3 × 3 convolution kernel convolutions (downsampling) to obtain a 28 × 28 × 16 feature image, the feature image is unfolded, a full-connection network is used for transforming the feature image into a 16-dimensional vector, and finally two full-connection networks are used for fitting the mean value and the variance of the hidden space Gaussian distribution. The hidden space is assumed to be an independent multidimensional gaussian distribution with a dimension of 10. The mean is a 10-dimensional vector and the covariance is a 10 x 10 diagonal matrix.
Mean value: mu-1 (mu)0,μ1,…,μ9)
the sampling operation S (-) is: and sampling the hidden space with independent multidimensional Gaussian distribution to obtain Z. The specific operation is as follows: sampling independent multidimensional standard gaussian distributions to obtain e, and Z ═ mu + e × sigma.
The network structure of decoder D (-) is: the 10-dimensional Z vector is subjected to full-connection network transformation into a 1568-dimensional vector, then transformed into a 7 × 7 × 32 feature image, deconvoluted (upsampled) using 32 3 × 3 deconvolution kernels to obtain a 14 × 14 × 32 feature image, deconvoluted (upsampled) using 16 3 × 3 deconvolution kernels to obtain a 28 × 28 × 16 feature image, and finally deconvoluted (upsampled) using 1 3 × 3 deconvolution kernel to obtain a 28 × 28 × 1 generated image.
And S102, defining a loss function and training a variational self-encoder network.
Where the reconstruction loss (using the same function as the reconstruction error) employs the mean square error loss (cross entropy loss may also be used):
the loss of a single sample is defined as follows:
in addition, the training process for training the variational self-coder network is shown in fig. 7. The concrete description is as follows:
and S201, inputting and reconstructing a batch of samples. The present embodiment divides 5000 "0" samples into 50 batches, and optimizes the network parameters once for each 100 samples.
S202, calculating the reconstruction loss. The reconstruction loss of 100 samples per batch was calculated by TensorFlow using Mean Squared Error (MSE) as follows.
And S203, calculating the gradient of the reconstruction loss to the network parameters. This step invokes the TensorFlow interface to compute the gradient of L to all parameters of the network.
And S204, optimizing network parameters by gradient descent. In the step, a TensorFlow Adam gradient descent optimization strategy is selected to optimize network parameters.
And S205, judging whether training of all batches is completed. This step of this example requires a determination of whether a total of 50 batches are complete. If the sample is not completed, the process goes back to S201 to input and reconstruct the next batch of samples.
And S206, judging whether the maximum iteration number is reached. In this embodiment, the maximum training frequency is defined as 50, and it is determined whether the maximum training frequency is reached, if so, the training of the variational self-encoder network is ended and the subsequent step S103 is performed, otherwise, the step returns to step S201 to perform the next training. Generally, multiple rounds of training are required. In this embodiment, the training samples are divided into 50 batches for training, and after 50 batches of training are completed, one round of training is completed, and after 50 rounds of training (that is, the defined maximum training times 50 are reached), all the training is completed.
S103, storing the variational self-encoder network and the parameters. This step is used to save the networks and parameters trained in steps S201-S206 above. Where E (-) and D (-) are two subnets of the variational encoder, and are stored as two files, respectively, and the network and parameters can be stored as h5 files using the interface of tensrflow.
And S104, calculating the maximum reconstruction errors of all negative samples. In the step, 5000 '0' samples are input into a trained variational self-encoder network (a network shown in fig. 6) to be reconstructed, and a reconstruction error is calculated to obtain a maximum reconstruction error M.
And S105, training Softmax classifiers of all positive samples. In the step, all positive samples are trained by using Logistic Regression of SciKit leanin (SKLearn is short for SciKit leanin and is a python library and a module specially used for machine learning), a Softmax classifier is the popularization of two-classification Logistic Regression in multi-classification, the two-classification Logistic Regression is uniformly realized by the Logistic Regression of SciKit leanin, the traditional Logistic Regression can only be classified by two, and the two-classification Logistic Regression and the multi-classification are realized by the Logistic Regression. In the present embodiment, 5 "1", 10 "2", and 5 "3" are used as positive samples, so that the positive sample class K is 3, and Softmax outputs a probability distribution of the samples. For example, as shown in FIG. 8, a 3-dimensional one-hot coded vector of sample "1" is input to a Softmax classifier, and a sample probability distribution (p) is output1,p2,...,pi,...,pK) Wherein p isiRepresenting the probability that a sample belongs to class i. It should be noted that other conventional machine learning algorithms can be used to train the classification model of the positive sample, such as a simple two-layer neural network, and is not limited to the Softmax classifier. The Softmax classifier is preferred because of its convenient probability distribution output, less training data required, and simpler requirements.
And S106, saving the Softmax model and the parameters. This step uses the Python's pickle module to save the trained Logistic Regression model.
After model training is completed, the classification of an unknown sample x is predicted according to the following steps, and the flow is shown in fig. 9 and specifically described as follows:
s301, inputting prediction parameters. This step defines the critical section upper and lower limits α ═ 0.9 and β ═ 1.1, and the positive sample classification threshold P ═ 0.9. The critical interval [ α × M, β × M ] defines a fuzzy interval of positive and negative samples.
And S302, calculating a reconstruction error. The step is that a sample x to be predicted is obtained by a trained variational self-encoder network
Calculate its weightError of formation
S303, judging whether the reconstruction error is smaller than the lower limit of the critical area.
And S304, if so, outputting the classification as a negative sample. In this step, if Rx< α × M, the output is classified as 0 (i.e., as a negative sample). Where α defines a sample classification as negative sample confidence threshold, samples below this threshold are classified as negative samples, and a smaller value indicates a greater probability of entering the following refiltering process.
And S305, if not, calculating the probability distribution of the positive sample. If the credible classification cannot be determined as negative sample, the saved positive sample classifier Softmax of S106 is usedK(. h) predict the positive sample classification probability distribution of x:
s306, judging whether the sample can be classified into a certain type of positive sample with the probability greater than P.
And S307, if yes, outputting positive sample classification. In this step, if x can be classified into a certain class of positive samples with a probability greater than P, a positive sample classification label is output, and prediction is ended. Namely: maxipi> P, output argmaxipi(i.e., classification of positive samples) and the prediction is ended. Here, P defines a threshold for the confidence classification of the positive samples, and a smaller P indicates a higher probability of classifying x into a certain class of existing positive samples.
And S308, if not, judging whether the reconstruction error is smaller than the upper limit of the critical area.
And S309, if so, outputting the classification as a negative sample. In this step, if x cannot be classified as a positive sample with probability greater than P and the sample reconstruction error is in the critical region, x is classified as a negative sample again. Namely: rxBeta is less than or equal to beta multiplied by M, the output is classified as 0 (namely, the classification is a negative sample), and the prediction is finished. Here, a re-filtering of negative samples is achieved.
And S310, if not, newly adding an abnormal classification, and retraining the Softmax classifier.
And S311, outputting the positive sample classification.
S310 and S311 indicate that x can be classified as a certain type of positive sample neither with a probability greater than P, and the sample reconstruction error is above the critical section upper limit, at this time, x is considered as a newly classified sample, the sample is added to the positive sample set, x is labeled as K +1, and K is modified to be K +1, and then the flow of S105 and S106 is called to retrain the Softmax classifier, and the retrained Softmax classifier is used as the next prediction. β in S309 defines a preference for the new positive sample class, and a smaller value indicates a higher preference for the new positive sample class.
In summary, in this embodiment 2, during prediction, the selection of the prediction preference is realized by setting the parameters of the critical section, that is, setting the upper and lower limits of the critical section and the classification threshold of the positive sample, and the prediction preference can be dynamically adjusted. For newly added positive sample classification, only a few Softmax classifiers of the positive samples need to be retrained, and all samples do not need to be retrained, so that the rapid updating iteration of the model can be realized. In addition, the method is different from the existing method, and the operation of newly adding or discarding the characteristic information of the sample does not exist, so that the under-fitting and over-fitting risks are reduced compared with the existing algorithm.
Example 3:
on the basis of the unbalanced sample classification methods provided in embodiments 1 to 2, the present invention further provides an unbalanced sample classification apparatus for implementing the methods and systems, as shown in fig. 10, which is a schematic diagram of an apparatus architecture according to an embodiment of the present invention. The unbalanced sample classification apparatus of the present embodiment includes one or more processors 21 and a memory 22. In fig. 10, one processor 21 is taken as an example.
The processor 21 and the memory 22 may be connected by a bus or other means, and fig. 10 illustrates the connection by a bus as an example.
The memory 22, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as the unbalanced sample classification method in embodiments 1 to 2. The processor 21 executes various functional applications and data processing of the unbalanced sample classification device by running the nonvolatile software programs, instructions, and modules stored in the memory 22, that is, implements the unbalanced sample classification methods of embodiments 1 to 2.
The memory 22 may include high speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the memory 22 may optionally include memory located remotely from the processor 21, which may be connected to the processor 21 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
Program instructions/modules are stored in the memory 22 and, when executed by the one or more processors 21, perform the unbalanced sample classification method of embodiments 1 to 2 described above, for example, perform the respective steps shown in fig. 1 to 9 described above.
Those of ordinary skill in the art will appreciate that all or part of the steps of the various methods of the embodiments may be implemented by associated hardware as instructed by a program, which may be stored on a computer-readable storage medium, which may include: a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and the like.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.