CN113378963A

CN113378963A - Hybrid framework-based imbalance classification method, system, equipment and storage medium

Info

Publication number: CN113378963A
Application number: CN202110708211.0A
Authority: CN
Inventors: 郭得科; 陈锐; 罗来龙; 陈颖文
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2021-06-24
Filing date: 2021-06-24
Publication date: 2021-09-10
Anticipated expiration: 2041-06-24
Also published as: CN113378963B

Abstract

The application relates to a method, a system, equipment and a storage medium for unbalanced classification based on a hybrid framework. The method is used for verifying an integrated model of our hybrid resampling through an unbalanced network anomaly detection dataset. The processing speed is increased by proposing a combination of resampling methods to reduce the number of majority classes. And processing the imbalanced data sets at the data level and converting the data sets into an equalized distribution using a resampling technique. By building an integrated model containing 12 different classifiers, they provided more options than the 5 classifiers in the previous work. The slightly equalized data obtained after the above processing is classified by using an integration model, and therefore, by proposing a novel combination of undersampling and oversampling, the imbalance between different data classes is equalized and the processing speed is accelerated with less memory overhead.

Description

Hybrid framework-based imbalance classification method, system, equipment and storage medium

Technical Field

The present application relates to the field of data processing, and in particular, to a method, a system, a device, and a storage medium for classifying imbalances based on a hybrid framework.

Background

In the current big data era, data mining and analysis have gained increasing importance in effective decision making. Among various data mining techniques, classification analysis is one of the most widely used techniques, and is applicable to various business and engineering problems, such as cancer prediction, runoff prediction, fraud detection, face detection, fraud detection, and the like. Classification analysis is a supervised classifier learning problem for predicting variables that are composed of a finite number of classes. Typically, classifier learning methods are intended for use with reasonably balanced data sets. However, in many practical cases, the data sets tend to be unbalanced.

Currently, there are two main approaches to solve the unbalanced classification problem: oversampling may randomly generate multiple copies of an existing item to expand a few classes, and undersampling may randomly select a subset of the existing item to reduce the size of a majority of the classes. However, we believe that using only over-sampling or under-sampling strategies may not be sufficient to adequately mitigate the imbalance problem of the data set. First, if only the oversampling method is used to increase the number of the few classes, it is impractical to expand the few classes to have the same amount of data as the majority classes in terms of time consumption and training cost. Second, if only the undersampling method is used to scale down most categories, the large reduction in data sets may lead to inadequate training results. The generated model may not be able to distinguish between these classes in the test dataset. Finally, there are indeed some work that mentions the mixed sampling method, but this method is not explicitly described. Therefore, there is a need in the society for a hybrid sampling method that combines over-sampling and under-sampling strategies.

Data classification is a common data processing method in the field of networks and distributed systems, and has attracted much attention in recent years. However, existing classification algorithms are primarily directed to relatively balanced data sets, but data in reality often exhibit unbalanced characteristics.

Disclosure of Invention

In view of the above, it is necessary to provide a method, a system, an apparatus and a storage medium for classifying imbalances based on a hybrid framework.

In a first aspect, an embodiment of the present invention provides a method for classifying imbalances based on a hybrid frame, including the following steps:

obtaining a training data set D comprising a plurality of classes_majorityAnd a few classes of training data sets D_minorityGiven initial data set D;

eliminating data samples of majority categories in the initial data set D by a random undersampling method, generating a new majority category data set, and adopting D_{majority_reduced}The dataset represents the reduced subset;

increasing data samples of minority classes in the initial data set D by a random oversampling method, and generating a new data set of minority classes, and adopting D_{minority_increased}The dataset represents the augmented subset;

will D_{majority_reduced}Data set and D_{majority_reduced}And combining the data sets to generate a new mixed data set D ', and training the integrated model of the mixed data set D' through 12 classifiers to obtain the classification result of the initial data set.

Further, the method eliminates the data samples of the majority category in the initial data set D by the random undersampling method, and generates a new data set of the majority category, and adopts D_{majority_reduced}The data set represents a reduced subset including,

selecting samples from a plurality of data sets through random undersampling, and determining the proportion of sample category selection through a preset category distribution threshold;

according to the reduction of the number of most of the data sets, the relatively fast data classification processing process is realized by using less memory;

and analyzing the influence of the unbalanced network anomaly detection data set on the integrated classification performance through random undersampling in different proportions.

Further, the method increases the data samples of the minority class in the initial data set D by the random oversampling method and generates a new data set of the minority class, and D is adopted_{minority_increased}The data set represents the augmented subset of data sets, including,

random copying is performed on the data samples of the minority class by a random oversampling method, so that the number of the samples of the minority class is increased;

by randomly controlling the sampling ratio, quantitative differences between the data samples of the minority class are balanced. In another aspect, the present invention provides a hybrid frame based imbalance classification system comprising

An initial data set giving module for obtaining a training data set D containing a plurality of categories_majorityAnd a few classes of training data sets D_minorityGiven initial data set D;

an under-sampling module for eliminating the data samples of multiple categories in the initial data set D by random under-sampling method and generating a new data set of multiple categories and adopting D_{majority_reduced}The dataset represents the reduced subset;

an oversampling module for increasing the data samples of the minority class in the initial data set D by a random oversampling method and generating a new data set of the minority class, and adopting D_{minority_increased}The dataset represents the augmented subset;

model blending Module, blending D_{majority_reduced}Data set and D_{majority_reduced}And combining the data sets to generate a new mixed data set D ', and training the integrated model of the mixed data set D' through 12 classifiers to obtain the classification result of the initial data set.

Further, the undersampling module comprises a sample reduction unit configured to:

Further, the oversampling module includes a sample addition unit configured to:

by randomly controlling the sampling ratio, quantitative differences between the data samples of the minority class are balanced.

In another aspect, the present invention further provides a system for classifying imbalances based on a hybrid framework, comprising:

the proposed model is used to solve the practical problem in network anomaly detection. The unbalanced network anomaly detection dataset is used to validate our HRE model. Furthermore, we propose a combination of resampling methods to reduce the number of majority classes, thereby speeding up the processing. We process the unbalanced data set at the data level and convert the data set to an equalized distribution using a resampling technique.

Further, details of the HRE model framework for imbalance classification are included. Given that extending a few classes with an oversampling strategy increases training costs, we have specified only one integration framework that uses an undersampling strategy. We have hereafter proposed a hybrid integration framework that combines oversampling and undersampling to balance the classes in the dataset.

Further, including, in this model, we use random undersampling to reduce the number of majority classes. It also allows faster processing speeds with less memory overhead. Furthermore, multiple classifiers have proven to be more accurate than a single classifier. Therefore, we have chosen 12 classifiers in the integration method, which provides more choices than the 5 classifiers in the previous work.

The embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and capable of running on the processor, and when the processor executes the computer program, the following steps are implemented:

elimination of majority classes in the initial data set D by a random undersampling methodOther data samples and generating a new majority class data set using D_{majority_reduced}The dataset represents the reduced subset;

An embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the following steps:

The beneficial effect that this application brought is: the embodiment of the invention discloses an unbalanced classification method, a system, equipment and a storage medium based on a hybrid framework. The model through mixed sampling is used for solving the practical problem in network anomaly detection, and the unbalanced network anomaly detection data set is used for verifying the integrated model of mixed resampling. In addition, the processing speed is increased by proposing a combination of resampling methods to reduce the number of majority classes. And processing the imbalanced data sets at the data level and converting the data sets into an equalized distribution using a resampling technique. Furthermore, by building an integrated model containing 12 different classifiers, they provided more options than the 5 classifiers in the previous work. The slightly equalized data obtained after the above processing is then classified by using an integration model, thus equalizing the imbalance between different data classes and speeding up the processing with less memory overhead by proposing a novel combination of undersampling and oversampling.

Drawings

FIG. 1 is a flow diagram illustrating a hybrid framework based imbalance classification method according to one embodiment;

FIG. 2 is a flow diagram that illustrates reduction of majority class datasets by an undersampling method, under an embodiment;

FIG. 3 is a flow diagram illustrating the addition of a minority class data set by an over-sampling method in one embodiment;

FIG. 4 is a block diagram of a hybrid framework based imbalance classification system in one embodiment;

FIG. 5 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In one embodiment, as shown in fig. 1, the present embodiment provides a method for classifying imbalances based on a hybrid frame, comprising the following steps:

step 101, obtaining a training data set D comprising a plurality of categories_majorityAnd a few classes of training data sets D_minorityGiven initial data set D;

step (ii) of102, eliminating the data samples of the majority category in the initial data set D by a random undersampling method, and generating a new data set of the majority category, and adopting D_{majority_reduced}The dataset represents the reduced subset;

103, increasing the data samples of the minority class in the initial data set D by a random oversampling method, and generating a new data set of the minority class by using D_{minority_increased}The dataset represents the augmented subset;

step 104, adding D_{majority_reduced}Data set and D_{majority_reduced}And combining the data sets to generate a new mixed data set D ', and training the integrated model of the mixed data set D' through 12 classifiers to obtain the classification result of the initial data set.

Specifically, in this embodiment, we consider the problem of classification imbalance in practical applications. We propose a new hybrid resampling-based integration framework that takes full advantage of the undersampling and oversampling strategies. The model through mixed sampling is used for solving the practical problem in network anomaly detection, and the unbalanced network anomaly detection data set is used for verifying the integrated model of mixed resampling. In addition, the processing speed is increased by proposing a combination of resampling methods to reduce the number of majority classes. And processing the imbalanced data sets at the data level and converting the data sets into an equalized distribution using a resampling technique. Furthermore, by building an integrated model containing 12 different classifiers, they provided more options than the 5 classifiers in the previous work. The slightly equalized data obtained after the above processing is then classified by using an integration model, thus equalizing the imbalance between different data classes and speeding up the processing with less memory overhead by proposing a novel combination of undersampling and oversampling. Experimental results of the examples show that integration of mixed resampling can significantly improve classification accuracy while reducing computational overhead.

In one embodiment, as shown in fig. 2, the process of reducing the majority class data set by the under-sampling method comprises:

step 201, connecting sentences in the evidence set into a sequence evidence text, and then connecting the sequence evidence text with the statement to form an input sequence;

step 202, selecting samples from a plurality of types of data sets through random undersampling, and determining the proportion of selecting the sample types through a preset type distribution threshold;

step 203, using less memory to realize relatively fast data classification processing process according to the reduced number of most types of data sets;

and 204, analyzing the influence of the unbalanced network anomaly detection data set on the integrated classification performance through random undersampling in different proportions.

Specifically, by analyzing the influence of the sampling rate (imbalance rate) on the classification performance in the present embodiment, we obtain the following: first, as the number of majority class samples increases, the performance of the integrated model increases and then decreases while keeping the number of minority class samples constant. Such experimental results also show that an increase in the data imbalance ratio has an effect on the performance of the classifier. Further, when the unbalance ratio is small, the influence on the classification performance is not significant. When the undersampling ratio is 1: 4, the average accuracy will be relatively high. If the imbalance ratio is large, the performance of the classifier will be negatively affected.

In one embodiment, as shown in fig. 3, the process of adding the minority class data set by the oversampling method includes:

step 301, performing random copy on the data samples of the minority class by using a random oversampling method, thereby increasing the number of the samples of the minority class;

in step 302, the quantitative differences between the data samples of the minority class are balanced by randomly controlling the sampling ratio.

Specifically, the mixed resampling is to reduce the number of minority classes to a certain number and then increase the number of cases in the minority classes to match the number with the majority classes on the basis of the pass oversampling. We gradually reduce the imbalance rate by mixing resampling. The results show that as the number of the minority samples gradually increases, the accuracy of the classifier improves from 0.8393 to 0.8981 as a whole, which indicates that the performance of the model gradually improves. Experimental results show that increasing the number of minority categories and decreasing the number of majority categories can improve accuracy when the number of minority categories is small. This also means that a combination of over-sampling and under-sampling methods is effective.

It should be understood that, although the steps in the above-described flowcharts are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in the above-described flowcharts may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or the stages is not necessarily sequential, but may be performed alternately or alternatingly with other steps or at least a portion of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 4, there is provided a hybrid frame-based imbalance classification system, comprising:

the initial data set giving module 401 obtains a training data set D containing a plurality of categories_majorityAnd a few classes of training data sets D_minorityGiven initial data set D;

the under-sampling module 402 eliminates the majority class data samples in the initial data set D by a random under-sampling method, and generates a new majority class data set, using D_{majority_reduced}The dataset represents the reduced subset;

the oversampling module 403 adds the data samples of the minority class in the initial data set D by the random oversampling method and generates a new data set of the minority class, and adopts D_{minority_increased}The dataset represents the augmented subset;

model blending module 404 blending D_{majority_reduced}Data set and D_{majority_reduced}And combining the data sets to generate a new mixed data set D ', and training the integrated model of the mixed data set D' through 12 classifiers to obtain the classification result of the initial data set.

In one embodiment, the undersampling module 402 includes a sample reduction unit to:

In one embodiment, the oversampling module 402 includes a sample increment unit to:

For specific limitations of the hybrid frame-based imbalance classification system, reference may be made to the above limitations of the hybrid frame-based imbalance classification method, which are not described herein again. The various modules in the hybrid framework-based imbalance classification system described above may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

FIG. 5 is a diagram illustrating an internal structure of a computer device in one embodiment. The computer device comprises a processor, a memory, a network interface, an input device and a display screen which are connected through a system bus. Wherein the memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system and may also store a computer program that, when executed by the processor, causes the processor to implement the method of privilege anomaly detection. The internal memory may also have a computer program stored therein, which when executed by the processor, causes the processor to perform the method for detecting an abnormality of authority. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the architecture shown in fig. 5 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, as shown in fig. 5, there is provided a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for classifying imbalances based on a hybrid framework is characterized by comprising the following steps:

2. The hybrid-framework-based imbalance classification method according to claim 1, wherein the majority class data samples in the initial data set D are eliminated through a random undersampling method, and a new majority class data set is generated, and D is adopted_{majority_reduced}The data set represents a reduced subset including,

3. The base of claim 1The method for classifying imbalances in a hybrid framework is characterized in that a random oversampling method is used for increasing data samples of minority classes in an initial data set D and generating a new data set of minority classes, and D is used for_{minority_increased}The data set represents the augmented subset of data sets, including,

4. A hybrid frame-based imbalance classification system, comprising:

5. The hybrid frame-based imbalance classification system of claim 4, wherein the undersampling module comprises a sample reduction unit to:

6. The hybrid frame-based imbalance classification system of claim 4, wherein the oversampling module includes a sample addition unit to:

7. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 3 are implemented when the computer program is executed by the processor.

8. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 3.