CN113361591A

CN113361591A - Category imbalance processing method based on category combination and sample sampling

Info

Publication number: CN113361591A
Application number: CN202110620136.2A
Authority: CN
Inventors: 叶方全; 陈逸龙
Original assignee: Chongqing Pengkang Big Data Co ltd; Guangzhou Tianpeng Computer Technology Co ltd; Chongqing Nanpeng Artificial Intelligence Technology Research Institute Co ltd
Current assignee: Chongqing Pengkang Big Data Co ltd; Guangzhou Tianpeng Computer Technology Co ltd; Chongqing Nanpeng Artificial Intelligence Technology Research Institute Co ltd
Priority date: 2021-06-03
Filing date: 2021-06-03
Publication date: 2021-09-07

Abstract

The invention discloses a category imbalance processing method based on category combination and sample sampling, which comprises the following steps: s1: constructing an original data set; s2: a training process; s3: and (6) testing. The training mode of the method is to train a plurality of two classifiers and then combine the results of each two classifiers to obtain the final prediction result. Compared with a multi-classification method, the method is more flexible, and because each two classifiers are independent, different neural network models can be used, and even established rules can be considered.

Description

Category imbalance processing method based on category combination and sample sampling

Technical Field

The invention belongs to the technical field of computers, and particularly relates to a category imbalance processing method based on category combination and sample sampling.

Background

The problem of category imbalance is a very common problem in processing real data sets. The problem of category imbalance means that the number of samples in each category in a data set is different and the difference is large. Without any treatment for this problem, the prediction results of the trained model may be biased towards the classes with large data volumes, which is not desirable.

There are many studies in this area, and one of the research directions is based on a sampling method, which can be divided into an oversampling method and an undersampling method. The oversampling method is to increase samples of the class with fewer samples to balance the number of samples of each class. However, such oversampling methods easily cause model overfitting, and reduce the generalization performance of the model. The undersampling method is to reduce samples of a class with more samples to balance the number of samples of each class. The method is characterized in that a random undersampling method randomly samples partial samples from categories with more samples, the problem of the method is that residual samples cannot be utilized, the easy Ensemble algorithm can process the problem, samples with the number similar to that of a few categories are sampled from the categories by an integration method each time, a model is trained, and the model is repeated for multiple times to obtain multiple models, so that all samples of the categories can be completely utilized. However, in real data sets, especially disease classification data sets, the frequency difference of occurrence of each disease is very large, and the frequency difference may reach hundreds of times, which is not enough only by the sample down-sampling method, because if the most classes are down-sampled to the number of samples close to the few classes, the samples used for training the model are very few, resulting in under-fitting of the model.

Disclosure of Invention

The present invention is directed to a class imbalance processing method based on class combination and sample sampling, so as to solve the problems in the background art.

In order to achieve the purpose, the invention provides the following technical scheme: a category imbalance processing method based on category combination and sample sampling is structurally characterized in that: the method comprises the following steps:

s1: constructing an original data set: the raw data set formula is as follows:

wherein x is_iIs a word sequence, C is the number of categories, and N is the number of samples;

s2: training process: training M models, wherein the training process of each model is independent, the M models are trained in parallel, each model is trained, a new binary data set needs to be constructed from an original data set, then the new data set is used for training the model, the model is repeated for M times to obtain M different models, and M secondary classifiers P are obtained_m；

S3: the testing process comprises the following steps: given a test sample x, its class y e { 1.,. C }, P } is predicted_m(meta (C) x) represents the probability that the category of x is meta (C), wherein meta (C) belongs to {0,1}, C belongs to { 1.., C }, and meta (C) represents the C element class.

Preferably, the binary data set formula in step S2 is as follows:

wherein n is_sampleThe number of samples per category for the new data set.

Preferably, any one of the categories c in step S3 is not included in the two classifiers P_mIn the category of (1), then

P (cx) represents the probability that x is classified as c, and the calculation formula is as follows:

prediction class y is

y＝argmax_cp(c|x)。

Compared with the prior art, the method has the training mode that a plurality of two classifiers are trained, and then the result of each two classifier is combined to obtain the final prediction result. Compared with a multi-classification method, the method is more flexible, and because each two classifiers are independent, different neural network models can be used, and even established rules can be considered.

Drawings

FIG. 1 is an algorithmic schematic of the raw data set construction method of the present invention;

FIG. 2 is an algorithmic schematic of the training process of the present invention;

FIG. 3 is an algorithm diagram of the testing process of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings of the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

Referring to fig. 1-3, the present invention provides a technical solution, a category imbalance processing method based on category combination and sample sampling, comprising the following steps:

s1: constructing an original data set: the raw data set formula is as follows:

s2: training process: training M models, wherein the training process of each model is independent, the M models are trained in parallel, each model is trained, a new binary data set needs to be constructed from an original data set, and then the new binary data set is usedTraining the model with the new data set, repeating for M times to obtain M different models, and obtaining M two classifiers P_mIt should be noted that the category in the new data set includes some categories in the original data set, that is, the category in the new data set is a super-class of some categories in the original data set, and is also referred to as a meta-class herein;

if the number of samples | X | has not yet reached the predetermined value n_sampleThen select the data D of a certain category c of the original data set D_cIf | x | + | D_c|＞n_sampleThen, the slave D is random_cIn which n is selected_sample- | x | samples are added to x, otherwise, D is added_cAll samples are added to X, and the process is repeated until the number of samples reaches n_sample；

In this embodiment, the formula of the binary data set in step S2 is as follows:

wherein n is_sampleThe number of samples per category for the new data set.

In this embodiment, any category c in step S3 is not included in the classifier P_mIn the category of (1), then

prediction class y is

y＝argmax_cp(c|x)。

Easylansymble can alleviate the problem of class imbalance to a certain extent, however, in a real data set, especially a disease classification data set, the frequency difference of the occurrence frequency of each disease is very large, and the frequency difference can reach hundreds of times, so that the situation is not enough only through a sample down-sampling method, because if most classes are down-sampled to the number of samples similar to a few classes, the samples used for training the model are very few, and the model is under-fitted. According to the method, through two technologies of category combination and sample sampling, the category combination is carried out firstly, then the down sampling is carried out, and the problem caused by large difference of category sample data can be relieved well.

The training mode of the method is to train a plurality of two classifiers and then combine the results of each two classifiers to obtain the final prediction result. Compared with a multi-classification method, the method is more flexible, and because each two classifiers are independent, different neural network models can be used, and even established rules can be considered.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein, and any reference signs in the claims are not intended to be construed as limiting the claim concerned.

Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.

Claims

1. A class imbalance processing method based on class combination and sample sampling is characterized in that: the method comprises the following steps:

s1: constructing an original data set: the raw data set formula is as follows:

S3: the testing process comprises the following steps: given a test sample x, its class y e { 1.,. C }, P } is predicted_m(meta (C) | x) represents the probability that the category of x is meta (C), wherein meta (C) is e {0,1}, C is e { 1.., C }, and meta (C) represents the C-element class.

2. The method of claim 1, wherein the method comprises the following steps: the binary data set formula in step S2 is as follows:

wherein n is_sampleThe number of samples per category for the new data set.

3. The method of claim 1, wherein the method comprises the following steps: in step S3, any one of the categories c is not included in the classifier P_mIn the category of (1), then

P (c | x) represents the probability that x is classified as c, and the calculation formula is as follows:

prediction class y is

y＝argmax_cp(c|x)。