CN113361591A - Category imbalance processing method based on category combination and sample sampling - Google Patents
Category imbalance processing method based on category combination and sample sampling Download PDFInfo
- Publication number
- CN113361591A CN113361591A CN202110620136.2A CN202110620136A CN113361591A CN 113361591 A CN113361591 A CN 113361591A CN 202110620136 A CN202110620136 A CN 202110620136A CN 113361591 A CN113361591 A CN 113361591A
- Authority
- CN
- China
- Prior art keywords
- category
- data set
- class
- model
- samples
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000005070 sampling Methods 0.000 title claims abstract description 13
- 238000003672 processing method Methods 0.000 title claims abstract description 8
- 238000000034 method Methods 0.000 claims abstract description 42
- 238000012549 training Methods 0.000 claims abstract description 18
- 238000012360 testing method Methods 0.000 claims abstract description 8
- 238000004364 calculation method Methods 0.000 claims description 3
- 238000003062 neural network model Methods 0.000 abstract description 3
- 201000010099 disease Diseases 0.000 description 4
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 4
- 238000010276 construction Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/24765—Rule-based classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/285—Selection of pattern recognition techniques, e.g. of classifiers in a multi-classifier system
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a category imbalance processing method based on category combination and sample sampling, which comprises the following steps: s1: constructing an original data set; s2: a training process; s3: and (6) testing. The training mode of the method is to train a plurality of two classifiers and then combine the results of each two classifiers to obtain the final prediction result. Compared with a multi-classification method, the method is more flexible, and because each two classifiers are independent, different neural network models can be used, and even established rules can be considered.
Description
Technical Field
The invention belongs to the technical field of computers, and particularly relates to a category imbalance processing method based on category combination and sample sampling.
Background
The problem of category imbalance is a very common problem in processing real data sets. The problem of category imbalance means that the number of samples in each category in a data set is different and the difference is large. Without any treatment for this problem, the prediction results of the trained model may be biased towards the classes with large data volumes, which is not desirable.
There are many studies in this area, and one of the research directions is based on a sampling method, which can be divided into an oversampling method and an undersampling method. The oversampling method is to increase samples of the class with fewer samples to balance the number of samples of each class. However, such oversampling methods easily cause model overfitting, and reduce the generalization performance of the model. The undersampling method is to reduce samples of a class with more samples to balance the number of samples of each class. The method is characterized in that a random undersampling method randomly samples partial samples from categories with more samples, the problem of the method is that residual samples cannot be utilized, the easy Ensemble algorithm can process the problem, samples with the number similar to that of a few categories are sampled from the categories by an integration method each time, a model is trained, and the model is repeated for multiple times to obtain multiple models, so that all samples of the categories can be completely utilized. However, in real data sets, especially disease classification data sets, the frequency difference of occurrence of each disease is very large, and the frequency difference may reach hundreds of times, which is not enough only by the sample down-sampling method, because if the most classes are down-sampled to the number of samples close to the few classes, the samples used for training the model are very few, resulting in under-fitting of the model.
Disclosure of Invention
The present invention is directed to a class imbalance processing method based on class combination and sample sampling, so as to solve the problems in the background art.
In order to achieve the purpose, the invention provides the following technical scheme: a category imbalance processing method based on category combination and sample sampling is structurally characterized in that: the method comprises the following steps:
s1: constructing an original data set: the raw data set formula is as follows:
wherein x isiIs a word sequence, C is the number of categories, and N is the number of samples;
s2: training process: training M models, wherein the training process of each model is independent, the M models are trained in parallel, each model is trained, a new binary data set needs to be constructed from an original data set, then the new data set is used for training the model, the model is repeated for M times to obtain M different models, and M secondary classifiers P are obtainedm;
S3: the testing process comprises the following steps: given a test sample x, its class y e { 1.,. C }, P } is predictedm(meta (C) x) represents the probability that the category of x is meta (C), wherein meta (C) belongs to {0,1}, C belongs to { 1.., C }, and meta (C) represents the C element class.
Preferably, the binary data set formula in step S2 is as follows:
wherein n issampleThe number of samples per category for the new data set.
Preferably, any one of the categories c in step S3 is not included in the two classifiers PmIn the category of (1), thenP (cx) represents the probability that x is classified as c, and the calculation formula is as follows:
prediction class y is
y=argmaxcp(c|x)。
Compared with the prior art, the method has the training mode that a plurality of two classifiers are trained, and then the result of each two classifier is combined to obtain the final prediction result. Compared with a multi-classification method, the method is more flexible, and because each two classifiers are independent, different neural network models can be used, and even established rules can be considered.
Drawings
FIG. 1 is an algorithmic schematic of the raw data set construction method of the present invention;
FIG. 2 is an algorithmic schematic of the training process of the present invention;
FIG. 3 is an algorithm diagram of the testing process of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings of the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
Referring to fig. 1-3, the present invention provides a technical solution, a category imbalance processing method based on category combination and sample sampling, comprising the following steps:
s1: constructing an original data set: the raw data set formula is as follows:
wherein x isiIs a word sequence, C is the number of categories, and N is the number of samples;
s2: training process: training M models, wherein the training process of each model is independent, the M models are trained in parallel, each model is trained, a new binary data set needs to be constructed from an original data set, and then the new binary data set is usedTraining the model with the new data set, repeating for M times to obtain M different models, and obtaining M two classifiers PmIt should be noted that the category in the new data set includes some categories in the original data set, that is, the category in the new data set is a super-class of some categories in the original data set, and is also referred to as a meta-class herein;
if the number of samples | X | has not yet reached the predetermined value nsampleThen select the data D of a certain category c of the original data set DcIf | x | + | Dc|>nsampleThen, the slave D is randomcIn which n is selectedsample- | x | samples are added to x, otherwise, D is addedcAll samples are added to X, and the process is repeated until the number of samples reaches nsample;
S3: the testing process comprises the following steps: given a test sample x, its class y e { 1.,. C }, P } is predictedm(meta (C) x) represents the probability that the category of x is meta (C), wherein meta (C) belongs to {0,1}, C belongs to { 1.., C }, and meta (C) represents the C element class.
In this embodiment, the formula of the binary data set in step S2 is as follows:
wherein n issampleThe number of samples per category for the new data set.
In this embodiment, any category c in step S3 is not included in the classifier PmIn the category of (1), thenP (cx) represents the probability that x is classified as c, and the calculation formula is as follows:
prediction class y is
y=argmaxcp(c|x)。
Easylansymble can alleviate the problem of class imbalance to a certain extent, however, in a real data set, especially a disease classification data set, the frequency difference of the occurrence frequency of each disease is very large, and the frequency difference can reach hundreds of times, so that the situation is not enough only through a sample down-sampling method, because if most classes are down-sampled to the number of samples similar to a few classes, the samples used for training the model are very few, and the model is under-fitted. According to the method, through two technologies of category combination and sample sampling, the category combination is carried out firstly, then the down sampling is carried out, and the problem caused by large difference of category sample data can be relieved well.
The training mode of the method is to train a plurality of two classifiers and then combine the results of each two classifiers to obtain the final prediction result. Compared with a multi-classification method, the method is more flexible, and because each two classifiers are independent, different neural network models can be used, and even established rules can be considered.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein, and any reference signs in the claims are not intended to be construed as limiting the claim concerned.
Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.
Claims (3)
1. A class imbalance processing method based on class combination and sample sampling is characterized in that: the method comprises the following steps:
s1: constructing an original data set: the raw data set formula is as follows:
wherein x isiIs a word sequence, C is the number of categories, and N is the number of samples;
s2: training process: training M models, wherein the training process of each model is independent, the M models are trained in parallel, each model is trained, a new binary data set needs to be constructed from an original data set, then the new data set is used for training the model, the model is repeated for M times to obtain M different models, and M secondary classifiers P are obtainedm;
S3: the testing process comprises the following steps: given a test sample x, its class y e { 1.,. C }, P } is predictedm(meta (C) | x) represents the probability that the category of x is meta (C), wherein meta (C) is e {0,1}, C is e { 1.., C }, and meta (C) represents the C-element class.
3. The method of claim 1, wherein the method comprises the following steps: in step S3, any one of the categories c is not included in the classifier PmIn the category of (1), thenP (c | x) represents the probability that x is classified as c, and the calculation formula is as follows:
prediction class y is
y=argmaxcp(c|x)。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110620136.2A CN113361591A (en) | 2021-06-03 | 2021-06-03 | Category imbalance processing method based on category combination and sample sampling |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110620136.2A CN113361591A (en) | 2021-06-03 | 2021-06-03 | Category imbalance processing method based on category combination and sample sampling |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113361591A true CN113361591A (en) | 2021-09-07 |
Family
ID=77531772
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110620136.2A Pending CN113361591A (en) | 2021-06-03 | 2021-06-03 | Category imbalance processing method based on category combination and sample sampling |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113361591A (en) |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100094800A1 (en) * | 2008-10-09 | 2010-04-15 | Microsoft Corporation | Evaluating Decision Trees on a GPU |
US20110075920A1 (en) * | 2009-09-14 | 2011-03-31 | Siemens Medical Solutions Usa, Inc. | Multi-Level Contextual Learning of Data |
CN106022376A (en) * | 2016-05-18 | 2016-10-12 | 安徽大学 | Structured SVM-based unbalanced evaluation criterion direct optimization algorithm |
CN106599913A (en) * | 2016-12-07 | 2017-04-26 | 重庆邮电大学 | Cluster-based multi-label imbalance biomedical data classification method |
CN108171280A (en) * | 2018-01-31 | 2018-06-15 | 国信优易数据有限公司 | A kind of grader construction method and the method for prediction classification |
CN109086412A (en) * | 2018-08-03 | 2018-12-25 | 北京邮电大学 | A kind of unbalanced data classification method based on adaptive weighted Bagging-GBDT |
CN109359704A (en) * | 2018-12-26 | 2019-02-19 | 北京邮电大学 | A kind of more classification methods integrated based on adaptive equalization with dynamic layered decision |
CN109447118A (en) * | 2018-09-26 | 2019-03-08 | 中南大学 | A kind of uneven learning method based on Adaboost and lack sampling |
KR102194962B1 (en) * | 2020-05-20 | 2020-12-24 | 주식회사 네이처모빌리티 | System for providing bigdata based artificial intelligence automatic allocation matching service using assignmanet problem and simulated annealing |
-
2021
- 2021-06-03 CN CN202110620136.2A patent/CN113361591A/en active Pending
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100094800A1 (en) * | 2008-10-09 | 2010-04-15 | Microsoft Corporation | Evaluating Decision Trees on a GPU |
US20110075920A1 (en) * | 2009-09-14 | 2011-03-31 | Siemens Medical Solutions Usa, Inc. | Multi-Level Contextual Learning of Data |
CN106022376A (en) * | 2016-05-18 | 2016-10-12 | 安徽大学 | Structured SVM-based unbalanced evaluation criterion direct optimization algorithm |
CN106599913A (en) * | 2016-12-07 | 2017-04-26 | 重庆邮电大学 | Cluster-based multi-label imbalance biomedical data classification method |
CN108171280A (en) * | 2018-01-31 | 2018-06-15 | 国信优易数据有限公司 | A kind of grader construction method and the method for prediction classification |
CN109086412A (en) * | 2018-08-03 | 2018-12-25 | 北京邮电大学 | A kind of unbalanced data classification method based on adaptive weighted Bagging-GBDT |
CN109447118A (en) * | 2018-09-26 | 2019-03-08 | 中南大学 | A kind of uneven learning method based on Adaboost and lack sampling |
CN109359704A (en) * | 2018-12-26 | 2019-02-19 | 北京邮电大学 | A kind of more classification methods integrated based on adaptive equalization with dynamic layered decision |
KR102194962B1 (en) * | 2020-05-20 | 2020-12-24 | 주식회사 네이처모빌리티 | System for providing bigdata based artificial intelligence automatic allocation matching service using assignmanet problem and simulated annealing |
Non-Patent Citations (3)
Title |
---|
IN WOONG HAN等: "Risk prediction platform for pancreatic fistula after pancreatoduodenectomy using artificial intelligence", 《WORLD JOURNAL OF GASTROENTEROLOGY》 * |
吴萌等: "基于多类不平衡分类的改进AdaBoost算法研究", 《北京信息科技大学学报(自然科学版)》 * |
张全贵等: "融合元数据及隐式反馈信息的多层次联合学习推荐方法", 《计算机应用研究》 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Han et al. | Unsupervised generative modeling using matrix product states | |
JP2019028839A (en) | Classifier, method for learning of classifier, and method for classification by classifier | |
CN109543727B (en) | Semi-supervised anomaly detection method based on competitive reconstruction learning | |
Ji et al. | Unsupervised few-shot feature learning via self-supervised training | |
Wang et al. | imDC: an ensemble learning method for imbalanced classification with miRNA data | |
Tomita et al. | Sparse projection oblique randomer forests | |
Gallo et al. | Image and text fusion for upmc food-101 using bert and cnns | |
Guo et al. | Towards the classification of cancer subtypes by using cascade deep forest model in gene expression data | |
CN107357895B (en) | Text representation processing method based on bag-of-words model | |
CN113241115A (en) | Depth matrix decomposition-based circular RNA disease correlation prediction method | |
CN112232395B (en) | Semi-supervised image classification method for generating countermeasure network based on joint training | |
Zhang | Deep generative model for multi-class imbalanced learning | |
Rosli et al. | Development of CNN transfer learning for dyslexia handwriting recognition | |
Liong et al. | Automatic traditional Chinese painting classification: A benchmarking analysis | |
Iqbal et al. | A dynamic weighted tabular method for convolutional neural networks | |
Watson et al. | Adversarial random forests for density estimation and generative modeling | |
Poelmans et al. | Text mining with emergent self organizing maps and multi-dimensional scaling: A comparative study on domestic violence | |
Matsuda et al. | Single-layered complex-valued neural network with SMOTE for imbalanced data classification | |
CN109934281B (en) | Unsupervised training method of two-class network | |
CN113361591A (en) | Category imbalance processing method based on category combination and sample sampling | |
Kundu et al. | Optimal Machine Learning Based Automated Malaria Parasite Detection and Classification Model Using Blood Smear Images. | |
Freed et al. | Application of support vector machines to the classification of galaxy morphologies | |
CN115497564A (en) | Antigen identification model establishing method and antigen identification method | |
Versteeg et al. | Boosting local causal discovery in high-dimensional expression data | |
Liu et al. | Learning from small data: A pairwise approach for ordinal regression |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210907 |
|
RJ01 | Rejection of invention patent application after publication |