CN113361591A - Category imbalance processing method based on category combination and sample sampling - Google Patents

Category imbalance processing method based on category combination and sample sampling Download PDF

Info

Publication number
CN113361591A
CN113361591A CN202110620136.2A CN202110620136A CN113361591A CN 113361591 A CN113361591 A CN 113361591A CN 202110620136 A CN202110620136 A CN 202110620136A CN 113361591 A CN113361591 A CN 113361591A
Authority
CN
China
Prior art keywords
category
data set
class
model
samples
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110620136.2A
Other languages
Chinese (zh)
Inventor
叶方全
陈逸龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing Pengkang Big Data Co ltd
Guangzhou Tianpeng Computer Technology Co ltd
Chongqing Nanpeng Artificial Intelligence Technology Research Institute Co ltd
Original Assignee
Chongqing Pengkang Big Data Co ltd
Guangzhou Tianpeng Computer Technology Co ltd
Chongqing Nanpeng Artificial Intelligence Technology Research Institute Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing Pengkang Big Data Co ltd, Guangzhou Tianpeng Computer Technology Co ltd, Chongqing Nanpeng Artificial Intelligence Technology Research Institute Co ltd filed Critical Chongqing Pengkang Big Data Co ltd
Priority to CN202110620136.2A priority Critical patent/CN113361591A/en
Publication of CN113361591A publication Critical patent/CN113361591A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/24765Rule-based classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/285Selection of pattern recognition techniques, e.g. of classifiers in a multi-classifier system

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a category imbalance processing method based on category combination and sample sampling, which comprises the following steps: s1: constructing an original data set; s2: a training process; s3: and (6) testing. The training mode of the method is to train a plurality of two classifiers and then combine the results of each two classifiers to obtain the final prediction result. Compared with a multi-classification method, the method is more flexible, and because each two classifiers are independent, different neural network models can be used, and even established rules can be considered.

Description

Category imbalance processing method based on category combination and sample sampling
Technical Field
The invention belongs to the technical field of computers, and particularly relates to a category imbalance processing method based on category combination and sample sampling.
Background
The problem of category imbalance is a very common problem in processing real data sets. The problem of category imbalance means that the number of samples in each category in a data set is different and the difference is large. Without any treatment for this problem, the prediction results of the trained model may be biased towards the classes with large data volumes, which is not desirable.
There are many studies in this area, and one of the research directions is based on a sampling method, which can be divided into an oversampling method and an undersampling method. The oversampling method is to increase samples of the class with fewer samples to balance the number of samples of each class. However, such oversampling methods easily cause model overfitting, and reduce the generalization performance of the model. The undersampling method is to reduce samples of a class with more samples to balance the number of samples of each class. The method is characterized in that a random undersampling method randomly samples partial samples from categories with more samples, the problem of the method is that residual samples cannot be utilized, the easy Ensemble algorithm can process the problem, samples with the number similar to that of a few categories are sampled from the categories by an integration method each time, a model is trained, and the model is repeated for multiple times to obtain multiple models, so that all samples of the categories can be completely utilized. However, in real data sets, especially disease classification data sets, the frequency difference of occurrence of each disease is very large, and the frequency difference may reach hundreds of times, which is not enough only by the sample down-sampling method, because if the most classes are down-sampled to the number of samples close to the few classes, the samples used for training the model are very few, resulting in under-fitting of the model.
Disclosure of Invention
The present invention is directed to a class imbalance processing method based on class combination and sample sampling, so as to solve the problems in the background art.
In order to achieve the purpose, the invention provides the following technical scheme: a category imbalance processing method based on category combination and sample sampling is structurally characterized in that: the method comprises the following steps:
s1: constructing an original data set: the raw data set formula is as follows:
Figure BDA0003099512520000021
wherein x isiIs a word sequence, C is the number of categories, and N is the number of samples;
s2: training process: training M models, wherein the training process of each model is independent, the M models are trained in parallel, each model is trained, a new binary data set needs to be constructed from an original data set, then the new data set is used for training the model, the model is repeated for M times to obtain M different models, and M secondary classifiers P are obtainedm
S3: the testing process comprises the following steps: given a test sample x, its class y e { 1.,. C }, P } is predictedm(meta (C) x) represents the probability that the category of x is meta (C), wherein meta (C) belongs to {0,1}, C belongs to { 1.., C }, and meta (C) represents the C element class.
Preferably, the binary data set formula in step S2 is as follows:
Figure BDA0003099512520000022
wherein n issampleThe number of samples per category for the new data set.
Preferably, any one of the categories c in step S3 is not included in the two classifiers PmIn the category of (1), then
Figure BDA0003099512520000023
P (cx) represents the probability that x is classified as c, and the calculation formula is as follows:
Figure BDA0003099512520000031
prediction class y is
y=argmaxcp(c|x)。
Compared with the prior art, the method has the training mode that a plurality of two classifiers are trained, and then the result of each two classifier is combined to obtain the final prediction result. Compared with a multi-classification method, the method is more flexible, and because each two classifiers are independent, different neural network models can be used, and even established rules can be considered.
Drawings
FIG. 1 is an algorithmic schematic of the raw data set construction method of the present invention;
FIG. 2 is an algorithmic schematic of the training process of the present invention;
FIG. 3 is an algorithm diagram of the testing process of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings of the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
Referring to fig. 1-3, the present invention provides a technical solution, a category imbalance processing method based on category combination and sample sampling, comprising the following steps:
s1: constructing an original data set: the raw data set formula is as follows:
Figure BDA0003099512520000041
wherein x isiIs a word sequence, C is the number of categories, and N is the number of samples;
s2: training process: training M models, wherein the training process of each model is independent, the M models are trained in parallel, each model is trained, a new binary data set needs to be constructed from an original data set, and then the new binary data set is usedTraining the model with the new data set, repeating for M times to obtain M different models, and obtaining M two classifiers PmIt should be noted that the category in the new data set includes some categories in the original data set, that is, the category in the new data set is a super-class of some categories in the original data set, and is also referred to as a meta-class herein;
if the number of samples | X | has not yet reached the predetermined value nsampleThen select the data D of a certain category c of the original data set DcIf | x | + | Dc|>nsampleThen, the slave D is randomcIn which n is selectedsample- | x | samples are added to x, otherwise, D is addedcAll samples are added to X, and the process is repeated until the number of samples reaches nsample
S3: the testing process comprises the following steps: given a test sample x, its class y e { 1.,. C }, P } is predictedm(meta (C) x) represents the probability that the category of x is meta (C), wherein meta (C) belongs to {0,1}, C belongs to { 1.., C }, and meta (C) represents the C element class.
In this embodiment, the formula of the binary data set in step S2 is as follows:
Figure BDA0003099512520000042
wherein n issampleThe number of samples per category for the new data set.
In this embodiment, any category c in step S3 is not included in the classifier PmIn the category of (1), then
Figure BDA0003099512520000051
P (cx) represents the probability that x is classified as c, and the calculation formula is as follows:
Figure BDA0003099512520000052
prediction class y is
y=argmaxcp(c|x)。
Easylansymble can alleviate the problem of class imbalance to a certain extent, however, in a real data set, especially a disease classification data set, the frequency difference of the occurrence frequency of each disease is very large, and the frequency difference can reach hundreds of times, so that the situation is not enough only through a sample down-sampling method, because if most classes are down-sampled to the number of samples similar to a few classes, the samples used for training the model are very few, and the model is under-fitted. According to the method, through two technologies of category combination and sample sampling, the category combination is carried out firstly, then the down sampling is carried out, and the problem caused by large difference of category sample data can be relieved well.
The training mode of the method is to train a plurality of two classifiers and then combine the results of each two classifiers to obtain the final prediction result. Compared with a multi-classification method, the method is more flexible, and because each two classifiers are independent, different neural network models can be used, and even established rules can be considered.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein, and any reference signs in the claims are not intended to be construed as limiting the claim concerned.
Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.

Claims (3)

1. A class imbalance processing method based on class combination and sample sampling is characterized in that: the method comprises the following steps:
s1: constructing an original data set: the raw data set formula is as follows:
Figure FDA0003099512510000011
wherein x isiIs a word sequence, C is the number of categories, and N is the number of samples;
s2: training process: training M models, wherein the training process of each model is independent, the M models are trained in parallel, each model is trained, a new binary data set needs to be constructed from an original data set, then the new data set is used for training the model, the model is repeated for M times to obtain M different models, and M secondary classifiers P are obtainedm
S3: the testing process comprises the following steps: given a test sample x, its class y e { 1.,. C }, P } is predictedm(meta (C) | x) represents the probability that the category of x is meta (C), wherein meta (C) is e {0,1}, C is e { 1.., C }, and meta (C) represents the C-element class.
2. The method of claim 1, wherein the method comprises the following steps: the binary data set formula in step S2 is as follows:
Figure FDA0003099512510000012
wherein n issampleThe number of samples per category for the new data set.
3. The method of claim 1, wherein the method comprises the following steps: in step S3, any one of the categories c is not included in the classifier PmIn the category of (1), then
Figure FDA0003099512510000013
P (c | x) represents the probability that x is classified as c, and the calculation formula is as follows:
Figure FDA0003099512510000021
prediction class y is
y=argmaxcp(c|x)。
CN202110620136.2A 2021-06-03 2021-06-03 Category imbalance processing method based on category combination and sample sampling Pending CN113361591A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110620136.2A CN113361591A (en) 2021-06-03 2021-06-03 Category imbalance processing method based on category combination and sample sampling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110620136.2A CN113361591A (en) 2021-06-03 2021-06-03 Category imbalance processing method based on category combination and sample sampling

Publications (1)

Publication Number Publication Date
CN113361591A true CN113361591A (en) 2021-09-07

Family

ID=77531772

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110620136.2A Pending CN113361591A (en) 2021-06-03 2021-06-03 Category imbalance processing method based on category combination and sample sampling

Country Status (1)

Country Link
CN (1) CN113361591A (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100094800A1 (en) * 2008-10-09 2010-04-15 Microsoft Corporation Evaluating Decision Trees on a GPU
US20110075920A1 (en) * 2009-09-14 2011-03-31 Siemens Medical Solutions Usa, Inc. Multi-Level Contextual Learning of Data
CN106022376A (en) * 2016-05-18 2016-10-12 安徽大学 Structured SVM-based unbalanced evaluation criterion direct optimization algorithm
CN106599913A (en) * 2016-12-07 2017-04-26 重庆邮电大学 Cluster-based multi-label imbalance biomedical data classification method
CN108171280A (en) * 2018-01-31 2018-06-15 国信优易数据有限公司 A kind of grader construction method and the method for prediction classification
CN109086412A (en) * 2018-08-03 2018-12-25 北京邮电大学 A kind of unbalanced data classification method based on adaptive weighted Bagging-GBDT
CN109359704A (en) * 2018-12-26 2019-02-19 北京邮电大学 A kind of more classification methods integrated based on adaptive equalization with dynamic layered decision
CN109447118A (en) * 2018-09-26 2019-03-08 中南大学 A kind of uneven learning method based on Adaboost and lack sampling
KR102194962B1 (en) * 2020-05-20 2020-12-24 주식회사 네이처모빌리티 System for providing bigdata based artificial intelligence automatic allocation matching service using assignmanet problem and simulated annealing

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100094800A1 (en) * 2008-10-09 2010-04-15 Microsoft Corporation Evaluating Decision Trees on a GPU
US20110075920A1 (en) * 2009-09-14 2011-03-31 Siemens Medical Solutions Usa, Inc. Multi-Level Contextual Learning of Data
CN106022376A (en) * 2016-05-18 2016-10-12 安徽大学 Structured SVM-based unbalanced evaluation criterion direct optimization algorithm
CN106599913A (en) * 2016-12-07 2017-04-26 重庆邮电大学 Cluster-based multi-label imbalance biomedical data classification method
CN108171280A (en) * 2018-01-31 2018-06-15 国信优易数据有限公司 A kind of grader construction method and the method for prediction classification
CN109086412A (en) * 2018-08-03 2018-12-25 北京邮电大学 A kind of unbalanced data classification method based on adaptive weighted Bagging-GBDT
CN109447118A (en) * 2018-09-26 2019-03-08 中南大学 A kind of uneven learning method based on Adaboost and lack sampling
CN109359704A (en) * 2018-12-26 2019-02-19 北京邮电大学 A kind of more classification methods integrated based on adaptive equalization with dynamic layered decision
KR102194962B1 (en) * 2020-05-20 2020-12-24 주식회사 네이처모빌리티 System for providing bigdata based artificial intelligence automatic allocation matching service using assignmanet problem and simulated annealing

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
IN WOONG HAN等: "Risk prediction platform for pancreatic fistula after pancreatoduodenectomy using artificial intelligence", 《WORLD JOURNAL OF GASTROENTEROLOGY》 *
吴萌等: "基于多类不平衡分类的改进AdaBoost算法研究", 《北京信息科技大学学报(自然科学版)》 *
张全贵等: "融合元数据及隐式反馈信息的多层次联合学习推荐方法", 《计算机应用研究》 *

Similar Documents

Publication Publication Date Title
Han et al. Unsupervised generative modeling using matrix product states
JP2019028839A (en) Classifier, method for learning of classifier, and method for classification by classifier
CN109543727B (en) Semi-supervised anomaly detection method based on competitive reconstruction learning
Ji et al. Unsupervised few-shot feature learning via self-supervised training
Wang et al. imDC: an ensemble learning method for imbalanced classification with miRNA data
Tomita et al. Sparse projection oblique randomer forests
Gallo et al. Image and text fusion for upmc food-101 using bert and cnns
Guo et al. Towards the classification of cancer subtypes by using cascade deep forest model in gene expression data
CN107357895B (en) Text representation processing method based on bag-of-words model
CN113241115A (en) Depth matrix decomposition-based circular RNA disease correlation prediction method
CN112232395B (en) Semi-supervised image classification method for generating countermeasure network based on joint training
Zhang Deep generative model for multi-class imbalanced learning
Rosli et al. Development of CNN transfer learning for dyslexia handwriting recognition
Liong et al. Automatic traditional Chinese painting classification: A benchmarking analysis
Iqbal et al. A dynamic weighted tabular method for convolutional neural networks
Watson et al. Adversarial random forests for density estimation and generative modeling
Poelmans et al. Text mining with emergent self organizing maps and multi-dimensional scaling: A comparative study on domestic violence
Matsuda et al. Single-layered complex-valued neural network with SMOTE for imbalanced data classification
CN109934281B (en) Unsupervised training method of two-class network
CN113361591A (en) Category imbalance processing method based on category combination and sample sampling
Kundu et al. Optimal Machine Learning Based Automated Malaria Parasite Detection and Classification Model Using Blood Smear Images.
Freed et al. Application of support vector machines to the classification of galaxy morphologies
CN115497564A (en) Antigen identification model establishing method and antigen identification method
Versteeg et al. Boosting local causal discovery in high-dimensional expression data
Liu et al. Learning from small data: A pairwise approach for ordinal regression

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210907

RJ01 Rejection of invention patent application after publication