CN110659682A - Data classification method based on MCWD-KSMOTE-AdaBoost-DenseNet algorithm - Google Patents
Data classification method based on MCWD-KSMOTE-AdaBoost-DenseNet algorithm Download PDFInfo
- Publication number
- CN110659682A CN110659682A CN201910895521.0A CN201910895521A CN110659682A CN 110659682 A CN110659682 A CN 110659682A CN 201910895521 A CN201910895521 A CN 201910895521A CN 110659682 A CN110659682 A CN 110659682A
- Authority
- CN
- China
- Prior art keywords
- samples
- data
- cluster
- weight
- instances
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 25
- 238000004422 calculation algorithm Methods 0.000 title claims description 11
- 238000012549 training Methods 0.000 claims abstract description 19
- 239000011159 matrix material Substances 0.000 claims abstract description 11
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims abstract description 5
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 5
- 150000001875 compounds Chemical class 0.000 claims description 6
- 238000012360 testing method Methods 0.000 claims description 6
- 238000010606 normalization Methods 0.000 claims description 4
- 239000000126 substance Substances 0.000 claims description 3
- 238000005192 partition Methods 0.000 claims description 2
- 238000012545 processing Methods 0.000 abstract description 2
- 238000007635 classification algorithm Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 230000007547 defect Effects 0.000 description 3
- 238000013145 classification model Methods 0.000 description 2
- 230000007423 decrease Effects 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- SGTNSNPWRIOYBX-UHFFFAOYSA-N 2-(3,4-dimethoxyphenyl)-5-{[2-(3,4-dimethoxyphenyl)ethyl](methyl)amino}-2-(propan-2-yl)pentanenitrile Chemical compound C1=C(OC)C(OC)=CC=C1CCN(C)CCCC(C#N)(C(C)C)C1=CC=C(OC)C(OC)=C1 SGTNSNPWRIOYBX-UHFFFAOYSA-N 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000010923 batch production Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 239000004744 fabric Substances 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
- G06F18/2148—Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the process organisation or structure, e.g. boosting cascade
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a data classification method based on MCWD-KSMOTE-AdaBoost-DenseNet, which relates to the field of data processing and is characterized by comprising the following steps: (1) determining a tag correlation matrix; (2) sub-cluster weight distribution, and adjusting the unbalance of the samples in the class; the method comprises the steps of (3) predicting label information of all instances in a training set one by one, (4) normalizing data, (5) updating network weight by using an Adam optimizer during training of a convolutional neural network, and using cross entropy as a target loss function.
Description
Technical Field
The invention relates to the field of data processing and machine learning, in particular to a method for accurately classifying unbalanced data.
Background
Data that the computer can batch process need people to collect the arrangement through data acquisition equipment such as sensor. Then, the hidden deep knowledge and rules behind the data need to be processed by methods of data learning analysis, data mining and the like. Thereby improving the perception and the linking capability of people to external things. However, data generated in real life are often unbalanced. More accurate data classification has been advanced to the aspects of people's life. The existing problems to be solved urgently are that most of existing classifier models are designed aiming at some balanced data, a small amount of or a large amount of unbalanced data often exist in various data generated in actual life, the accurate classification rate of the data is usually sharply reduced due to the existence of the unbalanced data, the classification effect is often reduced under severe conditions, and even the use requirements of use or scientific research cannot be met. The existing undersampling method has the defects that a large number of negative sample characteristics are lost, a model cannot fully learn the sample characteristics of the negative samples, the classification accuracy of the negative samples is reduced, and the like. In addition, for the over-sampling mode, the generated positive sample is not the true positive sample obtained by collection, and the defects of sample noise are brought while the number and diversity of samples are increased.
The algorithm of the data classification problem faces important problems of incomplete data labels, difficulty in obtaining marks, large data size and the like. The classical algorithms like LP, BR, ECC, ML-KNN all require that all tags in the training data be complete. But as the amount of data grows explosively, it is not easy to acquire a fully labeled instance. Under the conditions of large noise influence and poor sample anti-interference capability, AdaBoost has the defect that effective identification and filtration of noise samples are difficult to carry out. In addition, when class-to-class imbalance exists between a few classes and a plurality of classes, the problem of sample distribution imbalance is not fully considered, and the problem of 'marginalization' is easily caused to be prominent. From the aspect of a data layer, the oversampling or undersampling mode is mainly adopted at present, but due to the unbalance of data distribution of an unbalanced data set, the effect of using a single classification algorithm is poor, and the accuracy of the model is low. Therefore, to realize accurate data classification, a set of efficient and accurate data classification method must be established, so that the problem of data imbalance can be effectively solved, and the data classification accuracy is effectively improved. Therefore, the problems can be found by the staff in time, and a better judgment standard is provided for subsequent operations such as the next prediction step.
Disclosure of Invention
In view of the problems in the prior art, the technical problem to be solved by the present invention is to provide a data classification method based on MCWD, KSMOTE-AdaBoost and densneet, and the specific flow is shown in fig. 1.
The technical scheme comprises the following implementation steps:
(1) determining a tag correlation matrix
Wherein the content of the first and second substances,indicating label c1A set of instances of the annotation are presented,indicating a labeled object c1The number of instances of the annotation,is represented by c1And c2Number of instances noted at the same time. s is a parameter set to avoid some extreme cases due to tag imbalance problems to some extent.
(2) Sub-cluster weights w (i) are assigned, adjusting the imbalance of samples within a class:
where c represents the total number of class clusters of the sample set partition, L (c)1,c2)num(i)Indicating the number of samples in the ith class cluster. I.e. the higher the number of samples in a cluster the lower the weight. Finally realizing the balance division among the same typesAnd (3) cloth.
In the formula, KNN (I)Test) It means the k-th immediate vicinity,the weight value of the jth label representing the ith instance at the last iteration, t is the number of iterations,a matrix of labels is represented that is,representing a predictive label.
In addition, the updated value for the weight
Wherein sgn () is a sign function, e is a high confidence threshold value, which is (0.5,1), c is a low confidence threshold value, which is (0,0.5), and simultaneouslyIn this manner, the user can easily and accurately select the desired target,will be reassigned to [ -1, 1]A value in between.
Repeating the above steps, dividing the sample set into a certain number of clusters by using a clustering algorithm, and synthesizing the number of samplesAnd the number of samples in each cluster and the weight occupied by each cluster and the number of samples needing to be synthesized are obtained. Each iteration process will obtainA reset is performed. When the 80% of the label information in the data is recovered, namely the labels of the examples do not contain the missing value of 0, the cycle is ended, and the next step is continued. The algorithm flow is shown in fig. 2.
(4) Data normalization qnew:
In the formula (I), the compound is shown in the specification,the maximum and minimum values of the raw data.
(5) Network weights are updated using Adam optimizers in training convolutional neural networks, using cross entropy as the objective loss function (loss):
in the formula, gtGradient, theta, representing the t-th stept-1For gradient update, α defaults to 0.001. The specific network structure is shown in fig. 3.
In the formula (I), the compound is shown in the specification,which is indicative of a desired output, is,representing the actual output. The training error fitting process is shown in fig. 3.
Compared with the prior art, the invention has the advantages that:
(1) the invention overcomes the problem that the classification accuracy of the minority class is low because the maximum total classification accuracy is the target, the classification model is biased to the majority class and the minority class is ignored, and the unbalanced data classification accuracy can be effectively improved.
(2) The MCWD and KSMOTE-AdaBoost methods are applied to data classification, and higher classification accuracy is obtained by combining the data classification with DenseNet. This shows that the present invention can achieve a better classification effect when classifying the unbalanced data.
Drawings
For a better understanding of the present invention, reference is made to the following further description taken in conjunction with the accompanying drawings.
FIG. 1 is a flow chart of the steps of establishing an unbalanced data classification algorithm based on MCWD-KSMOTE-AdaBoost-DenseNet;
FIG. 2 is a flow chart for establishing an unbalanced data classification algorithm based on MCWD-KSMOTE-AdaBoost-DenseNet;
FIG. 3 is a training error fit graph;
fig. 4 is a diagram of a network update architecture.
Detailed description of the preferred embodiments
The present invention will be described in further detail below with reference to examples.
The data set selected by the embodiment is divided into four types, and there are 800 groups of samples, wherein 200 groups of stars, galaxy and unknown celestial bodies are respectively extracted from 4 groups of data by adopting a random sampling method to be used as a training set, and the rest 40 groups are used as a test set. Finally, the total number of samples used for training is 640 and the total number of samples used for testing is 160.
The overall flow of the unbalanced data classification algorithm provided by the invention is shown in fig. 1, and the specific steps are as follows:
The existing 200 sets of data were taken, of which 50 were labeled stars, galaxies, quasar, and the rest were labeled by unknown stars. Randomly choosing 10 groups, then 10 groups are just marked as stars, galaxies, then the association between stars and galaxies is estimated to be 0.
Wherein the content of the first and second substances,indicating label c1A set of instances of the annotation are presented,indicating a labeled object c1The number of instances of the annotation,is represented by c1And c2Number of instances noted at the same time. Obtaining a correlation matrix using fully recovered tag information for 80% of the sampled data
(2) Sub-cluster weights w (i) are assigned, adjusting the imbalance of samples within a class:
where c represents the total number of class clusters divided by the sample set, e.g., 4 classes, L (c)1,c2)num(i)Indicating that the number of samples in the ith class cluster is 200 groups. I.e. the higher the number of samples in a cluster the lower the weight. And finally, realizing balanced distribution among the same classes.
And dividing the sample set into a certain number of clusters by using a clustering algorithm, and obtaining the weight occupied by each cluster and the number of samples to be synthesized according to the number of samples to be synthesized and the number of samples contained in each cluster.
In the formula, KNN (I)Test) It means the k-th immediate vicinity,the weight value of the jth label representing the ith instance at the last iteration, t is the number of iterations,a matrix of labels is represented that is,representing a predictive label.
In addition, the updated value for the weight
Wherein sgn () is a sign function, e is a high confidence threshold value, which is (0.5,1), c is a low confidence threshold value, which is (0,0.5), and simultaneouslyIn this manner, the user can easily and accurately select the desired target,will be reassigned to [ -1, 1]BetweenThe value of (c).
And repeating the steps, dividing the sample set into a specific number of clusters by using a clustering algorithm, and obtaining the weight occupied by each cluster and the number of samples to be synthesized according to the number of synthesized samples and the number of samples in each cluster. Each iteration process will obtainA reset is performed. When the 80% of the label information in the data is recovered, namely the labels of the examples do not contain the missing value of 0, the cycle is ended, and the next step is continued. The algorithm flow is shown in fig. 2.
(4) Data normalization qnew:
In the formula (I), the compound is shown in the specification,the maximum and minimum values of the raw data. I.e. uniform mapping of data to [0,1 ]]On the interval. So as to improve the convergence speed of the model and the accuracy of the model.
(5) Network weights are updated using Adam optimizers in training convolutional neural networks, using cross entropy as the objective loss function (loss):
the neural network is trained in 60 times in total, the initial learning rate is set to be 0.01, and the initial learning rate is reduced by 10 times at 10 th, 30 th and 50 th times respectively. The training process is as shown in figure x, and the loss on the verification set is unstable due to the large learning rate of the first 10 times of training. As the learning rate decreases and the training increases, the loss on both the training set and the test set tends to stabilize and slowly decrease. The verification set loss hardly decreased after 30 training runs. To prevent overfitting, we finally retained the weights trained 35 times as the final model. As shown in fig. 3
In the formula, gtGradient, theta, representing the t-th stept-1For gradient update, α defaults to 0.001. The specific network structure is shown in fig. 3.
In the formula (I), the compound is shown in the specification,which is indicative of a desired output, is,representing the actual output. The training error fitting process is shown in fig. 3.
In order to verify the accuracy of the present invention in classifying unbalanced data, four classification experiments were performed on the present invention, and the experimental results are shown in table 3. The accuracy rate of classifying the unbalanced data by the method combining MCWD, KSMOTE-AdaBoost and DenseNet established by the invention is kept above 92%, and the method can achieve higher accuracy rate on the basis of ensuring stability and has good classification effect. The MCWD, KSMOTE-AdaBoost and DenseNet classification methods established by the invention are effective, provide a better method for establishing an accurate data classification model, and have certain practicability.
Claims (1)
1. The invention relates to a data classification method based on a MCWD-KSMOTE-AdaBoost-DenseNet algorithm, which is characterized by comprising the following steps: (1) determining a tag correlation matrix; (2) sub-cluster weight distribution, and adjusting the unbalance of the samples in the class; the method comprises the following steps of (3) predicting label information of all instances in a training set one by one, (4) data normalization, (5) updating network weight by using an Adam optimizer when a convolutional neural network is trained, and using cross entropy as a target loss function, wherein the method specifically comprises the following five steps:
wherein the content of the first and second substances,indicating label c1A set of instances of the annotation are presented,indicating a labeled object c1The number of instances of the annotation,is represented by c1And c2The number of instances, s, marked at the same time is a parameter set to avoid some extreme situations due to the problem of label imbalance to a certain extent;
step two: sub-cluster weights w (i) are assigned, adjusting the imbalance of samples within a class:
where c represents the total number of class clusters of the sample set partition, L (c)1,c2)num(i)The number of samples in the ith class cluster is represented, namely the more the number of samples in a certain cluster is, the smaller the weight is, and finally, the balanced distribution among the same classes is realized;
In the formula, KNN (I)Test) It means the k-th immediate vicinity,the weight value of the jth label representing the ith instance at the last iteration, t is the number of iterations,a matrix of labels is represented that is,represents a predictive tag;
in addition, the updated value for the weight
Wherein sgn () is a sign function, e is a high confidence threshold value, which is (0.5,1), c is a low confidence threshold value, which is (0,0.5), and simultaneouslyIn this manner, the user can easily and accurately select the desired target,will be reassigned to [ -1, 1]A value in between;
repeating the steps, dividing the sample set into a specific number of clusters by using a clustering algorithm, obtaining the weight occupied by each cluster and the number of samples to be synthesized according to the number of synthesized samples and the number of samples in each cluster, and carrying out each iteration process on the obtained weightResetting is carried out, when 80% of label information in the data is recovered, namely when the labels of the examples do not contain the missing value '0', the cycle is ended, and then the next step is continued;
step four: data normalization qnew:
In the formula (I), the compound is shown in the specification,maximum and minimum values of the original data;
step five: updating network weights by using an Adam optimizer when the convolutional neural network is trained, and using cross entropy as a target loss function (loss);
in the formula, gtGradient, theta, representing the t-th stept-1For gradient update, α defaults to 0.001;
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910895521.0A CN110659682A (en) | 2019-09-21 | 2019-09-21 | Data classification method based on MCWD-KSMOTE-AdaBoost-DenseNet algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910895521.0A CN110659682A (en) | 2019-09-21 | 2019-09-21 | Data classification method based on MCWD-KSMOTE-AdaBoost-DenseNet algorithm |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110659682A true CN110659682A (en) | 2020-01-07 |
Family
ID=69037566
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910895521.0A Pending CN110659682A (en) | 2019-09-21 | 2019-09-21 | Data classification method based on MCWD-KSMOTE-AdaBoost-DenseNet algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110659682A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111666872A (en) * | 2020-06-04 | 2020-09-15 | 电子科技大学 | Efficient behavior identification method under data imbalance |
CN112115638A (en) * | 2020-08-28 | 2020-12-22 | 合肥工业大学 | Transformer fault diagnosis method based on improved Adam algorithm optimization neural network |
CN113030197A (en) * | 2021-03-26 | 2021-06-25 | 哈尔滨工业大学 | Gas sensor drift compensation method |
CN113361624A (en) * | 2021-06-22 | 2021-09-07 | 北京邮电大学 | Machine learning-based sensing data quality evaluation method |
CN113408707A (en) * | 2021-07-05 | 2021-09-17 | 哈尔滨理工大学 | Network encryption traffic identification method based on deep learning |
-
2019
- 2019-09-21 CN CN201910895521.0A patent/CN110659682A/en active Pending
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111666872A (en) * | 2020-06-04 | 2020-09-15 | 电子科技大学 | Efficient behavior identification method under data imbalance |
CN111666872B (en) * | 2020-06-04 | 2022-08-05 | 电子科技大学 | Efficient behavior identification method under data imbalance |
CN112115638A (en) * | 2020-08-28 | 2020-12-22 | 合肥工业大学 | Transformer fault diagnosis method based on improved Adam algorithm optimization neural network |
CN112115638B (en) * | 2020-08-28 | 2023-09-26 | 合肥工业大学 | Transformer fault diagnosis method based on improved Adam algorithm optimization neural network |
CN113030197A (en) * | 2021-03-26 | 2021-06-25 | 哈尔滨工业大学 | Gas sensor drift compensation method |
CN113030197B (en) * | 2021-03-26 | 2022-11-04 | 哈尔滨工业大学 | Gas sensor drift compensation method |
CN113361624A (en) * | 2021-06-22 | 2021-09-07 | 北京邮电大学 | Machine learning-based sensing data quality evaluation method |
CN113408707A (en) * | 2021-07-05 | 2021-09-17 | 哈尔滨理工大学 | Network encryption traffic identification method based on deep learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110659682A (en) | Data classification method based on MCWD-KSMOTE-AdaBoost-DenseNet algorithm | |
CN107633255B (en) | Rock lithology automatic identification and classification method under deep learning mode | |
CN110516305B (en) | Intelligent fault diagnosis method under small sample based on attention mechanism meta-learning model | |
CN109085469A (en) | A kind of method and system of the signal type of the signal of cable local discharge for identification | |
CN111562108A (en) | Rolling bearing intelligent fault diagnosis method based on CNN and FCMC | |
CN113567130A (en) | Bearing fault diagnosis method based on multiple working conditions of equipment | |
CN108398266B (en) | Bearing fault diagnosis method based on integrated transfer learning | |
CN106682454B (en) | A kind of macro genomic data classification method and device | |
CN108009567B (en) | Automatic excrement character distinguishing method combining image color and HOG and SVM | |
CN109598292A (en) | A kind of transfer learning method of the positive negative ratio of difference aid sample | |
CN110738232A (en) | grid voltage out-of-limit cause diagnosis method based on data mining technology | |
CN112949517B (en) | Plant stomata density and opening degree identification method and system based on deep migration learning | |
CN110348494A (en) | A kind of human motion recognition method based on binary channels residual error neural network | |
CN111950645A (en) | Method for improving class imbalance classification performance by improving random forest | |
CN114429152A (en) | Rolling bearing fault diagnosis method based on dynamic index antagonism self-adaption | |
CN112504682A (en) | Chassis engine fault diagnosis method and system based on particle swarm optimization algorithm | |
CN115165366A (en) | Variable working condition fault diagnosis method and system for rotary machine | |
CN111768761B (en) | Training method and device for speech recognition model | |
CN113109782B (en) | Classification method directly applied to radar radiation source amplitude sequence | |
CN110929761A (en) | Balance method for collecting samples in situation awareness framework of intelligent system security system | |
CN110569727B (en) | Transfer learning method combining intra-class distance and inter-class distance for motor imagery classification | |
CN109580146B (en) | Structural vibration parameter identification method based on improved sparse component analysis | |
CN113313213B (en) | Data set processing method for accelerating training of target detection algorithm | |
CN112348700B (en) | Line capacity prediction method combining SOM clustering and IFOU equation | |
CN114139598A (en) | Fault diagnosis method and diagnosis framework based on deep cost sensitive convolution network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20200107 |