CN111950652A - Semi-supervised learning data classification algorithm based on similarity - Google Patents

Semi-supervised learning data classification algorithm based on similarity Download PDF

Info

Publication number
CN111950652A
CN111950652A CN202010852138.XA CN202010852138A CN111950652A CN 111950652 A CN111950652 A CN 111950652A CN 202010852138 A CN202010852138 A CN 202010852138A CN 111950652 A CN111950652 A CN 111950652A
Authority
CN
China
Prior art keywords
data
similarity
semi
class
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202010852138.XA
Other languages
Chinese (zh)
Inventor
孙栓柱
陈广
高阳
周春蕾
李逗
孙彬
王林
王其祥
高进
李春岩
沈洋
黄治军
张磊
傅高健
周心澄
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Wanwei Aisi Network Intelligent Industry Innovation Center Co ltd
Jiangsu Fangtian Power Technology Co Ltd
Original Assignee
Jiangsu Wanwei Aisi Network Intelligent Industry Innovation Center Co ltd
Jiangsu Fangtian Power Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Wanwei Aisi Network Intelligent Industry Innovation Center Co ltd, Jiangsu Fangtian Power Technology Co Ltd filed Critical Jiangsu Wanwei Aisi Network Intelligent Industry Innovation Center Co ltd
Priority to CN202010852138.XA priority Critical patent/CN111950652A/en
Publication of CN111950652A publication Critical patent/CN111950652A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Abstract

The invention discloses a semi-supervised learning data classification algorithm based on similarity, which comprises three parts: k-means clustering of unlabeled samples, semi-supervised similarity calculation, and combination of the k-means clustering of the unlabeled samples and the semi-supervised similarity calculation to expand the small number of classes for model classification and then evaluate the classification effect of the models. According to the invention, clustering is not roughly carried out in a k =2 mode, but a set of algorithm mechanism is determined according to the principles of a smoothing hypothesis and a clustering hypothesis in semi-supervised learning, a k value is determined in a mode of measuring the similarity between a small quantity class and a clustering class, and then a data set closest to the small quantity class is determined and is expanded into labeled data. The marked data are expanded by using the unmarked data with high similarity, and the problem of data imbalance is effectively solved, so that the recall rate and the F1 value of a small number of samples are effectively improved.

Description

Semi-supervised learning data classification algorithm based on similarity
Technical Field
The invention relates to a new data classification algorithm, in particular to a semi-supervised learning data classification algorithm based on similarity.
Background
The data classification task often needs to establish a mapping relationship f: X- > Y between an input space X and an output space Y. Both the binary and multi-classification tasks require a large amount of labeled data to train, which puts requirements on the quantity and quality of the supervised data.
In general, public data sets in academic research generally have a large number of data labeled samples, the distribution of the samples is relatively uniform, and the model and the method perform better. However, in a real application scenario, there are problems that the supervision information of data is limited, the category distribution of data is unbalanced, and the labeled content of data has strong domain. The labeling information samples are few, and some samples with strong domain labeling cost are very high, and even cannot be labeled accurately. The magnitude-multiple and magnitude-less ratios in the data may be more than 1000:1 unbalanced. The traditional data classification method can improve the accuracy of the model by sacrificing the Recall rate of the few classes, and in some scenes, the Recall rate (Recall) of the few classes is a concerned index.
For the problem of unbalanced data distribution of the category, on one hand, optimization can be performed by adjusting the data distribution. Mainly through the way of data sampling, carry on the data resampling or oversampling to the class of quantity small, or carry on the optimization to the way of data undersampling to the class of quantity large. On the other hand, optimization can be performed by improving the model algorithm. For example, different losses are returned by a cost sensitive matrix mode according to the characteristics of data distribution, and the learning effect of the model on the small-quantity classes is enhanced. For the problems of less supervision data and strong domain, optimization can be performed by using a semi-supervised learning method, mainly including a difference-based method, a generating method, a discriminant method and a graph-based method.
Disclosure of Invention
The purpose of the invention is as follows: the invention provides a novel data classification algorithm, namely a semi-supervised learning data classification algorithm based on similarity, aiming at specific scenes with limited supervised labeled data, unbalanced labeled data categories and strong labeled content field. The method expands the few-class set by calculating the similarity between unmarked data and marked data, improves the classification and identification effects of the model on the few-class set, and improves the Recall rate (Recall) and the F1 value of the few-class set.
The technical scheme is as follows:
a semi-supervised learning data classification algorithm based on similarity comprises three parts: k-means clustering of unlabeled samples, semi-supervised similarity calculation, and combination of the k-means clustering of the unlabeled samples and the semi-supervised similarity calculation to expand the small number of classes for model classification and then evaluate the classification effect of the models.
In a further embodiment, the data is classified using a semi-supervised approach, and the unlabeled data of the high value few classes that are screened out are trained as labeled data to be added into the few classes.
In a further embodiment, the method further comprises a data classification framework of high-value small classes, and the main steps are as follows:
step (1): processing original data, and separating marked data from unmarked data;
step (2): clustering unlabeled data separated in the step (1) based on a similarity convergence algorithm of k-means clustering, wherein a k value is determined by a clustering result and a similarity calculation result of a small quantity of classes in labeled data, and further determining a set P' with the highest similarity;
and (3): expanding the set P' determined in the step 2 to the marked data P for model training;
and (4): the results of step (1) were evaluated by recall and F1 values.
In a further embodiment, in step (1), the original data is divided into a marked data set K and an unmarked data set D, the marked data set K is divided into positive examples P and negative examples N, the number of positive examples P is less than the number of negative examples N, so P is less-than-and N is more-than-less. The loss of misclassifying the small quantity class P into the large quantity class N is higher than that of misclassifying the large quantity class N into the small quantity class P, i.e. Cost (P, N)>Cost (N, P), where Cost (i, j) represents the loss of misclassification of category i into j; unmarked data set D is divided into D ═ D1,D2,···,DkWhere k is the number of clusters;
in a further embodiment, in step (2), the k-means clustering algorithm is as follows;
Figure BDA0002645092180000021
wherein
Figure BDA0002645092180000025
Wherein the value of k is the number of clusters, Di(i 1.. k) is a set of unmarked data D divided according to k values, and x is DiSample point of (1), μiIs DiOf the center of (c). The distance between dataset D and dataset P is calculated by the formula:
Figure BDA0002645092180000022
wherein
Figure BDA0002645092180000023
Is DiThe jth sample point in (a); distance (P, D)i) Convergence, stopping the increase of the k value, DiIs P' to be found.
In a further embodiment, in step (4), the recall rate is how many positive examples in the sample are predicted to be correct, and the calculation formula is as follows:
Figure BDA0002645092180000024
wherein TP is a true positive case and FN is a false negative case;
the F1 value takes the recall ratio and the precision ratio into consideration, and the calculation formula is as follows:
Figure BDA0002645092180000031
wherein, Recall is Recall rate, Precision is Precision rate, and Precision is TP/(TP + FP), where TP is true positive case and FP is false positive case.
Has the advantages that: the method has the obvious advantages that clustering is not roughly carried out according to the k-2 mode, but a set of algorithm mechanism is determined according to the principles of the smoothing hypothesis and the clustering hypothesis in semi-supervised learning, the k value is determined in a mode of measuring the similarity between the small-quantity class and the cluster set, and then the data set closest to the small-quantity class is determined and is expanded into the marking data. The marked data are expanded by using the unmarked data with high similarity, and the problem of data imbalance is effectively solved, so that the recall rate and the F1 value of a small number of samples are effectively improved.
Drawings
Fig. 1 raw data partitioning diagram.
FIG. 2 is a flow chart of the clustering-based similarity convergence algorithm of the present invention.
Fig. 3 is an exemplary graph of clustering results around a few classes P.
FIG. 4 illustrates the k-means cluster convergence process of the present invention.
FIG. 5 is a graph of cluster number and distance variation.
FIG. 6 data classification results recall and F1 values.
Detailed Description
A semi-supervised learning data classification algorithm based on similarity comprises three parts: k-means clustering of unlabeled samples, semi-supervised similarity calculation, and combination of the k-means clustering of the unlabeled samples and the semi-supervised similarity calculation to expand the small number of classes for model classification and then evaluate the classification effect of the models. And classifying the data by using a semi-supervised method, and adding the data serving as label data into a training set for training.
The invention also comprises a high-value small-class data classification framework, which mainly comprises the following steps:
step (1): processing original data, and separating marked data from unmarked data;
step (2): clustering unlabeled data separated in the step (1) based on a similarity convergence algorithm of k-means clustering, wherein a k value is determined by a clustering result and a similarity calculation result of a small quantity of classes in labeled data, and further determining a set P' with the highest similarity;
and (3): expanding the set P' determined in the step 2 to the marked data P for model training;
and (4): the results of step (1) were evaluated by recall and F1 values.
1. Processing of raw data
The raw data can be divided into a marked data set K and an unmarked data set D (the raw data division diagram is shown in fig. 1). The samples in the data set K contain two classes, positive examples P and negative examples N, where the number of positive examples P is much smaller than the number of negative examples N, and we are more concerned about recall rates for a small number of classes P. Thus, the penalty of misclassifying the less numerous class P into the more numerous class N is higher than the misclassification of the more numerous class N into the less numerous class P, i.e. Cost (P, N) > Cost (N, P), where Cost (i, j) represents the penalty of misclassifying class i into j.
For the unmarked data set D, D can be divided into D ═ D by means of clustering1,D2,···,DkWhere k is the number of clusters. Based on certain a priori knowledge, the smooth assumption and clustering assumption of semi-supervised data can be known, and the cluster set P' most similar to P is DiA high probability of (i ═ 1.. k) is also the sample labeled P. The expansion of the set P' into P can reduce the decrease of the accuracy rate as much as possible and improve the F1 value under the condition of improving the recall rate. The difficulty lies in the determination of the k value of the clustering class and the judgment of the similarity.
2. Similarity convergence algorithm based on k-means clustering
Instead of dividing the dataset into 2 classes according to the target class able ═ { P, N }, the value of k is determined by calculating the similarity between the target labeled samples and unlabeled clustered samples through the increasing iteration of the value of k, so that the value of k converges to a certain value.
(1) k-means clustering
And performing k-means clustering on a certain determined k value until clustering converges.
k-means obtains the optimal partitioning by minimizing the square error, and the formula is as follows:
Figure BDA0002645092180000041
wherein
Figure BDA0002645092180000043
Wherein the value of k is the number of clusters, Di(i 1.. k) is a set of unmarked data D divided according to k values, and x is DiSample point of (1), μiIs DiOf the center of (c).
However, since the above formula is a problem that NP is difficult, the optimal partition mode under the k value is usually obtained by an iterative optimization mode
{D1,D2,···,Dk}. An example graph of the clustering results around the data P is shown in fig. 3.
(2) Calculation of similarity
Determining the resulting division result D ═ D for the specific value of k in (1)1,D2,···,DkD, selecting the type P' with the minimum similarity to P in the range of Di(i 1.. k), where a classical euclidean distance is chosen to measure the distance between the two data sets. The calculation formula is as follows:
Figure BDA0002645092180000042
wherein
Figure BDA0002645092180000051
Is DiThe j-th sample point in (1).
(3) Determining a clustering k value, and determining an unlabeled clustering sample set P 'nearest to P'
Increasing the k value, and repeatedly performing the steps (1) and (2) until the Distance (P, D) in (2)i) Convergence begins and the increase in k value stops. The k value obtained in this case is the k value to be determined, Distance (P, D)i) D in (1)iIs the P' to be found.
The specific algorithm is as follows:
Figure BDA0002645092180000052
3. model training by expanding labeled data P
And (3) adding the set P' found by the similarity convergence algorithm based on k-means clustering in the step (2) into the marked data P, thereby effectively expanding the data with less classes and relieving the unbalance problem of the data with less classes. And dividing the expanded data set into a training set and a testing set for training, wherein the trained model can select a classical machine learning classification model or a neural network classification model according to actual requirements.
4. Evaluation of results
The classification effect of the model can be evaluated through the recall rate and the F1 value.
Recall, also known as recall, is how many positive examples in a sample are predicted to be correct.
Figure BDA0002645092180000053
Wherein TP is a true positive case and FN is a false negative case;
the F1 value takes the recall ratio and the precision ratio into consideration, and the calculation formula is as follows:
Figure BDA0002645092180000061
wherein, Recall is Recall rate, Precision is Precision rate, and Precision is TP/(TP + FP), where TP is true positive case and FP is false positive case.

Claims (6)

1. A semi-supervised learning data classification algorithm based on similarity is characterized by comprising three parts: k-means clustering of unlabeled samples, semi-supervised similarity calculation, and combination of the k-means clustering of the unlabeled samples and the semi-supervised similarity calculation to expand the small number of classes for model classification and then evaluate the classification effect of the models.
2. The similarity-based semi-supervised learning data classification algorithm according to claim 1, wherein a semi-supervised method is used for classifying data, and unlabeled data of the screened high-value small-quantity class is trained as labeled data added small-quantity class.
3. The similarity-based semi-supervised learning data classification algorithm according to claim 1, further comprising a high-value-quantity-less-class data classification framework, which mainly comprises the following steps:
step (1): processing original data, and separating marked data from unmarked data;
step (2): clustering unlabeled data separated in the step (1) based on a similarity convergence algorithm of k-means clustering, wherein a k value is determined by a clustering result and a similarity calculation result of a small quantity of classes in labeled data, and further determining a set P' with the highest similarity;
and (3): expanding the set P' determined in the step 2 to the marked data P for model training;
and (4): the results of step (1) were evaluated by recall and F1 values.
4. The semi-supervised learning data classification algorithm based on similarity as claimed in claim 3, wherein in step (1), the raw data are divided into labeled data sets K and unlabeled data sets D, the labeled data sets K are divided into positive examples P and negative examples N, the number of positive examples P is less than that of negative examples N, so P is less-than-amount class, and N is more-than-amount class. The loss of misclassifying the small quantity class P into the large quantity class N is higher than that of misclassifying the large quantity class N into the small quantity class P, i.e. Cost (P, N)>Cost (N, P), where Cost (i, j) represents the loss of misclassification of category i into j; unmarked data set D is divided into D ═ D1,D2,···,DkWhere k is the number of clusters.
5. The semi-supervised learning data classification algorithm based on similarity as claimed in claim 3, wherein in the step (2), the k-means clustering algorithm is as follows;
Figure FDA0002645092170000011
wherein
Figure FDA0002645092170000012
Wherein the value of k is the number of clusters, Di(i=1,...,k) Is a set of unlabeled data D divided according to k values, x is DiSample point of (1), μiIs DiOf the center of (c). The distance between dataset D and dataset P is calculated by the formula:
Figure FDA0002645092170000013
wherein
Figure FDA0002645092170000014
j=1,...,|DiL is DiThe jth sample point in (a); distance (P, D)i) Convergence, stopping the increase of the k value, DiIs P' to be found.
6. The semi-supervised learning data classification algorithm based on similarity as claimed in claim 3, wherein in step (4), the recall ratio is how many positive examples in the samples are predicted to be correct, and the calculation formula is as follows:
Figure FDA0002645092170000021
wherein TP is a true positive case and FN is a false negative case;
the F1 value takes the recall ratio and the precision ratio into consideration, and the calculation formula is as follows:
Figure FDA0002645092170000022
wherein, Recall is Recall rate, Precision is Precision rate, and Precision is TP/(TP + FP), where TP is true positive case and FP is false positive case.
CN202010852138.XA 2020-08-21 2020-08-21 Semi-supervised learning data classification algorithm based on similarity Withdrawn CN111950652A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010852138.XA CN111950652A (en) 2020-08-21 2020-08-21 Semi-supervised learning data classification algorithm based on similarity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010852138.XA CN111950652A (en) 2020-08-21 2020-08-21 Semi-supervised learning data classification algorithm based on similarity

Publications (1)

Publication Number Publication Date
CN111950652A true CN111950652A (en) 2020-11-17

Family

ID=73359660

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010852138.XA Withdrawn CN111950652A (en) 2020-08-21 2020-08-21 Semi-supervised learning data classification algorithm based on similarity

Country Status (1)

Country Link
CN (1) CN111950652A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112348284A (en) * 2020-11-25 2021-02-09 新智数字科技有限公司 Power load prediction method and device, readable medium and electronic equipment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112348284A (en) * 2020-11-25 2021-02-09 新智数字科技有限公司 Power load prediction method and device, readable medium and electronic equipment

Similar Documents

Publication Publication Date Title
Yu et al. Hierarchical deep click feature prediction for fine-grained image recognition
Zhang et al. Discovering new intents with deep aligned clustering
CN110443281B (en) Text classification self-adaptive oversampling method based on HDBSCAN (high-density binary-coded decimal) clustering
CN106557485B (en) Method and device for selecting text classification training set
CN112069310B (en) Text classification method and system based on active learning strategy
CN111324642A (en) Model algorithm type selection and evaluation method for power grid big data analysis
CN108875816A (en) Merge the Active Learning samples selection strategy of Reliability Code and diversity criterion
CN111626336A (en) Subway fault data classification method based on unbalanced data set
CN109993225B (en) Airspace complexity classification method and device based on unsupervised learning
CN103294817A (en) Text feature extraction method based on categorical distribution probability
JP2014026455A (en) Media data analysis device, method and program
CN113887643B (en) New dialogue intention recognition method based on pseudo tag self-training and source domain retraining
CN109299263B (en) Text classification method and electronic equipment
CN111325264A (en) Multi-label data classification method based on entropy
CN103778206A (en) Method for providing network service resources
CN111191033A (en) Open set classification method based on classification utility
CN112685374B (en) Log classification method and device and electronic equipment
CN111950652A (en) Semi-supervised learning data classification algorithm based on similarity
Liu et al. Unbalanced classification method using least squares support vector machine with sparse strategy for steel surface defects with label noise
CN107909090A (en) Learn semi-supervised music-book on pianoforte difficulty recognition methods based on estimating
CN114818979A (en) Noise-containing multi-label classification method based on maximum interval mechanism
Purnomo et al. Synthesis ensemble oversampling and ensemble tree-based machine learning for class imbalance problem in breast cancer diagnosis
CN114547264A (en) News diagram data identification method based on Mahalanobis distance and comparison learning
Kumar et al. Cluster-than-label: Semi-supervised approach for domain adaptation
CN110647671A (en) Data stream classification algorithm based on AAE-DWMIL-LearnNSE

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20201117