CN111950652A - Semi-supervised learning data classification algorithm based on similarity - Google Patents
Semi-supervised learning data classification algorithm based on similarity Download PDFInfo
- Publication number
- CN111950652A CN111950652A CN202010852138.XA CN202010852138A CN111950652A CN 111950652 A CN111950652 A CN 111950652A CN 202010852138 A CN202010852138 A CN 202010852138A CN 111950652 A CN111950652 A CN 111950652A
- Authority
- CN
- China
- Prior art keywords
- data
- similarity
- semi
- class
- value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
Abstract
The invention discloses a semi-supervised learning data classification algorithm based on similarity, which comprises three parts: k-means clustering of unlabeled samples, semi-supervised similarity calculation, and combination of the k-means clustering of the unlabeled samples and the semi-supervised similarity calculation to expand the small number of classes for model classification and then evaluate the classification effect of the models. According to the invention, clustering is not roughly carried out in a k =2 mode, but a set of algorithm mechanism is determined according to the principles of a smoothing hypothesis and a clustering hypothesis in semi-supervised learning, a k value is determined in a mode of measuring the similarity between a small quantity class and a clustering class, and then a data set closest to the small quantity class is determined and is expanded into labeled data. The marked data are expanded by using the unmarked data with high similarity, and the problem of data imbalance is effectively solved, so that the recall rate and the F1 value of a small number of samples are effectively improved.
Description
Technical Field
The invention relates to a new data classification algorithm, in particular to a semi-supervised learning data classification algorithm based on similarity.
Background
The data classification task often needs to establish a mapping relationship f: X- > Y between an input space X and an output space Y. Both the binary and multi-classification tasks require a large amount of labeled data to train, which puts requirements on the quantity and quality of the supervised data.
In general, public data sets in academic research generally have a large number of data labeled samples, the distribution of the samples is relatively uniform, and the model and the method perform better. However, in a real application scenario, there are problems that the supervision information of data is limited, the category distribution of data is unbalanced, and the labeled content of data has strong domain. The labeling information samples are few, and some samples with strong domain labeling cost are very high, and even cannot be labeled accurately. The magnitude-multiple and magnitude-less ratios in the data may be more than 1000:1 unbalanced. The traditional data classification method can improve the accuracy of the model by sacrificing the Recall rate of the few classes, and in some scenes, the Recall rate (Recall) of the few classes is a concerned index.
For the problem of unbalanced data distribution of the category, on one hand, optimization can be performed by adjusting the data distribution. Mainly through the way of data sampling, carry on the data resampling or oversampling to the class of quantity small, or carry on the optimization to the way of data undersampling to the class of quantity large. On the other hand, optimization can be performed by improving the model algorithm. For example, different losses are returned by a cost sensitive matrix mode according to the characteristics of data distribution, and the learning effect of the model on the small-quantity classes is enhanced. For the problems of less supervision data and strong domain, optimization can be performed by using a semi-supervised learning method, mainly including a difference-based method, a generating method, a discriminant method and a graph-based method.
Disclosure of Invention
The purpose of the invention is as follows: the invention provides a novel data classification algorithm, namely a semi-supervised learning data classification algorithm based on similarity, aiming at specific scenes with limited supervised labeled data, unbalanced labeled data categories and strong labeled content field. The method expands the few-class set by calculating the similarity between unmarked data and marked data, improves the classification and identification effects of the model on the few-class set, and improves the Recall rate (Recall) and the F1 value of the few-class set.
The technical scheme is as follows:
a semi-supervised learning data classification algorithm based on similarity comprises three parts: k-means clustering of unlabeled samples, semi-supervised similarity calculation, and combination of the k-means clustering of the unlabeled samples and the semi-supervised similarity calculation to expand the small number of classes for model classification and then evaluate the classification effect of the models.
In a further embodiment, the data is classified using a semi-supervised approach, and the unlabeled data of the high value few classes that are screened out are trained as labeled data to be added into the few classes.
In a further embodiment, the method further comprises a data classification framework of high-value small classes, and the main steps are as follows:
step (1): processing original data, and separating marked data from unmarked data;
step (2): clustering unlabeled data separated in the step (1) based on a similarity convergence algorithm of k-means clustering, wherein a k value is determined by a clustering result and a similarity calculation result of a small quantity of classes in labeled data, and further determining a set P' with the highest similarity;
and (3): expanding the set P' determined in the step 2 to the marked data P for model training;
and (4): the results of step (1) were evaluated by recall and F1 values.
In a further embodiment, in step (1), the original data is divided into a marked data set K and an unmarked data set D, the marked data set K is divided into positive examples P and negative examples N, the number of positive examples P is less than the number of negative examples N, so P is less-than-and N is more-than-less. The loss of misclassifying the small quantity class P into the large quantity class N is higher than that of misclassifying the large quantity class N into the small quantity class P, i.e. Cost (P, N)>Cost (N, P), where Cost (i, j) represents the loss of misclassification of category i into j; unmarked data set D is divided into D ═ D1,D2,···,DkWhere k is the number of clusters;
in a further embodiment, in step (2), the k-means clustering algorithm is as follows;
Wherein the value of k is the number of clusters, Di(i 1.. k) is a set of unmarked data D divided according to k values, and x is DiSample point of (1), μiIs DiOf the center of (c). The distance between dataset D and dataset P is calculated by the formula:
whereinIs DiThe jth sample point in (a); distance (P, D)i) Convergence, stopping the increase of the k value, DiIs P' to be found.
In a further embodiment, in step (4), the recall rate is how many positive examples in the sample are predicted to be correct, and the calculation formula is as follows:
wherein TP is a true positive case and FN is a false negative case;
the F1 value takes the recall ratio and the precision ratio into consideration, and the calculation formula is as follows:
wherein, Recall is Recall rate, Precision is Precision rate, and Precision is TP/(TP + FP), where TP is true positive case and FP is false positive case.
Has the advantages that: the method has the obvious advantages that clustering is not roughly carried out according to the k-2 mode, but a set of algorithm mechanism is determined according to the principles of the smoothing hypothesis and the clustering hypothesis in semi-supervised learning, the k value is determined in a mode of measuring the similarity between the small-quantity class and the cluster set, and then the data set closest to the small-quantity class is determined and is expanded into the marking data. The marked data are expanded by using the unmarked data with high similarity, and the problem of data imbalance is effectively solved, so that the recall rate and the F1 value of a small number of samples are effectively improved.
Drawings
Fig. 1 raw data partitioning diagram.
FIG. 2 is a flow chart of the clustering-based similarity convergence algorithm of the present invention.
Fig. 3 is an exemplary graph of clustering results around a few classes P.
FIG. 4 illustrates the k-means cluster convergence process of the present invention.
FIG. 5 is a graph of cluster number and distance variation.
FIG. 6 data classification results recall and F1 values.
Detailed Description
A semi-supervised learning data classification algorithm based on similarity comprises three parts: k-means clustering of unlabeled samples, semi-supervised similarity calculation, and combination of the k-means clustering of the unlabeled samples and the semi-supervised similarity calculation to expand the small number of classes for model classification and then evaluate the classification effect of the models. And classifying the data by using a semi-supervised method, and adding the data serving as label data into a training set for training.
The invention also comprises a high-value small-class data classification framework, which mainly comprises the following steps:
step (1): processing original data, and separating marked data from unmarked data;
step (2): clustering unlabeled data separated in the step (1) based on a similarity convergence algorithm of k-means clustering, wherein a k value is determined by a clustering result and a similarity calculation result of a small quantity of classes in labeled data, and further determining a set P' with the highest similarity;
and (3): expanding the set P' determined in the step 2 to the marked data P for model training;
and (4): the results of step (1) were evaluated by recall and F1 values.
1. Processing of raw data
The raw data can be divided into a marked data set K and an unmarked data set D (the raw data division diagram is shown in fig. 1). The samples in the data set K contain two classes, positive examples P and negative examples N, where the number of positive examples P is much smaller than the number of negative examples N, and we are more concerned about recall rates for a small number of classes P. Thus, the penalty of misclassifying the less numerous class P into the more numerous class N is higher than the misclassification of the more numerous class N into the less numerous class P, i.e. Cost (P, N) > Cost (N, P), where Cost (i, j) represents the penalty of misclassifying class i into j.
For the unmarked data set D, D can be divided into D ═ D by means of clustering1,D2,···,DkWhere k is the number of clusters. Based on certain a priori knowledge, the smooth assumption and clustering assumption of semi-supervised data can be known, and the cluster set P' most similar to P is DiA high probability of (i ═ 1.. k) is also the sample labeled P. The expansion of the set P' into P can reduce the decrease of the accuracy rate as much as possible and improve the F1 value under the condition of improving the recall rate. The difficulty lies in the determination of the k value of the clustering class and the judgment of the similarity.
2. Similarity convergence algorithm based on k-means clustering
Instead of dividing the dataset into 2 classes according to the target class able ═ { P, N }, the value of k is determined by calculating the similarity between the target labeled samples and unlabeled clustered samples through the increasing iteration of the value of k, so that the value of k converges to a certain value.
(1) k-means clustering
And performing k-means clustering on a certain determined k value until clustering converges.
k-means obtains the optimal partitioning by minimizing the square error, and the formula is as follows:
Wherein the value of k is the number of clusters, Di(i 1.. k) is a set of unmarked data D divided according to k values, and x is DiSample point of (1), μiIs DiOf the center of (c).
However, since the above formula is a problem that NP is difficult, the optimal partition mode under the k value is usually obtained by an iterative optimization mode
{D1,D2,···,Dk}. An example graph of the clustering results around the data P is shown in fig. 3.
(2) Calculation of similarity
Determining the resulting division result D ═ D for the specific value of k in (1)1,D2,···,DkD, selecting the type P' with the minimum similarity to P in the range of Di(i 1.. k), where a classical euclidean distance is chosen to measure the distance between the two data sets. The calculation formula is as follows:
(3) Determining a clustering k value, and determining an unlabeled clustering sample set P 'nearest to P'
Increasing the k value, and repeatedly performing the steps (1) and (2) until the Distance (P, D) in (2)i) Convergence begins and the increase in k value stops. The k value obtained in this case is the k value to be determined, Distance (P, D)i) D in (1)iIs the P' to be found.
The specific algorithm is as follows:
3. model training by expanding labeled data P
And (3) adding the set P' found by the similarity convergence algorithm based on k-means clustering in the step (2) into the marked data P, thereby effectively expanding the data with less classes and relieving the unbalance problem of the data with less classes. And dividing the expanded data set into a training set and a testing set for training, wherein the trained model can select a classical machine learning classification model or a neural network classification model according to actual requirements.
4. Evaluation of results
The classification effect of the model can be evaluated through the recall rate and the F1 value.
Recall, also known as recall, is how many positive examples in a sample are predicted to be correct.
Wherein TP is a true positive case and FN is a false negative case;
the F1 value takes the recall ratio and the precision ratio into consideration, and the calculation formula is as follows:
wherein, Recall is Recall rate, Precision is Precision rate, and Precision is TP/(TP + FP), where TP is true positive case and FP is false positive case.
Claims (6)
1. A semi-supervised learning data classification algorithm based on similarity is characterized by comprising three parts: k-means clustering of unlabeled samples, semi-supervised similarity calculation, and combination of the k-means clustering of the unlabeled samples and the semi-supervised similarity calculation to expand the small number of classes for model classification and then evaluate the classification effect of the models.
2. The similarity-based semi-supervised learning data classification algorithm according to claim 1, wherein a semi-supervised method is used for classifying data, and unlabeled data of the screened high-value small-quantity class is trained as labeled data added small-quantity class.
3. The similarity-based semi-supervised learning data classification algorithm according to claim 1, further comprising a high-value-quantity-less-class data classification framework, which mainly comprises the following steps:
step (1): processing original data, and separating marked data from unmarked data;
step (2): clustering unlabeled data separated in the step (1) based on a similarity convergence algorithm of k-means clustering, wherein a k value is determined by a clustering result and a similarity calculation result of a small quantity of classes in labeled data, and further determining a set P' with the highest similarity;
and (3): expanding the set P' determined in the step 2 to the marked data P for model training;
and (4): the results of step (1) were evaluated by recall and F1 values.
4. The semi-supervised learning data classification algorithm based on similarity as claimed in claim 3, wherein in step (1), the raw data are divided into labeled data sets K and unlabeled data sets D, the labeled data sets K are divided into positive examples P and negative examples N, the number of positive examples P is less than that of negative examples N, so P is less-than-amount class, and N is more-than-amount class. The loss of misclassifying the small quantity class P into the large quantity class N is higher than that of misclassifying the large quantity class N into the small quantity class P, i.e. Cost (P, N)>Cost (N, P), where Cost (i, j) represents the loss of misclassification of category i into j; unmarked data set D is divided into D ═ D1,D2,···,DkWhere k is the number of clusters.
5. The semi-supervised learning data classification algorithm based on similarity as claimed in claim 3, wherein in the step (2), the k-means clustering algorithm is as follows;
Wherein the value of k is the number of clusters, Di(i=1,...,k) Is a set of unlabeled data D divided according to k values, x is DiSample point of (1), μiIs DiOf the center of (c). The distance between dataset D and dataset P is calculated by the formula:
6. The semi-supervised learning data classification algorithm based on similarity as claimed in claim 3, wherein in step (4), the recall ratio is how many positive examples in the samples are predicted to be correct, and the calculation formula is as follows:
wherein TP is a true positive case and FN is a false negative case;
the F1 value takes the recall ratio and the precision ratio into consideration, and the calculation formula is as follows:
wherein, Recall is Recall rate, Precision is Precision rate, and Precision is TP/(TP + FP), where TP is true positive case and FP is false positive case.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010852138.XA CN111950652A (en) | 2020-08-21 | 2020-08-21 | Semi-supervised learning data classification algorithm based on similarity |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010852138.XA CN111950652A (en) | 2020-08-21 | 2020-08-21 | Semi-supervised learning data classification algorithm based on similarity |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111950652A true CN111950652A (en) | 2020-11-17 |
Family
ID=73359660
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010852138.XA Withdrawn CN111950652A (en) | 2020-08-21 | 2020-08-21 | Semi-supervised learning data classification algorithm based on similarity |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111950652A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112348284A (en) * | 2020-11-25 | 2021-02-09 | 新智数字科技有限公司 | Power load prediction method and device, readable medium and electronic equipment |
-
2020
- 2020-08-21 CN CN202010852138.XA patent/CN111950652A/en not_active Withdrawn
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112348284A (en) * | 2020-11-25 | 2021-02-09 | 新智数字科技有限公司 | Power load prediction method and device, readable medium and electronic equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Yu et al. | Hierarchical deep click feature prediction for fine-grained image recognition | |
Zhang et al. | Discovering new intents with deep aligned clustering | |
CN110443281B (en) | Text classification self-adaptive oversampling method based on HDBSCAN (high-density binary-coded decimal) clustering | |
CN106557485B (en) | Method and device for selecting text classification training set | |
CN112069310B (en) | Text classification method and system based on active learning strategy | |
CN111324642A (en) | Model algorithm type selection and evaluation method for power grid big data analysis | |
CN108875816A (en) | Merge the Active Learning samples selection strategy of Reliability Code and diversity criterion | |
CN111626336A (en) | Subway fault data classification method based on unbalanced data set | |
CN109993225B (en) | Airspace complexity classification method and device based on unsupervised learning | |
CN103294817A (en) | Text feature extraction method based on categorical distribution probability | |
JP2014026455A (en) | Media data analysis device, method and program | |
CN113887643B (en) | New dialogue intention recognition method based on pseudo tag self-training and source domain retraining | |
CN109299263B (en) | Text classification method and electronic equipment | |
CN111325264A (en) | Multi-label data classification method based on entropy | |
CN103778206A (en) | Method for providing network service resources | |
CN111191033A (en) | Open set classification method based on classification utility | |
CN112685374B (en) | Log classification method and device and electronic equipment | |
CN111950652A (en) | Semi-supervised learning data classification algorithm based on similarity | |
Liu et al. | Unbalanced classification method using least squares support vector machine with sparse strategy for steel surface defects with label noise | |
CN107909090A (en) | Learn semi-supervised music-book on pianoforte difficulty recognition methods based on estimating | |
CN114818979A (en) | Noise-containing multi-label classification method based on maximum interval mechanism | |
Purnomo et al. | Synthesis ensemble oversampling and ensemble tree-based machine learning for class imbalance problem in breast cancer diagnosis | |
CN114547264A (en) | News diagram data identification method based on Mahalanobis distance and comparison learning | |
Kumar et al. | Cluster-than-label: Semi-supervised approach for domain adaptation | |
CN110647671A (en) | Data stream classification algorithm based on AAE-DWMIL-LearnNSE |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20201117 |