CN111950652A

CN111950652A - Semi-supervised learning data classification algorithm based on similarity

Info

Publication number: CN111950652A
Application number: CN202010852138.XA
Authority: CN
Inventors: 孙栓柱; 陈广; 高阳; 周春蕾; 李逗; 孙彬; 王林; 王其祥; 高进; 李春岩; 沈洋; 黄治军; 张磊; 傅高健; 周心澄
Original assignee: Jiangsu Wanwei Aisi Network Intelligent Industry Innovation Center Co ltd; Jiangsu Fangtian Power Technology Co Ltd
Current assignee: Jiangsu Wanwei Aisi Network Intelligent Industry Innovation Center Co ltd; Jiangsu Fangtian Power Technology Co Ltd
Priority date: 2020-08-21
Filing date: 2020-08-21
Publication date: 2020-11-17

Abstract

The invention discloses a semi-supervised learning data classification algorithm based on similarity, which comprises three parts: k-means clustering of unlabeled samples, semi-supervised similarity calculation, and combination of the k-means clustering of the unlabeled samples and the semi-supervised similarity calculation to expand the small number of classes for model classification and then evaluate the classification effect of the models. According to the invention, clustering is not roughly carried out in a k =2 mode, but a set of algorithm mechanism is determined according to the principles of a smoothing hypothesis and a clustering hypothesis in semi-supervised learning, a k value is determined in a mode of measuring the similarity between a small quantity class and a clustering class, and then a data set closest to the small quantity class is determined and is expanded into labeled data. The marked data are expanded by using the unmarked data with high similarity, and the problem of data imbalance is effectively solved, so that the recall rate and the F1 value of a small number of samples are effectively improved.

Description

Semi-supervised learning data classification algorithm based on similarity

Technical Field

The invention relates to a new data classification algorithm, in particular to a semi-supervised learning data classification algorithm based on similarity.

Background

The data classification task often needs to establish a mapping relationship f: X- > Y between an input space X and an output space Y. Both the binary and multi-classification tasks require a large amount of labeled data to train, which puts requirements on the quantity and quality of the supervised data.

In general, public data sets in academic research generally have a large number of data labeled samples, the distribution of the samples is relatively uniform, and the model and the method perform better. However, in a real application scenario, there are problems that the supervision information of data is limited, the category distribution of data is unbalanced, and the labeled content of data has strong domain. The labeling information samples are few, and some samples with strong domain labeling cost are very high, and even cannot be labeled accurately. The magnitude-multiple and magnitude-less ratios in the data may be more than 1000:1 unbalanced. The traditional data classification method can improve the accuracy of the model by sacrificing the Recall rate of the few classes, and in some scenes, the Recall rate (Recall) of the few classes is a concerned index.

For the problem of unbalanced data distribution of the category, on one hand, optimization can be performed by adjusting the data distribution. Mainly through the way of data sampling, carry on the data resampling or oversampling to the class of quantity small, or carry on the optimization to the way of data undersampling to the class of quantity large. On the other hand, optimization can be performed by improving the model algorithm. For example, different losses are returned by a cost sensitive matrix mode according to the characteristics of data distribution, and the learning effect of the model on the small-quantity classes is enhanced. For the problems of less supervision data and strong domain, optimization can be performed by using a semi-supervised learning method, mainly including a difference-based method, a generating method, a discriminant method and a graph-based method.

Disclosure of Invention

The purpose of the invention is as follows: the invention provides a novel data classification algorithm, namely a semi-supervised learning data classification algorithm based on similarity, aiming at specific scenes with limited supervised labeled data, unbalanced labeled data categories and strong labeled content field. The method expands the few-class set by calculating the similarity between unmarked data and marked data, improves the classification and identification effects of the model on the few-class set, and improves the Recall rate (Recall) and the F1 value of the few-class set.

The technical scheme is as follows:

a semi-supervised learning data classification algorithm based on similarity comprises three parts: k-means clustering of unlabeled samples, semi-supervised similarity calculation, and combination of the k-means clustering of the unlabeled samples and the semi-supervised similarity calculation to expand the small number of classes for model classification and then evaluate the classification effect of the models.

In a further embodiment, the data is classified using a semi-supervised approach, and the unlabeled data of the high value few classes that are screened out are trained as labeled data to be added into the few classes.

In a further embodiment, the method further comprises a data classification framework of high-value small classes, and the main steps are as follows:

step (1): processing original data, and separating marked data from unmarked data;

step (2): clustering unlabeled data separated in the step (1) based on a similarity convergence algorithm of k-means clustering, wherein a k value is determined by a clustering result and a similarity calculation result of a small quantity of classes in labeled data, and further determining a set P' with the highest similarity;

and (3): expanding the set P' determined in the step 2 to the marked data P for model training;

and (4): the results of step (1) were evaluated by recall and F1 values.

In a further embodiment, in step (1), the original data is divided into a marked data set K and an unmarked data set D, the marked data set K is divided into positive examples P and negative examples N, the number of positive examples P is less than the number of negative examples N, so P is less-than-and N is more-than-less. The loss of misclassifying the small quantity class P into the large quantity class N is higher than that of misclassifying the large quantity class N into the small quantity class P, i.e. Cost (P, N)>Cost (N, P), where Cost (i, j) represents the loss of misclassification of category i into j; unmarked data set D is divided into D ═ D₁,D₂,···,D_kWhere k is the number of clusters;

in a further embodiment, in step (2), the k-means clustering algorithm is as follows;

wherein

Wherein the value of k is the number of clusters, D_i(i 1.. k) is a set of unmarked data D divided according to k values, and x is D_iSample point of (1), μ_iIs D_iOf the center of (c). The distance between dataset D and dataset P is calculated by the formula:

wherein

Is D_iThe jth sample point in (a); distance (P, D)_i) Convergence, stopping the increase of the k value, D_iIs P' to be found.

In a further embodiment, in step (4), the recall rate is how many positive examples in the sample are predicted to be correct, and the calculation formula is as follows:

wherein TP is a true positive case and FN is a false negative case;

the F1 value takes the recall ratio and the precision ratio into consideration, and the calculation formula is as follows:

wherein, Recall is Recall rate, Precision is Precision rate, and Precision is TP/(TP + FP), where TP is true positive case and FP is false positive case.

Has the advantages that: the method has the obvious advantages that clustering is not roughly carried out according to the k-2 mode, but a set of algorithm mechanism is determined according to the principles of the smoothing hypothesis and the clustering hypothesis in semi-supervised learning, the k value is determined in a mode of measuring the similarity between the small-quantity class and the cluster set, and then the data set closest to the small-quantity class is determined and is expanded into the marking data. The marked data are expanded by using the unmarked data with high similarity, and the problem of data imbalance is effectively solved, so that the recall rate and the F1 value of a small number of samples are effectively improved.

Drawings

Fig. 1 raw data partitioning diagram.

FIG. 2 is a flow chart of the clustering-based similarity convergence algorithm of the present invention.

Fig. 3 is an exemplary graph of clustering results around a few classes P.

FIG. 4 illustrates the k-means cluster convergence process of the present invention.

FIG. 5 is a graph of cluster number and distance variation.

FIG. 6 data classification results recall and F1 values.

Detailed Description

A semi-supervised learning data classification algorithm based on similarity comprises three parts: k-means clustering of unlabeled samples, semi-supervised similarity calculation, and combination of the k-means clustering of the unlabeled samples and the semi-supervised similarity calculation to expand the small number of classes for model classification and then evaluate the classification effect of the models. And classifying the data by using a semi-supervised method, and adding the data serving as label data into a training set for training.

The invention also comprises a high-value small-class data classification framework, which mainly comprises the following steps:

and (4): the results of step (1) were evaluated by recall and F1 values.

1. Processing of raw data

The raw data can be divided into a marked data set K and an unmarked data set D (the raw data division diagram is shown in fig. 1). The samples in the data set K contain two classes, positive examples P and negative examples N, where the number of positive examples P is much smaller than the number of negative examples N, and we are more concerned about recall rates for a small number of classes P. Thus, the penalty of misclassifying the less numerous class P into the more numerous class N is higher than the misclassification of the more numerous class N into the less numerous class P, i.e. Cost (P, N) > Cost (N, P), where Cost (i, j) represents the penalty of misclassifying class i into j.

For the unmarked data set D, D can be divided into D ═ D by means of clustering₁,D₂,···,D_kWhere k is the number of clusters. Based on certain a priori knowledge, the smooth assumption and clustering assumption of semi-supervised data can be known, and the cluster set P' most similar to P is D_iA high probability of (i ═ 1.. k) is also the sample labeled P. The expansion of the set P' into P can reduce the decrease of the accuracy rate as much as possible and improve the F1 value under the condition of improving the recall rate. The difficulty lies in the determination of the k value of the clustering class and the judgment of the similarity.

2. Similarity convergence algorithm based on k-means clustering

Instead of dividing the dataset into 2 classes according to the target class able ═ { P, N }, the value of k is determined by calculating the similarity between the target labeled samples and unlabeled clustered samples through the increasing iteration of the value of k, so that the value of k converges to a certain value.

(1) k-means clustering

And performing k-means clustering on a certain determined k value until clustering converges.

k-means obtains the optimal partitioning by minimizing the square error, and the formula is as follows:

wherein

Wherein the value of k is the number of clusters, D_i(i 1.. k) is a set of unmarked data D divided according to k values, and x is D_iSample point of (1), μ_iIs D_iOf the center of (c).

However, since the above formula is a problem that NP is difficult, the optimal partition mode under the k value is usually obtained by an iterative optimization mode

{D₁,D₂,···,D_k}. An example graph of the clustering results around the data P is shown in fig. 3.

(2) Calculation of similarity

Determining the resulting division result D ═ D for the specific value of k in (1)₁,D₂,···,D_kD, selecting the type P' with the minimum similarity to P in the range of D_i(i 1.. k), where a classical euclidean distance is chosen to measure the distance between the two data sets. The calculation formula is as follows:

wherein

Is D_iThe j-th sample point in (1).

(3) Determining a clustering k value, and determining an unlabeled clustering sample set P 'nearest to P'

Increasing the k value, and repeatedly performing the steps (1) and (2) until the Distance (P, D) in (2)_i) Convergence begins and the increase in k value stops. The k value obtained in this case is the k value to be determined, Distance (P, D)_i) D in (1)_iIs the P' to be found.

The specific algorithm is as follows:

3. model training by expanding labeled data P

And (3) adding the set P' found by the similarity convergence algorithm based on k-means clustering in the step (2) into the marked data P, thereby effectively expanding the data with less classes and relieving the unbalance problem of the data with less classes. And dividing the expanded data set into a training set and a testing set for training, wherein the trained model can select a classical machine learning classification model or a neural network classification model according to actual requirements.

4. Evaluation of results

The classification effect of the model can be evaluated through the recall rate and the F1 value.

Recall, also known as recall, is how many positive examples in a sample are predicted to be correct.

Wherein TP is a true positive case and FN is a false negative case;

Claims

1. A semi-supervised learning data classification algorithm based on similarity is characterized by comprising three parts: k-means clustering of unlabeled samples, semi-supervised similarity calculation, and combination of the k-means clustering of the unlabeled samples and the semi-supervised similarity calculation to expand the small number of classes for model classification and then evaluate the classification effect of the models.

2. The similarity-based semi-supervised learning data classification algorithm according to claim 1, wherein a semi-supervised method is used for classifying data, and unlabeled data of the screened high-value small-quantity class is trained as labeled data added small-quantity class.

3. The similarity-based semi-supervised learning data classification algorithm according to claim 1, further comprising a high-value-quantity-less-class data classification framework, which mainly comprises the following steps:

and (4): the results of step (1) were evaluated by recall and F1 values.

4. The semi-supervised learning data classification algorithm based on similarity as claimed in claim 3, wherein in step (1), the raw data are divided into labeled data sets K and unlabeled data sets D, the labeled data sets K are divided into positive examples P and negative examples N, the number of positive examples P is less than that of negative examples N, so P is less-than-amount class, and N is more-than-amount class. The loss of misclassifying the small quantity class P into the large quantity class N is higher than that of misclassifying the large quantity class N into the small quantity class P, i.e. Cost (P, N)>Cost (N, P), where Cost (i, j) represents the loss of misclassification of category i into j; unmarked data set D is divided into D ═ D₁,D₂,···,D_kWhere k is the number of clusters.

5. The semi-supervised learning data classification algorithm based on similarity as claimed in claim 3, wherein in the step (2), the k-means clustering algorithm is as follows;

wherein

Wherein the value of k is the number of clusters, D_i(i＝1,...,k) Is a set of unlabeled data D divided according to k values, x is D_iSample point of (1), μ_iIs D_iOf the center of (c). The distance between dataset D and dataset P is calculated by the formula:

wherein

j＝1,...,|D_iL is D_iThe jth sample point in (a); distance (P, D)_i) Convergence, stopping the increase of the k value, D_iIs P' to be found.

6. The semi-supervised learning data classification algorithm based on similarity as claimed in claim 3, wherein in step (4), the recall ratio is how many positive examples in the samples are predicted to be correct, and the calculation formula is as follows:

wherein TP is a true positive case and FN is a false negative case;