CN111325264A

CN111325264A - Multi-label data classification method based on entropy

Info

Publication number: CN111325264A
Application number: CN202010096523.6A
Authority: CN
Inventors: 杜博; 陈玉坤
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2020-02-17
Filing date: 2020-02-17
Publication date: 2020-06-23

Abstract

The invention discloses a multi-label data classification method based on entropy, which comprises a training phase and a testing phase. The training stage comprises the steps of selecting data samples, constructing a training set, constructing a label set, constructing a classifier and analyzing parameters. Selecting a proper data sample, dividing the sample into a training set and a testing set according to the ratio of 4:1, calculating the entropy value of each Label in the training set sample, selecting a proper Label set by sequencing the entropy values of the labels, performing parameter analysis to obtain the optimal Label subset number and voting threshold value, and training based on a Label Power set classifier. In the testing stage, the testing centralized samples are used as input, the trained classifier is used for predicting, and the prediction result is evaluated, so that the multi-label data classification result is obtained.

Description

Multi-label data classification method based on entropy

Technical Field

The invention belongs to the field of machine learning multi-label classification, and particularly relates to a multi-label data classification method based on entropy.

Background

In the field of machine learning, traditional supervised learning is a learning framework which is researched most and applied most widely. Under this framework, for each object in the real world, the learning system learns a mapping between the input space and the output space using a certain learning algorithm, based on which class labels of unseen examples can be predicted, and the traditional supervised learning framework has had great success when the object to be learned has definite, single semantics, i.e. the class labels of the object are unique.

However, real-world objects tend not to have only unique semantics, but may be ambiguous. With the continuous improvement of scientific technology, various expression forms of data are continuously enriched, and the assumption of a single type label of a sample is difficult to accurately describe semantic information of a real object. Due to the complexity and ambiguity of the objective object itself, many objects in real life may be simultaneously associated with multiple category labels. In order to intuitively reflect the various semantic information that an ambiguous object has, it is a natural way to explicitly assign the object a set of appropriate class labels, i.e., a subset of labels. Based on the above considerations, a multi-label learning framework arises from this. Under this framework, each object is described by an example with multiple, no longer unique class labels, the goal of learning is to assign all appropriate class labels to unknown examples.

Aiming at the problem of multi-label classification, scholars at home and abroad put forward a plurality of methods. Existing multi-label learning methods can be divided into two broad categories, the first category is "problem transformation" methods, and the second category is "algorithm adaptation" methods. For the problem conversion method, the strategy is to convert the multi-label classification problem into a series of single-label classification problems, so that the existing single-label learning algorithm can be more conveniently applied to solve the problems. For the algorithm adaptation method, the strategy is to improve and expand the current single-label learning algorithm so that the algorithm can be applied to a multi-label classification task.

Problem translation methods typically translate multi-label classification problems into other learning problems known as single-label classification problems and label ordering problems. Considering that the single-label classification problem is a special case of multi-label classification and many efficient and accurate algorithms exist for single-label classification, the problem transformation method naturally transforms the multi-label classification into single-label classification problems of different types in the research process, and the algorithm adaptation method is to adapt other known learning algorithms to directly process the multi-label classification problems.

In addition to the two-category problem, the multi-category problem is also a problem for many researchersThe transformed objects are considered in designing the multi-label classification algorithm. The LP (Label Power set) method first transforms all the different label subsets corresponding to each sample in the training set into a series of different class values. Each unique label subset corresponds to a class, an unknown sample is classified by training a multi-class classifier, and then the label subset corresponding to the class output by the multi-classifier is used as a final prediction result of the sample. However, for a data set containing q tags, the number of tag subsets can reach 2 at most^q1, therefore, the number of samples corresponding to many labeled subsets in the actual data set is very small, which is likely to cause an unbalanced classification problem, thereby affecting the final classification generalization performance, and the method cannot predict the label subsets that do not appear in the training set. To overcome these deficiencies, a RAKEL (Random K-labelsets) algorithm was subsequently proposed. The main idea is as follows: a series of multi-class classifiers are established through an ensemble learning framework, wherein each multi-classifier randomly selects subsets from all label subsets of a label set, then the subsets are constructed through a method, and finally the related label subsets of unknown samples are predicted through a voting method. Based on the above thought, the token method overcomes the disadvantages of the LP method, but also brings other disadvantages, the randomly selected tag set may cause unbalanced data distribution of single-tag multi-class learning, and the dependency relationship between different tags in the same tag set may also cause serious information redundancy and overlapping. Both of these deficiencies affect the generalization ability of multi-label learning.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a high-accuracy multi-label data classification method based on entropy.

In order to solve the technical problems, the invention adopts the following technical scheme that the multi-label data classification method based on entropy comprises the following steps:

(1) selecting a multi-label data sample, and constructing a training set and a testing set based on sparse representation and five-fold cross validation;

(2) calculating entropy values of labels in the training set, obtaining entropy value sequencing, and selecting k labels with the minimum entropy values to construct a label set;

(3) performing parameter analysis to obtain optimal tag set parameters and voting threshold values, and constructing a Label Power set classifier according to parameter analysis results;

(4) inputting the constructed training set into the Label Powerset classifier constructed in the step (3) for classification training;

(5) inputting the constructed test set sample into a trained classifier for classification test to obtain a corresponding predicted value;

(6) and taking the predicted value obtained by the classifier as an output value of the test set, and evaluating and analyzing the output value.

Further, the specific implementation manner of the step (2) is as follows,

calculating by using the training set obtained in the step (1) and using a libsvm function, setting command parameters to be [ '-c', '100', '-g', num2str (gamma), '-b 1' ], setting a penalty coefficient c in the function to be 100, setting a parameter g, namely a gamma coefficient, to be the reciprocal of the number of classes in the training set, and setting a parameter b to be 1 to represent that the probability of classifying the sample into each class is estimated in the training process; based on the parameters, training a sample by using an svmtrain function, predicting the probability p of each label to be true by using an svmpredict function, obtaining the entropy value of each label based on an entropy value formula H-p log (p), and performing order increasing sequencing on all the entropy values according to a sequencing function to select k labels with the minimum entropy value to construct a label set;

after the data set is processed by adopting the five-fold cross validation in the step (1), different training sets can be generated, and different label sets can be obtained by performing the operation based on the different training sets.

Further, in the step (3), based on parameter analysis of the token method, an approximate range of optimal values of two parameters, namely the number k of the tags contained in each tag set and the voting threshold t, is determined, and then the most suitable parameter is obtained in a mode of controlling a variable.

Further, k was 4 and t was 0.75.

Further, in the step (4), according to the tag joint entropy increasing ordering result, selecting the first q tag sets with the minimum entropy values from the tag sets as tag subsets to perform ensemble learning, wherein q is the number of the tag types of the data samples; aiming at the constructed Label subsets, a classifier based on a Label Power set method is adopted to convert the multi-Label classification problem into a multi-class single-Label problem, then the classifiers are respectively called for q Label subsets, statistics is carried out on all class labels to obtain the actual votes of each class Label in the classification result, then the actual votes are divided by the maximum votes to obtain the ratio of the actual votes to the maximum votes, the ratio is compared with a threshold t in parameter analysis, and when the ratio is larger than the threshold t, the Label is regarded as a related Label of the test sample.

The invention has the beneficial effects that:

(1) the invention provides a label subset selection strategy based on entropy analysis, which can select k labels with the minimum entropy sequencing in labels to obtain a label subset with the minimum joint entropy.

(2) The invention provides a multi-label classifier, which can process a training set subjected to entropy analysis through parameter analysis, distinguish the largest label subset on information, and efficiently realize the prediction of unknown labels in a sample based on a voting process.

Drawings

FIG. 1 is a schematic flow chart of an embodiment of the present invention.

Detailed Description

For the convenience of those skilled in the art to understand and implement the technical solution of the present invention, the following detailed description of the present invention is provided in conjunction with the accompanying drawings and examples, it is to be understood that the embodiments described herein are only for illustrating and explaining the present invention and are not to be construed as limiting the present invention.

The invention discloses a multi-label data classification method based on entropy, which comprises a training phase and a testing phase. The training stage comprises the steps of selecting data samples, constructing a training set, constructing a label subset, analyzing parameters and constructing a classifier. Selecting a proper data sample, dividing the sample into a training set and a testing set according to the ratio of 4:1, calculating the entropy value of each Label in the training set sample, selecting a proper Label subset through sorting the Label entropy values, performing parameter analysis to obtain the optimal Label subset number and voting threshold, and training based on a Label Power set classifier. In the testing stage, the testing concentrated samples are used as input, the trained classifier is used for predicting, and the prediction result is evaluated, so that the multi-label data classification effect is obtained.

The method comprises the steps of adopting a Matlab platform to achieve the method based on an SVM base, enabling an SVM function to be a classifier base function, enabling a Libsvm function to be used for entropy value calculation, inputting multi-label data samples, reading in a matrix X with the size of M × N, enabling M in the matrix to be the number of the samples, enabling N to be the number of features of the multi-label samples, and enabling the SVM technology to be a well-known technology in the field of machine learning and not repeated.

In an embodiment, the following operations are performed on a multi-label sample:

(1): processing the data sample by using sparse representation and five-fold cross validation to obtain a training set and a test set;

the specific operation of the step (1) is as follows: and carrying out sparse representation on the matrix X by using a sparse function to obtain a processed data set. And then processing the data set by adopting a five-fold cross validation mode, dividing all the data sets into 5 parts, repeatedly taking one part as a test set, training a classifier by taking the other four parts as a training set, calculating output results of the classifier on the test set, and averaging all the output results to obtain a final output result. The function for performing sparsification processing on the matrix is a library function in the MATLAB, which is not described herein again.

(2): calculating entropy values of labels in the training set, obtaining entropy value sequencing, and selecting k labels with the minimum entropy values to construct a label set;

in the embodiment, the specific operation of step (2) is as follows: using the training set obtained in the step (1), calculating by using a libsvm function, setting command parameters to be [ '-c', '100', '-g', num2str (gamma), '-b 1' ], setting a penalty coefficient c in the function to be 100, setting a parameter g, namely a gamma coefficient, to be the reciprocal of the number of classes in the training set, and setting a parameter b to be 1 to represent that the probability of the samples being classified into each class is estimated in the training process. Based on the parameters, a sample is trained by using an svmtrain function, then the probability p that each label is true is predicted by using an svmpredict function, entropy values of all labels are obtained based on an entropy value formula H-p log (p), then all the entropy values are subjected to increasing sequence ordering according to an ordering function, k labels with the minimum entropy values are selected to construct a label subset, and the k value is the number of the labels contained in the label set.

(3): performing parameter analysis to obtain optimal tag set parameters and voting threshold values, and constructing a Label Power set classifier according to parameter analysis results;

in the embodiment, the specific operation of step (3) is as follows: and (3) selecting the data set based on the five-fold cross validation technology for processing the data set in the step (1) to generate different training sets, and performing the operation in the step (2) based on different training set combinations to obtain different label sets. Based on parameter analysis of a RAKEL method, determining an approximate range of optimal values of two parameters, namely the number k of the labels contained in each label set and a voting threshold t, and then measuring the most appropriate parameter in a variable control mode, so that the method has optimal comprehensive performance under the evaluation index. The parameter analysis experiment is carried out based on CAL500 and Birds data sets, wherein the optimal value range of the number k of the labels contained in each label set is 3-6, and the optimal value range of the voting threshold t is 0.4-0.8. Firstly, setting a voting threshold value to be 0.5, and then determining an optimal k value by comparing the quality of output results corresponding to different k values, wherein the k value is more suitable for being 4 through experiments; and setting the value of k to be 4, comparing output results corresponding to different voting threshold values t, and testing to confirm that the performance of the output result is optimal when the value of t is set to be 0.75. The classifier is adjusted according to the two parameter values.

(4): inputting the constructed training set into the Label Powerset classifier constructed in the step (3) for classification training;

in an embodiment, the specific operation of step (4) is to, based on the tag ascending order sorting result in (2), obtain- ∑ according to the joint entropy formula H (x, y)_x∑_yP(x，y)log₂[P(x，y)]Calculating label joint entropy (p in the formula represents the probability that the label is true), and selecting the first q label sets with the minimum entropy values from the label sets as label sets according to the label joint entropy increasing sequencing resultAnd performing ensemble learning by using the label subset, wherein q is the number of label types of the data sample. Aiming at the constructed label subsets, a classifier based on a LabelPowerset method is adopted to convert the multi-label classification problem into a plurality of types (single label) problems, then the classifiers are respectively called for q label subsets, statistics is carried out on all class labels to obtain the actual votes for each class label in the classification result, then the actual votes are divided by the maximum votes to obtain the ratio of the actual votes to the maximum votes, the ratio is compared with a threshold t in parameter analysis, and when the ratio is larger than the threshold t, the label is regarded as a related label of the test sample. And inputting the constructed training set into the classifier for classification training, and integrating the classification results of the plurality of label subsets to obtain an output value.

(5): inputting the constructed test set sample into a trained classifier for classification test to obtain a corresponding predicted value;

in the embodiment, the specific operation of step (5) is as follows: and taking the constructed test set sample as the input of the trained classifier to obtain a corresponding predicted value.

(6): and taking the predicted value obtained by the classifier as an output value of the test set, and evaluating and analyzing the output value.

The specific operation of the step (6) is as follows: and after the test set is input into the classifier, the output of the prediction label set is obtained, and the prediction label set is evaluated by adopting various evaluation indexes. Specific evaluation indexes comprise The amplified F-measure, Hamming lose, One Error, Macro F-measure and Micro F-measure, and The indexes are all common evaluation indexes in multi-label learning.

In specific implementation, the automatic operation of the process can be realized by adopting a software mode. The apparatus for operating the process should also be within the scope of the present invention.

The advantageous effects of the present invention are verified by comparative experiments as follows.

The test uses eight data sets covering five fields including creatures, text, images, music and video, as shown in table 1 below:

TABLE 1 dataset attributes

ID	Data set	FIELD	Number of samples	Number of features	Number of labels	Average number of labels in a sample
							1	Birds	Audio	645	260	19	1.014
2	CAL500	Music	502	68	174	26.044
							3	Emotions	Music	593	72	6	1.869
4	Flags	Images	194	19	7	3.392
							5	Gensbase	Biology	662	1186	27	1.252
6	Medical	Text	978	1449	45	1.245
							7	Scene	Images	2407	294	6	1.074
8	Yeast	Biology	2417	103	14	4.237

Multi-label classification evaluation index: the evaluation indexes can be calculated specifically by Zhang M L, Zhou Z H.A review on multi-label learning algorithms [ J ] IEEE transactions on knowledge and data engineering,2013,26(8):1819-1837.

The Example-based F measure is a comprehensive version of sample precision and recall:

p_iand r_iRespectively the precision ratio and the recall ratio of the ith sample, wherein m represents the number of the samples;

hamming Loss represents the proportion of all labels misclassified (i.e., the correct label is not predicted and the wrong label is predicted):

wherein Δ represents the difference between the true tag set and the predicted tag set;

one Error represents the proportion of all samples where the highest predictive accuracy ranked label is not in the true label set:

wherein

Is that

A sorting function of the tags;

the Macro F-measure is a comprehensive version of the label precision and recall ratio, and the Micro F-measure expands the F1 index of single-label classification to multi-label classification:

table 2 comparative experimental results

Average rank	The method of the invention	Method 1	Method 2	Method 3	Method 4	Method 5
							Hamming loss	1.875	5.25	2.375	4	1.75	5.25
One error	2.375	5.125	2.5	3.875	2.625	4.5
							Example-based F	1.75	4.75	4.2500	2.5	3.125	4.25
Micro-F	2.25	4.125	3.625	3.75	2.25	4.875
							Macro-F	1.625	4.875	4.75	3.25	2.25	4.125

As can be seen from Table 2, the ranking of the five evaluation indexes tested by the method of the invention is listed as top row, which shows that the method of the invention has better classification performance. Compared with basic classical algorithms such as methods 1 and 5, the Hamminloss ranking of the method is greatly advanced, and the method is more excellent than the overall classification effect of the classical algorithms; the sample-based F rankings were also higher for the methods of the invention compared to prior representation-based methods such as methods 1, 2, 4, and 5, indicating that the invention performed better in sample-based tag classification.

Therefore, the method provided by the invention has better classification performance compared with the existing multi-label classification method. The invention solves the problems that the randomly selected tag set can cause unbalanced data distribution of single-tag multi-class learning and the dependency relationship between different tags in the same tag set can also cause serious information redundancy and overlapping, selects the tag subsets through entropy analysis and sorting, enables the information quantity of the selected tag subsets and the randomly selected comparison to be larger, and can better utilize the tag correlation to realize multi-tag classification, thereby greatly improving the precision of target detection.

It should be understood that parts of the specification not set forth in detail are well within the prior art.

It should be understood that the above description of the preferred embodiments is given for clarity and not for any purpose of limitation, and that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. An entropy-based multi-label data classification method is characterized by comprising the following steps:

(3) performing parameter analysis to obtain optimal tag set parameters and voting threshold values, and constructing a LabelPowerset classifier according to parameter analysis results;

2. An entropy-based multi-label data classification method as claimed in claim 1, characterized in that: the specific implementation manner of the step (2) is as follows,

3. An entropy-based multi-label data classification method as claimed in claim 1, characterized in that: the method is characterized in that: and (3) determining an approximate range of optimal values of two parameters, namely the number k of the labels contained in each label set and a voting threshold t, based on parameter analysis of a RAKEL method, and then obtaining the most appropriate parameter in a variable control mode.

4. An entropy-based multi-label data classification method as claimed in claim 3, characterized in that: the method is characterized in that: the value of k is 4 and the value of t is 0.75.

5. An entropy-based multi-label data classification method as claimed in claim 3, characterized in that: in the step (4), according to the label joint entropy increasing sequencing result, selecting the first q label sets with the minimum entropy values from the label sets as label subsets to perform ensemble learning, wherein q is the number of label types of the data samples; aiming at the constructed Label subsets, a classifier based on a Label Power set method is adopted to convert the multi-Label classification problem into a multi-class single-Label problem, then the classifiers are respectively called for q Label subsets, statistics is carried out on all class labels to obtain the actual votes of each class Label in the classification result, then the actual votes are divided by the maximum votes to obtain the ratio of the actual votes to the maximum votes, the ratio is compared with a threshold t in parameter analysis, and when the ratio is larger than the threshold t, the Label is regarded as a related Label of the test sample.