CN114298191A

CN114298191A - Classification method and system based on label subset

Info

Publication number: CN114298191A
Application number: CN202111566217.5A
Authority: CN
Inventors: 彭黎文
Original assignee: Sichuan Police College
Current assignee: Sichuan Police College
Priority date: 2021-12-20
Filing date: 2021-12-20
Publication date: 2022-04-08

Abstract

The invention discloses a classification method and a classification system based on a label subset, which belong to the technical field of computers, and comprise the following steps: converting the multi-label data sample set into a single label data set; calculating label subsets of all samples in the single label data set, and constructing a new sample data set based on the label subsets; the computing of the subset of tags includes: calculating the importance, the relevance and the redundancy of all the characteristics in the sample, selecting the labels with the top rank by combining the relevance and the redundancy among the characteristics, and constructing a label subset; putting the label subset of each sample into the original sample to obtain a new single label data set; and constructing a single label classification model based on the new single label data set, and then counting the labels of each sample in each single label classification model to obtain a final multi-label classification result. According to the method, through analyzing the correlation and redundancy among the characteristics in the labels, an excellent label subset is selected, and the performance of the classification model is effectively improved.

Description

Classification method and system based on label subset

Technical Field

The invention relates to the technical field of computers, in particular to a classification method and a classification system based on a label subset.

Background

The multi-label classification is mainly used for processing the situation that one piece of data belongs to multiple categories at the same time, and in reality, the situation is widely existed, for example, an article belongs to the categories of news, economy, culture and the like at the same time. In order to accurately divide ambiguous objects in a real scene, many researchers have conducted intensive research on multi-label classification methods.

In the field of machine learning, a plurality of multi-label classification algorithms are provided, an existing multi-label classification scheme is generally realized based on a plurality of single-label classification models, the multi-label tasks are classified by the single-label classification models respectively, then a set of prediction results of all single-label classifiers is used as a final prediction result of the multi-label task, and the prediction accuracy of the single-label classifier can directly influence the accuracy of multi-label classification. In practical application, the samples used for each single label classifier are fewer, so that the accuracy of the prediction result of a single label classifier is poor, the accuracy of the final prediction result of a multi-label task is affected, and the relevance among labels is not considered in the conventional multi-label classification algorithm.

In addition, research finds that if the multi-label classification method only considers the relevance among labels during classification, the method does not necessarily obtain good classification performance, that is, in feature selection, the classification performance of the classification method cannot be always improved by the combination of single good features, because there is a possibility that the features are highly correlated, which causes redundancy among the features and affects the classification performance of the method.

Disclosure of Invention

The invention aims to overcome the problems of a multi-label classification method in the prior art, and provides a classification method and a classification system based on a label subset.

The purpose of the invention is realized by the following technical scheme:

there is provided a method for classification based on a subset of tags, the method comprising:

acquiring a multi-label data sample set, and converting the multi-label data sample set into a single-label data set;

calculating label subsets of all samples in the single label data set, and constructing a new sample data set based on the label subsets; the computing of the subset of tags includes: calculating the importance of all the characteristics in the sample; calculating a correlation between the features; calculating the redundancy among the characteristics, selecting the labels in the top order by combining the correlation and the redundancy among the characteristics, and constructing a label subset;

putting the label subset of each sample into the sample of the original single-label data set to obtain a new single-label data set;

and constructing a single label classification model based on the new single label data set, and then counting the labels of each sample in each single label classification model to obtain a final multi-label classification result.

As an option, the converting the multi-labeled data sample set into a single-labeled data set includes:

and corresponding a single label in the multi-label data sample set to each sample, and decomposing the single label into a data subset equal to the number of labels.

As an option, the method further comprises:

pre-processing the multi-labeled data sample set, the pre-processing comprising:

and deleting the samples with missing data characteristic values, keeping the samples with complete data characteristics, and then randomly dividing the multi-label data sample set into a training set and a testing set.

As an option, the importance of each feature is calculated by the F-score formula:

wherein, F_iThe larger the feature x is_iThe stronger the class discrimination ability of (2).

As an option, the correlation between features is calculated using mutual information, and the calculation formula is as follows:

x denotes a characteristic variable, Y denotes a tag variable, p (X)_i) And p (y)_j) Edge probabilities, p (X), for variables X and Y, respectively_i,y_j) Is a joint probability distribution function of X and Y.

As an option, the selecting top-ranked labels in combination with relevance and redundancy between features includes:

calculating the mutual information mean value between all the characteristics and the target variable, wherein the calculation formula is as follows:

wherein S represents a selected feature set, c represents a target variable, namely a class label variable;

and calculating the redundant information quantity between the characteristics according to the following calculation formula:

selecting a label with small redundancy and large correlation according to the mutual information mean value and the redundant information quantity, and selecting a calculation formula as follows:

wherein m is the number of features, and the labels ranked in the top 70% are selected according to the result calculated by the formula.

As an option, the ratio of training set to test set is 1: 1.

As an option, a bayesian algorithm is used to build the single label classification model.

As an option, the multi-labeled data sample set includes a plurality of different labels, including a plurality of different features.

The invention also provides a classification system based on the label subset, which comprises:

the sample acquisition module is used for acquiring a multi-label data sample set and converting the multi-label data sample set into a single-label data set;

the label subset calculation module is used for calculating label subsets of all samples in the single label data set and constructing a new sample data set based on the label subsets; the computing of the subset of tags includes: calculating the importance of all the characteristics in the sample; calculating a correlation between the features; calculating the redundancy among the characteristics, selecting the labels in the top order by combining the correlation and the redundancy among the characteristics, and constructing a label subset;

the sample recombination module is used for putting the label subset of each sample into the original sample to obtain a new single-label data set;

and the modeling and classifying module is used for constructing a single label classification model based on the new single label data set, then counting the labels of each sample in each single label classification model, and obtaining a final multi-label classification result.

It should be further noted that the technical features corresponding to the above options can be combined with each other or replaced to form a new technical solution without conflict.

Compared with the prior art, the invention has the beneficial effects that:

the method comprises the steps of converting a multi-label data sample set into a single-label data set, effectively selecting an excellent label subset by calculating the importance of all features in the single-label data set, calculating the correlation among the features, calculating the redundancy among the features and combining the correlation and the redundancy among the features, wherein the obtained features have the redundancy as small as possible, the correlation among the features is large as possible, the influence of the redundancy among the features on the classification performance of the model is avoided, and the classification performance of the model is improved.

Drawings

Fig. 1 is a schematic flow chart of a classification method based on tag subsets according to the present invention.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In addition, the technical features involved in the different embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

The invention mainly converts the multi-label data sample set into the single label data set, effectively selects the excellent label subset by calculating the importance of all the characteristics, calculating the correlation among the characteristics and calculating the redundancy among the characteristics in the single label data set and combining the correlation and the redundancy among the characteristics, obtains the characteristics with the redundancy among the characteristics as small as possible, obtains the correlation among the characteristics as large as possible, considers the correlation among the labels, avoids the influence of the redundancy among the characteristics on the classification performance of the model and achieves the aim of improving the classification performance of the model.

Example 1

In an exemplary embodiment, a method of classification based on a subset of tags is provided, as shown in figure 1,

the method comprises the following steps:

Specifically, by calculating the importance of all features in a single label data set, calculating the correlation among the features, calculating the redundancy among the features, and combining the correlation and the redundancy among the features, an excellent label subset is effectively selected, the redundancy between the obtained features is as small as possible, the correlation among the features is as large as possible, the correlation among the labels is considered, the influence of the redundancy among the features on the classification performance of the model is avoided, and the purpose of improving the classification performance of the model is achieved.

Further, after the label subset of each sample in the single label data set is obtained through calculation, the label subset is put into each sample converted into a single label, the method is used for directly inserting the label subset into the data sample as a feature to obtain a new single label data set, and the obtained new single label data set can reflect the relation between the labels.

Further, the label of each sample in each single label classification model is counted, for example, a certain data sample to be predicted is classified in each single label model, and if the sample belongs to the classification, the sample is recorded with the label.

Example 2

Based on the embodiment 1, there is provided a classification method based on tag subsets, the method for converting a multi-tag data sample set into a single-tag data set includes:

Further, the method further comprises:

Specifically, assuming that the multi-label data sample set D contains L labels in total, wherein the number of the labels includes q features, the given multi-label data sample set D is preprocessed, wherein the preprocessing includes processing missing values, and the deleting processing is performed on samples with missing data feature values. A sample with the data characteristics intact is retained. And then randomly dividing the set D into two parts according to the proportion of 1:1 between the training set Train and the Test set Test, setting the selected feature set as S, wherein S is empty at the beginning, the candidate feature set is W, and all q features in the set are included at the beginning.

Further, assuming that the multi-label dataset D contains 8 samples, L ═ 5, which are labels { L1, L2, L3, L4, L5}, the original multi-label dataset D is shown in the following table:

TABLE 1

id

F1

F2

F3

F4

F5

F6

F7

Labels

1

G1

G2

G3

G4

G5

G6

G7

L1,L2,L3,L5

2

R1

R2

R3

R4

R5

R6

R7

L2,L4,L5

3

M1

M2

M3

M4

M5

M6

M7

L1,L3,L4,L5

4

Y1

Y2

Y3

Y4

Y5

Y6

Y7

L2,L3,L5

5

V1

V2

V3

V4

V5

V6

V7

L1,L2,L4

6

H1

H2

H3

H4

H5

H6

H7

L3,L4

7

X1

X2

X3

X4

X5

X6

X7

L2,L5

8

N1

N7

L1,L2,L4

Wherein, F1, F2, etc. represent features, and a single label in the label set is corresponding to each sample and is decomposed into data subsets equal to the number of labels.

In particular, label L1 corresponds to the following table:

TABLE 2

id

F1

F2

F3

F4

F5

F6

F7

Labels

1

G1

G2

G3

G4

G5

G6

G7

L1

2

R1

R2

R3

R4

R5

R6

R7

3

M1

M2

M3

M4

M5

M6

M7

4

Y1

Y2

Y3

Y4

Y5

Y6

Y7

5

V1

V2

V3

V4

V5

V6

V7

L1

6

H1

H2

H3

H4

H5

H6

H7

7

X1

X2

X3

X4

X5

X6

X7

8

N1

N7

L1

Label L2 corresponds to the following table:

TABLE 3

id

F1

F2

F3

F4

F5

F6

F7

Labels

1

G1

G2

G3

G4

G5

G6

G7

L2

2

R1

R2

R3

R4

R5

R6

R7

L2

3

M1

M2

M3

M4

M5

M6

M7

4

Y1

Y2

Y3

Y4

Y5

Y6

Y7

L2

5

V1

V2

V3

V4

V5

V6

V7

L2

6

H1

H2

H3

H4

H5

H6

H7

7

X1

X2

X3

X4

X5

X6

X7

L2

8

N1

N7

L2

Label L3 corresponds to the following table:

TABLE 4

id

F1

F2

F3

F4

F5

F6

F7

Labels

1

G1

G2

G3

G4

G5

G6

G7

L3

2

R1

R2

R3

R4

R5

R6

R7

3

M1

M2

M3

M4

M5

M6

M7

L3

4

Y1

Y2

Y3

Y4

Y5

Y6

Y7

L3

5

V1

V2

V3

V4

V5

V6

V7

6

H1

H2

H3

H4

H5

H6

H7

L3

7

X1

X2

X3

X4

X5

X6

X7

8

N1

N7

Label L4 corresponds to the following table:

TABLE 5

id

F1

F2

F3

F4

F5

F6

F7

Labels

1

G1

G2

G3

G4

G5

G6

G7

2

R1

R2

R3

R4

R5

R6

R7

L4

3

M1

M2

M3

M4

M5

M6

M7

L4

4

Y1

Y2

Y3

Y4

Y5

Y6

Y7

5

V1

V2

V3

V4

V5

V6

V7

L4

6

H1

H2

H3

H4

H5

H6

H7

L4

7

X1

X2

X3

X4

X5

X6

X7

8

N1

N7

L4

Label L5 corresponds to the following table:

TABLE 6

Further, the converted data can be directly constructed by using a single label classification algorithm, but the incidence relation among labels in the multi-label data is not considered at this time, and the labels in the multi-label data have a certain incidence relation, so that a new method is provided for constructing the label subset of the sample, the incidence relation among the labels in the sample is mapped through the label subset, and the F-score is a method for measuring the distinguishing capability of the features between the two types, so that the effective feature selection can be realized through the method.

Specifically, the importance of each feature is calculated by the F-score formula:

wherein n is₊Number of samples representing positive class, n_-Number of samples representing negative class。

And

mean values over the entire data set, mean values over the positive class data set and mean values over the negative class data set for the ith feature, respectively.

The feature value of the ith feature representing the kth positive type sample point,

the eigenvalue of the ith characteristic representing the kth negative type sample point.

F_iThe larger the feature x is_iThe stronger the class discrimination ability of (2), i.e. the more sparse the classes are, the more dense the classes are, the better the classification effect is, i.e. the stronger the class discrimination ability of the feature is.

However, the F-score cannot accurately measure mutual information between features, and the mutual information is a representation of correlation between features, and if the mutual information cannot be disclosed, the magnitude of the correlation between features cannot be measured, so that the correlation between features is calculated by using the mutual information, and the calculation formula is as follows:

x denotes a characteristic variable, Y denotes a tag variable, p (X)_i) And p (y)_j) Edge probabilities, p (X), for variables X and Y, respectively_i,y_j) Is a joint probability distribution function of X and Y, where the labels are also considered as a feature to reflect the relationship between the labels.

Further, a model is constructed by considering only features that are particularly relevant to class variables, and the model does not necessarily result in good classification performance, that is, in feature selection, the combination of single good features does not always improve the classification performance of the model, because there is a possibility that features are highly relevant to each other, which results in redundancy among the features. Therefore, in order to effectively select an excellent feature subset, it is necessary to combine the correlation and redundancy between features to select top-ranked labels, including:

wherein m is the number of features, and the labels ranked in the top 70% are selected according to the result calculated by the formula to construct a new label subset of the sample.

And finally, circularly and iteratively calculating the label subset of each sample, constructing a new single-label data set based on the new sample data set, putting the label subset of each sample into the sample, and constructing a classification model based on the new data sample.

Taking label L5 as an example, the new single label data set after adding the label subset is shown in the following table:

TABLE 7

id

F1

F2

F3

F4

F5

F6

F7

L1

L2

L3

L4

Labels

1

G1

G2

G3

G4

G5

G6

G7

G8

G9

G10

G11

L5

2

R1

R2

R3

R4

R5

R6

R7

R8

R9

R10

R11

L5

3

M1

M2

M3

M4

M5

M6

M7

M8

M9

M10

M11

L5

4

Y1

Y2

Y3

Y4

Y5

Y6

Y7

Y8

Y9

Y10

Y11

L5

5

V1

V2

V3

V4

V5

V6

V7

V8

V9

V10

V11

6

H1

H2

H3

H4

H5

H6

H7

H8

H9

H10

H11

7

X1

X2

X3

X4

X5

X6

X7

X8

X9

X10

X11

L5

8

N1

N7

N8

N9

N10

N11

As can be seen from table 7, F1 through F7 are the original features in the data, and we then add the original L1, L2, L3, and L4 tags as features to the single-tagged data, assuming that there are a total of 5 tags in the multi-tagged dataset. G8, G9, G10, G11, etc. are only values of the feature, and may be, for example, 0 or 1, indicating the presence or absence of the label. Thus, the relationships between label L5 and other labels L1, L2, L3, and L4 are established, and so on, the relationship between label L1 and other labels, the relationship between label L2 and other labels, the relationship between label L3 and other labels, and the relationship between label L4 and other labels are established.

Further, the ratio of training set to test set is 1: 1.

Further, a Bayesian algorithm is used for constructing the single-label classification model.

Further, the multi-labeled data sample set includes a plurality of different labels, including a plurality of different features.

Example 3

There is provided a classification system based on a subset of tags, the system comprising:

Example 4

The present embodiment has the same inventive concept as embodiment 1, and a storage medium is provided on the basis of embodiment 1, and computer instructions are stored thereon, and when the computer instructions are executed, the steps of the classification method based on the tag subset in embodiment 1 are executed.

Based on such understanding, the technical solution of the present embodiment or parts of the technical solution may be essentially implemented in the form of a software product, which is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method of the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Example 5

The present embodiment also provides a terminal, which has the same inventive concept as that of embodiment 1, and includes a memory and a processor, where the memory stores computer instructions executable on the processor, and the processor executes the computer instructions to perform the steps of the classification method based on the tag subset in embodiment 1. The processor may be a single or multi-core central processing unit or a specific integrated circuit, or one or more integrated circuits configured to implement the present invention.

Each functional unit in the embodiments provided by the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The above detailed description is for the purpose of describing the invention in detail, and it should not be construed that the detailed description is limited to the description, and it will be apparent to those skilled in the art that various modifications and substitutions can be made without departing from the spirit of the invention.

Claims

1. A method for classification based on a subset of tags, the method comprising:

2. The method of claim 1,

the converting the multi-label data sample set into the single-label data set comprises:

3. The method of claim 1, wherein the method further comprises:

4. The method of claim 1, wherein the importance of each feature is calculated by the F-score formula:

5. The method of claim 4, wherein the correlation between features is calculated using mutual information, and the calculation formula is as follows:

6. The method according to claim 5, wherein the selecting the top ranked label in combination with the correlation and redundancy between features comprises:

wherein S represents a selected feature set, and c represents a target variable;

7. The method of claim 3, wherein the ratio of the training set to the test set is 1: 1.

8. The method of claim 1, wherein a Bayesian algorithm is used to construct the single label classification model.

9. The method of claim 1, wherein the multi-labeled data sample set comprises a plurality of different labels and a plurality of different features.

10. A classification system based on a subset of tags, the system comprising: