CN114612914A - Machine learning method and system for multi-label unbalanced data classification - Google Patents

Machine learning method and system for multi-label unbalanced data classification Download PDF

Info

Publication number
CN114612914A
CN114612914A CN202210309385.4A CN202210309385A CN114612914A CN 114612914 A CN114612914 A CN 114612914A CN 202210309385 A CN202210309385 A CN 202210309385A CN 114612914 A CN114612914 A CN 114612914A
Authority
CN
China
Prior art keywords
label
distribution
population
class
labels
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210309385.4A
Other languages
Chinese (zh)
Inventor
段继聪
于化龙
段宝敏
姜元昊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu University of Science and Technology
Original Assignee
Jiangsu University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu University of Science and Technology filed Critical Jiangsu University of Science and Technology
Priority to CN202210309385.4A priority Critical patent/CN114612914A/en
Publication of CN114612914A publication Critical patent/CN114612914A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Abstract

The invention provides a machine learning method and a system for multi-label unbalanced data classification, wherein the construction method and the system for multi-label unbalanced data classification machine learning based on a double high-order strategy and an evolutionary computation sampling method expand multi-label unbalanced data by using characteristic type and label type high-order strategies in a mixed manner; the fusion evolution calculation method provides a calculation method of population equilibrium fitness of a multi-label data set, and accordingly dynamic down-sampling operation is realized in a high-dimensional overlapping space according to the change condition of the average label unbalance rate IRLbl (P); the multi-label problem is converted into the traditional classification problem, so that the traditional classifier directly participates in multi-label classification in a mode of a double high-order strategy. The invention can enable the traditional multi-class classifier to directly participate in the multi-label unbalanced classification in a mode of considering the label relation, and effectively improve the value of the multi-label evaluation index F-measure of the algorithm.

Description

Machine learning method and system for multi-label unbalanced data classification
Technical Field
The invention relates to the technical field of artificial intelligence-machine learning algorithm design, in particular to a machine learning method and a machine learning system for multi-label unbalanced data classification.
Background
With the development of artificial intelligence technology, the design of a machine learning algorithm is gradually developed towards the direction of practicality, integration and refinement, and as a machine learning algorithm, the application of a multi-label classification algorithm is more and more extensive, however, the traditional multi-label classification algorithm generally adopts a low-order strategy, does not completely consider the relation among labels, ignores key learning information, and distributes labels in an unbalanced manner, so that the multi-label algorithm has low prediction precision and poor robustness.
In conclusion, the design of the modern machine learning multi-label classification algorithm lacks an effective solution for the consideration of the relation among labels, the selection of a high-order strategy and the improvement of precision and robustness.
Therefore, there is a need to provide a method and a system for machine learning and multi-label unbalanced data classification to solve the above-mentioned technical problems.
Disclosure of Invention
The invention aims to provide a machine learning method and a machine learning system for multi-label unbalanced data classification, which aim to overcome the defects in the prior art.
In order to achieve the purpose, the invention is realized by adopting the following technical scheme: a machine learning method for multi-label unbalanced data classification includes the following steps,
s1: performing multiple iterations according to the characteristics of the multi-label data set, and finally diffusing the iterations into the characteristics of the label data set;
s2, performing dynamic down-sampling operation according to the population balance fitness of the multi-label data set;
s3, clustering according to the characteristics of the label distribution condition of the multi-label data set and forming a label cluster, and recording label serial numbers before clustering;
s4, controlling the maximum scale of each label cluster, wherein all the clusters meet the condition that the number of the labels is not more than 3;
s5, converting the multi-label distribution in all the class clusters into multi-class distribution and converting the multi-label distribution into a plurality of multi-classification data sets;
s6, calling a traditional multi-class unbalanced classifier, and respectively learning and predicting the multi-class data set obtained by conversion to obtain a predicted multi-class classification result;
s7, converting the predicted multi-class classification result into a multi-label distribution result again, and restoring the original label set sequence through conversion;
and S8, evaluating the prediction result by using Macro-F1 and Micro-F1 indexes. Before calculating the Macro-F and Micro-F values, the Precision and Recall recalls were calculated according to the following formulas:
Figure BDA0003565987430000021
wherein TP and True Positive: positive, actually Positive FP, False Positive for False Positive: predicted positive, actually Negative FN, False Negative for False Negative: predicted versus Negative, actual positive TN, True Negative: the prediction is negative and the reality is also negative.
The S4 is as follows: and when the number of the class clusters of the labels exceeds 3, clustering the class clusters for multiple times, and splitting the class clusters into a plurality of class clusters again until the scale of all the obtained class clusters does not exceed 3 labels.
The conventional multi-class classifier in S6 should satisfy the following requirements: when the selected multi-class classifier is used for processing the imbalance problem, at least 5 classes of classification problems can be effectively distinguished; the selected multi-class classifier takes as little time as possible to process the tasks, and the algorithm time complexity is lower than O (n 3).
The S1 includes the steps of:
s1-1: expanding a feature set of multi-label data by using an LDA topic model as a target, presetting iteration times i, setting the topic number K to be 2, and setting the topic number to be 2 because the obtained topic distribution is ensured to be binary distribution;
s1-2: regarding each instance as a document and regarding each label as a word in the document, determining Dirichlet distribution parameters, and introducing the Dirichlet distribution parameters into an LDA topic model;
s1-3: calculating an example-theme probability distribution matrix according to an LDA theme model calculation rule, wherein the matrix represents the probability value of each example belonging to each theme, and a binary discrete matrix is generated according to the probability value so as to determine the theme of each example;
s1-4: the training data set and the test data set have the same theme probability distribution, so that the characteristics of the training set are extracted firstly, the discrete matrix in S1-3 is taken as a result, the training set and the test data set are combined into a new multi-label data set, and then the traditional multi-class classifier is utilized to learn and predict the discrete matrix corresponding to the test set;
s1-5: and (4) expanding the feature space of the original training set by using the discretization matrix in the step S1-3, expanding the feature space of the test set by using the traditional multi-class classifier in the step S1-4 to learn and predict the discretization matrix of the corresponding test set, checking whether iteration is finished, if not, turning to the step S1-4, and if not, finishing.
The dynamic down-sampling operation in S2 specifically includes the following steps:
s2-1: and calculating the value of the population balance fitness of the multi-label data set. At the moment, according to a standard processing method in the field of evolutionary computation, processing each sample in a data set as an individual, and forming a population by the individuals together;
s2-2, simultaneously mapping each individual to a high-dimensional label space and a high-dimensional feature label space, and mapping the total N contained in the populationPThe individual units are simultaneously placed in a connection network N, one of which is formed with NPA connected network N of vertices is { V, E }; at this time, the connected net is formed with only NPA connected graph T, which is a vertex and an edge set is an empty set at this time, { V, E }, where V denotes a point set and E denotes an edge set;
s2-3: according to the Kruskal principle, two individuals which are not recorded into a communication tabu and are closest to the overlapping space distance are selected each time, at the moment, if the two individuals are communicated and the communication component of the system is not reduced, the operation is cancelled, the operation is recorded into the communication tabu table, and the two individuals are not allowed to be connected. If the two individuals are connected and the connected component of the system is reduced, go to S2-4; if there is no qualified individual that can execute the operation, go to step S2-5; wherein, the overlapping space distance is calculated according to the following formula:
Figure BDA0003565987430000031
wherein D isCFor overlapping spatial distances, DL(p,q)、DF(p, q) are Euclidean distances after the p-th individual and the q-th individual in the population are mapped into a high-dimensional label space and a high-dimensional characteristic label space respectively;
s2-4: marking the two individuals transmitted in S2-3 as connected; at this time, the unconnected graph T ═ { V, E } is updated synchronously, and the connected components of the system are also reduced with the operation; if an individual of which the number of connected top points exceeds 1 due to the operation exists in the two individuals executing the communication operation, the individual is recorded into a communication tabu table, and the individual is not allowed to be connected with any other individual any more; after the completion, go to step S2-3;
s2-5: the connected component at this time is calculated. Treating each communicated individual as an individual combination, selecting an individual combination with the largest number of individuals, selecting an edge with the shortest length from the individual combinations, connecting the individual on two vertexes of the edge with all other individuals in the individual combination in sequence, and calculating the total length of a generated new edge; for individuals with smaller total length, the total length is deleted from the population so as to realize single down-sampling operation; when the total length values are the same, deleting the two individuals from the population at the same time;
s2-6: calculating the value of the average label unbalance rate IRLbl (P) of each label of the population P by using a standard IRLbl calculation method; and when the average value irlbl (l) of the average label unbalance rates irlbl (p) of all the labels is still higher than the preset threshold value ERT, returning to the step 2-1, otherwise, ending.
The algorithm of population balance fitness in the S2-1 is as follows:
Figure BDA0003565987430000041
in the population balance fitness calculation formula, EFPFor the population equilibrium fitness, EF, of the population PL、EFFRespectively the population label equilibrium fitness and the population characteristic equilibrium fitness, NL、NFThe number of labels contained in the population and the dimensions of the features contained in the population, respectively, wherein L and F represent the set of labels and the set of features contained in the population, respectively, and c1、c2The two constant coefficients are used as the coefficients of the two constants,
Figure BDA0003565987430000042
the number of individuals in the population that contain the ith tag,
Figure BDA0003565987430000043
the number of individuals in the population whose jth eigenvalue is not 0.
In the step S1-2, the prior distribution of the topics in each document and the distribution of the words in each topic are determined according to the dirichlet parameter and the following formula:
Figure BDA0003565987430000044
in the above formula, N is the number of instances in the dataset, K is the number of topics, θ is the prior distribution of topics in the document, and is analogous to the multi-label dataset, i.e. the distribution of topics in the dataset instances, Φ is the distribution of words in the topics, and is analogous to the multi-label dataset, i.e. the distribution of labels in the topics.
The step of S3 includes the following steps:
s3-1: according to the label distribution condition of the data set, regarding each label as a clustering element, regarding each instance as a one-dimensional label feature, regarding each instance as a data set for calculating the Jaccard similarity, and recording the label sequence;
s3-2: calculating the Jaccard similarity among all the labels by taking the Jaccard similarity as an evaluation standard, then carrying out hierarchical clustering according to a calculation result, respectively representing the example distribution conditions of the two labels by using A and B, and at the moment, calculating the Jaccard similarity among the labels according to the following formula:
Figure BDA0003565987430000051
wherein A and B respectively represent the example distribution conditions of two labels;
s3-3: and forming the clustering result into a cluster, thereby realizing the structural storage of the clustering result by using the form of the cluster and facilitating the further processing of the clustering result.
The conversion process of the multi-label distribution and the multi-classification distribution in S5 includes: when the scale of all the obtained clusters does not exceed 3 labels, the multi-label distribution in all the clusters does not exceed 8 labels at most, and the corresponding multi-class distribution is 8 classes, namely 23. In addition, the number of labels in each class cluster may be different, so before training the multi-class classifier, the multi-label distribution should be uniquely corresponding to the multi-class distribution, and the corresponding rule should be as shown in the following table:
Figure BDA0003565987430000052
and after the multi-label distribution in each cluster is converted into multi-classification distribution, combining and combining the multi-label data characteristic set expanded by the LDA topic model with the cluster to form a new multi-class data set, wherein the combined number is the same as the number of the cluster.
In addition, the invention also discloses a system of the multi-label unbalanced classification machine learning framework, which comprises the following steps: the terminal equipment adopts internet terminal equipment and comprises a processor and a computer readable storage medium, wherein the processor is used for realizing instructions; a computer readable storage medium for storing a plurality of instructions adapted to be loaded by a processor and to perform a method of machine learning for multi-label imbalance data classification as claimed in any one of claims 1 to 9.
According to the technical scheme, the invention has the following beneficial effects: the invention provides a characteristic type and label type dual high-order strategy, provides method guidance for considering the relation between labels, takes an LDA topic model as a high-order strategy to expand the internal relation of the labels to a data characteristic set from the aspect of label relation, and takes related labels into joint consideration on the basis of clustering without influencing the expansion of the former. In addition, the formation of the class clusters limits the distribution scale of the labels to a certain extent, and the conversion of the class clusters enables the traditional multi-class classifier to directly participate in multi-label learning by considering the relationship between the labels without performing complex modification on the algorithm. Through experiments, the invention is steady, flexible and efficient. It not only has the ability of enhancing the traditional algorithm, but also has better performance than the existing algorithm.
Drawings
Fig. 1 is a detailed flowchart of a machine learning method for multi-label unbalanced data classification according to the present invention.
Fig. 2 is a schematic diagram of the application of the LDA topic model provided by the present invention on a multi-label dataset.
Fig. 3 is a schematic diagram of the LDA workflow provided by the present invention.
FIG. 4 is a schematic diagram of a clustering tree generated by clustering a tag set by hierarchical clustering according to the present invention.
Detailed Description
In order to make the technical means, the creation characteristics, the achievement purposes and the effects of the invention easy to understand, the invention is further described with the specific embodiments.
Please refer to fig. 1, fig. 2 and fig. 3 in combination, wherein fig. 1 is a method flowchart of a method and a system for constructing multi-label unbalanced data classification machine learning based on a dual high-order strategy and an evolutionary computation sampling method according to the present invention; FIG. 2 is a schematic diagram illustrating the use of an LDA topic model in a multi-label set according to a first high-order strategy method provided in the present invention; fig. 3 is a clustering tree obtained after hierarchical clustering of the tag set by the second highest-order strategy method provided by the present invention.
As shown in fig. 1, a method for constructing machine learning of multi-label unbalanced data classification includes:
step S1: according to the characteristics of the multi-label data set, calculating a theme to which each instance belongs by using an LDA theme model, representing a binary theme calculated by the LDA theme model, carrying out multiple iterations, and then expanding the distribution of the iterated theme into the characteristics of the multi-label data set;
the iteration times i are preset before the feature set of the multi-label data is expanded by using the LDA topic model, the topic number K is set to be 2, and the reason why the topic number is set to be 2 is to ensure that the obtained topic distribution is binary distribution;
step S1-2: regarding each instance as a document and regarding each tag as a word in the document, determining a dirichlet distribution parameter, and then introducing the dirichlet distribution parameter into the LDA topic model shown in fig. 2;
as shown in fig. 3, in order to satisfy the initial calculation condition of the LDA topic model, according to the dirichlet parameter, the prior distribution of topics in each document and the determination of the word distribution in each topic are determined according to the following formulas:
Figure BDA0003565987430000071
in the above formula, N is the number of instances in the data set, and K is the number of topics.
The prior distribution of the topics in the theta document is similar to the multi-label data set, namely the distribution of the topics in the data set example;
the distribution of words in the phi topic is analogous to the multi-label dataset, i.e. the distribution of labels in the topic.
Step S1-3: calculating an example-theme probability distribution matrix according to an LDA theme model calculation rule, wherein the matrix represents the probability value of each example belonging to each theme, and a binary discrete matrix is generated according to the probability value so as to determine the theme of each example;
step S1-4: the training data set and the test data set have the same theme probability distribution, so that the characteristics of the training set are firstly extracted, the discrete matrices in the step 2-3 are taken as results, the results are combined into a new multi-label data set, and then the traditional multi-class classifier is utilized to learn and predict the discrete matrices corresponding to the test set;
step S1-5: and (4) expanding the feature space of the original training set by using the discretization matrix in the step S1-3, expanding the feature space of the test set by using the traditional multi-class classifier in the step S1-4 to learn and predict the discretization matrix of the corresponding test set, checking whether iteration is finished, if not, turning to the step S1-4, and if not, finishing.
Step S2: in order to satisfy the requirement of implementing the solution of the class imbalance problem in the multi-label dataset by the dynamic downsampling operation, the step S2 further includes:
step S2-1: and calculating the value of the population balance fitness of the multi-label data set. At this time, each sample in the data set is treated as an individual according to a standard treatment method in the field of evolutionary computing, and the individuals are combined to form a population. In order to calculate the value of the population equilibrium fitness of the multi-label data set, the population equilibrium fitness of the multi-label data set should be calculated according to the following formula:
Figure BDA0003565987430000072
in the population equilibrium fitness calculation formula, EFPIs the population equilibrium fitness, EF, of the population PL、EFFRespectively the population label equilibrium fitness and the population characteristic equilibrium fitness, NL、NFThe number of labels contained in the population and the dimensions of the features contained in the population, respectively, wherein L and F represent the set of labels and the set of features contained in the population, respectively, and c1、c2Are two constant coefficients.
Figure BDA0003565987430000081
The number of individuals in the population that contain the ith tag,
Figure BDA0003565987430000082
the number of individuals with the jth characteristic value not being 0 in the population;
step S2-2: simultaneously mapping each individual to a high-dimensional label space and a high-dimensional feature label space, and mapping N contained in the populationPThe individual units are simultaneously placed in a connection network N, one of which is formed with NPAnd the connected network N of each vertex is { V, E }. At this time, the connected net is formed with only NPA connected graph T, which is a vertex and an edge set is an empty set at this time, { V, E }, where V denotes a point set and E denotes an edge set;
step S2-3: according to the Kruskal principle, two individuals which are not recorded into a connected tabu list and have the closest overlapping space distance are selected each time. At this time, if the two individuals are connected and the connected component of the system is not reduced, the operation is cancelled, and the operation is recorded in the connection tabu table, so that the two individuals are not allowed to be connected. If the two individuals are connected, the connected component of the system is reduced, and the step 3-4 is switched. If no qualified individuals capable of executing the operation exist, the step 3-5 is switched to. Wherein, the overlapping space distance is calculated according to the following formula:
Figure BDA0003565987430000083
wherein D isCFor overlapping spatial distances, DL(p,q)、DF(p, q) are Euclidean distances after the p-th individual and the q-th individual in the population are mapped into a high-dimensional label space and a high-dimensional characteristic label space respectively;
step S2-4: the two individuals passed in by step S2-3 are marked as connected. At this time, the unconnected graph T ═ { V, E } is updated synchronously, and the connected components of the system are also reduced with this operation. If the two individuals executing the connection operation have the number of the connected top points exceeding 1 due to the operation, the individuals are recorded in the connection taboo table, and the individuals are not allowed to be connected with any other individuals. After the completion, go to step S2-3;
step S2-5: the connected component at this time is calculated. Each interconnected individual is treated as an individual combination, an individual combination with the largest number of individuals is selected, an edge with the shortest length is selected from the individual combinations, the individuals on two vertexes of the edge are sequentially connected with all other individuals in the individual combination, and the total length of the generated new edge is calculated. For individuals with a smaller overall length, they are removed from the population to achieve a single down-sampling operation. When the total length values are the same, deleting the two individuals from the population at the same time;
step S2-6: the value of the average label imbalance rate IRLbl (P) of each label of the population P is calculated using the standard IRLbl calculation method. And when the average value irlbl (l) of the average label unbalance rates irlbl (p) of all the labels is still higher than the preset threshold value ERT, returning to the step S2-1, otherwise, ending.
Step S3: according to the distribution condition of the labels of the multi-label dataset, in order to measure the difference between the labels, the binary distribution of each label belonging to each instance is taken as a characteristic, the Jaccard similarity is taken as a measurement standard, and the Jaccard similarity between the labels is calculated according to the following formula:
Figure BDA0003565987430000091
wherein A and B respectively represent the example distribution conditions of two labels;
as shown in fig. 4, a tag clustering result is obtained, a tag class cluster is formed, and tag serial numbers are recorded before clustering; in order to mine and learn the potential relationships in the labels using the hierarchical clustering algorithm, the step S2 further includes:
step S3-1: according to the label distribution condition of the data set, regarding each label as a clustering element, regarding each instance as a one-dimensional label feature, regarding each instance as a data set for calculating the Jaccard similarity, and recording the label sequence;
step S3-2: calculating the similarity of the Jaccard among all the labels by taking the similarity of the Jaccard as an evaluation standard, and then carrying out hierarchical clustering according to a calculation result;
step S3-3: and forming a cluster by the clustering result and storing the cluster.
Step S4: controlling the maximum scale of each label cluster, clustering the clusters with more than 3 labels for multiple times, and splitting the clusters into a plurality of clusters again until the scales of all the obtained clusters do not exceed 3 labels; in the process of processing, the splitting method is to re-cluster all the clusters. And after the re-clustering is finished, the new clustered cluster should be replaced by the original clustered cluster. After the reduction is finished, the scale of each cluster is checked again, whether the cluster with more than 3 labels still exists is observed, if yes, re-clustering is carried out again until all the clusters meet the condition that the number of the labels does not exceed 3;
step S5: converting the multi-label distribution in all the class clusters into multi-class distribution, wherein the converted class of each class cluster does not exceed 8 classes at most, namely 23. Then combining the feature set expanded by the LDA topic model with the converted multiple classes respectively to convert the feature set into multiple classified data sets;
in the conversion process of the multi-label distribution and the multi-category distribution in step S5, the method further includes: when the scale of all the obtained clusters does not exceed 3 labels, the multi-label distribution in all the clusters does not exceed 8 labels at most, and the corresponding multi-class distribution is 8 classes, namely 23. In addition, the number of labels in each class cluster may be different, so before training the multi-class classifier, the multi-label distribution should be made to uniquely correspond to the multi-class distribution. The corresponding rule should be as shown in the following table:
Figure BDA0003565987430000101
step S6: calling a traditional multi-class imbalance classifier, and respectively learning and predicting the multi-class data set obtained by conversion to obtain a predicted multi-class classification result; in order to satisfy the converted multi-class data set, before calling the traditional multi-class unbalanced classifier, the type requirements further include: the limited number of labels in the cluster reduces the size of label distribution, but the reduced label distribution still presents an unbalanced situation and the time complexity is slightly increased. According to the characteristics of the converted data set, the selected traditional multi-class classifier should satisfy the following conditions: the requirement of imbalance problem processing capacity and the requirement of minimizing required time is met;
step S7: converting the predicted multi-class classification result into a multi-label distribution result again, and restoring the original label sequence; after the multi-class result is predicted by the conventional multi-class classifier, the multi-class classification result is converted back to the multi-label distribution, and the conversion process is the reverse process of S4. And after the conversion is finished, combining the conversion results of all kinds of clusters, and restoring the original label set sequence.
Step S8: for calculating the Macro-F and Micro-F values, the values to be calculated also include: the accuracy Precision and Recall are calculated as follows:
Figure BDA0003565987430000102
wherein TP and True Positive: positive, actually Positive FP, False Positive for False Positive: predicted positive, actually Negative FN, False Negative for False Negative: predicted versus Negative, actual positive TN, True Negative: the prediction is negative and the actual is also negative. And finally, evaluating the prediction result by using Macro-F1 and Micro-F1 indexes.
It will be appreciated by those skilled in the art that the invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The embodiments disclosed above are therefore to be considered in all respects as illustrative and not restrictive. All changes which come within the scope of or equivalence to the invention are intended to be embraced therein.

Claims (10)

1. A machine learning method for multi-label unbalanced data classification is characterized by comprising the following steps,
s1: performing multiple iterations according to the characteristics of the multi-label data set, and finally diffusing the iterations into the characteristics of the label data set;
s2, carrying out dynamic down-sampling operation according to the population balance fitness of the multi-label data set;
s3, clustering according to the characteristics of the label distribution condition of the multi-label data set and forming a label cluster, and recording label serial numbers before clustering;
s4, controlling the maximum scale of each label cluster, wherein all the clusters meet the condition that the number of the labels is not more than 3;
s5, converting the multi-label distribution in all the class clusters into multi-class distribution and converting the multi-label distribution into a plurality of multi-classification data sets;
s6, calling a traditional multi-class unbalanced classifier, and respectively learning and predicting the multi-class data set obtained by conversion to obtain a predicted multi-class classification result;
s7, converting the predicted multi-class classification result into a multi-label distribution result again, and restoring the original label set sequence through conversion;
and S8, evaluating the prediction result by using Macro-F1 and Micro-F1 indexes, and calculating Precision and Recall according to the following formulas before calculating Macro-F and Micro-F values:
Figure FDA0003565987420000011
wherein TP and True Positive: positive, actually Positive FP, False Positive for False Positive: predicted positive, actually Negative FN, False Negative for False Negative: predicted versus Negative, actual positive TN, True Negative: the prediction is negative and the reality is also negative.
2. The method for machine learning with multi-label unbalanced data classification as claimed in claim 1, wherein the step S4 is as follows: and when the number of the label clusters exceeds 3, clustering the cluster for many times, and splitting the cluster into a plurality of cluster again until the scale of all the obtained cluster does not exceed 3 labels.
3. The method of claim 1 or 2, wherein the conventional multi-class classifier of S6 satisfies the following requirements: when the selected multi-class classifier is used for processing the imbalance problem, at least 5 classes of classification problems can be effectively distinguished; the selected multi-class classifier takes as little time as possible to process the tasks, and the algorithm time complexity is lower than O (n 3).
4. The method for machine learning with multi-label unbalanced data classification as claimed in claim 3, wherein the step of S1 comprises the steps of:
s1-1: expanding a feature set of multi-label data by using an LDA topic model as a target, presetting iteration times i, setting the topic number K to be 2, and setting the topic number to be 2 because the obtained topic distribution is ensured to be binary distribution;
s1-2: regarding each instance as a document and regarding each label as a word in the document, determining Dirichlet distribution parameters, and introducing the Dirichlet distribution parameters into an LDA topic model;
s1-3: calculating an example-theme probability distribution matrix according to an LDA theme model calculation rule, wherein the matrix represents the probability value of each example belonging to each theme, and a binary discrete matrix is generated according to the probability value so as to determine the theme of each example;
s1-4: the training data set and the test data set have the same theme probability distribution, so that the characteristics of the training set are firstly extracted, the discrete matrix in S1-3 is taken as a result, the training set and the test data set are combined into a new multi-label data set, and then the traditional multi-class classifier is utilized to learn and predict the discrete matrix corresponding to the test set;
s1-5: and expanding the feature space of the original training set by using the discretization matrix in the S1-3, expanding the feature space of the test set by using the traditional multi-class classifier in the S1-4 to learn and predict the discretization matrix of the corresponding test set, checking whether iteration is finished, if not, turning to the S1-4, and if not, ending.
5. The method for machine learning with multi-label unbalanced data classification as claimed in claim 1, wherein the dynamic down-sampling operation in S2 specifically comprises the following steps:
s2-1: and calculating the value of the population balance fitness of the multi-label data set. At the moment, according to a standard processing method in the field of evolutionary computation, processing each sample in a data set as an individual, and forming a population by the individuals together;
s2-2, simultaneously mapping each individual to a high-dimensional label space and a high-dimensional feature label space, and mapping the total N contained in the populationPThe individual units are simultaneously placed in a connection network N, one of which is formed with NPThe connected net of vertices N ═ { V, E }, which now forms only NPA connected graph T ═ V, E } which is a vertex and an edge set is an empty set at this time, where V denotes a point set and E denotes an edge set;
s2-3: according to the Kruskal principle, two individuals which are not recorded into a connected tabu list and have the closest overlapping space distance are selected each time. At this time, if the two individuals are connected and the connected component of the system is not reduced, the operation is cancelled, and the operation is recorded in the connection tabu table, so that the two individuals are not allowed to be connected. If the two individuals are connected and the connected component of the system is reduced, go to S2-4; if no qualified individual capable of executing the operation exists, switching to S2-5; wherein, the overlapping space distance is calculated according to the following formula:
Figure FDA0003565987420000031
wherein D isCFor overlapping spatial distances, DL(p,q)、DF(p, q) are Euclidean distances after the p-th individual and the q-th individual in the population are mapped into a high-dimensional label space and a high-dimensional characteristic label space respectively;
s2-4: marking the two individuals transmitted in the step S2-3 as connected, at this time, the non-connected graph T ═ V, E is updated synchronously, the connected component of the system is also reduced along with the operation, if there is an individual whose number of vertices connected exceeds 1 due to the operation in the two individuals performing the connected operation, the individual is recorded in a connected tabu table, the individual is not allowed to be connected with any other individual, and after the completion, the step S2-3 is performed;
s2-5: calculating the communication component at the moment, treating each communicated individual as an individual combination, selecting an individual combination with the largest number of individuals, selecting an edge with the shortest length from the individual combinations, sequentially connecting the individual on two vertexes of the edge with all other individuals in the individual combination, and calculating the total length of a generated new edge; for individuals with smaller total length, deleting the individuals from the population to realize single down-sampling operation, and when the total length values are the same, deleting the two individuals from the population simultaneously;
s2-6: calculating the value of the average label unbalance rate IRLbl (P) of each label of the population P by using a standard IRLbl calculation method, returning to S2-1 when the average value IRLbl (L) of the average label unbalance rates IRLbl (P) of all labels is still higher than a preset threshold ERT, and ending otherwise.
6. The method for machine learning of multi-label unbalanced data classification as claimed in claim 5, wherein the algorithm of population balanced fitness in S2-1 is as follows:
Figure FDA0003565987420000032
in the population balance fitness calculation formula, EFPIs the population equilibrium fitness, EF, of the population PL、EFFRespectively the population label equilibrium fitness and the population characteristic equilibrium fitness, NL、NFThe number of labels contained in the population and the dimensions of the features contained in the population, respectively, wherein L and F represent the set of labels and the set of features contained in the population, respectively, and c1、c2The two constant coefficients are used as the coefficients of the two constants,
Figure FDA0003565987420000033
the number of individuals in the population containing the ith tag,
Figure FDA0003565987420000034
the j-th characteristic value in the population is not 0The number of bodies.
7. The method of claim 4, wherein said step S1-2 of determining the prior distribution of topics in each document and the distribution of words in each topic according to Dirichlet' S parameters is determined according to the following formula:
Figure FDA0003565987420000041
in the above formula, N is the number of instances in the dataset, K is the number of topics, θ is the prior distribution of topics in the document, and is analogous to the multi-label dataset, i.e. the distribution of topics in the dataset instances, Φ is the distribution of words in the topics, and is analogous to the multi-label dataset, i.e. the distribution of labels in the topics.
8. The method for machine learning with multi-label unbalanced data classification as claimed in claim 6, wherein the step of S3 comprises the following steps:
s3-1: according to the label distribution condition of the data set, regarding each label as a clustering element, regarding each instance as a one-dimensional label feature, regarding each instance as a data set for calculating the Jaccard similarity, and recording the label sequence;
s3-2: calculating the Jaccard similarity among all the labels by taking the Jaccard similarity as an evaluation standard, then carrying out hierarchical clustering according to a calculation result, respectively representing the example distribution conditions of the two labels by using A and B, and at the moment, calculating the Jaccard similarity among the labels according to the following formula:
Figure FDA0003565987420000042
wherein A and B respectively represent the example distribution conditions of two labels;
s3-3: forming the clustering result into a cluster, thereby realizing that: and the clustering result is structurally stored by using the form of the cluster, so that the clustering result is further processed conveniently.
9. The method of claim 1, wherein the transforming of the multi-label distribution into the multi-class distribution in S5 comprises: when the scale of all the obtained clusters does not exceed 3 labels, the multi-label distribution in all the clusters does not exceed 8 labels at most, and the corresponding multi-class distribution is 8 classes, namely 23(ii) a In addition, the number of labels in each class cluster may be different, so before training the multi-class classifier, the multi-label distribution should be uniquely corresponding to the multi-class distribution, and the corresponding rule should be as shown in the following table:
Figure FDA0003565987420000043
Figure FDA0003565987420000051
and after the multi-label distribution in each cluster is converted into multi-classification distribution, combining and combining the multi-label data characteristic set expanded by the LDA topic model with the cluster to form a new multi-class data set, wherein the combined number is the same as the number of the cluster.
10. A system of a multi-label imbalance classification machine learning framework: the system is characterized by comprising terminal equipment, wherein the terminal equipment adopts internet terminal equipment and comprises a processor and a computer readable storage medium, and the processor is used for realizing each instruction; a computer readable storage medium for storing a plurality of instructions adapted to be loaded by a processor and to perform a method of machine learning for multi-label imbalance data classification as claimed in any one of claims 1 to 9.
CN202210309385.4A 2022-03-25 2022-03-25 Machine learning method and system for multi-label unbalanced data classification Pending CN114612914A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210309385.4A CN114612914A (en) 2022-03-25 2022-03-25 Machine learning method and system for multi-label unbalanced data classification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210309385.4A CN114612914A (en) 2022-03-25 2022-03-25 Machine learning method and system for multi-label unbalanced data classification

Publications (1)

Publication Number Publication Date
CN114612914A true CN114612914A (en) 2022-06-10

Family

ID=81867747

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210309385.4A Pending CN114612914A (en) 2022-03-25 2022-03-25 Machine learning method and system for multi-label unbalanced data classification

Country Status (1)

Country Link
CN (1) CN114612914A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115632996A (en) * 2022-12-19 2023-01-20 中国人民解放军国防科技大学 Network flow classification system and method based on federal online active learning

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115632996A (en) * 2022-12-19 2023-01-20 中国人民解放军国防科技大学 Network flow classification system and method based on federal online active learning

Similar Documents

Publication Publication Date Title
Sun et al. What and how: generalized lifelong spectral clustering via dual memory
CN107273490B (en) Combined wrong question recommendation method based on knowledge graph
Pang et al. Evaluation of the results of multi-attribute group decision-making with linguistic information
Ding et al. Research on using genetic algorithms to optimize Elman neural networks
CN111667022A (en) User data processing method and device, computer equipment and storage medium
CN109902714B (en) Multi-modal medical image retrieval method based on multi-graph regularization depth hashing
CN113190688B (en) Complex network link prediction method and system based on logical reasoning and graph convolution
CN109992773A (en) Term vector training method, system, equipment and medium based on multi-task learning
CN112734154B (en) Multi-factor public opinion risk assessment method based on fuzzy number similarity
CN112947300A (en) Virtual measuring method, system, medium and equipment for processing quality
CN111782826A (en) Knowledge graph information processing method, device, equipment and storage medium
Dai et al. Hybrid deep model for human behavior understanding on industrial internet of video things
CN114647741A (en) Process automatic decision and reasoning method, device, computer equipment and storage medium
Ming-Te et al. Using data mining technique to perform the performance assessment of lean service
Wang et al. A deep-learning-inspired person-job matching model based on sentence vectors and subject-term graphs
CN114612914A (en) Machine learning method and system for multi-label unbalanced data classification
Zhang et al. Closeness degree-based hesitant trapezoidal fuzzy multicriteria decision making method for evaluating green suppliers with qualitative information
CN117391497A (en) News manuscript quality subjective and objective scoring consistency evaluation method and system
CN112148994A (en) Information push effect evaluation method and device, electronic equipment and storage medium
CN116975434A (en) Content recommendation method and related equipment
CN115982645A (en) Method, device, processor and computer-readable storage medium for realizing data annotation based on machine learning in trusted environment
Shreevastava et al. Feature subset selection of semi-supervised data: an intuitionistic fuzzy-rough set-based concept
CN115344794A (en) Scenic spot recommendation method based on knowledge map semantic embedding
Termritthikun et al. Evolutionary neural architecture search based on efficient CNN models population for image classification
Selvi et al. Topic categorization of Tamil news articles

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination