CN110059756A

CN110059756A - A kind of multi-tag categorizing system based on multiple-objection optimization

Info

Publication number: CN110059756A
Application number: CN201910327482.4A
Authority: CN
Inventors: 章昭辉; 蒋昌俊; 王鹏伟; 王海建
Original assignee: Donghua University
Current assignee: Donghua University; National Dong Hwa University
Priority date: 2019-04-23
Filing date: 2019-04-23
Publication date: 2019-07-26

Abstract

The multi-tag categorizing system based on multiple-objection optimization that the present invention relates to a kind of.The present invention is using Naive Bayes Classification Algorithm as basic model, multiple two classification problems are converted by multi-tag classification problem, by weighting scheme to sample characteristics attribute weaken the attribute independent of naive Bayesian it is assumed that the multi-tag disaggregated model of one naive Bayesian of building.Then it is showed using the attribute weight generated at random as the gene of genetic algorithm, multiple targets of multi-tag disaggregated model is optimized to obtain optimal classification model set using NSGA-II genetic algorithm.Finally it is distributed using the final label that modd selection strategy votes in each test sample.Compared with the present invention has been done on 3 data sets with 3 multi-tag classification methods, the multi-tag classification method that the discovery present invention is mentioned is significantly improved in the total score of 7 indexs such as Hamming loss, MicroF1.

Description

A kind of multi-tag categorizing system based on multiple-objection optimization

Technical field

The present invention relates to a kind of systems using the multi-tag sorting algorithm in sorting algorithm, belong to computer science and technology Field.

Background technique

It is that a multi-tag classification problem is converted into multiple single labelings based on problem conversion multi-tag classification method Subproblem, this multiple subproblem is then solved using single labeling algorithm, finally by the calculating knot of all subproblems Fruit merges into the solution of multi-tag problem.Current many fundamental classifiers have been used in multi-tag sorting algorithm.Such as: simple shellfish Ye Si, support vector machines etc..Multi-tag sorting algorithm based on algorithm conversion, which refers to, asks the classification of existing processing multi-tag The algorithm of topic is adapted or is established an optimal problem directly to handle sample all in data set.Due to multi-tag point Sample includes multiple labels, and relevant property between label and label in class problem, how to establish and solves this optimization and ask It inscribes critically important.Although the realization of algorithm is relatively difficult, this algorithm does not destroy the incidence relation between label, does not have yet The structure for changing data set is able to reflect the feature of outgoing label classification, therefore this algorithm performance is more preferable.It is common based on algorithm The multi-tag sorting algorithm of conversion is all based on greatly following algorithm improvement, such as: k neighbour, Adaboost, neural network and certainly Plan tree etc..

Summary of the invention

The object of the present invention is to provide the systems in a kind of multi-tag classification that can be applied to various fields, pass through the system In conjunction with the theme distribution of Given information analysis online news, accurate decision-making foundation is provided for specific area supervision department.

In order to achieve the above object, the technical solution of the present invention is to provide a kind of multi-tags based on multiple-objection optimization point Class system characterized by comprising

Online data acquisition module, for acquiring the relevant text data of specific area from interconnection amount；

Model training module, the naive Bayesian multi-tag disaggregated model including double weightings, naive Bayesian multi-tag point Multi-tag classification problem is converted multiple two classification problems by class model, weakens Piao by the weighting scheme to sample characteristics attribute The attribute independent of plain Bayes it is assumed that the multi-tag disaggregated model of one naive Bayesian of building, is then generated using random Attribute weight as genetic algorithm gene show, come followed by NSGA-II genetic algorithm to naive Bayesian multi-tag Multiple targets of disaggregated model optimize to obtain multiple groups optimal models, obtain the optimal classification mould of multiple groups finally by training set Type；

Model selection module, user comprehensively consider the weighting of optimal solution according to different application scenarios, select mould in model Optimal solution is obtained in block using different modd selection strategies.

Preferably, the formula of the naive Bayesian multi-tag disaggregated model are as follows:

In formula, P (c_i| X) indicate that sample X belongs to c_iProbability, c_iIndicate i-th of classification of classification space C；P(c_i) indicate Belong to c in all samples_iProbability；P(x_j) indicate sample in j-th of attribute probability；w_jiIndicate j-th of attribute, i-th of class Other weight.

Preferably, the modd selection strategy in Model selection module includes dynamic model selection strategy and integrated model selection Strategy, in which:

Dynamic model selection strategy: the preceding K disaggregated model difference that dynamic model selection strategy passes through selection preference target O The label of test data set T is predicted, the prediction label collection Y of test data set T is determined further according to weighted voting；

Integrated model selection strategy: integrated model selection strategy is right respectively using N number of disaggregated model P that optimization algorithm generates Test data set T classifies, and reuses voting mechanism and predicts tally set Y.

In terms of the present invention can be classified with efficient application multi-tag.It exists between each evaluation index in multi-tag classification Inconsistent or internal conflict problem multi-tag classification problem is converted using Naive Bayes Classification Algorithm as basic model For multiple two classification problems, by the weighting scheme to sample characteristics attribute weaken the attribute independent of naive Bayesian it is assumed that Construct the multi-tag disaggregated model an of naive Bayesian.Then using the attribute weight generated at random as the base of genetic algorithm Because of performance, multiple targets of multi-tag disaggregated model are optimized to obtain optimal classification mould using NSGA-II genetic algorithm Type set.Finally it is distributed using the final label that modd selection strategy votes in each test sample.By the present invention and more than 3 Labeling method compares on 3 data sets, the discovery multi-tag classification method that is mentioned of the present invention loses in Hamming, It is significantly improved in the total score of 7 indexs such as MicroF1.

Detailed description of the invention

Fig. 1 is system module figure；

Fig. 2 is model training stage flow chart；

Influence of the Fig. 3 (a) and Fig. 3 (b) for population scale and the algebra of operation to algorithm performance.

Specific embodiment

Present invention will be further explained below with reference to specific examples.It should be understood that these embodiments are merely to illustrate the present invention Rather than it limits the scope of the invention.In addition, it should also be understood that, after reading the content taught by the present invention, those skilled in the art Member can make various changes or modifications the present invention, and such equivalent forms equally fall within the application the appended claims and limited Range.

A kind of multi-tag categorizing system based on multiple-objection optimization provided by the invention is mainly made of 3 parts:

(1) online data acquisition module.The module be responsible for from the relevant enterprise of interconnection amount acquisition specific area, policy with And the text datas such as associated news, bulletin.

(2) model training module: the module constructs the naive Bayesian multi-tag disaggregated model of a double weightings, then Multiple targets of the disaggregated model are optimized to obtain multiple groups optimal models using NSGA-II genetic algorithm.Then pass through Training set obtains the optimal disaggregated model of multiple groups.

(3) Model selection module: user can comprehensively consider the weighting of optimal solution according to different application scenarios, using not Same modd selection strategy, to obtain optimal solution.

Model training module and Model selection module are key components of the invention, and specific implementation method is as follows:

One, model training module

The present invention proposes a kind of naive Bayesian multi-tag classification (DWNB) model of double weightings.It asks multi-tag classification Topic is converted into multiple two classification problems, weakens the attribute independent of naive Bayesian by the weighting scheme to sample characteristics attribute It is assumed that the multi-tag disaggregated model of one naive Bayesian of building.Then it is calculated using the attribute weight generated at random as heredity The gene of method shows.Its formula are as follows:

Wherein, P (c_i| X) indicate that sample x belongs to c_iProbability, c_iIndicate i-th of classification in classification space, P (c_i) indicate Belong to c in all samples_iProbability, P (x_j) indicate sample in j-th of attribute probability.w_jiIndicate j-th of attribute, i-th of class Other weight.

The naive Bayesian multi-tag sorting algorithm of multiple-objection optimization of the present invention proposition based on NGSA-II.It is combined DWNB optimizes to obtain optimal classification Models Sets using NSGA-II genetic algorithm to multiple targets of multi-tag disaggregated model It closes.

Two, Model selection module

Dynamic model selection strategy: each disaggregated model has the superiority and inferiority of oneself under different evaluation function, and DYN selects plan Slightly by selecting the preceding K disaggregated model of preference target O to predict respectively the label of test data set T, further according to ballot The prediction label collection Y of voting method decision T.

Integrated model selection strategy: integrated study is the important thought of one kind in machine learning, it is by constructing and tying Multiple learners are closed to complete task.Integrated model is namely based on N number of disaggregated model P that this thought uses optimization algorithm to generate Classify respectively to test data set T, reuses voting mechanism and tally set Y is predicted.

Case study on implementation of the invention is as follows:

1, it includes Yeast data set that experiment, which has chosen 3 group data sets, and Tmc2007 expects library and Tongji University's news data Collection.This three group data set is different field respectively, with different multi-tag distributions and different label densities.Specific number It is as shown in table 1 according to the information of collection.

1 experimental data Basic Information Table of table

2, parameter setting

There are two the performance that parameter relevant with genetic manipulation decides DWNBML-EMO algorithm in genetic algorithm, it is respectively The algebra Gen of scale N and the algorithm operation of population.We first fix optimization aim number in an experiment, select HL, RL is as us Optimization object function.Use Tmc2007 as experimental data set simultaneously, using ten times of cross-validation method training patterns.Population Scale N rises to 60, shown in the variation of the average behavior of observation algorithm such as Fig. 3 (a) and Fig. 3 (b) in such a way that step-length is 10.

When the timing of population scale size one as shown in Fig. 3 (a) and Fig. 3 (b), the performance of algorithm with algebra continuous iteration And be continuously increased, but after running to certain algebra, the performance of algorithm is not further added by.And as population scale increases, Convergence speed of the algorithm is slower, and the performance of algorithm is better.This is because population scale expands, the diversity of population is improved, this Sample avoids algorithm and precocious phenomenon occurs.But after population scale to 60 generations, each performance of algorithm improves just no longer obvious.

3, performance compares

Present case has chosen 3 groups of multi-tag sorting algorithms as comparative experiments, is ML-RBF, BP-MLL, RANK- respectively SVM.Target letter in experiment using two pairs of evaluation criterias { HL, RL } the most typical and { MicF1, AP } as experiment simultaneously Number, to construct the multi-tag disaggregated model of multiple-objection optimization, i.e. DWNBML-EMO { HL, RL } and DWNBML-EMO MicF1, AP }, wherein HL and RL, there is inherent to conflict between MicF1 and AP.In order to reduce preference bring performance shadow in a model It rings, uses EN modd selection strategy in the model choice phase in experiment.From Fig. 3 (a) and Fig. 3 (b) it can be seen that the population of model The forward position Pareto can be all converged to after 100 generations, the model performance boost after 50 of the scale of population varies less, therefore The scale N of population chooses 50 in optimization algorithm, and algorithm runs algebra G and chooses 100.

The method training data model for using cross validation in the case, each data set is respectively adopted different algorithms It is tested respectively, finally obtains the classification results of test sample, calculated and commented according to the calculation formula of evaluation index in table 2-1 Valence index value.Table 2, table 3 and table 4 respectively correspond yeast, and Tmc2007 and Tongji University's news data collection are under each algorithm Experimental result corresponding to evaluation index value.

Experimental result of the various methods of table 2 on data set Yeast

Experimental result of the various methods of table 3 on data set Tmc2007

Experimental result of the various methods of table 4 in the associated data set of field

In 3 tables, numerical value indicates that the result that the algorithm obtains is best in all algorithms under the performance for overstriking , total score indicates the optimal number of the algorithm performance, the i.e. number of numerical value overstriking label in table.It can be seen that DWNBML-EMO { HL, RL } under three data sets must be divided into 3,4,1, DWNBML-EMO { MicF1, AP } must be divided into 6,5,6.It can be seen that DWNBML-EMO algorithm is greatly improved in the total score of performance.

Claims

1. a kind of multi-tag categorizing system based on multiple-objection optimization characterized by comprising

Model training module, the naive Bayesian multi-tag disaggregated model including double weightings, naive Bayesian multi-tag classification mould Multi-tag classification problem is converted multiple two classification problems by type, weakens simple shellfish by the weighting scheme to sample characteristics attribute The attribute independent of Ye Si it is assumed that building one naive Bayesian multi-tag disaggregated model, then utilize the category generated at random Property weight as genetic algorithm gene show, followed by NSGA-II genetic algorithm come to naive Bayesian multi-tag classify Multiple targets of model optimize to obtain multiple groups optimal models, obtain the optimal disaggregated model of multiple groups finally by training set；

Model selection module, user comprehensively considers the weighting of optimal solution according to different application scenarios, in Model selection module Optimal solution is obtained using different modd selection strategies.

2. a kind of multi-tag categorizing system based on multiple-objection optimization as described in claim 1, which is characterized in that the simplicity The formula of Bayes's multi-tag disaggregated model are as follows:

In formula, P (c_i| X) indicate that sample X belongs to c_iProbability, c_iIndicate i-th of classification of classification space C；P(c_i) indicate all Belong to c in sample_iProbability；P(x_j) indicate sample in j-th of attribute probability；w_jiIndicate j-th of attribute, i-th of classification Weight.

3. a kind of multi-tag categorizing system based on multiple-objection optimization as described in claim 1, which is characterized in that model selection Modd selection strategy in module includes dynamic model selection strategy and integrated model selection strategy, in which:

Dynamic model selection strategy: dynamic model selection strategy is by the preceding K disaggregated model of selection preference target O respectively to survey The label of examination data set T is predicted, the prediction label collection Y of test data set T is determined further according to weighted voting；

Integrated model selection strategy: N number of disaggregated model P that integrated model selection strategy is generated using optimization algorithm is respectively to test Data set T classifies, and reuses voting mechanism and predicts tally set Y.