CN110059756A - A kind of multi-tag categorizing system based on multiple-objection optimization - Google Patents

A kind of multi-tag categorizing system based on multiple-objection optimization Download PDF

Info

Publication number
CN110059756A
CN110059756A CN201910327482.4A CN201910327482A CN110059756A CN 110059756 A CN110059756 A CN 110059756A CN 201910327482 A CN201910327482 A CN 201910327482A CN 110059756 A CN110059756 A CN 110059756A
Authority
CN
China
Prior art keywords
tag
classification
model
attribute
algorithm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910327482.4A
Other languages
Chinese (zh)
Inventor
章昭辉
蒋昌俊
王鹏伟
王海建
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Donghua University
National Dong Hwa University
Original Assignee
Donghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Donghua University filed Critical Donghua University
Priority to CN201910327482.4A priority Critical patent/CN110059756A/en
Publication of CN110059756A publication Critical patent/CN110059756A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/217Validation; Performance evaluation; Active pattern learning techniques
    • G06F18/2193Validation; Performance evaluation; Active pattern learning techniques based on specific statistical tests
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • G06F18/24155Bayesian classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2431Multiple classes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • General Factory Administration (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The multi-tag categorizing system based on multiple-objection optimization that the present invention relates to a kind of.The present invention is using Naive Bayes Classification Algorithm as basic model, multiple two classification problems are converted by multi-tag classification problem, by weighting scheme to sample characteristics attribute weaken the attribute independent of naive Bayesian it is assumed that the multi-tag disaggregated model of one naive Bayesian of building.Then it is showed using the attribute weight generated at random as the gene of genetic algorithm, multiple targets of multi-tag disaggregated model is optimized to obtain optimal classification model set using NSGA-II genetic algorithm.Finally it is distributed using the final label that modd selection strategy votes in each test sample.Compared with the present invention has been done on 3 data sets with 3 multi-tag classification methods, the multi-tag classification method that the discovery present invention is mentioned is significantly improved in the total score of 7 indexs such as Hamming loss, MicroF1.

Description

A kind of multi-tag categorizing system based on multiple-objection optimization
Technical field
The present invention relates to a kind of systems using the multi-tag sorting algorithm in sorting algorithm, belong to computer science and technology Field.
Background technique
It is that a multi-tag classification problem is converted into multiple single labelings based on problem conversion multi-tag classification method Subproblem, this multiple subproblem is then solved using single labeling algorithm, finally by the calculating knot of all subproblems Fruit merges into the solution of multi-tag problem.Current many fundamental classifiers have been used in multi-tag sorting algorithm.Such as: simple shellfish Ye Si, support vector machines etc..Multi-tag sorting algorithm based on algorithm conversion, which refers to, asks the classification of existing processing multi-tag The algorithm of topic is adapted or is established an optimal problem directly to handle sample all in data set.Due to multi-tag point Sample includes multiple labels, and relevant property between label and label in class problem, how to establish and solves this optimization and ask It inscribes critically important.Although the realization of algorithm is relatively difficult, this algorithm does not destroy the incidence relation between label, does not have yet The structure for changing data set is able to reflect the feature of outgoing label classification, therefore this algorithm performance is more preferable.It is common based on algorithm The multi-tag sorting algorithm of conversion is all based on greatly following algorithm improvement, such as: k neighbour, Adaboost, neural network and certainly Plan tree etc..
Summary of the invention
The object of the present invention is to provide the systems in a kind of multi-tag classification that can be applied to various fields, pass through the system In conjunction with the theme distribution of Given information analysis online news, accurate decision-making foundation is provided for specific area supervision department.
In order to achieve the above object, the technical solution of the present invention is to provide a kind of multi-tags based on multiple-objection optimization point Class system characterized by comprising
Online data acquisition module, for acquiring the relevant text data of specific area from interconnection amount;
Model training module, the naive Bayesian multi-tag disaggregated model including double weightings, naive Bayesian multi-tag point Multi-tag classification problem is converted multiple two classification problems by class model, weakens Piao by the weighting scheme to sample characteristics attribute The attribute independent of plain Bayes it is assumed that the multi-tag disaggregated model of one naive Bayesian of building, is then generated using random Attribute weight as genetic algorithm gene show, come followed by NSGA-II genetic algorithm to naive Bayesian multi-tag Multiple targets of disaggregated model optimize to obtain multiple groups optimal models, obtain the optimal classification mould of multiple groups finally by training set Type;
Model selection module, user comprehensively consider the weighting of optimal solution according to different application scenarios, select mould in model Optimal solution is obtained in block using different modd selection strategies.
Preferably, the formula of the naive Bayesian multi-tag disaggregated model are as follows:
In formula, P (ci| X) indicate that sample X belongs to ciProbability, ciIndicate i-th of classification of classification space C;P(ci) indicate Belong to c in all samplesiProbability;P(xj) indicate sample in j-th of attribute probability;wjiIndicate j-th of attribute, i-th of class Other weight.
Preferably, the modd selection strategy in Model selection module includes dynamic model selection strategy and integrated model selection Strategy, in which:
Dynamic model selection strategy: the preceding K disaggregated model difference that dynamic model selection strategy passes through selection preference target O The label of test data set T is predicted, the prediction label collection Y of test data set T is determined further according to weighted voting;
Integrated model selection strategy: integrated model selection strategy is right respectively using N number of disaggregated model P that optimization algorithm generates Test data set T classifies, and reuses voting mechanism and predicts tally set Y.
In terms of the present invention can be classified with efficient application multi-tag.It exists between each evaluation index in multi-tag classification Inconsistent or internal conflict problem multi-tag classification problem is converted using Naive Bayes Classification Algorithm as basic model For multiple two classification problems, by the weighting scheme to sample characteristics attribute weaken the attribute independent of naive Bayesian it is assumed that Construct the multi-tag disaggregated model an of naive Bayesian.Then using the attribute weight generated at random as the base of genetic algorithm Because of performance, multiple targets of multi-tag disaggregated model are optimized to obtain optimal classification mould using NSGA-II genetic algorithm Type set.Finally it is distributed using the final label that modd selection strategy votes in each test sample.By the present invention and more than 3 Labeling method compares on 3 data sets, the discovery multi-tag classification method that is mentioned of the present invention loses in Hamming, It is significantly improved in the total score of 7 indexs such as MicroF1.
Detailed description of the invention
Fig. 1 is system module figure;
Fig. 2 is model training stage flow chart;
Influence of the Fig. 3 (a) and Fig. 3 (b) for population scale and the algebra of operation to algorithm performance.
Specific embodiment
Present invention will be further explained below with reference to specific examples.It should be understood that these embodiments are merely to illustrate the present invention Rather than it limits the scope of the invention.In addition, it should also be understood that, after reading the content taught by the present invention, those skilled in the art Member can make various changes or modifications the present invention, and such equivalent forms equally fall within the application the appended claims and limited Range.
A kind of multi-tag categorizing system based on multiple-objection optimization provided by the invention is mainly made of 3 parts:
(1) online data acquisition module.The module be responsible for from the relevant enterprise of interconnection amount acquisition specific area, policy with And the text datas such as associated news, bulletin.
(2) model training module: the module constructs the naive Bayesian multi-tag disaggregated model of a double weightings, then Multiple targets of the disaggregated model are optimized to obtain multiple groups optimal models using NSGA-II genetic algorithm.Then pass through Training set obtains the optimal disaggregated model of multiple groups.
(3) Model selection module: user can comprehensively consider the weighting of optimal solution according to different application scenarios, using not Same modd selection strategy, to obtain optimal solution.
Model training module and Model selection module are key components of the invention, and specific implementation method is as follows:
One, model training module
The present invention proposes a kind of naive Bayesian multi-tag classification (DWNB) model of double weightings.It asks multi-tag classification Topic is converted into multiple two classification problems, weakens the attribute independent of naive Bayesian by the weighting scheme to sample characteristics attribute It is assumed that the multi-tag disaggregated model of one naive Bayesian of building.Then it is calculated using the attribute weight generated at random as heredity The gene of method shows.Its formula are as follows:
Wherein, P (ci| X) indicate that sample x belongs to ciProbability, ciIndicate i-th of classification in classification space, P (ci) indicate Belong to c in all samplesiProbability, P (xj) indicate sample in j-th of attribute probability.wjiIndicate j-th of attribute, i-th of class Other weight.
The naive Bayesian multi-tag sorting algorithm of multiple-objection optimization of the present invention proposition based on NGSA-II.It is combined DWNB optimizes to obtain optimal classification Models Sets using NSGA-II genetic algorithm to multiple targets of multi-tag disaggregated model It closes.
Two, Model selection module
Dynamic model selection strategy: each disaggregated model has the superiority and inferiority of oneself under different evaluation function, and DYN selects plan Slightly by selecting the preceding K disaggregated model of preference target O to predict respectively the label of test data set T, further according to ballot The prediction label collection Y of voting method decision T.
Integrated model selection strategy: integrated study is the important thought of one kind in machine learning, it is by constructing and tying Multiple learners are closed to complete task.Integrated model is namely based on N number of disaggregated model P that this thought uses optimization algorithm to generate Classify respectively to test data set T, reuses voting mechanism and tally set Y is predicted.
Case study on implementation of the invention is as follows:
1, it includes Yeast data set that experiment, which has chosen 3 group data sets, and Tmc2007 expects library and Tongji University's news data Collection.This three group data set is different field respectively, with different multi-tag distributions and different label densities.Specific number It is as shown in table 1 according to the information of collection.
1 experimental data Basic Information Table of table
2, parameter setting
There are two the performance that parameter relevant with genetic manipulation decides DWNBML-EMO algorithm in genetic algorithm, it is respectively The algebra Gen of scale N and the algorithm operation of population.We first fix optimization aim number in an experiment, select HL, RL is as us Optimization object function.Use Tmc2007 as experimental data set simultaneously, using ten times of cross-validation method training patterns.Population Scale N rises to 60, shown in the variation of the average behavior of observation algorithm such as Fig. 3 (a) and Fig. 3 (b) in such a way that step-length is 10.
When the timing of population scale size one as shown in Fig. 3 (a) and Fig. 3 (b), the performance of algorithm with algebra continuous iteration And be continuously increased, but after running to certain algebra, the performance of algorithm is not further added by.And as population scale increases, Convergence speed of the algorithm is slower, and the performance of algorithm is better.This is because population scale expands, the diversity of population is improved, this Sample avoids algorithm and precocious phenomenon occurs.But after population scale to 60 generations, each performance of algorithm improves just no longer obvious.
3, performance compares
Present case has chosen 3 groups of multi-tag sorting algorithms as comparative experiments, is ML-RBF, BP-MLL, RANK- respectively SVM.Target letter in experiment using two pairs of evaluation criterias { HL, RL } the most typical and { MicF1, AP } as experiment simultaneously Number, to construct the multi-tag disaggregated model of multiple-objection optimization, i.e. DWNBML-EMO { HL, RL } and DWNBML-EMO MicF1, AP }, wherein HL and RL, there is inherent to conflict between MicF1 and AP.In order to reduce preference bring performance shadow in a model It rings, uses EN modd selection strategy in the model choice phase in experiment.From Fig. 3 (a) and Fig. 3 (b) it can be seen that the population of model The forward position Pareto can be all converged to after 100 generations, the model performance boost after 50 of the scale of population varies less, therefore The scale N of population chooses 50 in optimization algorithm, and algorithm runs algebra G and chooses 100.
The method training data model for using cross validation in the case, each data set is respectively adopted different algorithms It is tested respectively, finally obtains the classification results of test sample, calculated and commented according to the calculation formula of evaluation index in table 2-1 Valence index value.Table 2, table 3 and table 4 respectively correspond yeast, and Tmc2007 and Tongji University's news data collection are under each algorithm Experimental result corresponding to evaluation index value.
Experimental result of the various methods of table 2 on data set Yeast
Experimental result of the various methods of table 3 on data set Tmc2007
Experimental result of the various methods of table 4 in the associated data set of field
In 3 tables, numerical value indicates that the result that the algorithm obtains is best in all algorithms under the performance for overstriking , total score indicates the optimal number of the algorithm performance, the i.e. number of numerical value overstriking label in table.It can be seen that DWNBML-EMO { HL, RL } under three data sets must be divided into 3,4,1, DWNBML-EMO { MicF1, AP } must be divided into 6,5,6.It can be seen that DWNBML-EMO algorithm is greatly improved in the total score of performance.

Claims (3)

1. a kind of multi-tag categorizing system based on multiple-objection optimization characterized by comprising
Online data acquisition module, for acquiring the relevant text data of specific area from interconnection amount;
Model training module, the naive Bayesian multi-tag disaggregated model including double weightings, naive Bayesian multi-tag classification mould Multi-tag classification problem is converted multiple two classification problems by type, weakens simple shellfish by the weighting scheme to sample characteristics attribute The attribute independent of Ye Si it is assumed that building one naive Bayesian multi-tag disaggregated model, then utilize the category generated at random Property weight as genetic algorithm gene show, followed by NSGA-II genetic algorithm come to naive Bayesian multi-tag classify Multiple targets of model optimize to obtain multiple groups optimal models, obtain the optimal disaggregated model of multiple groups finally by training set;
Model selection module, user comprehensively considers the weighting of optimal solution according to different application scenarios, in Model selection module Optimal solution is obtained using different modd selection strategies.
2. a kind of multi-tag categorizing system based on multiple-objection optimization as described in claim 1, which is characterized in that the simplicity The formula of Bayes's multi-tag disaggregated model are as follows:
In formula, P (ci| X) indicate that sample X belongs to ciProbability, ciIndicate i-th of classification of classification space C;P(ci) indicate all Belong to c in sampleiProbability;P(xj) indicate sample in j-th of attribute probability;wjiIndicate j-th of attribute, i-th of classification Weight.
3. a kind of multi-tag categorizing system based on multiple-objection optimization as described in claim 1, which is characterized in that model selection Modd selection strategy in module includes dynamic model selection strategy and integrated model selection strategy, in which:
Dynamic model selection strategy: dynamic model selection strategy is by the preceding K disaggregated model of selection preference target O respectively to survey The label of examination data set T is predicted, the prediction label collection Y of test data set T is determined further according to weighted voting;
Integrated model selection strategy: N number of disaggregated model P that integrated model selection strategy is generated using optimization algorithm is respectively to test Data set T classifies, and reuses voting mechanism and predicts tally set Y.
CN201910327482.4A 2019-04-23 2019-04-23 A kind of multi-tag categorizing system based on multiple-objection optimization Pending CN110059756A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910327482.4A CN110059756A (en) 2019-04-23 2019-04-23 A kind of multi-tag categorizing system based on multiple-objection optimization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910327482.4A CN110059756A (en) 2019-04-23 2019-04-23 A kind of multi-tag categorizing system based on multiple-objection optimization

Publications (1)

Publication Number Publication Date
CN110059756A true CN110059756A (en) 2019-07-26

Family

ID=67320149

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910327482.4A Pending CN110059756A (en) 2019-04-23 2019-04-23 A kind of multi-tag categorizing system based on multiple-objection optimization

Country Status (1)

Country Link
CN (1) CN110059756A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110568286A (en) * 2019-09-12 2019-12-13 齐鲁工业大学 Transformer fault diagnosis method and system based on weighted double-hidden naive Bayes
CN110852441A (en) * 2019-09-26 2020-02-28 温州大学 Fire early warning method based on improved naive Bayes algorithm
CN111476321A (en) * 2020-05-18 2020-07-31 哈尔滨工程大学 Air flyer identification method based on feature weighting Bayes optimization algorithm
CN112988981A (en) * 2021-05-14 2021-06-18 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Automatic labeling method based on genetic algorithm
CN113827981A (en) * 2021-08-17 2021-12-24 杭州电魂网络科技股份有限公司 Game loss user prediction method and system based on naive Bayes

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050289089A1 (en) * 2004-06-28 2005-12-29 Naoki Abe Methods for multi-class cost-sensitive learning
US20090125461A1 (en) * 2007-11-09 2009-05-14 Microsoft Corporation Multi-Label Active Learning
CN101901345A (en) * 2009-05-27 2010-12-01 复旦大学 Classification method of differential proteomics
CN105095494A (en) * 2015-08-21 2015-11-25 中国地质大学(武汉) Method for testing categorical data set
CN106960041A (en) * 2017-03-28 2017-07-18 山西同方知网数字出版技术有限公司 A kind of structure of knowledge method based on non-equilibrium data
CN107526805A (en) * 2017-08-22 2017-12-29 杭州电子科技大学 A kind of ML kNN multi-tag Chinese Text Categorizations based on weight

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050289089A1 (en) * 2004-06-28 2005-12-29 Naoki Abe Methods for multi-class cost-sensitive learning
US20090125461A1 (en) * 2007-11-09 2009-05-14 Microsoft Corporation Multi-Label Active Learning
CN101901345A (en) * 2009-05-27 2010-12-01 复旦大学 Classification method of differential proteomics
CN105095494A (en) * 2015-08-21 2015-11-25 中国地质大学(武汉) Method for testing categorical data set
CN106960041A (en) * 2017-03-28 2017-07-18 山西同方知网数字出版技术有限公司 A kind of structure of knowledge method based on non-equilibrium data
CN107526805A (en) * 2017-08-22 2017-12-29 杭州电子科技大学 A kind of ML kNN multi-tag Chinese Text Categorizations based on weight

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘勘 等: "中文微博的立场判别研究", 《知识管理论坛》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110568286A (en) * 2019-09-12 2019-12-13 齐鲁工业大学 Transformer fault diagnosis method and system based on weighted double-hidden naive Bayes
CN110852441A (en) * 2019-09-26 2020-02-28 温州大学 Fire early warning method based on improved naive Bayes algorithm
CN110852441B (en) * 2019-09-26 2023-06-09 温州大学 Fire disaster early warning method based on improved naive Bayes algorithm
CN111476321A (en) * 2020-05-18 2020-07-31 哈尔滨工程大学 Air flyer identification method based on feature weighting Bayes optimization algorithm
CN112988981A (en) * 2021-05-14 2021-06-18 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Automatic labeling method based on genetic algorithm
CN112988981B (en) * 2021-05-14 2021-10-15 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Automatic labeling method based on genetic algorithm
CN113827981A (en) * 2021-08-17 2021-12-24 杭州电魂网络科技股份有限公司 Game loss user prediction method and system based on naive Bayes

Similar Documents

Publication Publication Date Title
CN110059756A (en) A kind of multi-tag categorizing system based on multiple-objection optimization
Lines et al. Hive-cote: The hierarchical vote collective of transformation-based ensembles for time series classification
Chen et al. Supervised feature selection with a stratified feature weighting method
Li et al. Exploration of classification confidence in ensemble learning
Kong et al. Transductive multilabel learning via label set propagation
Idris et al. Intelligent churn prediction in telecom: employing mRMR feature selection and RotBoost based ensemble classification
Ghanem et al. Multi-class pattern classification in imbalanced data
CN106991296B (en) Integrated classification method based on randomized greedy feature selection
CN111709299B (en) Underwater sound target identification method based on weighting support vector machine
Idris et al. Customer churn prediction for telecommunication: Employing various various features selection techniques and tree based ensemble classifiers
Cai et al. Imbalanced evolving self-organizing learning
Bi et al. Hierarchical multilabel classification with minimum bayes risk
CN111325264A (en) Multi-label data classification method based on entropy
Zhu et al. Collaborative decision-reinforced self-supervision for attributed graph clustering
Xie et al. Margin distribution based bagging pruning
kenji Nakano et al. Stacking methods for hierarchical classification
Zhang et al. Feature selection for high dimensional imbalanced class data based on F-measure optimization
CN105046323A (en) Regularization-based RBF network multi-label classification method
Pang et al. Improving deep forest by screening
Al Nuaimi et al. Online streaming feature selection with incremental feature grouping
Ali et al. A k nearest neighbour ensemble via extended neighbourhood rule and feature subsets
Liu et al. A weight-incorporated similarity-based clustering ensemble method
CN111832645A (en) Classification data feature selection method based on discrete crow difference collaborative search algorithm
Altınçay Decision trees using model ensemble-based nodes
Xie et al. Churn prediction with linear discriminant boosting algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190726