CN106991296A - Ensemble classifier method based on the greedy feature selecting of randomization - Google Patents

Ensemble classifier method based on the greedy feature selecting of randomization Download PDF

Info

Publication number
CN106991296A
CN106991296A CN201710209168.7A CN201710209168A CN106991296A CN 106991296 A CN106991296 A CN 106991296A CN 201710209168 A CN201710209168 A CN 201710209168A CN 106991296 A CN106991296 A CN 106991296A
Authority
CN
China
Prior art keywords
feature
classification
sample
base grader
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710209168.7A
Other languages
Chinese (zh)
Other versions
CN106991296B (en
Inventor
孟军
张晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian University of Technology
Original Assignee
Dalian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian University of Technology filed Critical Dalian University of Technology
Priority to CN201710209168.7A priority Critical patent/CN106991296B/en
Publication of CN106991296A publication Critical patent/CN106991296A/en
Application granted granted Critical
Publication of CN106991296B publication Critical patent/CN106991296B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Epidemiology (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Public Health (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A kind of Ensemble classifier method based on the greedy feature selecting of randomization, belongs to bioinformatics and Data Mining, and responding related gene expression data to plant stress classifies.Comprise the following steps:(1) randomness is introduced in traditional greedy algorithm and carries out feature selecting;(2) use and weight local modularization function as the heuristic information of randomization greedy algorithm as community discovery evaluation index in complex network;(3) base grader is trained using algorithm of support vector machine on each character subset;(4) division that clusters is carried out using neighbour's propagation clustering algorithm to base grader;(5) the middle base grader as representative point that clusters is used to carry out integrated, using simple majority ballot method formation Ensemble classifier model.The present invention can recognize whether plant sample is forced according to gene expression data, greatly improve the classification accuracy to microarray data, and the generalization ability of algorithm is strong, with extraordinary stability.

Description

Ensemble classifier method based on the greedy feature selecting of randomization
Technical field
The invention belongs to bioinformatics and Data Mining, the more particularly to important gene to gene expression data Selection and selective ensemble disaggregated model structure.
Background technology
The development of high throughput sequencing technologies, the gene expression data of magnanimity is provided for researcher, is therefrom extracted valuable The information of value has become the study hotspot of bioinformatics.Plant in growth course often by pest and disease damage and environment because How the influence of element, predict and carry out preventing and controlling, will play non-to many development such as forestry, farming and animal husbandry, environmental protection Often important effect.The characteristics of there is " high-dimensional ", " small sample " and " highly redundant " due to gene expression data, using traditional The problems such as classification stability difference and relatively low accuracy rate occurs in single sorting algorithm, thus the analysis of such data is needed to handle energy The stronger disaggregated model of power.
Because the high dimension attribute of gene expression data is, it is necessary to which selecting important feature is used to classify.Feature selection approach Three classes can be divided into:Filtering type, packaging type and embedded.Simple in the analysis to gene expression data, efficient filtering Formula feature selection approach is widely used.Filtering type feature selecting algorithm is divided into two kinds of feature ordering and feature subset selection.Mesh Preceding most sort method have ignored the relation of interdependence between feature, simply individual of the selection with stronger classification capacity Feature.Feature subset selection method can select the character subset with stronger classification capacity, and in view of characteristic set Overall classification performance.Because it is a class NP difficult problems to find optimal feature subset, generally entered using greedy algorithm The selection of the character subset of row near-optimization.The process of exploration is come according to the heuristic information for being capable of evaluating characteristic partitions of subsets performance Carry out.However, traditional greedy algorithm is that the region of very little in feature space is explored, therefore, part is merely creating Optimal solution.In order to solve the above problems, randomness has been introduced in greedy algorithm.
Paper name:Introducing randomness into greedy ensemble pruning Algorithms, periodical:Applied Intelligence, time:2015.Dai et al. is to traditional based on greedy algorithm Integrated pruning method improved, expand the search space of greedy algorithm by introducing randomness.And by repeatedly holding The row base grader selection algorithm produces multigroup different base grader set, finally chooses the base point of a component class best performance Class device produces last Ensemble classifier model.
Traditional characteristic evaluating index has mutual information, Pearson came correlation and sum of ranks detection etc..Paper name: Feature Subset Selection for Cancer Classification Using Weight Local Modularity, phase Periodical:Scientific Reports, time:2016.Zhao et al. is proposed one kind and commented based on community discovery in complex network The feature selecting algorithm of valency index is applied to in the classification of cancer data.This feature subset selection method make use of weighting originally Ground modularization index come evaluating characteristic subset it is overall for the separating capacity of classification rather than as current most of evaluation indexes Simply the classification capacity of single feature is evaluated.
In the case of base grader is a fairly large number of, can exist some redundancies grader, cause overall otherness compared with Difference.In order to improve the performance of Ensemble classifier, it is very necessary to carry out selection to base grader.Selective ensemble method can be big Cause is divided into four classes:Iteration optimization method, ranking method, sub-clustering method and mode excavation method.
Paper name:LibD3C:ensemble classifiers with a clustering and dynamic Selection strategy, periodical:Neurocomputing, time:2014.In the base grader selection based on sub-clustering, Lin et al. clustered to base grader subset using K-means clustering algorithms first, is clustered afterwards in the grader of generation Grader is selected using the dynamic base grader selection strategy of cyclic sequence on basis.Paper name:A spectral Clustering based ensemble pruning approach, periodical:Neurocomputing, time:2014. Zhang et al. is carried out clustering division using spectral clustering to base grader, and grader is divided into two groups, using classifying quality compared with One group of good base grader is used for last Ensemble classifier.The present invention proposes the base grader selecting party based on neighbour's propagation clustering Method, quantity and starting point and can more rapidly, accurately be clustered because the clustering algorithm need not set to cluster in advance.
The content of the invention
Shortcoming based on prior art described above, it is an object of the invention to provide based on the greedy feature selecting of randomization Ensemble classifier method, important gene can be selected and classified to whether plant is forced.
Based on the Ensemble classifier method of the greedy feature selecting of randomization, step is as follows:
(1) randomness is introduced in traditional greedy algorithm and carries out feature selecting
The selection of optimal feature subset is a class NP difficult problems, so generally choosing one using greedy algorithm approximately Optimal character subset.Greedy algorithm refers to, is always made when to problem solving and is currently appearing to be best selection, not from Taken on total optimization, what is made is only the approximate of locally optimal solution either total optimization solution in some sense Solution.
Greedy algorithm is divided into two kinds of sweep forward and sweep backward, the first be since empty character subset, by by The mode of addition is walked to find optimal characteristics set;Second is since all characteristic sets, by way of progressively deleting Feature space is explored.It is generally only one in very little during due to carrying out feature selecting using traditional greedy algorithm Scanned in problem space.But for gene microarray data, its feature is the gene in gene microarray data Quantity is typically dimension up to ten thousand, so it is to have obtained a part most to be selected using traditional greedy algorithm during important gene Excellent solution.Therefore randomness is introduced, first feature is chosen by random manner rather than according to fixed heuristic information, is come Expand the search space to feature.
(2) it is used in greedy as randomization as the local modularization function of weighting of community discovery evaluation index in complex network The heuristic information of center algorithm
In order to extract valuable information, data mining and Complex Networks Theory from the burgeoning data set of scale Arisen at the historic moment in different time.Many systems such as internet, community network, human diseases genetic neural network and scientist's collaboration network Network, can be expressed as the form of complex network.The characteristic being had in complex network has worldlet, uncalibrated visual servo and community structure special Property.The present invention data mining technology is combined with complex network, using in complex network the evaluation index of community discovery as Heuristic information carries out feature selecting.To things carry out classification be grouped be the mankind solve problem basic skills, similarly for point The study of rule-like is also the important research problem in machine learning, data mining and complex network field.It is big in real world Most complex networks are all made up of packet, and each packet is named as a community.Determine the elementary cell of network function It is to be made up of the summit and side in each community.Community is the subset on summit, and the summit in identical community is completely embedded, no It is sparse with the summit connection in community.Community discovery is in order that community intrinsic in detection and announcement different type complex network is tied Structure, can help it is appreciated that complex network function, find complex network in imply rule and prediction complex network row For.
The phenomenon that traditional modularization Q functions have the limitation of resolution ratio and extremely degenerated, therefore the present invention uses improvement Be used as the index for evaluating gene subset classification performance based on the local modular function of weighting.Weight local modularization function Calculating process is as follows:
1) weighted undirected graph G (V, A) is built, the sample that wherein gene microarray data is concentrated is right as the summit in figure In any two vertex v1And v2If,Or, then there is weight between two summits is WE=exp (- d (v1,v2)) side;k-NN(v1) include vertex v1K neighbours, d (v1,v2) it is the distance between two summits;
2) classification according to sample carries out the division of community to sample naturally
3) for each character subset, it is calculated based on the importance for weighting local modularization function, expression formula is as follows:
Wherein:C is the categorical measure of gene microarray data collection to be sorted;wiIt is the internal edges weight in i-th of community Summation;WiIt is the summation that internal edges add adjacent side weight in community i;viIt is the summation of the degree on all summits in community i, top The degree of point represents the weight summation on the side being adjacent;
Introduce the as follows based on the feature selection process for weighting local modularization function of randomness:
1) current character subset F={ } is set;
2) feature is randomly selected to be added in F;
3) for each feature g being not included in F, according to attribute set F+ { g }, its significance level is calculated;
4) find and cause step 3) in the maximum g ' of significance level, make F=F+ { g ' }, repeat the step until feature Feature quantity in collection F reaches max-thresholds;
(3) base grader is trained using algorithm of support vector machine on each character subset
SVMs is the classification of a kind of supervised learning method, i.e. known sample point, ask sample point and classification it Between corresponding relation, so as to which the sample in training set is separated according to classification, or the class corresponding to the new sample point of prediction Not.For all samples in training set, it is linear divide, approximately linear can divide and three kinds of situations of linearly inseparable, this It is exactly the three types of classification problem.
1) for two class problems, if the sample point of a certain hyperplane both sides is divided into positive class and negative class, symbol letter is used The decision function of classification is as follows corresponding to several mode extrapolated sample x:
F (x)=wTx+b (1-2)
Wherein, w is the normal vector of hyperplane, determines the direction of hyperplane;B is displacement, determines hyperplane and original The distance between point;X is the vector of representative sample.
2) disaggregated model needs that does to seek to obtain w and b so that classification errors of the anticipation function f (x) to original sample Rate is minimum.Loss function is a kind of measurement for being specifically used to evaluation and foreca order of accuarcy.SVM methods are in the case of linear separability Optimal classification surface angle propose, so-called optimal classification surface is exactly to require that classifying face not only can divide two class samples without error Open, and to make the class interval of two class samples maximum.The former is to ensure empirical risk minimization, and makes class interval maximum actual On be exactly to make confidence least risk.The maximum hyperplane in class interval to be found, need to meet equation below:
s.t.yj[(wTxj)+b] -1 >=0 (j=1,2 ..., n)
Wherein, yjFor sample xjClass label.
3) optimization problem that maximal margin method is solved into optimal classification surface is converted into its dual problem, so that by solving Relatively simple dual problem solves former classification problem.Its formula is as follows:
αp>=0, p=1,2 ..., n
Wherein, αpAnd αqFor the Lagrange multiplier system for each sample that dual problem is obtained using method of Lagrange multipliers Number.
4) Nonlinear Classification is solved the problems, such as by introducing slack variable and penalty factor, and allows certain classification wrong By mistake, its optimization aim is:
s.t.yj[(wTxj+b)]≥1-ζj(j=1,2 ..., n)
Wherein, ζjFor slack variable, C is the weight of slack variable.
5) input space is transformed to a higher dimensional space by the nonlinear transformation that SVM is defined by using interior Product function, then The sorting technique of optimal classification surface is sought in this space again.So that linear inseparable problem is transformed in lower dimensional space In higher dimensional space the problem of linear separability.Make φ (x) represent the characteristic vector after x is mapped, divide super flat in feature space The corresponding model in face and corresponding Optimized model are expressed as follows:
F (x)=wTφ(x)+b (1-6)
s.t.yj[(wTφ(xj))+b] -1 >=0 (j=1,2 ... n)
6) because feature space dimension may very high even Infinite-dimensional, directly calculating φ (xp)Tφ(xq) it is typically very Difficult, therefore kernel function is introduced, ingenious part therein, which is that, is reduced to the solution of a complicated optimization problem To the inner product operation of original sample data.
κ(xp,xq)=φ (xp)Tφ(xq) (1-8)
αp>=0, p=1,2 ..., n
By constantly performing the feature selection process in (2), multiple character subsets are produced.The each character subset of correspondence is formed One training set is used to train SVM base graders.
(4) division that clusters is carried out using neighbour's propagation clustering algorithm to base grader
1) similarity matrix S is built, as the input of neighbour's propagation clustering algorithm, with each base grader in checking Classification results on collection are as data point, and element s (e, m) represents the similitude between data point e and m in matrix, and numerical value is bigger Then show that the similitude between two data points is bigger.
In the gene selects stage, N number of gene subset is selected, is named asEach gene subset is used to form one Individual training set, wherein only comprising sample the expression value in the gene subset.Therefore, it can obtain N number of base grader by trainingClassification results of each base grader on checking collection are as a data point, element s (e, m) in similarity matrix (e=1,2 ..., N, m=1,2 ..., N) represent base grader HeAnd HmBetween similitude.Calculating the process of similitude In, first have to consideration is the classification performance of grader, and the different feature quantities that base grader is selected in addition are also similitude Key factor in calculating process.Base grader HeAnd HmBetween similarity definition be:
S (e, m)=(Ntt+Nff)/(Ntt+Ntf+Nft+Nff)-DN(e,m) (1-10)
Wherein, NttSample size is concentrated in the checking for being expressed as by two base graders correctly being classified simultaneously;NffIt is expressed as same When sample size is concentrated by the checking of two base grader mistakes classification.Concentrated in checking by base grader HeClassification is correct still By HmThe sample size of mistake classification is expressed as Ntf, NftWith NtfConversely.Two base grader classification results identical sample sizes The similitude that the ratio of number of population sample is exactly classification performance between them is concentrated with checking.DN (e, m) is two base classification The gene polyadenylation signal that device is used concentrates the quantity of different genes ratio shared in general gene quantity.
2) the value s (h, h) on similarity matrix diagonal is set, and the value is referred to as data point i.e. base grader on checking collection Classification results h point of reference, the value is bigger to illustrate that the data point is more suitable for as the center of clustering, therefore generation clusters Number is also more.In order to ensure that there is each data point identical chance to be represented a little as clustering, by the reference of all data points Degree is set to identical numerical value.
3) in AP clustering algorithms, each data point is considered as between the center that potentially clusters, data point constantly Information transmission is carried out until algorithmic statement or iteration terminate.AP clustering algorithms transmit two kinds of information during iteration, r (e, M) represent data point m as the adaptedness at the data point e center that clusters;A (e, m) represents that data point e selection data points m makees For the tendency degree at its center that clusters.R (e, m) and a (e, m) calculation formula are as follows:
R (e, m)=s (e, m)-max a (e, l)+s (e, l) (l ∈ 1,2 ..., N, l ≠ m }) (1-11)
In order to improve the stability of AP clustering algorithms, introduce damped coefficient λ, so r (e, m) and a (e, m) just by To the constraint of the calculated value of last iteration.Calculation formula after improvement is as follows:
rt=(1- λ) rt+λrt-1 (1-13)
at=(1- λ) at+λat-1 (1-14)
Wherein, rtAnd atRepresent the result of the t times iteration, rt-1And at-1Represent the iteration result of the t-1 times.
4) AP clusters, which automatically determine to cluster, represents a little, if r (h, h)+a (h, h) during iteration>0, then select number Strong point h is used as the center of clustering.Iteration distributes to remaining data point away from its nearest center that clusters after terminating.
(5) use the middle base grader as representative point that clusters to carry out integrated, collected using simple majority ballot method formation Composition class model
In the base grader of formation clusters, there is larger similitude, category between the base grader in same cluster There is larger difference between the base grader that difference clusters.Therefore, the representative base as the center of clustering is selected Grader carries out integrated, it is ensured that for the otherness between integrated base grader.Finally, using simple majority ballot method pair The classification results of selected base grader carry out fusion and form Ensemble classifier model.
Beneficial effects of the present invention:
(1) by introducing randomness in traditional greedy feature selection approach, the hunting zone to feature space is expanded.
(2) use based on the local modular heuristic information of weighting to weigh the classification capacity that character subset is overall.
(3) efficiency of whole system is further improved by selective ensemble, and improves the classification of Ensemble classifier model Ability.
(4) the characteristics of being directed to gene expression data, is improved traditional feature selection approach and sorting technique, pole The earth improves the classification performance to gene microarray data, and the complexity of algorithm is low, the speed of service fast, can be well Analysis applied to gene expression data.
Brief description of the drawings
Fig. 1 is the overview flow chart of Ensemble classifier method of the present invention based on the greedy feature selecting of randomization.
Fig. 2 is the composition schematic diagram of Ensemble classifier method of the present invention based on the greedy feature selecting of randomization.
Embodiment
As shown in figure 1, the general design idea of the present invention is:Because gene expression data have high-dimensional, small sample and The characteristics of highly redundant, so needing to carry out the selection of important gene before classifying to it.First, using the greedy of randomization Center algorithm is to weight the selection that local modularization function carries out gene subset as heuristic information.Pass through the spy of multiple randomization Levy selection and produce multiple character subsets, be that Ensemble classifier model forms multiple different training sets.Randomization feature selection approach It is not only that disaggregated model has filtered out important gene, also expands hunting zone of the disaggregated model on feature space.In order to Further improve the classification performance of Ensemble classifier model and improve the efficiency of classification, using the method pair based on neighbour's propagation clustering Base classification is selected, and picks out that otherness is larger and base grader of with preferable classification performance carries out last collection Into.
Fig. 2 is the composition schematic diagram of Ensemble classifier model of the present invention, is comprised the following steps:
(1) feature selecting is carried out using randomization greedy algorithm
1) for gene microarray data, its feature quantity is typically dimension up to ten thousand, so using traditional greed Algorithm is to have obtained a locally optimal solution during key character to select.Opened by random manner rather than according to fixed Photos and sending messages choose first feature, expanding the search space to feature.
2) use in complex network as the local modularization function of weighting of community discovery evaluation index as randomization The heuristic information of greedy algorithm
Using preceding to the selection that important gene subset is carried out by the way of addition, its process is as follows:
(a) current character subset F={ } is set
(b) feature is randomly selected to be added in F
(c) for each feature g being not included in F, foundation attribute set F+ { g }, structure weighted undirected graph G (V, A).Sample wherein in data set is as the summit in figure, and any two is for vertex v1And v2IfOr PersonThere is weight between two samples for WE=exp (- d (v1,v2)) side.k-NN(v1) include vertex v1 K neighbours, the distance between sample is calculated using Euclidean distance.During experiment by k values be set to from 1 to 25, progressively it is incremented by using spacing as 2, and find the best k values of classification performance.
(d) for each feature being not included in F, it is calculated based on the importance for weighting local modularization function, table It is as follows up to formula:
(e) g ' for make it that the significance level in step (d) is maximum is found, F=F+ { g ' } is made, the step is repeated until feature Feature quantity in subset F reaches 50, and experiment proves there is best classification performance between gene dosage is 10-20.
(2) base grader is trained using algorithm of support vector machine on each character subset
By constantly performing the feature selection process in (1), N number of character subset is produced.The sample for choosing 60% is used as instruction Practice collection, correspondence one training set of each character subset formation is used to train supporting vector, so as to produce N SVM classifier.This Text sets the kernel function of SVM classifier as the RBF kernel function stronger to data set adaptability, and K (x, y)=exp (- γ | | x- y||2) such a kernel function can be applied to the data set of the various features such as large sample, small sample, higher-dimension and low-dimensional.
(3) division that clusters is carried out using neighbour's propagation clustering algorithm to base grader
1) similarity matrix S is built as the input of neighbour's propagation clustering algorithm.
The sample for choosing 20% collects as checking, and base grader is clustered using the classification results on checking collection Divide.Base grader HeAnd HmBetween similarity definition be:
S (e, m)=(Ntt+Nff)/(Ntt+Ntf+Nft+Nff)-DN(e,m)
2) the value s (h, h) on diagonal of a matrix is set, clustered to ensure that there is each data point identical chance to turn into Represent point and the point of reference of all data points is set to identical numerical value, its value is 0.1.
3) in AP clustering algorithms, each data point is considered as between the center that potentially clusters, data point constantly Two kinds of information of transmission terminate until algorithmic statement or iteration.R (e, m) represents data point m as the suitable of the data point e center that clusters Answer degree;A (e, m) represents data point e selection data point m as the tendency degree at its center that clusters.R's (e, m) and a (e, m) Calculation formula is as follows:
R (e, m)=s (e, m)-max a (e, l)+s (e, l) (l ∈ 1,2 ..., N, l ≠ m })
4) AP clusters automatically determine the number that clusters, if r (h, h)+a (h, h) during iteration>0, then select data Point h is used as the center of clustering.Iteration distributes to remaining data point away from its nearest center that clusters after terminating.
(4) use the middle base grader as representative point that clusters to carry out integrated, collected using simple majority ballot method formation Composition class model
In the clustering of base grader of formation, it is same cluster in base grader between there is larger similitude, category There is larger difference between the base grader that difference clusters.Therefore, the representative base as the center of clustering is selected Grader carries out integrated, it is ensured that for the otherness between integrated base grader.Finally, using simple majority ballot method pair Classification results of the selected base grader on test set carry out fusion and form Ensemble classifier model.Simple majority ballot method be Finger is classified the final classification results of sample and is determined as the classification that the most base grader of quantity unanimously judges.
By the method for the invention be applied to table 1,2 and 3 in arabidopsis data set, and by context of methods with it is existing integrated Sorting technique is compared.The accuracy rate and G-mean of the present invention is apparently higher than existing integrated approach.
The experimental result contrast table of the Arabidopsis-Drought data sets of table 1
The experimental result contrast table of the Arabidopsis-Nitrogen data sets of table 2
The experimental result contrast table of the Arabidopsis-TEV data sets of table 3
Generally speaking, the present invention devises a kind of Ensemble classifier method based on the greedy feature selecting of randomization, the present invention The classification performance of Ensemble classifier model can be effectively improved.Therefore, the present invention can be applied to divide gene microarray data Analysis, strong instrument is provided for the diagnosis of timely and effectively plant stress.

Claims (1)

1. a kind of Ensemble classifier method based on the greedy feature selecting of randomization, it is characterised in that step is as follows:
(1) randomness is introduced in traditional greedy algorithm and carries out feature selecting
First feature is randomly selected, to expand the search space to feature;
(2), as the local modularization function of the weighting of community discovery evaluation index, randomization greed will be used as in complex network The heuristic information of algorithm
The characteristic of complex network has worldlet, uncalibrated visual servo and community structure, and this Ensemble classifier method is by data mining technology and again Miscellaneous network is combined, and the evaluation index of community discovery carries out feature selecting as heuristic information using in complex network;
Weight local modularization function calculating process as follows:
1) weighted undirected graph G (V, A) is built, wherein, the sample that gene microarray data is concentrated is as the summit in figure, for appointing Two vertex vs of meaning1And v2If, v1∈k-NN(v2) or v2∈k-NN(v1), then there is weight between two summits for WE=exp (-d(v1,v2)) side;k-NN(v1) include vertex v1K neighbours, d (v1,v2) it is the distance between two summits;
2) classification according to sample carries out the division of community to sample naturally
3) for each character subset, it is calculated based on the importance for weighting local modularization function, expression formula is as follows:
S i g = Σ i = 1 C [ w i W i - ( v i 2 W i ) 2 ] - - - ( 1 - 1 )
Wherein:C is the categorical measure of gene microarray data collection to be sorted;wiIt is the total of the internal edges weight in i-th of community With;WiIt is the summation that internal edges add adjacent side weight in community i;viIt is the summation of the degree on all summits in community i, summit Degree represents the weight summation on the side being adjacent;
Introduce the as follows based on the feature selection process for weighting local modularization function of randomness:
1) current character subset F={ } is set;
2) feature is randomly selected to be added in F;
3) for each feature g being not included in F, according to attribute set F+ { g }, its significance level is calculated;
4) find and cause step 3) in the maximum feature g ' of significance level, make F=F+ { g ' }, repeat the step until feature Feature quantity in collection F reaches max-thresholds;
(3) base grader is trained using algorithm of support vector machine on each character subset
1) for two class problems, if the sample point of a certain hyperplane both sides is divided into positive class and negative class, with sign function The decision function of classification is as follows corresponding to mode extrapolated sample x:
F (x)=wTx+b (1-2)
Wherein, w is the normal vector of hyperplane, determines the direction of hyperplane;B is displacement, is determined between hyperplane and origin Distance;X is the vector of representative sample;
2) meet under conditions of equation below (1-3), find the maximum hyperplane in class interval:
m i n 1 2 | | w | | 2 - - - ( 1 - 3 )
s.t.yj[(wTxj)+b] -1 >=0, j=1,2 ..., n
Wherein, yjFor sample xjClass label;
3) optimization problem that maximal margin method is solved into optimal classification surface is converted into its dual problem, relatively easy by solving Dual problem solve former classification problem, its formula is as follows:
m a x α Σ p = 1 n α p - 1 2 Σ p = 1 n Σ q = 1 n α p α q y p y q x p T x q - - - ( 1 - 4 )
s . t . Σ p = 1 n α p y p = 0 ,
αp>=0, p=1,2 ..., n
Wherein, αpAnd αqFor the Lagrange multiplier coefficient for each sample that dual problem is obtained using method of Lagrange multipliers;
4) Nonlinear Classification is solved the problems, such as by introducing slack variable and penalty factor, its optimization aim is:
min 1 2 | | w | | 2 + C Σ j = 1 n ζ j - - - ( 1 - 5 )
s.t.yj[(wTxj+b)]≥1-ζj(j=1,2 ..., n)
Wherein, ζjFor slack variable, C is the weight of slack variable;
5) nonlinear transformation that SVM is defined by using interior Product function, transforms to higher dimensional space, then again in higher-dimension by the input space The sorting technique of optimal classification surface is sought in space so that linear inseparable problem is transformed in higher dimensional space in lower dimensional space The problem of middle linear separability;Make φ (x) represent the characteristic vector after x is mapped, hyperplane is divided in feature space corresponding Model and corresponding Optimized model are expressed as follows:
F (x)=wTφ(x)+b (1-6)
m i n 1 2 | | w | | 2 - - - ( 1 - 7 )
s.t.yj[(wTφ(xj))+b] -1 >=0 (j=1,2 ... n)
6) kernel function is introduced, by the solution of complicated optimization problem, the inner product operation to original sample data is reduced to;
κ(xp,xq)=φ (xp)Tφ(xq) (1-8)
m a x α Σ p = 1 n α p - 1 2 Σ p = 1 n Σ q = 1 n α p α q y p y q κ ( x p , x q ) - - - ( 1 - 9 )
s . t . Σ p = 1 m α p y p = 0 ,
αp>=0, p=1,2 ..., n
By constantly performing the feature selection process in step (2), multiple character subsets are produced;The each character subset of correspondence is formed One training set is used to train SVM base graders;
(4) division that clusters is carried out using neighbour's propagation clustering algorithm to base grader
1) similarity matrix S is built, as the input of neighbour's propagation clustering algorithm, with each base grader on checking collection Classification results as data point, element s (e, m) represents the similitude between data point e and m, the more big then table of numerical value in matrix Similitude between bright two data points is bigger;
In the gene selects stage, N number of gene subset is selected, is named asEach gene subset is used to form an instruction Practice collection, wherein only comprising sample the expression value in the gene subset;Therefore, N number of base grader is obtained by training Classification results of each base grader on checking collection are as a data point, and element s (e, m) represents base point in similarity matrix Class device HeAnd HmBetween similitude, wherein, e=1,2 ..., N, m=1,2 ..., N;It is first during similitude is calculated What is first considered is the classification performance of grader, and the different feature quantities that base grader is selected in addition are also Similarity measures mistake Key factor in journey;Base grader HeAnd HmBetween similarity definition be:
S (e, m)=(Ntt+Nff)/(Ntt+Ntf+Nft+Nff)-DN(e,m)(1-10)
Wherein, NttSample size is concentrated in the checking for being expressed as by two base graders correctly being classified simultaneously;NffIt is expressed as quilt simultaneously Sample size is concentrated in the checking of two base grader mistake classification;Concentrated in checking by base grader HeClassification is correct but by Hm The sample size of mistake classification is expressed as Ntf, NftWith NtfConversely;Two base grader classification results identical sample sizes are with testing Card concentrates the similitude that the ratio of number of population sample is exactly classification performance between them;DN (e, m) is that two base graders make Gene polyadenylation signal concentrates the quantity of different genes ratio shared in general gene quantity;
2) the value s (h, h) on similarity matrix diagonal is set, and the value is referred to as point of the data point i.e. base grader on checking collection Class result h point of reference, the value is bigger to illustrate that the data point is more suitable for as the center of clustering, therefore the number that clusters of generation It is more;In order to ensure that there is each data point identical chance to be represented a little as clustering, the point of reference of all data points is set It is set to identical numerical value;
3) in AP clustering algorithms, each data point is considered as constantly carrying out between the center that potentially clusters, data point Information transmission terminates until algorithmic statement or iteration;AP clustering algorithms transmit two kinds of information, r (e, m) table during iteration Registration strong point m as the data point e center that clusters adaptedness;A (e, m) represents that data point e selection data point m are poly- as it The tendency degree at cluster center;R (e, m) and a (e, m) calculation formula are as follows:
R (e, m)=s (e, m)-max a (e, l)+s (e, l) (l ∈ 1,2 ..., N, l ≠ m }) (1-11)
a ( e , m ) = m i n { 0 , r ( m , m ) + Σ l { m a x ( 0 , r ( l , m ) ) } } - - - ( 1 - 12 )
In order to improve the stability of AP clustering algorithms, introduce damped coefficient λ, r (e, m) and a (e, m) is counted by last iteration The constraint of calculation value, the calculation formula after improvement is as follows:
rt=(1- λ) rt+λrt-1(1-13)
at=(1- λ) at+λat-1(1-14)
Wherein, rtAnd atRepresent the result of the t times iteration, rt-1And at-1Represent the iteration result of the t-1 times;
4) AP clusters, which automatically determine to cluster, represents a little, if r (h, h)+a (h, h) during iteration>0, then select data point H is used as the center of clustering;Iteration distributes to remaining data point away from its nearest center that clusters after terminating;
(5) the middle base grader as representative point that clusters is used to carry out integrated, using simple majority ballot method formation collection composition Class model.
CN201710209168.7A 2017-04-01 2017-04-01 Integrated classification method based on randomized greedy feature selection Active CN106991296B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710209168.7A CN106991296B (en) 2017-04-01 2017-04-01 Integrated classification method based on randomized greedy feature selection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710209168.7A CN106991296B (en) 2017-04-01 2017-04-01 Integrated classification method based on randomized greedy feature selection

Publications (2)

Publication Number Publication Date
CN106991296A true CN106991296A (en) 2017-07-28
CN106991296B CN106991296B (en) 2019-12-27

Family

ID=59415350

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710209168.7A Active CN106991296B (en) 2017-04-01 2017-04-01 Integrated classification method based on randomized greedy feature selection

Country Status (1)

Country Link
CN (1) CN106991296B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107609588A (en) * 2017-09-12 2018-01-19 大连大学 A kind of disturbances in patients with Parkinson disease UPDRS score Forecasting Methodologies based on voice signal
CN108021940A (en) * 2017-11-30 2018-05-11 中国银联股份有限公司 data classification method and system based on machine learning
CN108763873A (en) * 2018-05-28 2018-11-06 苏州大学 A kind of gene sorting method and relevant device
CN108845560A (en) * 2018-05-30 2018-11-20 国网浙江省电力有限公司宁波供电公司 A kind of power scheduling log Fault Classification
CN109800790A (en) * 2018-12-24 2019-05-24 厦门大学 A kind of feature selection approach towards high dimensional data
CN109801681A (en) * 2018-12-11 2019-05-24 江苏大学 A kind of SNP selection method based on improved fuzzy clustering algorithm
CN110674865A (en) * 2019-09-20 2020-01-10 燕山大学 Rule learning classifier integration method oriented to software defect class distribution unbalance
CN111081321A (en) * 2019-12-18 2020-04-28 江南大学 CNS drug key feature identification method
CN111178533A (en) * 2018-11-12 2020-05-19 第四范式(北京)技术有限公司 Method and device for realizing automatic semi-supervised machine learning
CN113743436A (en) * 2020-06-29 2021-12-03 北京沃东天骏信息技术有限公司 Feature selection method and device for generating user portrait
CN113820123A (en) * 2021-08-18 2021-12-21 北京航空航天大学 Gearbox fault diagnosis method based on improved CNN and selective integration

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105279520A (en) * 2015-09-25 2016-01-27 天津师范大学 Optimal character subclass selecting method based on classification ability structure vector complementation
CN105740891A (en) * 2016-01-27 2016-07-06 北京工业大学 Target detection method based on multilevel characteristic extraction and context model

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105279520A (en) * 2015-09-25 2016-01-27 天津师范大学 Optimal character subclass selecting method based on classification ability structure vector complementation
CN105740891A (en) * 2016-01-27 2016-07-06 北京工业大学 Target detection method based on multilevel characteristic extraction and context model

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
JUN MENG,JING ZHANG,YUSHI LUAN: "Gene Selection Integrated with Biological Knowledge for Plant Stress Response Using Neighborhood System and Rough Set Theory", 《IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS》 *
孟军,尉双云: "基于近邻传播聚类的集成特征选择方法", 《计算机科学》 *
朱倩: "属性不确定数据关联分类算法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107609588A (en) * 2017-09-12 2018-01-19 大连大学 A kind of disturbances in patients with Parkinson disease UPDRS score Forecasting Methodologies based on voice signal
CN107609588B (en) * 2017-09-12 2020-08-18 大连大学 Parkinson patient UPDRS score prediction method based on voice signals
CN108021940A (en) * 2017-11-30 2018-05-11 中国银联股份有限公司 data classification method and system based on machine learning
CN108021940B (en) * 2017-11-30 2023-04-18 中国银联股份有限公司 Data classification method and system based on machine learning
CN108763873A (en) * 2018-05-28 2018-11-06 苏州大学 A kind of gene sorting method and relevant device
CN108845560B (en) * 2018-05-30 2021-07-13 国网浙江省电力有限公司宁波供电公司 Power dispatching log fault classification method
CN108845560A (en) * 2018-05-30 2018-11-20 国网浙江省电力有限公司宁波供电公司 A kind of power scheduling log Fault Classification
CN111178533B (en) * 2018-11-12 2024-04-16 第四范式(北京)技术有限公司 Method and device for realizing automatic semi-supervised machine learning
CN111178533A (en) * 2018-11-12 2020-05-19 第四范式(北京)技术有限公司 Method and device for realizing automatic semi-supervised machine learning
CN109801681A (en) * 2018-12-11 2019-05-24 江苏大学 A kind of SNP selection method based on improved fuzzy clustering algorithm
CN109800790A (en) * 2018-12-24 2019-05-24 厦门大学 A kind of feature selection approach towards high dimensional data
CN110674865B (en) * 2019-09-20 2023-04-07 燕山大学 Rule learning classifier integration method oriented to software defect class distribution unbalance
CN110674865A (en) * 2019-09-20 2020-01-10 燕山大学 Rule learning classifier integration method oriented to software defect class distribution unbalance
CN111081321A (en) * 2019-12-18 2020-04-28 江南大学 CNS drug key feature identification method
CN111081321B (en) * 2019-12-18 2023-10-31 江南大学 CNS drug key feature identification method
CN113743436A (en) * 2020-06-29 2021-12-03 北京沃东天骏信息技术有限公司 Feature selection method and device for generating user portrait
CN113820123A (en) * 2021-08-18 2021-12-21 北京航空航天大学 Gearbox fault diagnosis method based on improved CNN and selective integration

Also Published As

Publication number Publication date
CN106991296B (en) 2019-12-27

Similar Documents

Publication Publication Date Title
CN106991296A (en) Ensemble classifier method based on the greedy feature selecting of randomization
Marinakis et al. Particle swarm optimization for pap-smear diagnosis
CN109063719B (en) Image classification method combining structure similarity and class information
CN107103332A (en) A kind of Method Using Relevance Vector Machine sorting technique towards large-scale dataset
CN106778832A (en) The semi-supervised Ensemble classifier method of high dimensional data based on multiple-objection optimization
Shi et al. Multi-label ensemble learning
CN106126972A (en) A kind of level multi-tag sorting technique for protein function prediction
CN104850890A (en) Method for adjusting parameter of convolution neural network based on example learning and Sadowsky distribution
CN107273909A (en) The sorting algorithm of high dimensional data
CN110378366A (en) A kind of cross-domain image classification method based on coupling knowledge migration
CN107943856A (en) A kind of file classification method and system based on expansion marker samples
CN110245252A (en) Machine learning model automatic generation method based on genetic algorithm
CN105760888A (en) Neighborhood rough set ensemble learning method based on attribute clustering
CN108734223A (en) The social networks friend recommendation method divided based on community
CN102324038A (en) A kind of floristics recognition methods based on digital picture
CN103914705A (en) Hyperspectral image classification and wave band selection method based on multi-target immune cloning
Saraswati et al. High-resolution Self-Organizing Maps for advanced visualization and dimension reduction
CN101866489A (en) Image dividing method based on immune multi-object clustering
CN102622609A (en) Method for automatically classifying three-dimensional models based on support vector machine
CN103971136A (en) Large-scale data-oriented parallel structured support vector machine classification method
CN101295362A (en) Combination supporting vector machine and pattern classification method of neighbor method
Xie et al. Margin distribution based bagging pruning
Dale et al. Quantitative analysis of ecological networks
CN105046323A (en) Regularization-based RBF network multi-label classification method
CN103258212A (en) Semi-supervised integrated remote-sensing image classification method based on attractor propagation clustering

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant