CN106991296A

CN106991296A - Ensemble classifier method based on the greedy feature selecting of randomization

Info

Publication number: CN106991296A
Application number: CN201710209168.7A
Authority: CN
Inventors: 孟军; 张晶
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2017-04-01
Filing date: 2017-04-01
Publication date: 2017-07-28
Anticipated expiration: 2037-04-01
Also published as: CN106991296B

Abstract

A kind of Ensemble classifier method based on the greedy feature selecting of randomization, belongs to bioinformatics and Data Mining, and responding related gene expression data to plant stress classifies.Comprise the following steps：(1) randomness is introduced in traditional greedy algorithm and carries out feature selecting；(2) use and weight local modularization function as the heuristic information of randomization greedy algorithm as community discovery evaluation index in complex network；(3) base grader is trained using algorithm of support vector machine on each character subset；(4) division that clusters is carried out using neighbour's propagation clustering algorithm to base grader；(5) the middle base grader as representative point that clusters is used to carry out integrated, using simple majority ballot method formation Ensemble classifier model.The present invention can recognize whether plant sample is forced according to gene expression data, greatly improve the classification accuracy to microarray data, and the generalization ability of algorithm is strong, with extraordinary stability.

Description

Ensemble classifier method based on the greedy feature selecting of randomization

Technical field

The invention belongs to bioinformatics and Data Mining, the more particularly to important gene to gene expression data Selection and selective ensemble disaggregated model structure.

Background technology

The development of high throughput sequencing technologies, the gene expression data of magnanimity is provided for researcher, is therefrom extracted valuable The information of value has become the study hotspot of bioinformatics.Plant in growth course often by pest and disease damage and environment because How the influence of element, predict and carry out preventing and controlling, will play non-to many development such as forestry, farming and animal husbandry, environmental protection Often important effect.The characteristics of there is " high-dimensional ", " small sample " and " highly redundant " due to gene expression data, using traditional The problems such as classification stability difference and relatively low accuracy rate occurs in single sorting algorithm, thus the analysis of such data is needed to handle energy The stronger disaggregated model of power.

Because the high dimension attribute of gene expression data is, it is necessary to which selecting important feature is used to classify.Feature selection approach Three classes can be divided into：Filtering type, packaging type and embedded.Simple in the analysis to gene expression data, efficient filtering Formula feature selection approach is widely used.Filtering type feature selecting algorithm is divided into two kinds of feature ordering and feature subset selection.Mesh Preceding most sort method have ignored the relation of interdependence between feature, simply individual of the selection with stronger classification capacity Feature.Feature subset selection method can select the character subset with stronger classification capacity, and in view of characteristic set Overall classification performance.Because it is a class NP difficult problems to find optimal feature subset, generally entered using greedy algorithm The selection of the character subset of row near-optimization.The process of exploration is come according to the heuristic information for being capable of evaluating characteristic partitions of subsets performance Carry out.However, traditional greedy algorithm is that the region of very little in feature space is explored, therefore, part is merely creating Optimal solution.In order to solve the above problems, randomness has been introduced in greedy algorithm.

Paper name：Introducing randomness into greedy ensemble pruning Algorithms, periodical：Applied Intelligence, time：2015.Dai et al. is to traditional based on greedy algorithm Integrated pruning method improved, expand the search space of greedy algorithm by introducing randomness.And by repeatedly holding The row base grader selection algorithm produces multigroup different base grader set, finally chooses the base point of a component class best performance Class device produces last Ensemble classifier model.

Traditional characteristic evaluating index has mutual information, Pearson came correlation and sum of ranks detection etc..Paper name： Feature Subset Selection for Cancer Classification Using Weight Local Modularity, phase Periodical：Scientific Reports, time：2016.Zhao et al. is proposed one kind and commented based on community discovery in complex network The feature selecting algorithm of valency index is applied to in the classification of cancer data.This feature subset selection method make use of weighting originally Ground modularization index come evaluating characteristic subset it is overall for the separating capacity of classification rather than as current most of evaluation indexes Simply the classification capacity of single feature is evaluated.

In the case of base grader is a fairly large number of, can exist some redundancies grader, cause overall otherness compared with Difference.In order to improve the performance of Ensemble classifier, it is very necessary to carry out selection to base grader.Selective ensemble method can be big Cause is divided into four classes：Iteration optimization method, ranking method, sub-clustering method and mode excavation method.

Paper name：LibD3C:ensemble classifiers with a clustering and dynamic Selection strategy, periodical：Neurocomputing, time：2014.In the base grader selection based on sub-clustering, Lin et al. clustered to base grader subset using K-means clustering algorithms first, is clustered afterwards in the grader of generation Grader is selected using the dynamic base grader selection strategy of cyclic sequence on basis.Paper name：A spectral Clustering based ensemble pruning approach, periodical：Neurocomputing, time：2014. Zhang et al. is carried out clustering division using spectral clustering to base grader, and grader is divided into two groups, using classifying quality compared with One group of good base grader is used for last Ensemble classifier.The present invention proposes the base grader selecting party based on neighbour's propagation clustering Method, quantity and starting point and can more rapidly, accurately be clustered because the clustering algorithm need not set to cluster in advance.

The content of the invention

Shortcoming based on prior art described above, it is an object of the invention to provide based on the greedy feature selecting of randomization Ensemble classifier method, important gene can be selected and classified to whether plant is forced.

Based on the Ensemble classifier method of the greedy feature selecting of randomization, step is as follows：

(1) randomness is introduced in traditional greedy algorithm and carries out feature selecting

The selection of optimal feature subset is a class NP difficult problems, so generally choosing one using greedy algorithm approximately Optimal character subset.Greedy algorithm refers to, is always made when to problem solving and is currently appearing to be best selection, not from Taken on total optimization, what is made is only the approximate of locally optimal solution either total optimization solution in some sense Solution.

Greedy algorithm is divided into two kinds of sweep forward and sweep backward, the first be since empty character subset, by by The mode of addition is walked to find optimal characteristics set；Second is since all characteristic sets, by way of progressively deleting Feature space is explored.It is generally only one in very little during due to carrying out feature selecting using traditional greedy algorithm Scanned in problem space.But for gene microarray data, its feature is the gene in gene microarray data Quantity is typically dimension up to ten thousand, so it is to have obtained a part most to be selected using traditional greedy algorithm during important gene Excellent solution.Therefore randomness is introduced, first feature is chosen by random manner rather than according to fixed heuristic information, is come Expand the search space to feature.

(2) it is used in greedy as randomization as the local modularization function of weighting of community discovery evaluation index in complex network The heuristic information of center algorithm

In order to extract valuable information, data mining and Complex Networks Theory from the burgeoning data set of scale Arisen at the historic moment in different time.Many systems such as internet, community network, human diseases genetic neural network and scientist's collaboration network Network, can be expressed as the form of complex network.The characteristic being had in complex network has worldlet, uncalibrated visual servo and community structure special Property.The present invention data mining technology is combined with complex network, using in complex network the evaluation index of community discovery as Heuristic information carries out feature selecting.To things carry out classification be grouped be the mankind solve problem basic skills, similarly for point The study of rule-like is also the important research problem in machine learning, data mining and complex network field.It is big in real world Most complex networks are all made up of packet, and each packet is named as a community.Determine the elementary cell of network function It is to be made up of the summit and side in each community.Community is the subset on summit, and the summit in identical community is completely embedded, no It is sparse with the summit connection in community.Community discovery is in order that community intrinsic in detection and announcement different type complex network is tied Structure, can help it is appreciated that complex network function, find complex network in imply rule and prediction complex network row For.

The phenomenon that traditional modularization Q functions have the limitation of resolution ratio and extremely degenerated, therefore the present invention uses improvement Be used as the index for evaluating gene subset classification performance based on the local modular function of weighting.Weight local modularization function Calculating process is as follows：

1) weighted undirected graph G (V, A) is built, the sample that wherein gene microarray data is concentrated is right as the summit in figure In any two vertex v₁And v₂If,Or, then there is weight between two summits is WE=exp (- d (v₁,v₂)) side；k-NN(v₁) include vertex v₁K neighbours, d (v₁,v₂) it is the distance between two summits；

2) classification according to sample carries out the division of community to sample naturally

3) for each character subset, it is calculated based on the importance for weighting local modularization function, expression formula is as follows：

Wherein：C is the categorical measure of gene microarray data collection to be sorted；w_iIt is the internal edges weight in i-th of community Summation；W_iIt is the summation that internal edges add adjacent side weight in community i；v_iIt is the summation of the degree on all summits in community i, top The degree of point represents the weight summation on the side being adjacent；

Introduce the as follows based on the feature selection process for weighting local modularization function of randomness：

1) current character subset F={ } is set；

2) feature is randomly selected to be added in F；

3) for each feature g being not included in F, according to attribute set F+ { g }, its significance level is calculated；

4) find and cause step 3) in the maximum g ' of significance level, make F=F+ { g ' }, repeat the step until feature Feature quantity in collection F reaches max-thresholds；

(3) base grader is trained using algorithm of support vector machine on each character subset

SVMs is the classification of a kind of supervised learning method, i.e. known sample point, ask sample point and classification it Between corresponding relation, so as to which the sample in training set is separated according to classification, or the class corresponding to the new sample point of prediction Not.For all samples in training set, it is linear divide, approximately linear can divide and three kinds of situations of linearly inseparable, this It is exactly the three types of classification problem.

1) for two class problems, if the sample point of a certain hyperplane both sides is divided into positive class and negative class, symbol letter is used The decision function of classification is as follows corresponding to several mode extrapolated sample x：

F (x)=w^Tx+b (1-2)

Wherein, w is the normal vector of hyperplane, determines the direction of hyperplane；B is displacement, determines hyperplane and original The distance between point；X is the vector of representative sample.

2) disaggregated model needs that does to seek to obtain w and b so that classification errors of the anticipation function f (x) to original sample Rate is minimum.Loss function is a kind of measurement for being specifically used to evaluation and foreca order of accuarcy.SVM methods are in the case of linear separability Optimal classification surface angle propose, so-called optimal classification surface is exactly to require that classifying face not only can divide two class samples without error Open, and to make the class interval of two class samples maximum.The former is to ensure empirical risk minimization, and makes class interval maximum actual On be exactly to make confidence least risk.The maximum hyperplane in class interval to be found, need to meet equation below：

s.t.y_j[(w^Tx_j)+b] -1 >=0 (j=1,2 ..., n)

Wherein, y_jFor sample x_jClass label.

3) optimization problem that maximal margin method is solved into optimal classification surface is converted into its dual problem, so that by solving Relatively simple dual problem solves former classification problem.Its formula is as follows：

α_p>=0, p=1,2 ..., n

Wherein, α_pAnd α_qFor the Lagrange multiplier system for each sample that dual problem is obtained using method of Lagrange multipliers Number.

4) Nonlinear Classification is solved the problems, such as by introducing slack variable and penalty factor, and allows certain classification wrong By mistake, its optimization aim is：

s.t.y_j[(w^Tx_j+b)]≥1-ζ_j(j=1,2 ..., n)

Wherein, ζ_jFor slack variable, C is the weight of slack variable.

5) input space is transformed to a higher dimensional space by the nonlinear transformation that SVM is defined by using interior Product function, then The sorting technique of optimal classification surface is sought in this space again.So that linear inseparable problem is transformed in lower dimensional space In higher dimensional space the problem of linear separability.Make φ (x) represent the characteristic vector after x is mapped, divide super flat in feature space The corresponding model in face and corresponding Optimized model are expressed as follows：

F (x)=w^Tφ(x)+b (1-6)

s.t.y_j[(w^Tφ(x_j))+b] -1 >=0 (j=1,2 ... n)

6) because feature space dimension may very high even Infinite-dimensional, directly calculating φ (x_p)^Tφ(x_q) it is typically very Difficult, therefore kernel function is introduced, ingenious part therein, which is that, is reduced to the solution of a complicated optimization problem To the inner product operation of original sample data.

κ(x_p,x_q)=φ (x_p)^Tφ(x_q) (1-8)

α_p>=0, p=1,2 ..., n

By constantly performing the feature selection process in (2), multiple character subsets are produced.The each character subset of correspondence is formed One training set is used to train SVM base graders.

(4) division that clusters is carried out using neighbour's propagation clustering algorithm to base grader

1) similarity matrix S is built, as the input of neighbour's propagation clustering algorithm, with each base grader in checking Classification results on collection are as data point, and element s (e, m) represents the similitude between data point e and m in matrix, and numerical value is bigger Then show that the similitude between two data points is bigger.

In the gene selects stage, N number of gene subset is selected, is named asEach gene subset is used to form one Individual training set, wherein only comprising sample the expression value in the gene subset.Therefore, it can obtain N number of base grader by trainingClassification results of each base grader on checking collection are as a data point, element s (e, m) in similarity matrix (e=1,2 ..., N, m=1,2 ..., N) represent base grader H_eAnd H_mBetween similitude.Calculating the process of similitude In, first have to consideration is the classification performance of grader, and the different feature quantities that base grader is selected in addition are also similitude Key factor in calculating process.Base grader H_eAnd H_mBetween similarity definition be：

S (e, m)=(N^tt+N^ff)/(N^tt+N^tf+N^ft+N^ff)-DN(e,m) (1-10)

Wherein, N^ttSample size is concentrated in the checking for being expressed as by two base graders correctly being classified simultaneously；N^ffIt is expressed as same When sample size is concentrated by the checking of two base grader mistakes classification.Concentrated in checking by base grader H_eClassification is correct still By H_mThe sample size of mistake classification is expressed as N^tf, N^ftWith N^tfConversely.Two base grader classification results identical sample sizes The similitude that the ratio of number of population sample is exactly classification performance between them is concentrated with checking.DN (e, m) is two base classification The gene polyadenylation signal that device is used concentrates the quantity of different genes ratio shared in general gene quantity.

2) the value s (h, h) on similarity matrix diagonal is set, and the value is referred to as data point i.e. base grader on checking collection Classification results h point of reference, the value is bigger to illustrate that the data point is more suitable for as the center of clustering, therefore generation clusters Number is also more.In order to ensure that there is each data point identical chance to be represented a little as clustering, by the reference of all data points Degree is set to identical numerical value.

3) in AP clustering algorithms, each data point is considered as between the center that potentially clusters, data point constantly Information transmission is carried out until algorithmic statement or iteration terminate.AP clustering algorithms transmit two kinds of information during iteration, r (e, M) represent data point m as the adaptedness at the data point e center that clusters；A (e, m) represents that data point e selection data points m makees For the tendency degree at its center that clusters.R (e, m) and a (e, m) calculation formula are as follows：

R (e, m)=s (e, m)-max a (e, l)+s (e, l) (l ∈ 1,2 ..., N, l ≠ m }) (1-11)

In order to improve the stability of AP clustering algorithms, introduce damped coefficient λ, so r (e, m) and a (e, m) just by To the constraint of the calculated value of last iteration.Calculation formula after improvement is as follows：

r_t=(1- λ) r_t+λr_t-1 (1-13)

a_t=(1- λ) a_t+λa_t-1 (1-14)

Wherein, r_tAnd a_tRepresent the result of the t times iteration, r_t-1And a_t-1Represent the iteration result of the t-1 times.

4) AP clusters, which automatically determine to cluster, represents a little, if r (h, h)+a (h, h) during iteration>0, then select number Strong point h is used as the center of clustering.Iteration distributes to remaining data point away from its nearest center that clusters after terminating.

(5) use the middle base grader as representative point that clusters to carry out integrated, collected using simple majority ballot method formation Composition class model

In the base grader of formation clusters, there is larger similitude, category between the base grader in same cluster There is larger difference between the base grader that difference clusters.Therefore, the representative base as the center of clustering is selected Grader carries out integrated, it is ensured that for the otherness between integrated base grader.Finally, using simple majority ballot method pair The classification results of selected base grader carry out fusion and form Ensemble classifier model.

Beneficial effects of the present invention：

(1) by introducing randomness in traditional greedy feature selection approach, the hunting zone to feature space is expanded.

(2) use based on the local modular heuristic information of weighting to weigh the classification capacity that character subset is overall.

(3) efficiency of whole system is further improved by selective ensemble, and improves the classification of Ensemble classifier model Ability.

(4) the characteristics of being directed to gene expression data, is improved traditional feature selection approach and sorting technique, pole The earth improves the classification performance to gene microarray data, and the complexity of algorithm is low, the speed of service fast, can be well Analysis applied to gene expression data.

Brief description of the drawings

Fig. 1 is the overview flow chart of Ensemble classifier method of the present invention based on the greedy feature selecting of randomization.

Fig. 2 is the composition schematic diagram of Ensemble classifier method of the present invention based on the greedy feature selecting of randomization.

Embodiment

As shown in figure 1, the general design idea of the present invention is：Because gene expression data have high-dimensional, small sample and The characteristics of highly redundant, so needing to carry out the selection of important gene before classifying to it.First, using the greedy of randomization Center algorithm is to weight the selection that local modularization function carries out gene subset as heuristic information.Pass through the spy of multiple randomization Levy selection and produce multiple character subsets, be that Ensemble classifier model forms multiple different training sets.Randomization feature selection approach It is not only that disaggregated model has filtered out important gene, also expands hunting zone of the disaggregated model on feature space.In order to Further improve the classification performance of Ensemble classifier model and improve the efficiency of classification, using the method pair based on neighbour's propagation clustering Base classification is selected, and picks out that otherness is larger and base grader of with preferable classification performance carries out last collection Into.

Fig. 2 is the composition schematic diagram of Ensemble classifier model of the present invention, is comprised the following steps：

(1) feature selecting is carried out using randomization greedy algorithm

1) for gene microarray data, its feature quantity is typically dimension up to ten thousand, so using traditional greed Algorithm is to have obtained a locally optimal solution during key character to select.Opened by random manner rather than according to fixed Photos and sending messages choose first feature, expanding the search space to feature.

2) use in complex network as the local modularization function of weighting of community discovery evaluation index as randomization The heuristic information of greedy algorithm

Using preceding to the selection that important gene subset is carried out by the way of addition, its process is as follows：

(a) current character subset F={ } is set

(b) feature is randomly selected to be added in F

(c) for each feature g being not included in F, foundation attribute set F+ { g }, structure weighted undirected graph G (V, A).Sample wherein in data set is as the summit in figure, and any two is for vertex v₁And v₂IfOr PersonThere is weight between two samples for WE=exp (- d (v₁,v₂)) side.k-NN(v₁) include vertex v₁ K neighbours, the distance between sample is calculated using Euclidean distance.During experiment by k values be set to from 1 to 25, progressively it is incremented by using spacing as 2, and find the best k values of classification performance.

(d) for each feature being not included in F, it is calculated based on the importance for weighting local modularization function, table It is as follows up to formula：

(e) g ' for make it that the significance level in step (d) is maximum is found, F=F+ { g ' } is made, the step is repeated until feature Feature quantity in subset F reaches 50, and experiment proves there is best classification performance between gene dosage is 10-20.

(2) base grader is trained using algorithm of support vector machine on each character subset

By constantly performing the feature selection process in (1), N number of character subset is produced.The sample for choosing 60% is used as instruction Practice collection, correspondence one training set of each character subset formation is used to train supporting vector, so as to produce N SVM classifier.This Text sets the kernel function of SVM classifier as the RBF kernel function stronger to data set adaptability, and K (x, y)=exp (- γ | | x- y||²) such a kernel function can be applied to the data set of the various features such as large sample, small sample, higher-dimension and low-dimensional.

(3) division that clusters is carried out using neighbour's propagation clustering algorithm to base grader

1) similarity matrix S is built as the input of neighbour's propagation clustering algorithm.

The sample for choosing 20% collects as checking, and base grader is clustered using the classification results on checking collection Divide.Base grader H_eAnd H_mBetween similarity definition be：

S (e, m)=(N^tt+N^ff)/(N^tt+N^tf+N^ft+N^ff)-DN(e,m)

2) the value s (h, h) on diagonal of a matrix is set, clustered to ensure that there is each data point identical chance to turn into Represent point and the point of reference of all data points is set to identical numerical value, its value is 0.1.

3) in AP clustering algorithms, each data point is considered as between the center that potentially clusters, data point constantly Two kinds of information of transmission terminate until algorithmic statement or iteration.R (e, m) represents data point m as the suitable of the data point e center that clusters Answer degree；A (e, m) represents data point e selection data point m as the tendency degree at its center that clusters.R's (e, m) and a (e, m) Calculation formula is as follows：

R (e, m)=s (e, m)-max a (e, l)+s (e, l) (l ∈ 1,2 ..., N, l ≠ m })

4) AP clusters automatically determine the number that clusters, if r (h, h)+a (h, h) during iteration>0, then select data Point h is used as the center of clustering.Iteration distributes to remaining data point away from its nearest center that clusters after terminating.

(4) use the middle base grader as representative point that clusters to carry out integrated, collected using simple majority ballot method formation Composition class model

In the clustering of base grader of formation, it is same cluster in base grader between there is larger similitude, category There is larger difference between the base grader that difference clusters.Therefore, the representative base as the center of clustering is selected Grader carries out integrated, it is ensured that for the otherness between integrated base grader.Finally, using simple majority ballot method pair Classification results of the selected base grader on test set carry out fusion and form Ensemble classifier model.Simple majority ballot method be Finger is classified the final classification results of sample and is determined as the classification that the most base grader of quantity unanimously judges.

By the method for the invention be applied to table 1,2 and 3 in arabidopsis data set, and by context of methods with it is existing integrated Sorting technique is compared.The accuracy rate and G-mean of the present invention is apparently higher than existing integrated approach.

The experimental result contrast table of the Arabidopsis-Drought data sets of table 1

The experimental result contrast table of the Arabidopsis-Nitrogen data sets of table 2

The experimental result contrast table of the Arabidopsis-TEV data sets of table 3

Generally speaking, the present invention devises a kind of Ensemble classifier method based on the greedy feature selecting of randomization, the present invention The classification performance of Ensemble classifier model can be effectively improved.Therefore, the present invention can be applied to divide gene microarray data Analysis, strong instrument is provided for the diagnosis of timely and effectively plant stress.

Claims

1. a kind of Ensemble classifier method based on the greedy feature selecting of randomization, it is characterised in that step is as follows：

First feature is randomly selected, to expand the search space to feature；

(2), as the local modularization function of the weighting of community discovery evaluation index, randomization greed will be used as in complex network The heuristic information of algorithm

The characteristic of complex network has worldlet, uncalibrated visual servo and community structure, and this Ensemble classifier method is by data mining technology and again Miscellaneous network is combined, and the evaluation index of community discovery carries out feature selecting as heuristic information using in complex network；

Weight local modularization function calculating process as follows：

1) weighted undirected graph G (V, A) is built, wherein, the sample that gene microarray data is concentrated is as the summit in figure, for appointing Two vertex vs of meaning₁And v₂If, v₁∈k-NN(v₂) or v₂∈k-NN(v₁), then there is weight between two summits for WE=exp (-d(v₁,v₂)) side；k-NN(v₁) include vertex v₁K neighbours, d (v₁,v₂) it is the distance between two summits；

S i g = Σ_{i = 1}^{C} [\frac{w_{i}}{W_{i}} - {(\frac{v_{i}}{2 W_{i}})}^{2}] - - - (1 - 1)

Wherein：C is the categorical measure of gene microarray data collection to be sorted；w_iIt is the total of the internal edges weight in i-th of community With；W_iIt is the summation that internal edges add adjacent side weight in community i；v_iIt is the summation of the degree on all summits in community i, summit Degree represents the weight summation on the side being adjacent；

1) current character subset F={ } is set；

2) feature is randomly selected to be added in F；

4) find and cause step 3) in the maximum feature g ' of significance level, make F=F+ { g ' }, repeat the step until feature Feature quantity in collection F reaches max-thresholds；

1) for two class problems, if the sample point of a certain hyperplane both sides is divided into positive class and negative class, with sign function The decision function of classification is as follows corresponding to mode extrapolated sample x：

F (x)=w^Tx+b (1-2)

Wherein, w is the normal vector of hyperplane, determines the direction of hyperplane；B is displacement, is determined between hyperplane and origin Distance；X is the vector of representative sample；

2) meet under conditions of equation below (1-3), find the maximum hyperplane in class interval：

m i n \frac{1}{2} | | w | |^{2} - - - (1 - 3)

s.t.y_j[(w^Tx_j)+b] -1 >=0, j=1,2 ..., n

Wherein, y_jFor sample x_jClass label；

3) optimization problem that maximal margin method is solved into optimal classification surface is converted into its dual problem, relatively easy by solving Dual problem solve former classification problem, its formula is as follows：

\underset{α}{m a x} Σ_{p = 1}^{n} α_{p} - \frac{1}{2} Σ_{p = 1}^{n} Σ_{q = 1}^{n} α_{p} α_{q} y_{p} y_{q} x_{p}^{T} x_{q} - - - (1 - 4)

\begin{matrix} s . t . & Σ_{p = 1}^{n} α_{p} y_{p} = 0 \end{matrix},

α_p>=0, p=1,2 ..., n

Wherein, α_pAnd α_qFor the Lagrange multiplier coefficient for each sample that dual problem is obtained using method of Lagrange multipliers；

4) Nonlinear Classification is solved the problems, such as by introducing slack variable and penalty factor, its optimization aim is：

\min \frac{1}{2} | | w | |^{2} + C Σ_{j = 1}^{n} ζ_{j} - - - (1 - 5)

s.t.y_j[(w^Tx_j+b)]≥1-ζ_j(j=1,2 ..., n)

Wherein, ζ_jFor slack variable, C is the weight of slack variable；

5) nonlinear transformation that SVM is defined by using interior Product function, transforms to higher dimensional space, then again in higher-dimension by the input space The sorting technique of optimal classification surface is sought in space so that linear inseparable problem is transformed in higher dimensional space in lower dimensional space The problem of middle linear separability；Make φ (x) represent the characteristic vector after x is mapped, hyperplane is divided in feature space corresponding Model and corresponding Optimized model are expressed as follows：

F (x)=w^Tφ(x)+b (1-6)

m i n \frac{1}{2} | | w | |^{2} - - - (1 - 7)

s.t.y_j[(w^Tφ(x_j))+b] -1 >=0 (j=1,2 ... n)

6) kernel function is introduced, by the solution of complicated optimization problem, the inner product operation to original sample data is reduced to；

κ(x_p,x_q)=φ (x_p)^Tφ(x_q) (1-8)

\underset{α}{m a x} Σ_{p = 1}^{n} α_{p} - \frac{1}{2} Σ_{p = 1}^{n} Σ_{q = 1}^{n} α_{p} α_{q} y_{p} y_{q} κ (x_{p}, x_{q}) - - - (1 - 9)

\begin{matrix} s . t . & Σ_{p = 1}^{m} α_{p} y_{p} = 0 \end{matrix},

α_p>=0, p=1,2 ..., n

By constantly performing the feature selection process in step (2), multiple character subsets are produced；The each character subset of correspondence is formed One training set is used to train SVM base graders；

1) similarity matrix S is built, as the input of neighbour's propagation clustering algorithm, with each base grader on checking collection Classification results as data point, element s (e, m) represents the similitude between data point e and m, the more big then table of numerical value in matrix Similitude between bright two data points is bigger；

In the gene selects stage, N number of gene subset is selected, is named asEach gene subset is used to form an instruction Practice collection, wherein only comprising sample the expression value in the gene subset；Therefore, N number of base grader is obtained by training Classification results of each base grader on checking collection are as a data point, and element s (e, m) represents base point in similarity matrix Class device H_eAnd H_mBetween similitude, wherein, e=1,2 ..., N, m=1,2 ..., N；It is first during similitude is calculated What is first considered is the classification performance of grader, and the different feature quantities that base grader is selected in addition are also Similarity measures mistake Key factor in journey；Base grader H_eAnd H_mBetween similarity definition be：

S (e, m)=(N^tt+N^ff)/(N^tt+N^tf+N^ft+N^ff)-DN(e,m)(1-10)

Wherein, N^ttSample size is concentrated in the checking for being expressed as by two base graders correctly being classified simultaneously；N^ffIt is expressed as quilt simultaneously Sample size is concentrated in the checking of two base grader mistake classification；Concentrated in checking by base grader H_eClassification is correct but by H_m The sample size of mistake classification is expressed as N^tf, N^ftWith N^tfConversely；Two base grader classification results identical sample sizes are with testing Card concentrates the similitude that the ratio of number of population sample is exactly classification performance between them；DN (e, m) is that two base graders make Gene polyadenylation signal concentrates the quantity of different genes ratio shared in general gene quantity；

2) the value s (h, h) on similarity matrix diagonal is set, and the value is referred to as point of the data point i.e. base grader on checking collection Class result h point of reference, the value is bigger to illustrate that the data point is more suitable for as the center of clustering, therefore the number that clusters of generation It is more；In order to ensure that there is each data point identical chance to be represented a little as clustering, the point of reference of all data points is set It is set to identical numerical value；

3) in AP clustering algorithms, each data point is considered as constantly carrying out between the center that potentially clusters, data point Information transmission terminates until algorithmic statement or iteration；AP clustering algorithms transmit two kinds of information, r (e, m) table during iteration Registration strong point m as the data point e center that clusters adaptedness；A (e, m) represents that data point e selection data point m are poly- as it The tendency degree at cluster center；R (e, m) and a (e, m) calculation formula are as follows：

R (e, m)=s (e, m)-max a (e, l)+s (e, l) (l ∈ 1,2 ..., N, l ≠ m }) (1-11)

a (e, m) = m i n {0, r (m, m) + \underset{l}{Σ} {m a x (0, r (l, m))}} - - - (1 - 12)

In order to improve the stability of AP clustering algorithms, introduce damped coefficient λ, r (e, m) and a (e, m) is counted by last iteration The constraint of calculation value, the calculation formula after improvement is as follows：

r_t=(1- λ) r_t+λr_t-1(1-13)

a_t=(1- λ) a_t+λa_t-1(1-14)

Wherein, r_tAnd a_tRepresent the result of the t times iteration, r_t-1And a_t-1Represent the iteration result of the t-1 times；

4) AP clusters, which automatically determine to cluster, represents a little, if r (h, h)+a (h, h) during iteration>0, then select data point H is used as the center of clustering；Iteration distributes to remaining data point away from its nearest center that clusters after terminating；

(5) the middle base grader as representative point that clusters is used to carry out integrated, using simple majority ballot method formation collection composition Class model.