CN105740653A

CN105740653A - Redundancy removal feature selection method LLRFC score+ based on LLRFC and correlation analysis

Info

Publication number: CN105740653A
Application number: CN201610057637.3A
Authority: CN
Inventors: 李建更; 李晓丹; 张卫; 王朋飞; 李立杰; 张岩
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2016-01-27
Filing date: 2016-01-27
Publication date: 2016-07-06

Abstract

The invention provides a redundancy removal feature selection method LLRFC (Locally Linear Representation Fisher Criterion) score+ based on LLRFC and correlation analysis. A DNA (Deoxyribonucleic Acid) microarray technology provides a new direction for clinic tumor diagnosis. Performance of gene expression data corresponding to different kinds of tumor is different; through the analysis on the tumor gene expression data, study personnel can realize the accurate recognition on the tumor and the tumor subtype in the molecular level; and important biological significance is realized on the diagnosis and the treatment of the tumor. The feature genes in LLRFC judging criterion descending sort gene expression data is used to be combined with the dynamic correlation analysis strategy for further eliminating redundant features; an LLRFC score+ algorithm is provided; and the optimum feature gene subset is selected. The feature selection method LLRFC score+ has the advantages that the classification precision of a classifier can be effectively improved; a sample data set does not need to meet the normal distribution; and the method is applicable to data in various distribution types. The feature selection method LLRFC score+ can help people to find the virulence gene of cancer, and the early-stage diagnosis, tumor staging and typing, prognosis treatment and the like of clinic tumor diseases are facilitated.

Description

The feature selection approach LLRFC score+ of redundancy is removed based on LLRFC and correlation analysis

Technical field

The present invention relates to the staging studying technological domain of bioinformatics, be a kind of feature selection approach for oncogene express spectra data.

Background technology

In recent years, the development of biochip technology makes the expression that large-scale parallel detects thousands of genes be possibly realized, and molecular biology level, diagnosis and preventing and treating for human diseases open brand-new approach.By analyzing the gene expression difference in histological types (such as normal cell and tumor cell or cancer difference by stages stage), classification to corresponding gene expression data, it is achieved clinical in the diagnoses and treatment of tumor disease, hypotype identification and prognostic analysis.The M & M of current tumor patient is always in rising trend, has become as the first killer of human health, so adopting biochip technology research cancer classification to become a study hotspot of field of bioinformatics in recent years.

Owing to Microarray Experiments cost is high, cdna sample quantity few (be usually tens or 100 example).And the number gene detected is even up to ten thousand up to several thousand, adding the relation that gene expression is complicated, only have a small amount of gene and carry the information relevant to disease category, these all make the relevant analysis of express spectra data face very big challenge." small sample, high latitude " easily causes the problem of " dimension disaster ", not only makes to calculate time complexity high, and the existence of other redundancy genes also reduces learning classification accuracy rate, the effect of impact classification further.Accordingly, it would be desirable to gene expression profile data is carried out effective dimensionality reduction, from mass data, extract the key feature gene that tumor identification is played an important role.

At present, feature extraction and feature selection two aspect it are mainly based on the dimension reduction method of oncogene express spectra data.Relative to the dimension reduction method of feature extraction, namely under certain constraint, high dimensional data is mapped to lower-dimensional subspace.Certain of primitive character of being generally characterized by extracted linearly combines, it does not have clear and definite implication, biological explanatory difference；Feature selection approach selects the characterizing gene containing more classification information from original mass data, can not only be effectively improved the precision of classification, and have important biological significance.By the biological function analysis that these genes are correlated with, it is possible to explore tumor pathogenesis, people are helped to find the Disease-causing gene of cancer.From the origin cause of formation of the interpretation tumor of gene expression, thus the method for feature selection is widely used in staging.

Usual feature selection approach is divided into Filtration, package method and imbedding method three types.In method for packing, feature selection process and categorizing process are integrated in one, and select optimal feature subset based on specific grader, and nicety of grading is higher, but computation complexity, and depend on the selection of grader, and generalization ability is poor；Certain characteristic of embedding grammar foundation grader is as characteristic evaluating standard, and computation complexity is also higher；Filter method only relies on the immanent structure of training dataset itself, according to criterion, feature is ranked up, and selects the feature containing more classification information.Filter method independent of the selection of grader, there is fast operation, the data of big internal memory can be processed preferably, the advantage such as grader generalization ability is strong and be widely adopted.

Traditional filtering characteristic selection algorithm has: T-test, signal to noise ratio, Fisherscore etc., but does not all account for the mutual relation between feature, and these methods are done well in linear feature selection but selected performance poor for nonlinear characteristic.Correlational study person also systematicness demonstrates Nonlinear Dimension Reduction model and is more suitable for the staging of gene expression profile data relative to linear model.LLE is a kind of new Method of Nonlinear Dimensionality Reduction proposed in recent years, it is considered to neighbour's sample problem, builds local optimum weight matrix.The low-dimensional obtaining high dimensional data based on optimal weights matrix embeds feature and makes it minimum with the error of the distance of Neighbor Points, also allows for the topological structure being maintained in luv space between neighbour's sample point in lower dimensional space.One overall low-dimensional of initial data can also be obtained simultaneously and embed expression, reach the purpose of feature extraction.LLE can detect the low dimensional manifold structure of high-dimensional data space very well, but owing to not considering sample class information, it is impossible to well for staging problem.Based on this, researcher is had to propose LLRFC (LocallyLinearRepresentationFisherCriterion), a kind of feature extracting method having supervision.Build neighbour figure in class, between class respectively according to sample class information, the basis of maintenance initial data geometry makes the neighbour's sample compact as far as possible, different label of neighbour's sample with same label disperse as far as possible.This feature extracting method based on Graph Spectral Theory can effectively promote the accuracy rate of classification, and data need not meet Gauss distribution type, it is adaptable to the arbitrarily training sample of spatial distribution.The feature that LLRFC extracts does not have clear and definite biological significance, explanatory not strong.And due to the complexity between gene expression data, LLRFC algorithm does not account for the mutual relation between characterizing gene, still suffers from redundancy in selected characterizing gene.

Summary of the invention

Embedding in graph theory in framework and linearisation, coring, a quantized versions, the manifold learning of many classics can be reconstructed.New dimension reduction method (LLRFC can also be grouped under this framework) can also be explored under this framework.Some occur in succession based on the feature selection approach of graph theory framework, such as Laplacianscore, LSDF (LocalitySensitiveDiscriminantFeature) score and MFA (MarginalFisherAnalysis) score etc., by the intrinsic junction composition of heuristic data it appeared that more information characteristics.

Present invention aims to the deficiencies in the prior art, add sample class information, it is proposed to a kind of new feature selection approach LLRFCscore.It is a kind of filtering characteristic system of selection having supervision, utilizes the criterion of LLRFCscore to calculate each characterizing gene percentage contribution to classification.Fractional value is more big, and contribution degree is more high, and classifying quality is more good.According to fractional value size descending characterizing gene, finally select the characterizing gene sequence of score forward (with more classification information).According to theory of information, when being that the feature space of D is selected a stack features that quantity is d (D > > d) from one group of quantity, in most cases, only each independent feature is arranged according to certain statistics or Separability Criterion, take d the feature come above, not considering mutual relation complicated between each feature, therefore acquired feature is not optimal characteristics set as a rule, in simulations even it is also possible to get poor effect.When there is two characterizing genes that degree of association is higher in selected characteristic set, if one of them is characterizing gene, another is also necessarily.So, when character subset dimension is certain, if the characterizing gene that the two has comparability prediction ability is simultaneously selected, unnecessary redundancy can be served by band.The information carrying amount not only reducing character subset too increases amount of calculation.Therefore, when carrying out feature selection in oncogene express spectra data, make the redundancy between the key gene in characteristic sequence minimize as far as possible.

The present invention adopts the strategy of dynamic correlation analysis the LLRFCscore characteristic sequence selected is got rid of redundancy further, obtain optimal characteristics gene subset, promote nicety of grading.

Utilize the gene expression profile data that chip technology obtains to lead to into represent by the form of numerical matrix, wherein row vector represents the expression of all genes in a sample, column vector represents the expression in all samples of a certain characterizing gene, the element representation gene expression when respective sample in matrix.Such as: a gene expression matrix being made up of n sample (containing D characterizing gene in each sample), it is possible to be expressed as follows: X=[X₁,X₂,...,X_n], wherein X_i∈R^D. (i=1,2 ..., n) represent all gene expressions corresponding to sample i；Tumor sample set also may indicate that into other form: X=F=[f₁,f₂,...,f_D]^T,f_j∈Rⁿ. (j=1,2 ..., D) characteristic vector that is made up of feature j expression in each sample (patient).Y=[Y₁,Y₂,...,Y_n] it is that original high dimensional data is by the popular learning algorithm LLE embedding in low-dimensional, Y_i∈R^d. (i=1,2 ..., n), d < < D.In supervision manifold learning, the class label of sample is defined as: c_i∈{1,2,...,n_c, n_cRepresent sample class number.According to the Euclidean distance between tumor sample and classification information the different subtype of tumor (namely ill, normal or), definition is relative to sample point X_iK closest sample point is X_iK neighboring regions.For any one sample X_i, under ensureing the premise that local linear reconstructed error is minimum, select k₁The individual neighbour's sample point having with it same label, k₂Neighbour's sample point of individual different label information, builds respectively in corresponding class and schemes between figure, class.Because the category attribute of each training dataset is different, and the number of samples of each classification also differs widely, parameter k₁、k₂Selection depend on specific data set.Based on experience value and theory analysis, k₁Value be generally no greater than min{n_c}-1.For oncogene express spectra data, k₁It is typically between 2-5 to choose, k₂Setting relative complex some.In LLRFCscore algorithm, the support vector that between the class being made up of different labels, Neighbor Points is similar in support vector machine.At preset parameter k₁When by the study of SVM, according to experimental result, choose the highest k value of classification accuracy as k₂。

For achieving the above object, to realize step as follows for the technical solution adopted in the present invention:

1) based on data tag information, to arbitrary sample point X_iStructure k₁Individual formed neighbour territory, k in class by same label sample point₂Individual formed neighbour territory between class by different exemplar points, reconstruct weight matrix W in corresponding class_intra, weight matrix W between class_inter。

Wherein sample X_jBelong to sample X_iK nearest neighbor point, define sample X_iNeighbour covariance matrix G, the element G in matrix G_ij=(X_i-X_j)^T(X_i-X_j), there is the character of symmetry, positive definite.IntraN(X_i) represent sample X_iClass in neighbour combination, InterN (X_i) represent sample X_iClass between neighbour combination.

2) utilize the optimum reconstruction weights in above formula (local geometry of manifold can be kept in lower dimensional space), calculate X_iIn corresponding class, low-dimensional embeds error ε (Y_intra), low-dimensional embeds error ε (Y between class_inter), build optimal classification criterion S (Y).

ε (.) represents sample X_iBy its k Neighbor Points X₁,...,X_kThe error of the linear reconstruction represented, j ≠ i, tr (.) is trace of a matrix computing.

In low-dimensional embedded space, maximize minimize while reconstructed error between class reconstructed error in class make to compact in sample class, dispersion between class, can better be used for staging.Based on above analysis, build following criterion function:

Cost matrix M in class_intra, cost matrix M between class_inter(having sparse, positive semidefinite character) definition is as follows: M_intra=(I-W_intra)^T(I-W_intra),M_inter=(I-W_inter)^T(I-W_inter)_。S (Y) is more big, and classifying quality is more notable.

3) the classification contribution degree S (f of each feature is calculated_j) (feature f_jScore under LLRFC Optimality Criteria.I.e. LLRFCscore), to score descending, obtain corresponding characteristic sequence.

Keep criterion according to graph theory, introduce linear change Y=A^TThe corresponding corresponding mapping relations of X, transition matrix A.At satisfied constraint A^TUnder A=I (I is unit matrix) premise, definitionThen there is Y=A^TX=f_j,Calculate each characterizing gene f_jScore, according to score size to primitive character sequence F descending sort, obtain new characteristic sequence F'=[F₁,F₂,...,F_D]。

4) the Policy evaluation characteristic sequence F'=[F that dynamic correlation is analyzed₁,F₂,...,F_D] in dependency between characterizing gene.Utilize correlation coefficient to weigh the degree of correlation between two features, get rid of the similarity redundancy between feature further.

Correlation coefficient between two features is defined as:

I represents the number of sample, and j and k represents the number of corresponding gene.f_ijAnd f_ikRepresent F respectively_j、F_kAt sample X_iOn feature representation value,WithIt is two features averages on all samples.The absolute value of correlation coefficient is between 0-1.Being closer to 1, the similarity between feature is more big, and redundancy is more many.

The present invention adopts the forward direction feature selection policy selection character subset that dynamic correlation is analyzed.Character subset S Initialize installation is empty set, every time from characterizing gene sequence F'=[F₁,F₂,...,F_D] in take out a characterizing gene F_j, first time removes F₁To subset, now S={F₁, then taking in residue character gene order F' next characterizing gene, and each feature calculation correlation coefficient in subset S, simply by the presence of a ρ_jkMore than given threshold value σ (note: different pieces of information collection threshold value also differs, still adopt here and experimentally determine), then by this characterizing gene F_jDelete from characterizing gene sequence F', then take next characterizing gene and differentiate；And if only if all of ρ_jkLess than given threshold value σ, by characteristic of correspondence gene F_jMove on in character subset S.Repeat this process, until the number of character subset meets requirement or characteristic sequence F' is empty set.

Compared with prior art, it is feature selection approach that the present invention improves LLRFC feature extracting method, and the feature biological meaning of selection is clear and definite, explanatory good.Analyze the approximation between judging characteristic gene in combination with dynamic correlation, get rid of redundancy.The present invention can effectively promote the nicety of grading of grader, it is not required that sample data set must is fulfilled for normal distribution, it is adaptable to the data of multiple distribution pattern.The present invention can help people to find the Disease-causing gene of cancer, contribute to clinically the early diagnosis of tumor disease, neoplasm staging typing and prognosis treatment etc..

Accompanying drawing explanation

Fig. 1 is technical scheme flow chart.

Fig. 2 is that the present invention contrasts additive method classification accuracy curve chart on 11Tumor data set.(k1=3, k2=3)

Fig. 3 is that the present invention contrasts additive method classification accuracy curve chart on BrainTumor1 data set.(k1=4, k2=9)

Fig. 4 is that the present invention contrasts additive method classification accuracy curve chart on BrainTumor2 data set.(k1=2, k2=4)

Fig. 5 is that the present invention contrasts additive method classification accuracy curve chart on LungCancer data set.(k1=4, k2=6)

Fig. 6 is that the present invention contrasts additive method classification accuracy curve chart on SRBCT (SmallRoundBlueCellTumor) data set.(k1=4, k2=7)

Fig. 7 is that the present invention contrasts additive method classification accuracy curve chart on DLBCL (DiffuselargeB-celllymphomas) data set.(k1=4, k2=7)

Detailed description of the invention

Below in conjunction with drawings and Examples, the present invention is described in further detail.

Embodiment

The tumor data set (11Tumors) 11 kinds different on the http://www.gems-system.org of website is now adopted to carry out classification checking, and compare the feature selection approach such as LLRFCscore+, LLRFCscore, Laplacianscore, Fisherscore, t-test classification accuracy on this data set, data set feature as shown in the following chart:

Table 111Tumors

Gene number: 12533

Considering the harmony of tumor sample distribution, category is by random for data decile, and half is training set, for feature selection；Second half is test set, draws classification accuracy for test, draws classification accuracy.Owing to SVM is insensitive to data dimension, it shows very big advantage in solving small sample higher-dimension problem.For gene expression profile data, grader adopts LIBSVM, linear kernel function, and parameter is given tacit consent to.At random data set is trained (if certain class sample number is odd number, divide data time, training set divide more than test set.Such as Ovary class, what be allocated to training set has 14 samples, and test set has 13).For 11Tumors data set, training set has 89 samples, and test set has 85 samples.

(1) feature subset selection:

1) based on data tag information c_i∈ 1,2 ..., 11}, to arbitrary sample point X_iStructure is by k₁Figure, k in the class of individual (3 are set to for 11Tumors data set) same label sample point composition₂Figure (checking by experiment is set to 3 herein) between the class of individual different exemplar point composition, reconstructs weight matrix W in corresponding class_intra, weight matrix W between class_inter。

The set of 89 sample compositions of 11Tumors training set can be expressed as: X=[X₁,X₂,...,X₈₉], matrix be sized to 89 × 12533.X_i∈R¹²⁵³³. (i=1,2 ..., 89) represent all gene expressions corresponding to sample i, sample set can also be write as X=F=[f₁,f₂,...,f₁₂₅₃₃]^T, f_jIt it is the vector of characterizing gene expression values composition in each sample.Select Neighbor Points according to the Euclidean distance between sample and classification information, calculate weight matrix W in reconstruct class_intra, weight matrix W between class_inter。

Wherein sample X_jBelong to sample X_iK nearest neighbor point, define sample X_iNeighbour covariance matrix G, the element G in matrix G_ij=(X_i-X_j)^T(X_i-X_j), there is the character of symmetry, positive definite.IntraN(X_i) represent and X_iThere are sample combination, InterN (X in the class of 3 Neighbor Points of same label information_i) represent and X_iSample combination between the class of 3 Neighbor Points of different label informations.

2) utilize the optimum reconstruction weights of above formula, calculate low-dimensional in corresponding class and embed error ε (Y_intra), low-dimensional embeds error ε (Y between class_inter), build optimal classification criterion S (Y).

3) the classification contribution degree S (f of each feature is calculated_j) (feature f_jScore under LLRFC Optimality Criteria), to score descending, obtain characteristic sequence.

According to feature f_jScore under LLRFC Optimality CriteriaCalculate the score of each characterizing gene, according to score size to primitive character descending sort, obtain characterizing gene sequence F'=[F₁,F₂,...,F₁₂₅₃₃]。

4) the Policy evaluation characteristic sequence F'=[F that dynamic correlation is analyzed₁,F₂,...,F₁₂₅₃₃] in dependency between characterizing gene, get rid of similarity between signatures redundancy further, obtain character subset S.

Correlation coefficient between two features is defined as:

Here adopting the forward direction feature selection strategy that dynamic correlation is analyzed, character subset S Initialize installation is empty set, every time from characterizing gene sequence F'=[F₁,F₂,...,F₁₂₅₃₃] in take out a characterizing gene F_j, first time removes F₁To subset, now S={F₁, then take F in residue character gene order F'₂, with the feature F in subset S₁Calculate correlation coefficient ρ_(1,2).Generally, namely the correlation coefficient of two features thinks strong correlation between two features between 0.8-0.95, for 11Tumors data set, sets σ=0.9.If ρ_(1,2)>=0.9, delete feature F₂If, ρ_(1,2)< 0.9, then by feature F₂Move on in subset S, take the next feature F in characterizing gene sequence F'_j, calculate itself and all characteristic correlation coefficients in subset.Simply by the presence of a ρ_jkMore than given threshold value σ, then by this characterizing gene F_jDelete from characterizing gene sequence F', then take next characterizing gene and differentiate, and if only if all of ρ_jkLess than given threshold value σ, then by characteristic of correspondence gene F_jMove in character subset S.Repeat this process, until the number of character subset meets requires m₀.(select m here₀=70)

(2) classification performance checking

Selecting 70 characterizing genes through LLRFCscore+ feature selection approach, training set and test set are expressed as X respectively_train(89×i)、X_test(85×i).Training set corresponding during selection ith feature gene and test set respectively X_test(85×i)、X'_test(85 × i), uses " svmtrain " function training dataset X' in LIBSVM workbox in Matlab2012b_train, " svmpredict " function is to test set X'_testCarry out prediction of result.Obtaining classification accuracy during corresponding selection i (from 1 to 70) individual characterizing gene, test repeats 30 times, calculates average classification accuracy.

LLRFCscore+ and LLRFCscore of the present invention, Laplacianscore, Fisherscore, T-test compare, the accuracy rate correlation curve of five kinds of methods, see accompanying drawing 2.

The present invention is also tested at the data set (data characteristic is in Table 2) such as BrainTumor1, BrainTumor2, LungCancer, SRBCT and DLBCL, and result is shown in accompanying drawing 3-7.

Other data set of table 2

Can be seen that from the experimental result of these data sets the method for the present invention obtains good classifying quality relative to additive method, relatively more mainly due to tumor classification in these data sets, number of samples or gene dosage also compare many, the space geometry complicated structure of sample, amount of redundancy is big, and effectiveness comparison is obvious.LLRFCscore only accounts for space geometry structure and the classification information of sample, does not consider the redundancy between characterizing gene, so effect is not as LLRFCscore+.In various data sets, LLRFCscore+ and LLRFCscore algorithm and Laplacianscore algorithm all consider near neighbor problem, can effectively keep the internal structure of data, their classification accuracy is also more close, owing to Laplacianscore is unsupervised algorithm, also without the mutual relation weighed between each feature, it is possible to choosing redundancy gene, therefore classification accuracy is inferior to some extent relative to other two kinds of methods.The gene number of roundlet large cortical cells tumor data set is smaller, the redundancy existed is smaller, experiment effect is not especially desirable, and Laplacianscore now is owing to being unsupervised algorithm, by keeping data cluster architectural characteristic itself to classify, expression effect is best, and when being 30 when selecting characteristic number's amount, LLRFCscore+ algorithm is compared with it also similar effect.

Claims

1. remove the feature selection approach LLRFCscore+ of redundancy based on LLRFC and correlation analysis, the method adds sample class information, it is proposed to a kind of feature selection approach LLRFCscore；It is a kind of filtering characteristic system of selection having supervision, utilizes the criterion of LLRFCscore to calculate each characterizing gene percentage contribution to classification；Fractional value is more big, and contribution degree is more high, and classifying quality is more good；According to fractional value size descending characterizing gene, finally select the characterizing gene sequence of score forward (with more classification information)；According to theory of information, when being that the feature space of D is selected a stack features that quantity is d (D > > d) from one group of quantity, in most cases, only each independent feature is arranged according to certain statistics or Separability Criterion, take d the feature come above, not considering mutual relation complicated between each feature, therefore acquired feature is not optimal characteristics set as a rule, in simulations even it is also possible to get poor effect；When there is two characterizing genes that degree of association is higher in selected characteristic set, if one of them is characterizing gene, another is also necessarily；So, when character subset dimension is certain, if the characterizing gene that the two has comparability prediction ability is simultaneously selected, unnecessary redundancy can be served by band；The information carrying amount not only reducing character subset too increases amount of calculation；Therefore, when carrying out feature selection in oncogene express spectra data, make the redundancy between the key gene in characteristic sequence minimize as far as possible；

The LLRFCscore characteristic sequence selected is got rid of redundancy by strategy further that adopt dynamic correlation analysis, obtains optimal characteristics gene subset, promotes nicety of grading；

It is characterized in that: the gene expression profile data utilizing chip technology to obtain leads to into and represents by the form of numerical matrix, wherein row vector represents the expression of all genes in a sample, column vector represents the expression in all samples of a certain characterizing gene, the element representation gene expression when respective sample in matrix；One gene expression matrix being made up of n sample, containing D characterizing gene in each sample, matrix is expressed as follows: X=[X₁,X₂,...,X_n], wherein X_i∈R^D. (i=1,2 ..., n) represent all gene expressions corresponding to sample i；Tumor sample set can also be expressed as other form: X=F=[f₁,f₂,...,f_D]^T,f_j∈Rⁿ. (j=1,2 ..., D) characteristic vector that is made up of feature j expression in each sample (patient)；Y=[Y₁,Y₂,...,Y_n] it is that original high dimensional data is by the popular learning algorithm LLE embedding in low-dimensional, Y_i∈R^d. (i=1,2 ..., n), d < < D；In supervision manifold learning, the class label of sample is defined as: c_i∈{1,2,...,n_c, n_cRepresent sample class number；According to the Euclidean distance between tumor sample and classification information the different subtype of tumor (namely ill, normal or), definition is relative to sample point X_iK closest sample point is X_iK neighboring regions；For any one sample X_i, under ensureing the premise that local linear reconstructed error is minimum, select k₁The individual neighbour's sample point having with it same label, k₂Neighbour's sample point of individual different label information, builds respectively in corresponding class and schemes between figure, class；Because the category attribute of each training dataset is different, and the number of samples of each classification also differs widely, parameter k₁、k₂Selection depend on specific data set；Based on experience value and theory analysis, k₁Value be generally no greater than min{n_c}-1；For oncogene express spectra data, k₁It is typically between 2-5 to choose, k₂Setting relative complex some；In LLRFCscore algorithm, the support vector that between the class being made up of different labels, Neighbor Points is similar in support vector machine；At preset parameter k₁When by the study of SVM, according to experimental result, choose the highest k value of classification accuracy as k₂。

2. the feature selection approach LLRFCscore+ removing redundancy based on LLRFC and correlation analysis according to claim 1, it is characterised in that: it is as follows that the technical scheme that this method adopts realizes step:

1) based on data tag information, to arbitrary sample point X_iStructure k₁Individual formed neighbour territory, k in class by same label sample point₂Individual formed neighbour territory between class by different exemplar points, reconstruct weight matrix W in corresponding class_intra, weight matrix W between class_inter；

W_{i n t r a} = \{\begin{matrix} \underset{j}{Σ} {G_{i j}}^{- 1} / \underset{i j}{Σ} {G_{i j}}^{- 1}, & X_{j} &Element; I n t r a N (X_{i}) . \\ 0 & o t h e r w i s e . \end{matrix}

W_{int e r} = \{\begin{matrix} \underset{j}{Σ} {G_{i j}}^{- 1} / \underset{i j}{Σ} {G_{i j}}^{- 1}, & X_{j} &Element; I n t r a N (X_{i}) . \\ 0 & o t h e r w i s e . \end{matrix}

Wherein sample X_jBelong to sample X_iK nearest neighbor point, define sample X_iNeighbour covariance matrix G, the element G in matrix G_ij=(X_i-X_j)^T(X_i-X_j), there is the character of symmetry, positive definite；IntraN(X_i) represent sample X_iClass in neighbour combination, InterN (X_i) represent sample X_iClass between neighbour combination；

2) utilize the optimum reconstruction weights in above formula (local geometry of manifold can be kept in lower dimensional space), calculate X_iIn corresponding class, low-dimensional embeds error ε (Y_intra), low-dimensional embeds error ε (Y between class_inter), build optimal classification criterion S (Y)；

ϵ (Y_{i n t r a}) = \underset{Y_{j} &Element; I n t r a N (Y_{i})}{m i n} | | Y_{i} - Σ_{j = 1}^{k 1} {(W_{i n t r a})}_{i j} Y_{j} | |^{2} = t r {Y {(I - W_{i n t r a})}^{T} (I - W_{i n t r a}) Y^{T}}

ϵ (Y_{int e r}) = \underset{Y_{j} &Element; I n t r a N (Y_{i})}{m i n} | | Y_{i} - Σ_{j = 1}^{k 1} {(W_{int e r})}_{i j} Y_{j} | |^{2} = t r {Y {(I - W_{int e r})}^{T} (I - W_{int e r}) Y^{T}}

ε (.) represents sample X_iBy its k Neighbor Points X₁,...,X_kThe error of the linear reconstruction represented, j ≠ i, tr (.) is trace of a matrix computing；

In low-dimensional embedded space, maximize minimize while reconstructed error between class reconstructed error in class make to compact in sample class, dispersion between class, can better be used for staging；Based on above analysis, build following criterion function:

S (Y) = \max \frac{ϵ (Y_{int e r})}{ϵ (Y_{int r a})} = \max \frac{t r {Y {(I - W_{int e r})}^{T} (I - W_{int e r}) Y^{T}}}{t r {Y {(I - W_{int r a})}^{T} (I - W_{int r a}) Y^{T}}} = \max \frac{t r {{YM}_{int e r} Y^{T}}}{t r {{YM}_{int r a} Y^{T}}}

Cost matrix M in class_intra, cost matrix M between class_inter(having sparse, positive semidefinite character) definition is as follows: M_intra=(I-W_intra)^T(I-W_intra),M_inter=(I-W_inter)^T(I-W_inter)；S (Y) is more big, and classifying quality is more notable；

3) the classification contribution degree S (f of each feature is calculated_j) (feature f_jScore under LLRFC Optimality Criteria；I.e. LLRFCscore), to score descending, obtain corresponding characteristic sequence；

Keep criterion according to graph theory, introduce linear change Y=A^TThe corresponding corresponding mapping relations of X, transition matrix A；At satisfied constraint A^TUnder A=I (I is unit matrix) premise, definitionThen there is Y=A^TX=f_j,

S (f_{j}) = \frac{t r (f_{j}^{T} M_{int e r} f_{j})}{t r (f_{j}^{T} M_{int r a} f_{j})};

Calculate each characterizing gene f_jScore, according to score size to primitive character sequence F descending sort, obtain new characteristic sequence F'=[F₁,F₂,...,F_D]；

4) the Policy evaluation characteristic sequence F'=[F that dynamic correlation is analyzed₁,F₂,...,F_D] in dependency between characterizing gene；Utilize correlation coefficient to weigh the degree of correlation between two features, get rid of the similarity redundancy between feature further；

Correlation coefficient between two features is defined as:

\begin{matrix} ρ (f_{j}, f_{k}) = \frac{Σ_{i}^{m} (f_{i j} - \overset{&OverBar;}{f_{j}}) (f_{i k} - \overset{&OverBar;}{f_{k}})}{\sqrt{Σ_{i = 1}^{m} {(f_{i j} - \overset{&OverBar;}{f_{j}})}^{2} Σ_{i = 1}^{m} {(f_{i k} - \overset{&OverBar;}{f_{k}})}^{2}}}, & i = 1, 2, ..., m & j, k = 1, 2, ..., D \end{matrix} .

I represents the number of sample, and j and k represents the number of corresponding gene；f_ijAnd f_ikRepresent F respectively_j、F_kAt sample X_iOn feature representation value,WithIt is two features averages on all samples；The absolute value of correlation coefficient is between 0-1；Being closer to 1, the similarity between feature is more big, and redundancy is more many.

3. the feature selection approach LLRFCscore+ removing redundancy based on LLRFC and correlation analysis according to claim 1, it is characterised in that: this method adopts the forward direction feature selection policy selection character subset that dynamic correlation is analyzed；Character subset S Initialize installation is empty set, every time from characterizing gene sequence F'=[F₁,F₂,...,F_D] in take out a characterizing gene F_j, first time removes F₁To subset, now S={F₁, then taking in residue character gene order F' next characterizing gene, and each feature calculation correlation coefficient in subset S, simply by the presence of a ρ_jkMore than given threshold value σ, then by this characterizing gene F_jDelete from characterizing gene sequence F', then take next characterizing gene and differentiate；And if only if all of ρ_jkLess than given threshold value σ, by characteristic of correspondence gene F_jMove on in character subset S；Repeat this process, until the number of character subset meets requirement or characteristic sequence F' is empty set.