CN105740653A - Redundancy removal feature selection method LLRFC score+ based on LLRFC and correlation analysis - Google Patents

Redundancy removal feature selection method LLRFC score+ based on LLRFC and correlation analysis Download PDF

Info

Publication number
CN105740653A
CN105740653A CN201610057637.3A CN201610057637A CN105740653A CN 105740653 A CN105740653 A CN 105740653A CN 201610057637 A CN201610057637 A CN 201610057637A CN 105740653 A CN105740653 A CN 105740653A
Authority
CN
China
Prior art keywords
sample
class
feature
gene
llrfc
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610057637.3A
Other languages
Chinese (zh)
Inventor
李建更
李晓丹
张卫
王朋飞
李立杰
张岩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN201610057637.3A priority Critical patent/CN105740653A/en
Publication of CN105740653A publication Critical patent/CN105740653A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Epidemiology (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Public Health (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention provides a redundancy removal feature selection method LLRFC (Locally Linear Representation Fisher Criterion) score+ based on LLRFC and correlation analysis. A DNA (Deoxyribonucleic Acid) microarray technology provides a new direction for clinic tumor diagnosis. Performance of gene expression data corresponding to different kinds of tumor is different; through the analysis on the tumor gene expression data, study personnel can realize the accurate recognition on the tumor and the tumor subtype in the molecular level; and important biological significance is realized on the diagnosis and the treatment of the tumor. The feature genes in LLRFC judging criterion descending sort gene expression data is used to be combined with the dynamic correlation analysis strategy for further eliminating redundant features; an LLRFC score+ algorithm is provided; and the optimum feature gene subset is selected. The feature selection method LLRFC score+ has the advantages that the classification precision of a classifier can be effectively improved; a sample data set does not need to meet the normal distribution; and the method is applicable to data in various distribution types. The feature selection method LLRFC score+ can help people to find the virulence gene of cancer, and the early-stage diagnosis, tumor staging and typing, prognosis treatment and the like of clinic tumor diseases are facilitated.

Description

The feature selection approach LLRFC score+ of redundancy is removed based on LLRFC and correlation analysis
Technical field
The present invention relates to the staging studying technological domain of bioinformatics, be a kind of feature selection approach for oncogene express spectra data.
Background technology
In recent years, the development of biochip technology makes the expression that large-scale parallel detects thousands of genes be possibly realized, and molecular biology level, diagnosis and preventing and treating for human diseases open brand-new approach.By analyzing the gene expression difference in histological types (such as normal cell and tumor cell or cancer difference by stages stage), classification to corresponding gene expression data, it is achieved clinical in the diagnoses and treatment of tumor disease, hypotype identification and prognostic analysis.The M & M of current tumor patient is always in rising trend, has become as the first killer of human health, so adopting biochip technology research cancer classification to become a study hotspot of field of bioinformatics in recent years.
Owing to Microarray Experiments cost is high, cdna sample quantity few (be usually tens or 100 example).And the number gene detected is even up to ten thousand up to several thousand, adding the relation that gene expression is complicated, only have a small amount of gene and carry the information relevant to disease category, these all make the relevant analysis of express spectra data face very big challenge." small sample, high latitude " easily causes the problem of " dimension disaster ", not only makes to calculate time complexity high, and the existence of other redundancy genes also reduces learning classification accuracy rate, the effect of impact classification further.Accordingly, it would be desirable to gene expression profile data is carried out effective dimensionality reduction, from mass data, extract the key feature gene that tumor identification is played an important role.
At present, feature extraction and feature selection two aspect it are mainly based on the dimension reduction method of oncogene express spectra data.Relative to the dimension reduction method of feature extraction, namely under certain constraint, high dimensional data is mapped to lower-dimensional subspace.Certain of primitive character of being generally characterized by extracted linearly combines, it does not have clear and definite implication, biological explanatory difference;Feature selection approach selects the characterizing gene containing more classification information from original mass data, can not only be effectively improved the precision of classification, and have important biological significance.By the biological function analysis that these genes are correlated with, it is possible to explore tumor pathogenesis, people are helped to find the Disease-causing gene of cancer.From the origin cause of formation of the interpretation tumor of gene expression, thus the method for feature selection is widely used in staging.
Usual feature selection approach is divided into Filtration, package method and imbedding method three types.In method for packing, feature selection process and categorizing process are integrated in one, and select optimal feature subset based on specific grader, and nicety of grading is higher, but computation complexity, and depend on the selection of grader, and generalization ability is poor;Certain characteristic of embedding grammar foundation grader is as characteristic evaluating standard, and computation complexity is also higher;Filter method only relies on the immanent structure of training dataset itself, according to criterion, feature is ranked up, and selects the feature containing more classification information.Filter method independent of the selection of grader, there is fast operation, the data of big internal memory can be processed preferably, the advantage such as grader generalization ability is strong and be widely adopted.
Traditional filtering characteristic selection algorithm has: T-test, signal to noise ratio, Fisherscore etc., but does not all account for the mutual relation between feature, and these methods are done well in linear feature selection but selected performance poor for nonlinear characteristic.Correlational study person also systematicness demonstrates Nonlinear Dimension Reduction model and is more suitable for the staging of gene expression profile data relative to linear model.LLE is a kind of new Method of Nonlinear Dimensionality Reduction proposed in recent years, it is considered to neighbour's sample problem, builds local optimum weight matrix.The low-dimensional obtaining high dimensional data based on optimal weights matrix embeds feature and makes it minimum with the error of the distance of Neighbor Points, also allows for the topological structure being maintained in luv space between neighbour's sample point in lower dimensional space.One overall low-dimensional of initial data can also be obtained simultaneously and embed expression, reach the purpose of feature extraction.LLE can detect the low dimensional manifold structure of high-dimensional data space very well, but owing to not considering sample class information, it is impossible to well for staging problem.Based on this, researcher is had to propose LLRFC (LocallyLinearRepresentationFisherCriterion), a kind of feature extracting method having supervision.Build neighbour figure in class, between class respectively according to sample class information, the basis of maintenance initial data geometry makes the neighbour's sample compact as far as possible, different label of neighbour's sample with same label disperse as far as possible.This feature extracting method based on Graph Spectral Theory can effectively promote the accuracy rate of classification, and data need not meet Gauss distribution type, it is adaptable to the arbitrarily training sample of spatial distribution.The feature that LLRFC extracts does not have clear and definite biological significance, explanatory not strong.And due to the complexity between gene expression data, LLRFC algorithm does not account for the mutual relation between characterizing gene, still suffers from redundancy in selected characterizing gene.
Summary of the invention
Embedding in graph theory in framework and linearisation, coring, a quantized versions, the manifold learning of many classics can be reconstructed.New dimension reduction method (LLRFC can also be grouped under this framework) can also be explored under this framework.Some occur in succession based on the feature selection approach of graph theory framework, such as Laplacianscore, LSDF (LocalitySensitiveDiscriminantFeature) score and MFA (MarginalFisherAnalysis) score etc., by the intrinsic junction composition of heuristic data it appeared that more information characteristics.
Present invention aims to the deficiencies in the prior art, add sample class information, it is proposed to a kind of new feature selection approach LLRFCscore.It is a kind of filtering characteristic system of selection having supervision, utilizes the criterion of LLRFCscore to calculate each characterizing gene percentage contribution to classification.Fractional value is more big, and contribution degree is more high, and classifying quality is more good.According to fractional value size descending characterizing gene, finally select the characterizing gene sequence of score forward (with more classification information).According to theory of information, when being that the feature space of D is selected a stack features that quantity is d (D > > d) from one group of quantity, in most cases, only each independent feature is arranged according to certain statistics or Separability Criterion, take d the feature come above, not considering mutual relation complicated between each feature, therefore acquired feature is not optimal characteristics set as a rule, in simulations even it is also possible to get poor effect.When there is two characterizing genes that degree of association is higher in selected characteristic set, if one of them is characterizing gene, another is also necessarily.So, when character subset dimension is certain, if the characterizing gene that the two has comparability prediction ability is simultaneously selected, unnecessary redundancy can be served by band.The information carrying amount not only reducing character subset too increases amount of calculation.Therefore, when carrying out feature selection in oncogene express spectra data, make the redundancy between the key gene in characteristic sequence minimize as far as possible.
The present invention adopts the strategy of dynamic correlation analysis the LLRFCscore characteristic sequence selected is got rid of redundancy further, obtain optimal characteristics gene subset, promote nicety of grading.
Utilize the gene expression profile data that chip technology obtains to lead to into represent by the form of numerical matrix, wherein row vector represents the expression of all genes in a sample, column vector represents the expression in all samples of a certain characterizing gene, the element representation gene expression when respective sample in matrix.Such as: a gene expression matrix being made up of n sample (containing D characterizing gene in each sample), it is possible to be expressed as follows: X=[X1,X2,...,Xn], wherein Xi∈RD. (i=1,2 ..., n) represent all gene expressions corresponding to sample i;Tumor sample set also may indicate that into other form: X=F=[f1,f2,...,fD]T,fj∈Rn. (j=1,2 ..., D) characteristic vector that is made up of feature j expression in each sample (patient).Y=[Y1,Y2,...,Yn] it is that original high dimensional data is by the popular learning algorithm LLE embedding in low-dimensional, Yi∈Rd. (i=1,2 ..., n), d < < D.In supervision manifold learning, the class label of sample is defined as: ci∈{1,2,...,nc, ncRepresent sample class number.According to the Euclidean distance between tumor sample and classification information the different subtype of tumor (namely ill, normal or), definition is relative to sample point XiK closest sample point is XiK neighboring regions.For any one sample Xi, under ensureing the premise that local linear reconstructed error is minimum, select k1The individual neighbour's sample point having with it same label, k2Neighbour's sample point of individual different label information, builds respectively in corresponding class and schemes between figure, class.Because the category attribute of each training dataset is different, and the number of samples of each classification also differs widely, parameter k1、k2Selection depend on specific data set.Based on experience value and theory analysis, k1Value be generally no greater than min{nc}-1.For oncogene express spectra data, k1It is typically between 2-5 to choose, k2Setting relative complex some.In LLRFCscore algorithm, the support vector that between the class being made up of different labels, Neighbor Points is similar in support vector machine.At preset parameter k1When by the study of SVM, according to experimental result, choose the highest k value of classification accuracy as k2
For achieving the above object, to realize step as follows for the technical solution adopted in the present invention:
1) based on data tag information, to arbitrary sample point XiStructure k1Individual formed neighbour territory, k in class by same label sample point2Individual formed neighbour territory between class by different exemplar points, reconstruct weight matrix W in corresponding classintra, weight matrix W between classinter
Wherein sample XjBelong to sample XiK nearest neighbor point, define sample XiNeighbour covariance matrix G, the element G in matrix Gij=(Xi-Xj)T(Xi-Xj), there is the character of symmetry, positive definite.IntraN(Xi) represent sample XiClass in neighbour combination, InterN (Xi) represent sample XiClass between neighbour combination.
2) utilize the optimum reconstruction weights in above formula (local geometry of manifold can be kept in lower dimensional space), calculate XiIn corresponding class, low-dimensional embeds error ε (Yintra), low-dimensional embeds error ε (Y between classinter), build optimal classification criterion S (Y).
ε (.) represents sample XiBy its k Neighbor Points X1,...,XkThe error of the linear reconstruction represented, j ≠ i, tr (.) is trace of a matrix computing.
In low-dimensional embedded space, maximize minimize while reconstructed error between class reconstructed error in class make to compact in sample class, dispersion between class, can better be used for staging.Based on above analysis, build following criterion function:
Cost matrix M in classintra, cost matrix M between classinter(having sparse, positive semidefinite character) definition is as follows: Mintra=(I-Wintra)T(I-Wintra),Minter=(I-Winter)T(I-Winter)S (Y) is more big, and classifying quality is more notable.
3) the classification contribution degree S (f of each feature is calculatedj) (feature fjScore under LLRFC Optimality Criteria.I.e. LLRFCscore), to score descending, obtain corresponding characteristic sequence.
Keep criterion according to graph theory, introduce linear change Y=ATThe corresponding corresponding mapping relations of X, transition matrix A.At satisfied constraint ATUnder A=I (I is unit matrix) premise, definitionThen there is Y=ATX=fj,Calculate each characterizing gene fjScore, according to score size to primitive character sequence F descending sort, obtain new characteristic sequence F'=[F1,F2,...,FD]。
4) the Policy evaluation characteristic sequence F'=[F that dynamic correlation is analyzed1,F2,...,FD] in dependency between characterizing gene.Utilize correlation coefficient to weigh the degree of correlation between two features, get rid of the similarity redundancy between feature further.
Correlation coefficient between two features is defined as:
I represents the number of sample, and j and k represents the number of corresponding gene.fijAnd fikRepresent F respectivelyj、FkAt sample XiOn feature representation value,WithIt is two features averages on all samples.The absolute value of correlation coefficient is between 0-1.Being closer to 1, the similarity between feature is more big, and redundancy is more many.
The present invention adopts the forward direction feature selection policy selection character subset that dynamic correlation is analyzed.Character subset S Initialize installation is empty set, every time from characterizing gene sequence F'=[F1,F2,...,FD] in take out a characterizing gene Fj, first time removes F1To subset, now S={F1, then taking in residue character gene order F' next characterizing gene, and each feature calculation correlation coefficient in subset S, simply by the presence of a ρjkMore than given threshold value σ (note: different pieces of information collection threshold value also differs, still adopt here and experimentally determine), then by this characterizing gene FjDelete from characterizing gene sequence F', then take next characterizing gene and differentiate;And if only if all of ρjkLess than given threshold value σ, by characteristic of correspondence gene FjMove on in character subset S.Repeat this process, until the number of character subset meets requirement or characteristic sequence F' is empty set.
Compared with prior art, it is feature selection approach that the present invention improves LLRFC feature extracting method, and the feature biological meaning of selection is clear and definite, explanatory good.Analyze the approximation between judging characteristic gene in combination with dynamic correlation, get rid of redundancy.The present invention can effectively promote the nicety of grading of grader, it is not required that sample data set must is fulfilled for normal distribution, it is adaptable to the data of multiple distribution pattern.The present invention can help people to find the Disease-causing gene of cancer, contribute to clinically the early diagnosis of tumor disease, neoplasm staging typing and prognosis treatment etc..
Accompanying drawing explanation
Fig. 1 is technical scheme flow chart.
Fig. 2 is that the present invention contrasts additive method classification accuracy curve chart on 11Tumor data set.(k1=3, k2=3)
Fig. 3 is that the present invention contrasts additive method classification accuracy curve chart on BrainTumor1 data set.(k1=4, k2=9)
Fig. 4 is that the present invention contrasts additive method classification accuracy curve chart on BrainTumor2 data set.(k1=2, k2=4)
Fig. 5 is that the present invention contrasts additive method classification accuracy curve chart on LungCancer data set.(k1=4, k2=6)
Fig. 6 is that the present invention contrasts additive method classification accuracy curve chart on SRBCT (SmallRoundBlueCellTumor) data set.(k1=4, k2=7)
Fig. 7 is that the present invention contrasts additive method classification accuracy curve chart on DLBCL (DiffuselargeB-celllymphomas) data set.(k1=4, k2=7)
Detailed description of the invention
Below in conjunction with drawings and Examples, the present invention is described in further detail.
Embodiment
The tumor data set (11Tumors) 11 kinds different on the http://www.gems-system.org of website is now adopted to carry out classification checking, and compare the feature selection approach such as LLRFCscore+, LLRFCscore, Laplacianscore, Fisherscore, t-test classification accuracy on this data set, data set feature as shown in the following chart:
Table 111Tumors
Gene number: 12533
Considering the harmony of tumor sample distribution, category is by random for data decile, and half is training set, for feature selection;Second half is test set, draws classification accuracy for test, draws classification accuracy.Owing to SVM is insensitive to data dimension, it shows very big advantage in solving small sample higher-dimension problem.For gene expression profile data, grader adopts LIBSVM, linear kernel function, and parameter is given tacit consent to.At random data set is trained (if certain class sample number is odd number, divide data time, training set divide more than test set.Such as Ovary class, what be allocated to training set has 14 samples, and test set has 13).For 11Tumors data set, training set has 89 samples, and test set has 85 samples.
(1) feature subset selection:
1) based on data tag information ci∈ 1,2 ..., 11}, to arbitrary sample point XiStructure is by k1Figure, k in the class of individual (3 are set to for 11Tumors data set) same label sample point composition2Figure (checking by experiment is set to 3 herein) between the class of individual different exemplar point composition, reconstructs weight matrix W in corresponding classintra, weight matrix W between classinter
The set of 89 sample compositions of 11Tumors training set can be expressed as: X=[X1,X2,...,X89], matrix be sized to 89 × 12533.Xi∈R12533. (i=1,2 ..., 89) represent all gene expressions corresponding to sample i, sample set can also be write as X=F=[f1,f2,...,f12533]T, fjIt it is the vector of characterizing gene expression values composition in each sample.Select Neighbor Points according to the Euclidean distance between sample and classification information, calculate weight matrix W in reconstruct classintra, weight matrix W between classinter
Wherein sample XjBelong to sample XiK nearest neighbor point, define sample XiNeighbour covariance matrix G, the element G in matrix Gij=(Xi-Xj)T(Xi-Xj), there is the character of symmetry, positive definite.IntraN(Xi) represent and XiThere are sample combination, InterN (X in the class of 3 Neighbor Points of same label informationi) represent and XiSample combination between the class of 3 Neighbor Points of different label informations.
2) utilize the optimum reconstruction weights of above formula, calculate low-dimensional in corresponding class and embed error ε (Yintra), low-dimensional embeds error ε (Y between classinter), build optimal classification criterion S (Y).
3) the classification contribution degree S (f of each feature is calculatedj) (feature fjScore under LLRFC Optimality Criteria), to score descending, obtain characteristic sequence.
According to feature fjScore under LLRFC Optimality CriteriaCalculate the score of each characterizing gene, according to score size to primitive character descending sort, obtain characterizing gene sequence F'=[F1,F2,...,F12533]。
4) the Policy evaluation characteristic sequence F'=[F that dynamic correlation is analyzed1,F2,...,F12533] in dependency between characterizing gene, get rid of similarity between signatures redundancy further, obtain character subset S.
Correlation coefficient between two features is defined as:
Here adopting the forward direction feature selection strategy that dynamic correlation is analyzed, character subset S Initialize installation is empty set, every time from characterizing gene sequence F'=[F1,F2,...,F12533] in take out a characterizing gene Fj, first time removes F1To subset, now S={F1, then take F in residue character gene order F'2, with the feature F in subset S1Calculate correlation coefficient ρ(1,2).Generally, namely the correlation coefficient of two features thinks strong correlation between two features between 0.8-0.95, for 11Tumors data set, sets σ=0.9.If ρ(1,2)>=0.9, delete feature F2If, ρ(1,2)< 0.9, then by feature F2Move on in subset S, take the next feature F in characterizing gene sequence F'j, calculate itself and all characteristic correlation coefficients in subset.Simply by the presence of a ρjkMore than given threshold value σ, then by this characterizing gene FjDelete from characterizing gene sequence F', then take next characterizing gene and differentiate, and if only if all of ρjkLess than given threshold value σ, then by characteristic of correspondence gene FjMove in character subset S.Repeat this process, until the number of character subset meets requires m0.(select m here0=70)
(2) classification performance checking
Selecting 70 characterizing genes through LLRFCscore+ feature selection approach, training set and test set are expressed as X respectivelytrain(89×i)、Xtest(85×i).Training set corresponding during selection ith feature gene and test set respectively Xtest(85×i)、X'test(85 × i), uses " svmtrain " function training dataset X' in LIBSVM workbox in Matlab2012btrain, " svmpredict " function is to test set X'testCarry out prediction of result.Obtaining classification accuracy during corresponding selection i (from 1 to 70) individual characterizing gene, test repeats 30 times, calculates average classification accuracy.
LLRFCscore+ and LLRFCscore of the present invention, Laplacianscore, Fisherscore, T-test compare, the accuracy rate correlation curve of five kinds of methods, see accompanying drawing 2.
The present invention is also tested at the data set (data characteristic is in Table 2) such as BrainTumor1, BrainTumor2, LungCancer, SRBCT and DLBCL, and result is shown in accompanying drawing 3-7.
Other data set of table 2
Can be seen that from the experimental result of these data sets the method for the present invention obtains good classifying quality relative to additive method, relatively more mainly due to tumor classification in these data sets, number of samples or gene dosage also compare many, the space geometry complicated structure of sample, amount of redundancy is big, and effectiveness comparison is obvious.LLRFCscore only accounts for space geometry structure and the classification information of sample, does not consider the redundancy between characterizing gene, so effect is not as LLRFCscore+.In various data sets, LLRFCscore+ and LLRFCscore algorithm and Laplacianscore algorithm all consider near neighbor problem, can effectively keep the internal structure of data, their classification accuracy is also more close, owing to Laplacianscore is unsupervised algorithm, also without the mutual relation weighed between each feature, it is possible to choosing redundancy gene, therefore classification accuracy is inferior to some extent relative to other two kinds of methods.The gene number of roundlet large cortical cells tumor data set is smaller, the redundancy existed is smaller, experiment effect is not especially desirable, and Laplacianscore now is owing to being unsupervised algorithm, by keeping data cluster architectural characteristic itself to classify, expression effect is best, and when being 30 when selecting characteristic number's amount, LLRFCscore+ algorithm is compared with it also similar effect.

Claims (3)

1. remove the feature selection approach LLRFCscore+ of redundancy based on LLRFC and correlation analysis, the method adds sample class information, it is proposed to a kind of feature selection approach LLRFCscore;It is a kind of filtering characteristic system of selection having supervision, utilizes the criterion of LLRFCscore to calculate each characterizing gene percentage contribution to classification;Fractional value is more big, and contribution degree is more high, and classifying quality is more good;According to fractional value size descending characterizing gene, finally select the characterizing gene sequence of score forward (with more classification information);According to theory of information, when being that the feature space of D is selected a stack features that quantity is d (D > > d) from one group of quantity, in most cases, only each independent feature is arranged according to certain statistics or Separability Criterion, take d the feature come above, not considering mutual relation complicated between each feature, therefore acquired feature is not optimal characteristics set as a rule, in simulations even it is also possible to get poor effect;When there is two characterizing genes that degree of association is higher in selected characteristic set, if one of them is characterizing gene, another is also necessarily;So, when character subset dimension is certain, if the characterizing gene that the two has comparability prediction ability is simultaneously selected, unnecessary redundancy can be served by band;The information carrying amount not only reducing character subset too increases amount of calculation;Therefore, when carrying out feature selection in oncogene express spectra data, make the redundancy between the key gene in characteristic sequence minimize as far as possible;
The LLRFCscore characteristic sequence selected is got rid of redundancy by strategy further that adopt dynamic correlation analysis, obtains optimal characteristics gene subset, promotes nicety of grading;
It is characterized in that: the gene expression profile data utilizing chip technology to obtain leads to into and represents by the form of numerical matrix, wherein row vector represents the expression of all genes in a sample, column vector represents the expression in all samples of a certain characterizing gene, the element representation gene expression when respective sample in matrix;One gene expression matrix being made up of n sample, containing D characterizing gene in each sample, matrix is expressed as follows: X=[X1,X2,...,Xn], wherein Xi∈RD. (i=1,2 ..., n) represent all gene expressions corresponding to sample i;Tumor sample set can also be expressed as other form: X=F=[f1,f2,...,fD]T,fj∈Rn. (j=1,2 ..., D) characteristic vector that is made up of feature j expression in each sample (patient);Y=[Y1,Y2,...,Yn] it is that original high dimensional data is by the popular learning algorithm LLE embedding in low-dimensional, Yi∈Rd. (i=1,2 ..., n), d < < D;In supervision manifold learning, the class label of sample is defined as: ci∈{1,2,...,nc, ncRepresent sample class number;According to the Euclidean distance between tumor sample and classification information the different subtype of tumor (namely ill, normal or), definition is relative to sample point XiK closest sample point is XiK neighboring regions;For any one sample Xi, under ensureing the premise that local linear reconstructed error is minimum, select k1The individual neighbour's sample point having with it same label, k2Neighbour's sample point of individual different label information, builds respectively in corresponding class and schemes between figure, class;Because the category attribute of each training dataset is different, and the number of samples of each classification also differs widely, parameter k1、k2Selection depend on specific data set;Based on experience value and theory analysis, k1Value be generally no greater than min{nc}-1;For oncogene express spectra data, k1It is typically between 2-5 to choose, k2Setting relative complex some;In LLRFCscore algorithm, the support vector that between the class being made up of different labels, Neighbor Points is similar in support vector machine;At preset parameter k1When by the study of SVM, according to experimental result, choose the highest k value of classification accuracy as k2
2. the feature selection approach LLRFCscore+ removing redundancy based on LLRFC and correlation analysis according to claim 1, it is characterised in that: it is as follows that the technical scheme that this method adopts realizes step:
1) based on data tag information, to arbitrary sample point XiStructure k1Individual formed neighbour territory, k in class by same label sample point2Individual formed neighbour territory between class by different exemplar points, reconstruct weight matrix W in corresponding classintra, weight matrix W between classinter
W i n t r a = &Sigma; j G i j - 1 / &Sigma; i j G i j - 1 , X j &Element; I n t r a N ( X i ) . 0 o t h e r w i s e .
W int e r = &Sigma; j G i j - 1 / &Sigma; i j G i j - 1 , X j &Element; I n t r a N ( X i ) . 0 o t h e r w i s e .
Wherein sample XjBelong to sample XiK nearest neighbor point, define sample XiNeighbour covariance matrix G, the element G in matrix Gij=(Xi-Xj)T(Xi-Xj), there is the character of symmetry, positive definite;IntraN(Xi) represent sample XiClass in neighbour combination, InterN (Xi) represent sample XiClass between neighbour combination;
2) utilize the optimum reconstruction weights in above formula (local geometry of manifold can be kept in lower dimensional space), calculate XiIn corresponding class, low-dimensional embeds error ε (Yintra), low-dimensional embeds error ε (Y between classinter), build optimal classification criterion S (Y);
&epsiv; ( Y i n t r a ) = m i n Y j &Element; I n t r a N ( Y i ) | | Y i - &Sigma; j = 1 k 1 ( W i n t r a ) i j Y j | | 2 = t r { Y ( I - W i n t r a ) T ( I - W i n t r a ) Y T }
&epsiv; ( Y int e r ) = m i n Y j &Element; I n t r a N ( Y i ) | | Y i - &Sigma; j = 1 k 1 ( W int e r ) i j Y j | | 2 = t r { Y ( I - W int e r ) T ( I - W int e r ) Y T }
ε (.) represents sample XiBy its k Neighbor Points X1,...,XkThe error of the linear reconstruction represented, j ≠ i, tr (.) is trace of a matrix computing;
In low-dimensional embedded space, maximize minimize while reconstructed error between class reconstructed error in class make to compact in sample class, dispersion between class, can better be used for staging;Based on above analysis, build following criterion function:
S ( Y ) = max &epsiv; ( Y int e r ) &epsiv; ( Y int r a ) = max t r { Y ( I - W int e r ) T ( I - W int e r ) Y T } t r { Y ( I - W int r a ) T ( I - W int r a ) Y T } = max t r { YM int e r Y T } t r { YM int r a Y T }
Cost matrix M in classintra, cost matrix M between classinter(having sparse, positive semidefinite character) definition is as follows: Mintra=(I-Wintra)T(I-Wintra),Minter=(I-Winter)T(I-Winter);S (Y) is more big, and classifying quality is more notable;
3) the classification contribution degree S (f of each feature is calculatedj) (feature fjScore under LLRFC Optimality Criteria;I.e. LLRFCscore), to score descending, obtain corresponding characteristic sequence;
Keep criterion according to graph theory, introduce linear change Y=ATThe corresponding corresponding mapping relations of X, transition matrix A;At satisfied constraint ATUnder A=I (I is unit matrix) premise, definitionThen there is Y=ATX=fj, S ( f j ) = t r ( f j T M int e r f j ) t r ( f j T M int r a f j ) ; Calculate each characterizing gene fjScore, according to score size to primitive character sequence F descending sort, obtain new characteristic sequence F'=[F1,F2,...,FD];
4) the Policy evaluation characteristic sequence F'=[F that dynamic correlation is analyzed1,F2,...,FD] in dependency between characterizing gene;Utilize correlation coefficient to weigh the degree of correlation between two features, get rid of the similarity redundancy between feature further;
Correlation coefficient between two features is defined as:
&rho; ( f j , f k ) = &Sigma; i m ( f i j - f j &OverBar; ) ( f i k - f k &OverBar; ) &Sigma; i = 1 m ( f i j - f j &OverBar; ) 2 &Sigma; i = 1 m ( f i k - f k &OverBar; ) 2 , i = 1 , 2 , ... , m j , k = 1 , 2 , ... , D .
I represents the number of sample, and j and k represents the number of corresponding gene;fijAnd fikRepresent F respectivelyj、FkAt sample XiOn feature representation value,WithIt is two features averages on all samples;The absolute value of correlation coefficient is between 0-1;Being closer to 1, the similarity between feature is more big, and redundancy is more many.
3. the feature selection approach LLRFCscore+ removing redundancy based on LLRFC and correlation analysis according to claim 1, it is characterised in that: this method adopts the forward direction feature selection policy selection character subset that dynamic correlation is analyzed;Character subset S Initialize installation is empty set, every time from characterizing gene sequence F'=[F1,F2,...,FD] in take out a characterizing gene Fj, first time removes F1To subset, now S={F1, then taking in residue character gene order F' next characterizing gene, and each feature calculation correlation coefficient in subset S, simply by the presence of a ρjkMore than given threshold value σ, then by this characterizing gene FjDelete from characterizing gene sequence F', then take next characterizing gene and differentiate;And if only if all of ρjkLess than given threshold value σ, by characteristic of correspondence gene FjMove on in character subset S;Repeat this process, until the number of character subset meets requirement or characteristic sequence F' is empty set.
CN201610057637.3A 2016-01-27 2016-01-27 Redundancy removal feature selection method LLRFC score+ based on LLRFC and correlation analysis Pending CN105740653A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610057637.3A CN105740653A (en) 2016-01-27 2016-01-27 Redundancy removal feature selection method LLRFC score+ based on LLRFC and correlation analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610057637.3A CN105740653A (en) 2016-01-27 2016-01-27 Redundancy removal feature selection method LLRFC score+ based on LLRFC and correlation analysis

Publications (1)

Publication Number Publication Date
CN105740653A true CN105740653A (en) 2016-07-06

Family

ID=56246840

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610057637.3A Pending CN105740653A (en) 2016-01-27 2016-01-27 Redundancy removal feature selection method LLRFC score+ based on LLRFC and correlation analysis

Country Status (1)

Country Link
CN (1) CN105740653A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108122223A (en) * 2017-12-18 2018-06-05 浙江工业大学 Ferrite depth of defect study recognition methods based on Fisher criterions
CN109190713A (en) * 2018-09-29 2019-01-11 王海燕 The minimally invasive fast inspection technology of oophoroma based on serum mass spectrum adaptive sparse feature selecting
CN110326051A (en) * 2017-03-03 2019-10-11 通用电气公司 The method of expression distinctive elements in biological sample for identification
CN110362603A (en) * 2018-04-04 2019-10-22 北京京东尚科信息技术有限公司 A kind of feature redundancy analysis method, feature selection approach and relevant apparatus
CN111814868A (en) * 2020-07-03 2020-10-23 苏州动影信息科技有限公司 Model based on image omics feature selection, construction method and application
CN112215290A (en) * 2020-10-16 2021-01-12 苏州大学 Q learning auxiliary data analysis method and system based on Fisher score
CN112802555A (en) * 2021-02-03 2021-05-14 南开大学 Complementary differential expression gene selection method based on mvAUC
CN113177604A (en) * 2021-05-14 2021-07-27 东北大学 High-dimensional data feature selection method based on improved L1 regularization and clustering
CN114913921A (en) * 2022-05-07 2022-08-16 厦门大学 System and method for identifying marker gene
CN116045427A (en) * 2023-03-30 2023-05-02 福建省特种设备检验研究院 Elevator car air purification system based on intelligent decision

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110326051B (en) * 2017-03-03 2023-11-14 环球生命科学解决方案运营英国有限公司 Method and analysis system for identifying expression discrimination elements in biological samples
CN110326051A (en) * 2017-03-03 2019-10-11 通用电气公司 The method of expression distinctive elements in biological sample for identification
CN108122223A (en) * 2017-12-18 2018-06-05 浙江工业大学 Ferrite depth of defect study recognition methods based on Fisher criterions
CN110362603A (en) * 2018-04-04 2019-10-22 北京京东尚科信息技术有限公司 A kind of feature redundancy analysis method, feature selection approach and relevant apparatus
CN109190713A (en) * 2018-09-29 2019-01-11 王海燕 The minimally invasive fast inspection technology of oophoroma based on serum mass spectrum adaptive sparse feature selecting
CN111814868A (en) * 2020-07-03 2020-10-23 苏州动影信息科技有限公司 Model based on image omics feature selection, construction method and application
CN112215290A (en) * 2020-10-16 2021-01-12 苏州大学 Q learning auxiliary data analysis method and system based on Fisher score
CN112215290B (en) * 2020-10-16 2024-04-09 苏州大学 Fisher score-based Q learning auxiliary data analysis method and Fisher score-based Q learning auxiliary data analysis system
CN112802555A (en) * 2021-02-03 2021-05-14 南开大学 Complementary differential expression gene selection method based on mvAUC
CN112802555B (en) * 2021-02-03 2022-04-19 南开大学 Complementary differential expression gene selection method based on mvAUC
CN113177604A (en) * 2021-05-14 2021-07-27 东北大学 High-dimensional data feature selection method based on improved L1 regularization and clustering
CN113177604B (en) * 2021-05-14 2024-04-16 东北大学 High-dimensional data feature selection method based on improved L1 regularization and clustering
CN114913921A (en) * 2022-05-07 2022-08-16 厦门大学 System and method for identifying marker gene
CN116045427A (en) * 2023-03-30 2023-05-02 福建省特种设备检验研究院 Elevator car air purification system based on intelligent decision
CN116045427B (en) * 2023-03-30 2023-10-10 福建省特种设备检验研究院 Elevator car air purification system based on intelligent decision

Similar Documents

Publication Publication Date Title
CN105740653A (en) Redundancy removal feature selection method LLRFC score+ based on LLRFC and correlation analysis
Paliy et al. Application of multivariate statistical techniques in microbial ecology
Vanitha et al. Gene expression data classification using support vector machine and mutual information-based gene selection
Jirapech-Umpai et al. Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes
García-Nieto et al. Sensitivity and specificity based multiobjective approach for feature selection: Application to cancer diagnosis
CN108038352B (en) Method for mining whole genome key genes by combining differential analysis and association rules
CN103793600A (en) Isolated component analysis and linear discriminant analysis combined cancer forecasting method
Jiang et al. Correlation kernels for support vector machines classification with applications in cancer data
CN107480441B (en) Modeling method and system for children septic shock prognosis prediction
Saberkari et al. Cancer classification in microarray data using a hybrid selective independent component analysis and υ-support vector machine algorithm
Aminian et al. Early prognosis of respiratory virus shedding in humans
Tripto et al. Evaluation of classification and forecasting methods on time series gene expression data
Amaratunga et al. High-dimensional data
TWI709904B (en) Methods for training an artificial neural network to predict whether a subject will exhibit a characteristic gene expression and systems for executing the same
Tian et al. Sparse group selection on fused lasso components for identifying group-specific DNA copy number variations
CN113838519B (en) Gene selection method and system based on adaptive gene interaction regularization elastic network model
KR20100001177A (en) Gene selection algorithm using principal component analysis
CN115274136A (en) Tumor cell line drug response prediction method integrating multiomic and essential genes
CN114822689A (en) GRNN-based tumor gene point mutation characteristic map extraction and classification method
Xu et al. Comparison of different classification methods for breast cancer subtypes prediction
CN113159132A (en) Hypertension grading method based on multi-model fusion
Chen et al. Gene expression analyses using genetic algorithm based hybrid approaches
Bhat Evaluating SVM algorithms for bioinformatic gene expression analysis
Farhadian et al. Predicting 5-year survival status of patients with breast cancer based on supervised wavelet method
Gan et al. A survey of pattern classification-based methods for predicting survival time of lung cancer patients

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20160706

RJ01 Rejection of invention patent application after publication