CN103186717A

CN103186717A - Heuristic breadth-first searching method for cancer-related genes

Info

Publication number: CN103186717A
Application number: CN 201310019941
Authority: CN
Inventors: 黄上峰; 王树林; 李雪玲; 赵俊; 邱萍; 王耀雄; 葛运建; 双丰; 朱旻
Original assignee: Hefei Institutes of Physical Science of CAS
Current assignee: Hefei Institutes of Physical Science of CAS
Priority date: 2013-01-18
Filing date: 2013-01-18
Publication date: 2013-07-03

Abstract

The invention relates to a heuristic breadth-first searching method for cancer-related genes. According to the method, appearance frequencies of genes in a selected gene subset are used for measuring the genes, and genes with higher appearance frequency are considered as the most important cancer-related genes, on the basis, a classifier is designed and a gene ordering method based on HBSA is established. As proved by study, information gene selection plays an important role in improving the classification performance, and the genes can be probably taken as important tumor clinical diagnosis signs, so discovery of the minimum gene subset with the highest classification performance is a very important research objective. As indicated by experimental results, the heuristic breadth-first searching method can not only obtain favorable generalization performance but also discover important tumor genes. And the relationship of the appearance frequencies of the selected genes and the gene number conforms to power-law distribution. The genes in the gene subset with extremely high classification accuracy are in close relationship with specific tumor subtypes, and even the genes are important genes directly related with the tumor.

Description

A kind of method based on heuristic breadth-first search tumor-related gene

Affiliated field

The technology such as the data acquisition of oncogene express spectra, the selection of tumor-related gene importance and machine learning that the present invention relates to are with theoretical, particularly system adopts at the heuristic breadth-first search method of oncogene express spectra sample set characteristics and finds important tumor-related gene and according to these important tumor-related genes tumors subtypes of classifying, so system belongs to pattern-recognition in biomedical application.

Background technology

From molecular biological angle, tumour is owing to dna damage on some chromosome causes gene unconventionality expression in the cell, cause cell growth out of control, lack differentiation and the complicated genopathy of a paraplasm class, thereby tumour also is a kind of biological disease of system, up to the present the human mechanism that still imperfectly understands tumor development.As everyone knows, the result for the treatment of of patients with advanced cancer is usually not good, and the early diagnosis of tumour is helpful to the successful treatment of tumour.Yet adopting traditional method such as X ray detection tumour agglomerate to carry out the infantile tumour diagnosis has certain difficulty.Particularly the different subtype of tumour but has different results for the treatment of for same treatment.Therefore adopting modern data mining that gene expression profile is furtherd investigate for disclosing tumor development process and tumour molecular diagnosis has important practical significance.

Since first piece about gene expression profile (Gene Expression Profiles, GEP) [T.R.Golub since the tumors subtypes classification paper publishing, D.K.Slonim, P.Tamayo et al., " Molecular classification of cancer:Class discovery and class prediction by gene expression monitoring; " Science, vol.286, no.5439, pp.531-537,1999.], this research field becomes the focus of research very soon and obtains broad research.Many data sets about gene expression profile are also openly issued on the internet, as the carcinoma of the rectum (Colon Tumor) [U.Alon, N.Barkai, D.A.Notterman et al., " Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays; " Proceedings of the National Academy of Sciences of the United States of America, vol.96, no.12, pp.6745-6750,1999.], roundlet large cortical cells knurl (Small Round Blue Cell Tumor, SRBCT) [J.Khan, J.S.Wei, M.Ringner et al., " Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks; " Nature Medicine, vol.7, no.6, pp.673-679,2001.], diffusion large B cell lymphoid tumor (Diffuse Large B-cell Lymphomas, DLBCL) [M.A.Shipp, K.N.Ross, P.Tamayo et al., " Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning; " Nature Medicine, vol.8, no.1, pp.68-74,2002.] and prostate cancer (Prostate Tumor) [D.Singh, P.G.Febbo, K.Ross et al., " Gene expression correlates of clinical prostate cancer behavior; " Cancer Cell, vol.1, no.2, pp.203-209,2002.] etc.The tumour data set of all open issues has such characteristics: the higher-dimension small sample characteristic of data, this mainly is the restriction that is subjected to factors such as resource, data acquisition time and Genotyping.Many supervised classification methods such as support vector machine (Support Vector Machines in the pattern-recognition in the past 10 years, SVM) [I.Guyon, W.J., and V.Vapnik, " Gene selection for cancer classification using support vector machine; " Machine Learning, vol.46, no.1-3, pp.389-422,2002.], artificial neural network (Artificial Neural Networks, ANN) [Y.Xu, F.M.Selaru, J.Yin et al., " Artificial neural networks and gene filtering distinguish between global gene expression profiles of Barrett ' s esophagus and esophageal cancer; " Cancer Research, vol.62, no.12, pp.3493-3497,2002.], k-near neighbor method (k-Nearest Neighbor, KNN) [L.P.Li, T.A.Darden, C.R.Weinberg et al., " Gene assessment and sample classification for gene expression data using a genetic algorithm/k-nearest neighbor method, " Combinatorial Chemistry﹠amp; High Throughput Screening, vol.4, no.8, pp.727-739,2001.] and shrink barycenter method (Nearest Shrunken Centroids recently, NSC) [R.Tibshirani, T.Hastie, B.Narasimhan et al., " Diagnosis of multiple cancer types by shrunken centroids of gene expression, " Proceedings of the National Academy of Sciences of the United States of America, vol.99, no.10, pp.6567-6572,2002.] etc. be widely used in the staging based on the gene expression profile data.All these researchs have shown based on the sorting technique of gene expression profile data to have using value for early diagnosis and the dlinial prediction of tumour, and it has a extensive future.But, because from the difficult problem of gene dosage much larger than the dimension disaster of sample size, the gene expression profile data are carried out the dimension-reduction treatment committed step that is absolutely necessary.Dimension reduction method mainly is divided into the whole principal component homing methods of employing (Total Principal Component Regression, TPCR) [Y.X.Tan, L.M.Shi, W.D.Tong et al., " Multi-class cancer classification by total principal component regression (TPCR) using microarray gene expression data; " Nucleic Acids Research, vol.33, no.1, pp.56-65,2005.] feature extraction method and gene system of selection, these dimension-reduction treatment all need be finished before making up disaggregated model.Compare with the feature extraction method, the gene system of selection does not change the expression of original data, therefore, the gene system of selection not only can improve the performance of staging and can select the information gene subclass that comprises important gene by deleting redundancy and uncorrelated gene, and information gene wherein can be used for the sign of lesion detection and the drug candidate target for the treatment of.The more important thing is that this is helpful to the potential mechanism of illustrating tumor development, so gene is chosen in and plays important effect in the staging.

In fact, because the characteristic of gene expression profile, most complicated methods also not obviously are better than the simplest method, and since the disappearance of the biological significance of too complicated method can not the compensation prediction performance the loss [B.Haibe-Kains that brings of a little improvement, C.Desmedt, C.Sotiriou et al., " A comparative study of survival models for breast cancer prognostication based on microarray data:does a single gene beat them all; " Bioinformatics, vol.24, no.19, pp.2200-2208,2008.].Therefore, can to obtain to have the minimum gene polyadenylation signal set pair of the highest or approximate best result class performance will be very important in the sorting technique of design robust in design.Further, identifying these minimum gene subclass means and has removed noise and the redundancy in the gene expression profile to greatest extent, this not only can improve predictablity rate, and can use by the diagnostic fees that adopts minimum biological marker to reduce tumour in clinical practice.Yet the dimension disaster problem of bringing owing to the gene expression profile data set makes the gene subclass of selecting minimum from thousands of genes imply two problems: over-fitting phenomenon and selection biasing.Because find that from so huge gene space one little has very that the gene subclass of high-class performance may only be a coincidence.Therefore, in order to obtain classification performance more reliably, must avoid the over-fitting phenomenon and select these two problems of biasing.

[C.Ambroise such as Ambroise, and G.J.McLachlan, " Selection bias in gene extraction on the basis of microarray gene-expression data; " Proceedings of the National Academy of Sciences of the United States of America, vol.99, no.10, pp.6562-6566,2002.] find, if test set not exclusively is independent of the learning process of sorter, then select biasing can cause too optimistic result.[L.P.Wang such as Wang L.P., F.Chu, and W.Xie, " Accurate cancer classification using expressions of very few genes; " Ieee-Acm Transactions on Computational Biology and Bioinformatics, vol.4, no.1, pp.40-53,2007.] further point out, research such as document [S.Dudoit according to these many past of criterion, J.Fridlyand, and T.P.Speed, " Comparison of discrimination methods for the classification of tumors using gene expression data; " Journal of the American Statistical Association, vol.97, no.457, pp.77-87,2002.] and document [J.Khan, J.S.Wei, M.Ringner et al., " Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks; " Nature Medicine, vol.7, no.6, pp.673-679,2001.] all obtained too optimistic experimental result, and they have proposed a kind ofly can only just can obtain the very simple method of high-class performance with a small amount of gene.This method combines the gene ordering and finds that with the exhaustive search method minimum gene subclass is to obtain not have the nicety of grading of biasing.Though their method can obtain very high no inclined to one side result, when the gene of primary election (such as surpassing 300 genes) very for a long time, to such an extent as to the too big algorithm of the calculation cost of this method is infeasible.Therefore, the present invention has designed based on the breadth-first search method of heuristic information and has selected important tumor-related gene, and makes up the staging forecast model with this.

Summary of the invention

For overcoming the dimension disaster problem of gene expression profile data, the objective of the invention is to propose a kind of method of the tumor-related gene based on heuristic breadth-first search, the appearance frequency that employing is concentrated at the gene polyadenylation signal of selecting is measured gene, the gene that comes the front is considered to most important tumor-related gene, and design category device thus, and set up abbreviation based on HBSA(Heuristic Breadth-first Search Algorithm) the gene sort method, its step is as follows:

(1), establishes G={g ₁..., g _nOne group of gene of expression, S set={ s ₁..., s _mOne group of sample of expression; Wherein | G|=n represents the quantity of gene, | S|=m represents the quantity of sample; Corresponding gene expression profile data set table is shown matrix X=(x _{I, j}) _Mn, 1≤i≤m, 1≤j≤n, wherein x _{I, j}Expression gene g _jAt sample s _iExpression, usually n＞＞m;

Each vectorial s in the gene expression matrix _jAll be counted as a point in the n-dimension space.And each all is made up of the expression vector of n element in m the sample vector; Suppose L={c ₁..., c _kExpression data centralization tag set, | L|=k represents the quantity of data centralization classification; Usually, the classification of each sample is known known, therefore, and S * L={ (s _i, l _i) | s _i∈ R ⁿ, l _i∈ R ⁿ, l _i∈ L, i=1,2 ..., m} represents to have the sample space of class label;

(2), be to select information gene subclass T with best result class performance the power set of gene sets G from gene space P (G); Suppose to have the gene subclass of strong classification performance with specifically tumors subtypes is relevant, the classification capacity of the sample data collection of mark Acc (T) expression gene subclass T, usually adopt the precision of prediction of sorter to measure the T classification performance, select to such an extent that information gene subclass T will satisfy following two targets:

min _T∈P(G)(|T|) (1)

max _T∈P(G)(Acc(T)) (2)

s . t . Acc (T) &GreaterEqual; Acc (G), T &Subset; G - - - (3)

Wherein | T| represents the radix of gene subclass T, and the gene subclass that satisfies target (1) and (2) is called as best base factor set T ^*Best subset A ^*All best base factor set T have been comprised ^*, that is to say,

T ^*Satisfy target (1) and (2) simultaneously };

Described sorter is when design, and the quantity of training sample is more than 5 times, just of feature quantity at least in each classification:

(m/k)/s _n＞5 (4)

Other quantity of k representation class wherein, m represents the quantity of training sample, s _nThe quantity of the gene that expression is selected; The integrated classifier that is made of N individual sorter is to the classification reliability of each sample, and each sample has defined a degree of confidence, supposes that a data set has k subclass, is expressed as L=={c ₁..., c _k, a test sample book is endowed a voting vector (m ₁..., m _k), each component m wherein _iThis sample of expression voting is corresponding to L={c ₁..., c _kIn subclass c _iThe voting poll that obtains, wherein,

Note m _MaxAndm _SecVector (m is decided by vote in expression respectively ₁..., m _k) the middle maximum that obtains and the inferior maximum ticket of deciding by vote, the degree of confidence conf of a test sample book is defined as: conf=m _Max/ m _SecIf m _Sec=0, conf is set to N, wherein 1≤conf≤N.Degree of confidence conf is more big.

Described gene sort method based on HBSA is in best base factor set set A according to gene ^*In the appearance frequency sort, those genes that come the front are considered to most important tumor-related gene.

The importance that the appearance frequency that the present invention adopts gene to concentrate at the gene polyadenylation signal of selecting is measured gene.Experimental result shows that this method not only can obtain good extensive performance and can find important oncogene.Further, the appearance frequency of the gene that the present invention's discovery is selected meets the distribution of power rate about the quantity of gene, this shows that fully the minority gene that comes the front can be used in the sign of diagnosing tumor, and the gene concentrated of the very high gene polyadenylation signal of these classification accuracies with specifically have close contacting between the tumors subtypes, even these genes are exactly the important gene directly related with tumour.The function of the gene of selecting by analysis, related biological pathway (pathway) and protein interaction network have proved that further this method is in the superiority of finding aspect the important tumor-related gene.

The invention has the beneficial effects as follows: tumour is a kind of great genetic disease that threatens human health, and at present the mankind still imperfectly understand tumor pathogenesis, therefore disclose tumor pathogenesis and accurately the diagnosing tumour hypotype be human dream for the personalized treatment of tumour always.The present invention just actively strides forward towards this goal in research.This method is conducive to disclose tumor pathogenesis, is conducive to the accurate diagnosis of tumors subtypes, is conducive to the personalized treatment of tumour.

Description of drawings

Fig. 1 is for adopting HBSA algorithm search best base factor set synoptic diagram;

The search extension tree synoptic diagram that Fig. 2 generates for HBSA;

Fig. 3 is two regulatory pathway synoptic diagram, and dotted line is illustrated in the possible assortment of genes on the different regulatory pathway;

Fig. 4 is the process flow diagram of analytical approach of the present invention;

Fig. 5 is the construction method block diagram based on the integrated classifier of HBSA-SVM;

Fig. 6 frequency and the power rate distribution plan of gene dosage on six data sets occur for gene, and horizontal ordinate is represented the sequence number of gene, and ordinate represents to occur frequency.This figure is log-log coordinate;

Fig. 7 is the classification predictablity rate comparison diagram of three kinds of gene sort methods.

Embodiment

For setting forth technical scheme better, the present invention has described classification problem to be solved at first formally, has introduced the search strategy of HBSA algorithm, has further provided the implementation procedure of HBSA algorithm.Designed on this basis based on the integrated classifier building method of HBSA and based on the gene sort method of HBSA to obtain no inclined to one side precision of prediction and the important tumor-related gene of discovery.Experimental result has shown feasibility and the validity of technical solution of the present invention.By with the comparison of other correlation techniques, shown the superiority of this method.Can carry out the rationality that bio-medical analysis further proves this method to the gene of selecting from three aspects (function of individual gene, path analysis and protein network).

1. problem description

If G={g ₁..., g _nOne group of gene of expression, S set={ s ₁..., s _mOne group of sample of expression.Wherein | G|=n represents the quantity of gene, | S|=m represents the quantity of sample.Corresponding gene expression profile data set can be expressed as matrix X=(x _{I, j}) _Mn, 1≤i≤m, 1≤j≤n, wherein x _{I, j}Expression gene g _jAt sample s _iExpression, usually n＞＞m.Each vectorial s in the gene expression matrix _iAll be counted as a point in the n-dimension space.And each all is made up of the expression vector of n element in m the sample vector.Suppose L={c ₁..., c _kExpression data centralization tag set, | L|=k represents the quantity of data centralization classification.Usually, the classification of each sample is known known, therefore, and S * L={ (s _i, l _i) | s _i∈ R ⁿ, l _i∈ R ⁿ, l _i∈ L, i=1,2 ..., m} represents to have the sample space of class label.

The information gene subclass T that selects to have best result class performance from gene space P (G) (power set of gene sets G) is a very crucial problem, but this is a NP-totality problem.Up to the present which gene biomedical expert also not exclusively knows and has how many genes relevant with specific tumors subtypes.Therefore the present invention's hypothesis has the gene subclass of strong classification performance with specifically tumors subtypes is relevant.The classification capacity of the sample data collection of mark Acc (T) expression gene subclass T adopts the precision of prediction of sorter to measure the T classification performance usually.The present invention wishes to select to such an extent that information gene subclass T satisfies following two targets.

min _T∈P(G)(|T|) (1)

max _T∈P(G)(Acc(T)) (2)

s . t . Acc (T) &GreaterEqual; Acc (G), T &Subset; G - - - (3)

Wherein | T| represents the radix of gene subclass T.The gene subclass that satisfies target (1) and (2) is called as best base factor set T ^*It should be noted that the best base factor set T that data are concentrated ^*Not being unique, is very similar because belong to the expression of gene pattern of same regulatory pathway in a cell with function.Best subset A ^*All best base factor set T have been comprised ^*, that is to say,

T ^*Satisfy target (1) and (2) simultaneously }.Though for classification, only find a gene subclass T ^*Just enough, but find best base factor set as much as possible for the design feature of understanding the tumour data set and find that prior tumor-related gene is very favorable.Because | G|=n usually very big (such as, a sample usually comprises 000 gene of 2 000-30), 2 ⁿAdopt the exhaustive search method to find A in the space of individual gene subclass ^*Be unpractical.Good solution is to adopt heuristic to remove to find A in a subspace of having compressed ^*An example is to adopt genetic algorithm (GA) to find best information gene subclass.Another example is to select very little information gene subclass in conjunction with the gene sort method of Cluster Classification.Yet, adopt diverse ways to select and have not homoimerous gene subclass, therefore, only judge that by method for designing the best base factor set is very difficult thing for specific tumour data set.Therefore, the present invention must compromise and consider the relation of minimum number and nicety of grading.[A.K.Jain such as Jain, R.P.W.Duin, and J.C.Mao, " Statistical pattern recognition:A review; " Ieee Transactions on Pattern Analysis and Machine Intelligence, vol.22, no.1, pp.4-37,2000] propose to select the quantity of feature should satisfy criterion (4), be avoid dimension disaster influence the design category device time each classification in the quantity of training sample be more than 5 times, just of feature quantity at least

(m/k)/s _n＞5 (4)

Other quantity of k representation class wherein, m represents the quantity of training sample, s _nThe quantity of the gene selected of expression also should be bigger for the ratio of more complicated sorter sample size and dimension.For example, be a sorter that the two class tumour data sets design that only has 80 samples has good extensive performance, the present invention considers to select 8 genes to come the structural classification device at the most.Considering at a small-scale sample data collection selects sufficient gene always can obtain very high classification performance, therefore, target of the present invention is exactly to select those to have the minimum gene subclass of near optimal classification performance rather than have the gene subclass that the optimal classification performance but comprises too many gene.Therefore, the approximate gene subclass that satisfies target (1) and (2) is also included within best base factor set set A ^*In.On the basis of these best gene subset set, how to obtain more reliable precision of prediction and find that most important tumor-related gene is two sixty-four dollar questions.

2. gene primary election

The gene that people generally believe differential expression is tumor-related gene normally, so usually be used for gene from those differential expressions of protogene space primary election based on the gene sort method of filter method, although always because the difference expression gene that the interference of noise is selected tumor-related gene not.The main thought of gene primary election is exactly to give the score value that can represent the importance of gene according to a certain criterion of keeping the score for each gene.The method of many single variablees such as t-test and Bhattacharyya distance have been widely used in criterion.Yet these methods require data set to follow the Gauss distribution, otherwise these methods can not obtain best experimental performance.The advantage of rank test method is not require that data must meet certain distribution.In fact, the tumour data set does not meet Gauss usually and distributes, and experimental result shows that (Wilcoxon rank sum test WRST) is better than the t-test method to Wilcoxon rank test method aspect the gene selection.But Wilcoxon rank test method only is fit to the classification problem of two classes.(Kruskal-Wallis rank sum test KWRST) is suitable for the classification problem of multi-class data collection to Kruskal-Wallis rank test method.On the basis of extensively comparing, do well in the staging problem based on the gene system of selection of WRST or KWRST.Consider that KWRST does not require that certain distribution of data fit and it are suitable for the characteristics of small data set, the present invention will adopt the KWRST method just to select to comprise the information gene set G of the strong gene of p classification capacity in experiment of the present invention ^*

3. heuristic breadth-first search

Search strategy

Target of the present invention is to find best base factor set as much as possible.When p was very little, breadth-first search algorithm just can realize target (1) and (2).Yet when p is very big (such as=300), the CPU time that such searching algorithm consumes is insupportable.Therefore, to have designed with Acc (T) be that (Heuristic Breadth-first Search Algorithm HBSA) finds best gene subclass A for the heuristic breadth-first search method of heuristic information in the present invention ^*, this method can be dwindled the search volume significantly, and the present invention sets forth its main thought by an example, as shown in Figure 1.

Suppose that the present invention has an information gene set G ^*=a, and b, c, d}, it comprises four information genes of being selected from data centralization by KWRST.At first, the present invention generates a root node, and this root node is endowed an empty set

Then root node is expanded generating 4 child nodes, each node is given 4 genes { a, b, c, among the d} one respectively.Then, be extended to 12 child nodes of the 2nd layer at 4 nodes of ground floor.The classification accuracy of all nodes can adopt Acc (T) to calculate on the second layer, and wherein T represents the subclass that the gene all nodes on the path from root node to current leaf node constitutes.For example, the gene subclass T of node 6 is { a, b}, so the precision of node 6 is that ({ a b}), and then, is selecting four nodes with best result class accuracy rate when anterior layer, and these four nodes are expanded to 8 child nodes Acc.It should be noted that all genes on the paths all are inequality.At last, at the 3rd layer, the predictablity rate of each node all adopts Acc (T) to measure.Suppose that

node

19,20,22 and 23 can obtain the highest nicety of grading, these 4 nodes will be selected and be expanded so, and other node on this one deck has been dropped.If have the nicety of grading of a node on this one deck at least more than or equal to pre-set threshold, then search procedure just is through with.Therefore, suppose that at this moment algorithm finishes, then best base is because of set A ^*Be a, and d, b}, a, and d, c}{b, c, a}, c, and b, d}}, it comprises 4 best base factor sets.

Following the present invention adopts more formal method to describe the HBSA algorithm basic principle further.Especially, adopt the HBSA algorithm from gene sets G ^*={ g ₁... g _pAn expansion tree generating as shown in Figure 2, gene sets G wherein ^*={ g ₁... g _pIt is the difference expression gene set of being selected by KWRST.Among the figure

Represent a node, i represents that (0≤i≤p), j are illustrated in the sequence node number on the i floor for the level number at node place.The data structure of each node is expressed as:

Wherein

Expression only comprises the set of a gene,

The expression node

Point to the pointer of its father node,

Expression comprises from root node

To node

The path on all nodes in the gene subclass that constitutes of gene.Obviously, the path of node

Length be exactly the radix i of gene sets, namely

N_{i}^{j} . path = N_{i}^{j} . parent . path \cup N_{i}^{j} . set .

N_{i}^{j} . c = Acc (N_{i}^{j} . path)

Expression gene subclass

Classification accuracy, the heuristic information that is mainly used in instructing the i node layer to select adopts sorters such as SVM or KNN to assess usually.For root node

( The expression empty set),

N_{0}^{1} . parent = nil,

With

N_{0}^{1} . c = 0 .

Root node is expanded and is that individual child node, this p child node can be regarded as according to heuristic information KWRST (g) and elects wherein setting of the present invention g _j∈ G ^*, 1≤j≤p.And then, all p node is expanded at following one deck again.Each node of ground floor Be expanded and be p-1 child node, therefore, at the second layer the individual node of p (p-1) is arranged, wherein,

N_{2}^{j} . path = N_{2}^{j} . set \cup N_{2}^{j} . parent . path,

N_{2}^{j} . c = Acc (N_{2}^{j} . path),

1≤j≤p (p-1) wherein, 0≤i≤p.Then the present invention according to

All nodes in the descending sort second layer, and detect

{Acc}_{\max} (2) = \max_{1 \leq j \leq p (p - 1)} (N_{2}^{j} . c)

Whether greater than threshold value A cc_Max given in advance, wherein Acc _Max(2) the maximum classification accuracy of the expression second layer.If Acc _Max(2) 〉=and Acc_Max, this shows has found a best base factor set at least, then search procedure is through with.Otherwise, but select w node that comes the front as expanding node in second layer the present invention, wherein parameter w represents to search for width.In fact, why no matter w is worth if being set up, always be set as the p value at ground floor w.Other node can and the like.It should be noted that the gene subclass on different paths may be identical when not considering the order that gene polyadenylation signal is concentrated.Except the node of the 0th layer and ground floor, if the nicety of grading of a node was calculated, then the classification accuracy of this node can be set to zero to avoid unnecessary expansion, and this node is called as closed node, and closed node is the node that can not be expanded.At last, when search procedure finished, in the end the gene subclass in the w that comes the front node on one deck was selected into best base factor set set A ^*

The target of HBSA is exactly only to select best base factor set as much as possible at training set.A ^*In each gene subclass all construct according to above-mentioned search procedure from empty set.The classification accuracy of each gene subclass all is to increase monotonously along with the expansion of gene subclass scale.When the classification accuracy of gene subclass reached preset threshold value Acc_Max or maximal value 100%, the radix of the gene subclass that obtains was minimum.Therefore, A ^*In the gene subclass also be approximate satisfy target (1) and (2).If it is suitable that search width w arranges, can avoid the erroneous tendancy of search procedure to some degree, thereby can select best base factor set as much as possible.

Clearly, index does not increase the search width of HBSA with the increase of search depth, and therefore, HBSA in fact also is a kind of cylindricality searching algorithm, the optimization of perhaps optimum search method.Optimum search is exactly according to sort a kind of graph search method of all solutions of certain heuristic information, have only the part optimum solution of predetermined amount to be used as candidate solution, that is to say, have only most promising node in search tree, to be saved down and as further extendible both candidate nodes, and other node has for good and all been reduced.In general, see that from local viewpoint the gene system of selection based on HBSA belongs to incremental mode, but from the overall situation, gene system of selection of the present invention belongs to mixed mode, because most assortment of genes with low classification accuracy has been abandoned in the process of search.The specific implementation of HBSA can be more flexible, and for example, w the node that need not be regularly comes the front in each layer selection expanded, and that is to say that w can change at different layers.There is dual mode that search width w can be set, 1) at each layer, the value of search width w is judged in the distribution of the classification accuracy by all node of this layer, 2) for each layer different Acc_Max is set, when the threshold value of anterior layer must be bigger than the threshold value of last layer, can make that like this quantity of expanding node of different layers is different, therefore, an advantage of HBSA algorithm is exactly that its adaptability to the different pieces of information collection is strong.

Another advantage of HBSA is that this algorithm has certain biomedical implication.Suppose Be illustrated in the very gene subclass of high-class precision that has that the i layer selects, wherein

If

Be appended to gene sets T _iIn make

Farthest increased, then gene in the ideal case

Should with T _iIn gene be independently or weak relevant.Otherwise, if nicety of grading Acc is (T _I+1) not obvious increase even descended, then gene subclass T _I+1Will be abandoned at the i+1 layer.Therefore, ideally, best base factor set T ^*In gene should be separate, so best base factor set T ^*Should be an independent variable group (Independent Variable Group, IVG).This also shows best base factor set T ^*In gene should be on different regulatory pathway, but because the existing gene that the gene that adds after the interference of noise may be concentrated with gene polyadenylation signal is weak relevant.

Further, be most important tumor-related gene in order to differentiate which gene, the importance of gene adopts it in gene subclass set A ^*In the appearance frequency measure.The appearance frequency of gene is more big, and then the degree that this gene is relevant with tumour is just more big, and is also just more important.The importance of definition gene has good biological significance equally like this.For example, given two subclass that comprise three genes

G_{1} = {g_{1}^{'}, g_{2}^{'}, g_{3}^{'}}

With

G_{2} = {g_{1}^{''}, g_{2}^{''}, g_{3}^{''}},

Wherein the present invention's supposition is at subclass G ₁In all genes all at path 1(Pathway1) on, at subclass G ₂In all genes all at path 2(Pathway2) on, as shown in Figure 3.The present invention supposes that also these genes intensity relevant with tumour is respectively along with at gene subclass G ₁And G ₂The sequence number of middle gene descends.Usually, such as

With

Such gene subclass can not chosen by HBSA algorithm of the present invention usually, because

With

May than other by the gene subclass of incoherent genomic constitution as Classification accuracy low, this mainly is to be caused by the gene expression on same path and functional similarity.Thereby the potential assortment of genes that can be chosen comprises nine gene subclass:

With

Especially, such gene subclass

With

Trend towards being chosen by HBSA, and look like

With Such gene tends to be abandoned by HBSA.Final result is gene subclass set A ^*In such as

With

The appearance frequency of such gene is very high, and therefore, from this angle, the importance that the appearance frequency of employing gene is measured gene is rational.

Algorithm is realized

In fact, there is no need to obtaining best base factor set A ^*And the structure search tree.As long as in search procedure, preserve potential gene subclass and its nicety of grading.Be the convenient HBSA algorithm of realizing, the present invention has defined classification matrix CM=(a _{I, j}) _{W * p}, be expressed as follows:

Adopt row labels vector Row=(T ^l, T ²... T ^w) each row of mark CM successively, T wherein ⁱ(the gene subclass that the expression of 1≤i≤w) is selected.Adopt row label vector Column=({ g ₁... { g _j... { g _p) each row of mark CM, wherein g successively _j∈ G ^*, and a _{I, j}=Acc (Row[i] ∪ Column[j]), wherein Row[i] the capable gene subclass of i of representing matrix CM, Column[j] the individual gene set of j row representative among the representing matrix CM, 1≤i≤w here, 1≤j≤p.Provide among the framework of the HBSA Algorithm1 below, wherein Acc (T) is defined as the classification accuracy T of gene sets.For example, if Row[5]={ g ₂, g ₄And.Column[3]={ g ₆, a then _5,3=Acc (Row[5] ∪ Column[3])=Acc ({ g ₂, g ₄, g ₆) expression gene subclass { g ₂, g ₄, g ₆Classification accuracy.

A calculation 1:HBSA (M, p, w, Acc_Max, Depth)

Input: M represents the gene expression profile array, and p represents the gene dosage of preliminary election, and w is illustrated in the gene subclass quantity of each layer selection, i.e. expression search width, Acc_Max are represented the maximum predicted precision threshold preset, and Depth represents the maximal value of search depth.

Output: one group of best base factor set A ^*.

Algorithm finishes

HBSA algorithm predefine three stopping criterions.1) when searching the classification accuracy of certain gene subclass on whole training set when being not less than the Acc_Max threshold value, algorithm has just stopped.2) if when not having the classification accuracy of gene subclass to surpass Acc_Max, then algorithm stops when search depth arrives predefined iterations, and this can guarantee that algorithm must finish.Usually, the present invention does not know how suitable search depth is set, and is improper if search depth arranges, and the possibility of result that searches out is not best.3) alternative criterion be HBSA with condition | Accuracy _Iter+1-Accuracy _Iter|＜δ finishes, and wherein δ is a very little positive number, Accuracy _IterMaximum nicety of grading when representing the iter time iteration.

Operation the most consuming time is exactly to calculate Acc (T) among the HBSA.If assumed calculation Acc of the present invention (T) only spends a unit time, the computation complexity that then calculates classification matrix CM is that (w * p), the time complexity of this algorithm are O (Depth * w * p) to O.Though the HBSA algorithm is a polynomial time algorithm, this algorithm remains very consuming time in experiment.Yet, find that best gene subclass mainly is to finish at laboratory stage, the clinical tumor diagnosis mainly is to carry out according to the gene subclass of selecting, this only spends CPU time (for example, spending a few times in second at the most on common PC) seldom.Therefore, the gene system of selection based on the HBSA algorithm of the present invention is feasible.

Assessment level

The present invention adopts two kinds of machine learning methods (KNN and SVM) to calculate the classification accuracy Acc (T) of gene subclass T respectively.KNN is the most frequently used nonparametric technique, in order to give the correct tag along sort of unknown sample x, KNN extracts k nearest samples and judges the class label of unknown sample x according to the majority voting strategy at training set by adopting such as the method for measuring similarity of Euclidean distance etc.K is set to odd number usually and avoids the draw phenomenon.In experiment of the present invention, Euclidean distance and 5 neighbours are used for the similarity of tolerance sample and make a policy.Adopt the HBSA algorithm of KNN sorter to be called as HBSA-KNN.

With the Gauss radial basis function (Gaussian Radial Basis Function, RBF) K (x, y)=exp (γ || x-y|| ²) be used to assess the classification performance of the gene subclass of selecting for the support vector machine SVM of kernel function (SVM-RBF).In concrete experiment, the present invention adopts support vector machine LIBSVM software to realize sample classification, punishes that wherein parameters C and Gauss nuclear parameter γ need optimize in the training process of SVM.Parameters C is that sample is got penalty factor by mistake, and parameter γ then controls the sensitivity of input data variation.Because the search volume is very big, general two-dimensional search method (for example, C=2 ^-5, 2 ^-4..., 2 ¹⁵, γ=2 ^-15, 2 ^-14..., 2 ³) finding that (C γ) is very time-consuming method in the optimal parameter combination.Further, the variation of the tumour data set pair parameters C after the present invention finds to standardize is not very sensitive, by limiting parameter γ interval [10 ^-5, 10] in value and parameters C be restricted to 200 and 400 or be fixed to 200 and can reduce parameter to the search volume of combination.Especially, if γ at O (10 ^-1) value on the magnitude, γ can get 0.1,0.2 respectively ..., 0.9; If parameter γ is at O (10 ^-2) going up value, γ can get 0.01,0.02 respectively ..., 0.09.All the other and the like.The HBSA that with SVM is sorter is called as HBSA-SVM, and the HBSA that hereinafter mentions mainly refers to HBSA-KNN and HBSA-SVM.

K-folding cross validation method (k-fold CV) is usually for assessment of disaggregated model.K-folding cross validation method only is used for calculating Acc (T) at training dataset in the experiment.If k is set to Tr _n(scale of training set), k-fold CV be called as leaving-one method (Leave-one-out cross-validation, LOOCV).If k is set to 2, then k-fold CV is called as reservation method (holdout).When k is set to when very low, the assessment precision of k-fold CV trends towards high biasing and low variance.In contrast, be set to when too big (such as, k=Tr as k _n), the nicety of grading of k-fold CV will have low biasing but high variance will be arranged.Breiman etc. think that 10-fold CV method is better than the LOOCV method to some degree.Ambroise etc. and Asyali etc. recommend 10-fold CV method assessment staging, but in fact whether 10-fold CV method is better than the LOOCV method and depends on data set.For balance biasing and variance, the present invention has designed the classification performance that a kind of new method is assessed the gene subclass.Note CV (k) expression k-fold CV classification performance, 2≤k≤m wherein, m represents the quantity of sample in the training set.The average of classification accuracy is defined as follows:

mean = \frac{1}{m - 1} (Σ_{k = 2}^{m} CV (k)) - - - (6)

The standard deviation of classification accuracy is defined as:

std = \sqrt{Σ_{k = 2}^{m} {(CV (k) - mean)}^{2} / (m - 2)} - - - (7)

Such appraisal procedure is called as full folding cross validation method (Full-fold CV method).The classification accuracy average that adopts such method to obtain is called as full folding cross validation accuracy rate (Full-fold CV accuracy).Can cause the calculated amount of HBSA algorithm significantly to increase owing to adopt full folding cross validation method to calculate Acc (T), so the present invention still adopts 10-fold CV method to calculate Acc (T), and with this heuristic information as HBSA.And roll over cross validation method entirely only for assessment of A ^*In the best base factor set of having selected with the highest the highest or inferior 10-fold CV classification accuracy.

The realization of HBSA-KNN and the realization of HBSA-SVM are similar substantially, but also have difference in the realization of these two kinds of methods.For HBSA-KNN, the present invention is divided into 10 parts to training dataset randomly when adopting 10-fold CV method, but different divisions has a little influence to experimental result, for eliminating the influence that different divisions brings, the present invention carries out the HBSA-KNN algorithm 5 times, all adopt the method for random division that data set is divided into 10 parts each time, so the present invention can obtain 5 slightly different A ^*, the appearance frequency of each gene is from these 5 gene subclass set A ^*Statistical computation is come out.Yet, for the HBSA-SVM method, during owing to employing 10-folding cross validation method in LIBSVM software the division of training dataset is fixed, so carrying out the HBSA-SVM algorithm, the present invention once gets final product.

Usually, for HBSA-SVM, last predictablity rate adopts the SVM-RBF sorter to come out in the assessment of independent test collection, and the SVM-RBF sorter only trains out by parameter optimization on training set, and this method is called as HBSA-SVM (Unbiased) method.Yet, not only an optimal parameter is to making that constructed sorter can obtain best 10-fold CV predictablity rate on training set, and with these different optimal parameters constructed sorter is obtained different predictablity rates at the independent test collection, therefore, as with the contrast of HBSA-SVM (Unbiased) method, a kind of HBSA-SVM method of biasing is devised for assessment of the gene subclass of selecting, this method selects those to make that constructed sorter can be right in the parameter of test set acquisition best result class accuracy rate, claims that this method is HBSA-SVM (Biased).

(Receiver Operator Characteristics, ROC) analyzing is a kind of method for visualizing of assessment two disaggregated model performances to accept operator's characteristic.Usually, some performance measure indexs can be calculated the performance of tolerance disaggregated model from some basic indexs, as calculating true positives (the number of true positives at test set, TP), true negative (true negatives, TN), false positive (false positives, FP) and false negative (false negatives, FN), thereafter calculate again susceptibility (the true-positive rate or sensitivity, TPR), false positive rate (the false-positive rate, FPR), positive predictive value (positive predictive value, PPV) and negative predictive value (negative predictive value, NPV).The ROC curve is to be the Y-axle with TPR and to be the curve that FPR draws with the X-axle, and (the Area Under ROC Curve AUC) is used for the performance of tolerance disaggregated model to area under a curve.

Acc=accuracy=(TP+TN)/(TP+TN+FP+FN) (8)

SP=specificity=TN/(FP+TN) (9)

TPR=sensitivity=TP/(TP+FN) (10)

FPR=(1-specificity)=FP/(FP+TN) (11)

PPV=TP/(TP+FP) (12)

NPV=TN/(TN+FN) (13)

Analytical framework

When HBSA being used for from adopting KWRST to carry out after gene selects at the gene of just selecting on the training set with differential expression, the present invention can select many best base factor sets again.Yet, find that in so huge gene space best gene subclass might cause the over-fitting problem to training set.Some and the incoherent gene of tumour are probably falsely dropped in the best base factor set, make the selection of gene have serious biasing problem.The extensive poor performance of forgiving the gene subclass of the uncorrelated gene of tumour, in order to handle this problem, the present invention has designed based on the integrated classifier of HBSA with based on the gene sort method of HBSA, obtaining not have the predictablity rate of biasing, and finds important gene as much as possible.Analysis process of the present invention as shown in Figure 4.

Integrated classifier based on HBSA

Integrated classifier based on HBSA is made of a plurality of individual segregation devices, the assessment generation of integrated classifier on test set that corresponding precision of prediction (comprise biasing with do not have biasing) is made of respectively SVM (Biased) and SVM (Unbiased).Last decision-making adopts simple majority voting policy determination to make.Fig. 5 has clearly shown the SVM integrated classifier that is generated by w individual svm classifier device, and wherein each individual segregation device all is to be formed by each best base factor set structure that HBSA-SVM selects.

In order to measure the integrated classifier that is made of N individual sorter to the classification reliability of each sample, the present invention supposes that for each sample has defined a degree of confidence data set has k subclass, is expressed as L={c ₁..., c _k, a test sample book is endowed a voting vector (m ₁..., m _k), each component m wherein _iThis sample of expression voting is corresponding to L={c ₁..., c _kIn subclass c _iThe voting poll that obtains, wherein,

Note m _MaxAndm _SecVector (m is decided by vote in expression respectively ₁..., m _k) the middle maximum that obtains and the inferior maximum ticket of deciding by vote, the degree of confidence conf of a test sample book is defined as: conf=m _Max/ m _SecIf m _Sec=0, conf is set to N, wherein 1≤conf≤N.Degree of confidence conf is more big, and then this sample is correctly classified or the reliability of mis-classification is also more big.

Gene sort method based on HBSA

The present invention designed a kind of according to gene in best base factor set set A ^*In the gene sort method based on HBSA of appearance frequency, its objective is and find important tumor-related gene as much as possible that the importance of gene adopts and frequency occurs and measure.Those genes that come the front are considered to most important tumor-related gene.Therefore, the sorter according to these gene designs ought to have best extensive performance.

The present invention proposes a kind of gene system of selection based on heuristic breadth-first search of enriching biomedical implication that has, designed the integrated categorizing system of a kind of tumour based on gene expression profile.In fact the tumor-related gene importance ranking method based on HBSA of the present invention's proposition can be found important tumor-related gene, the importance that the appearance frequency that this method adopts gene to concentrate at the gene polyadenylation signal of selecting is measured gene.Further, the appearance frequency of the gene that the present invention's discovery is selected meets the distribution of power rate about the quantity of gene, this shows that fully the minority gene that comes the front can be used in the sign of diagnosing tumor, and the gene concentrated of the very high gene polyadenylation signal of these classification accuracies with specifically have close contacting between the tumors subtypes.

4. the biometric authentication of experimental result

The present invention further function (individual gene function), path analysis (pathway analysis) and the protein interaction network by analyzing individual gene (protein-protein interaction network, PPI) correlation degree that the high gene of frequency and tumour occur and confirmatory experiment result's validity are analyzed in three aspects.The present invention at first tabulates to verify the gene selected and the relation of tumour by the known cancer related gene, then by the cancer degree of association (Cancer Linker Degree, CLD) and the biological document of relevant medical verify that those are the gene of checking.Further, the gene of selecting can come the correlation degree of these genes of indirect verification and tumour by the correlation degree of the involved biological pathway of these genes and tumor development.Following analysis mainly is based on the experimental result of HBSA-KNN method and carries out.

Pertinent literature checking based on genes of individuals

Known tumor-related gene information can (http://cbio.mskcc.org/cancergenes) be downloaded from the website.1086 known cancer related genes collecting are by " oncogene ", and this website of three inquiries of " tumor suppressor " and " stability " obtains.Preceding 50 genes of selecting for each data set analysis by relevant biomedical document.Here the present invention is that example analysis comes 50 genes of front and the correlation circumstance of tumour with Leukemia and Prostate data set.

Adopt in preceding 50 genes that HBSA-KNN selects for the Leukemia data set, 10 genes (20%) are known tumor-related genes.By consulting biomedical document and calculating CLD and verify preceding 10 genes, the present invention finds that in preceding 10 genes, 10/10=100% is relevant with tumour.Gene C D33 (M23197) is at most acute myeloid leukemias (Acute Myeloid Leukemia, express in the pernicious mother cell of most cases AML), but in the versatile stem cell (normal hematopoietic pluripotent stem cells) of normal hematopoiesis, do not express.When treatment acute myeloid leukemia patient, the CD33+ cell melts the treatment results that can obtain in live body.MARCKSL1, (multidrug resistant associated protein, MRP), the expression in some vincristine-resistant clone is very high to claim multiple medicines impedance related protein again.SP3 is considered to a kind of new core protein in many different biochemical analysis displacement breakpoint related with the acute myeloid leukemia hypotype.CD63 (X62654) belongs to a kind of gene family of redetermination, and coding comprises the memebrane protein of CD33, is considered to the monoclonal antibody inhibitor to many cell spaces formation of human T-leukemia virus's 1 class transduction.TCF3 (M31523) relates to the 19p13 chromosomal rearrangement, and it plays a part tumor suppressor gene at the B cell precursor acute myeloblastic leukemia.CST3 claims cysteine proteinase inhibitor C again, is raised in tumour patient.

For the Prostate data set, in preceding 50 genes that HBSA-KNN selects, 12 genes (24%) are the known cancer related gene.For other the gene of selecting, the present invention verifies the correlativity of preceding 10 genes and tumour by pertinent literature, and the present invention finds that there are close ties in the gene of 9/10=90% and tumour.In HEPSIN gene (HPN), No. 11 transposons haplocype SNP, closely related with prostate cancer, it illustrates that HPN (X07732) is a potential important candidate gene that relates to the prostate cancer neurological susceptibility.SLC25A6 claims ANT3 again, and in the MCF-7 cell, the cell death of transduceing for the pressure of TNF-and oxidation characteristic is necessary.KIBRA relates to that estrogen receptor is counter in the breast cancer cell activates.The RBP1 that changes expresses and supermethylation is very common in prostate is carcinogenic, and prostate cancer and epithelioma form and show that RBP1 crosses expression frequently.CHD9 and NELL2 have 4 and 2 CLD respectively, show as the following based on network analysis that will introduce.The Gene A 2R6W1 that extracts from aspergillus niger is core protein and DNA transcriptional regulatory in conjunction with zinc ion by hypothesis, and the relation value of it and tumour gets further research.

Network Basedly analyze preceding ten genes

Aragues etc. have defined tumour connection degree (the Cancer Linker Degree of an albumen, CLD), the i.e. quantity of the tumor correlated albumen that is connected with an albumen, and think that the function of the tumor correlated albumen that is connected with this albumen is that this albumen may be the important symbol of tumor correlated albumen.The present invention uses the albumen of the coded by said gene of selecting based on the method analysis of protein network at human protein reference database (Human Protein Reference Database, HPRD) function of closing on albumen in, data were downloaded by the end of in August, 2009.For the Leukemia data set, in preceding 10 genes, two genes (ZYX and CCND3) are oncogenes, five genes (APLP2, CD33, SP3, CD63, PSME1) all relevant with tumor-related gene directly or indirectly with other gene (CST3), as shown in figure 10.It is 1,2,5,7 and 9 that APLP2, CD33, SP3, CD63 and PSME1(sort respectively) tumour degree of connection be respectively 4,3,9,2 and 1.TTC3 and MARCKSL1 show does not have tumor-related gene to be attached thereto, and wherein MARCKSL1 is that increment is expressed in some anti-vincristine clone.

For the Prostate data set, result of the present invention shows that three gene: MAF, ABL1 are relevant with tumour with SERPINB5 itself, and the gene that other majority comes the front all has direct effect with the known cancer related gene.The CLD of MAF, HPN, ABL1, CHD9, WWC1, NELL2 and RBP1 is respectively 7,1,46,4,2,2 and 4.Perhaps, the gene of selecting is not the generation that directly causes certain tumour, but they play an important role in the canceration process probably or are subjected to the pernicious regulation and control of other known cancer gene that is attached thereto.Therefore, the present invention can infer that these genes that come the front can regard candidate's tumor-marker as.

Checking based on path analysis

The gene of selecting has also carried out path analysis at website http://vortex.cs.wayne.edu/projects.htm, and p-value is wherein calculated by formula (15).Here, analytical approach is based on such hypothesis: the quantity that participates in the gene of different paths is obeyed hypergeometric distribution.Suppose to have N gene, wherein M gene participates in path F, the present invention selects K to be considered to important function of gene randomly, then, in path F, there is x p-value individual or gene still less in path F, to have 1 in the random list by a calculating K gene, 2 ..., the probability sum of x gene is:

p = Σ_{i = 0}^{x} ((\begin{matrix} M \\ i \end{matrix}) (\begin{matrix} N - M \\ K - x \end{matrix}) / (\begin{matrix} N \\ K \end{matrix})) - - - (14)

When N was very big, binomial distribution was tended in hypergeometric distribution.In this case, the p-value also can be calculated by following formula.

P = 1 - Σ_{i = 0}^{x - 1} (\begin{matrix} K \\ i \end{matrix}) {(\frac{M}{N})}^{i} {(1 - \frac{M}{N})}^{K - i} - - - (15)

Preceding 50 related paths of gene comprise cell proliferation (cell proliferation) (cell cycle for example, dna replication dna), (base excision repair is repaired in the base excision to genome stability, do not match and repair mismatch repair, Deng), revascularization art angiogenesis (such as, vascular endothelial growth factor (VEGF) signal path), metastases (for example, the pathway of cell adhesion molecules), the tumor suppression path (for example, the p53 signal path), immunologic escape path (immunity escape) (such as, the pathways of antigen processing and presentation, B cell receptor signaling pathway, primary immunodeficiency etc.) or the development of one or more tumours etc.

Owing to relate to a lot of biological pathways, the present invention verifies four biological pathways being selected by HBSA-SVM and HBSA-KNN and the correlativity of tumour by biomedical document.For the Leukemia data set, B-cell antigen receptor (BCR) signal path is very important for the survival of chronic lymphocytic leukemia cell, and this cell is regulated and control by crossing expressed proteins kinase c β.The such hypothesis of heterogeneity support of the potential ability that Heterogeneity upgrades voluntarily at leukemic stem cells, they are derived from normal candidate stem cell.Many transcription factors be TIF also be the tumour virulence factor, therefore, there are close ties in their variation or undesired adjusting and tumour.It is different that spectrum is repaired in the lymphocytic DNA excision of the normal and leukemic mankind.

For the prostate data set, hypothesis such as Osman, a path of prostate cancer development relates to by MDM2 crosses the p53 deactivation of expressing control, the p21 transfer activity of the mechanism that relies on through a selectable signal path rather than by p53-.Insulin signaling pathway (Insulin signalling pathway) relates to the pathology of various malignant tumours, increases the risk of tumour by it in cell proliferation, differentiation and effect of apoptosis, takes place with to form growth relevant with prostatic tumour.Pass between the variation of neoplastic cell nuclei and chromosomal form and function ties up to document and obtains detailed elaboration.For the cell cycle path, research has disclosed male sex hormone in the main regulation and control factor in G1-S stage, can inducement signal impels the kinases that G1 cell cycle of prostate gland cancer cell relies on.

Analyze the present invention from above signal path and can draw such conclusion, the related most biological pathways of the gene of selecting all have direct relation with tumour generation, neoplasia or metastases.Gene function analysis, PPI network and biological pathway analysis from the molecular basis, the present invention infers that the gene come the front is very useful as important biological marker for the diagnosis of tumour, has important booster action for the mechanism of the generation of understanding tumour, development and transfer.

Fig. 6 gene frequency and the gene dosage power rate on six data sets occurs and distributes, the relation of the appearance frequency that the figure illustrates gene and the arrangement sequence number of corresponding gene.An importance of the appearance frequency of gene is exactly that this pass ties up to and presents a kind of linear relationship in the log-log grid among this figure, and therefore, the appearance frequency of the gene of selecting distributes about obeying the power rate greater than the quantity of the gene of corresponding gene frequency.The gene frequency that adopts the HBSA-KNN method to obtain is the adding up of result of 5 operations, and the adding up of 5 operation results still shows the form that the power rate distributes and show the reason that the strong becomes stronger.

Fig. 7 has shown that the classification predictablity rate of three kinds of gene sort methods compares.The present invention has further compared based on the sort method of HBSA-KNN and other two kinds of gene sort methods, i.e. Kruskal-Wallis rank sum test (KWRST) and Relief-F.Comparison shows that when enough hour of the gene dosage of selecting that comes the front, method of the present invention always was better than KWRST and Relief-F aspect predictablity rate.For the Prostate data set, though have only two genes in front to obtain very high accuracy rate (88.24%), this value is obviously greater than adopting KWRST and the accuracy rate of Relief-F when selecting two genes.This shows that method of the present invention remains very effective, because this situation target according to the invention still: important tumor-related gene has come the front.But, method of the present invention is devoted to find important tumor-related gene as much as possible exactly, and important tumor-related gene sees it still may is redundant from the classification angle, therefore, along with the increase of selecting the gene dosage that comes the front, the classification predictablity rate may step-down.

Claims

1. method based on heuristic breadth-first search oncogene, it is characterized in that: adopt the appearance frequency of concentrating at the gene polyadenylation signal of selecting to measure gene, the gene that comes the front is considered to most important tumor-related gene, and design category device thus, reach the gene sort method of setting up based on the HBSA algorithm, its step is as follows:

Each vectorial s in the gene expression matrix _iAll be counted as a point in the n-dimension space.And each all is made up of the expression vector of n element in m the sample vector; Suppose L={c ₁..., c _kExpression data centralization tag set, | L|=k represents the quantity of data centralization classification; Usually, the classification of each sample is known known, therefore, and S * L={ (s _i, l _i) | s _i∈ R ⁿ, l _i∈ R ⁿ, l _i∈ L, i=1,2 ..., m} represents to have the sample space of class label;

min _T∈P(G)(|T|) (1)

max _T∈P(G)(Acc(T)) (2)

T ^*Satisfy target (1) and (2) simultaneously }.

2. the method based on heuristic breadth-first search oncogene according to claim 1 is characterized in that:

(m/k)/s _n＞5 (4)

Other quantity of k representation class wherein, m represents the quantity of training sample, s _nThe quantity of the gene that expression is selected; The integrated classifier that is made of N individual sorter is to the classification reliability of each sample, and each sample has defined a degree of confidence, supposes that a data set has k subclass, is expressed as L={c ₁..., c _k, a test sample book is endowed a voting vector (m ₁..., m _k), each component m wherein _iThis sample of expression voting is corresponding to L={c ₁..., c _kIn subclass c _iThe voting poll that obtains, wherein,

3. the method based on heuristic breadth-first search oncogene according to claim 1 is characterized in that: when realizing described HBSA algorithm

Defining classification Matrix C M=(a _{I, j}) _{W * p}Be expressed as follows:

Adopt row labels vector R _Ow=(T ¹, T ²... T ^w) each row of mark CM successively, T wherein ⁱ(the gene subclass that the expression of 1≤i≤w) is selected; Adopt row label vector Column=({ g ₁... { g _j... { g _p) each row of mark CM, wherein anus g successively _i∈ G ^*, and a _{I, j}=Acc (Row[i] ∪ Column[j]), wherein Row[i] the capable gene subclass of i of representing matrix CM.Column[j] the individual gene set of j row representative among the representing matrix CM, 1≤i≤w here, 1≤j≤p.

4. the method based on heuristic breadth-first search oncogene according to claim 1, it is characterized in that: described gene sort method based on HBSA is in best base factor set set A according to gene ^*In the appearance frequency sort, those genes that come the front are considered to most important tumor-related gene.