CN107526946A

CN107526946A - Merge the gene expression data cancer classification method of self study and low-rank representation

Info

Publication number: CN107526946A
Application number: CN201611207518.8A
Authority: CN
Inventors: 於东军; 夏春秋; 韩珂
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2016-12-23
Filing date: 2016-12-23
Publication date: 2017-12-29
Anticipated expiration: 2036-12-23
Also published as: CN107526946B

Abstract

The invention discloses a kind of gene expression data cancer classification method for merging self study and low-rank representation, including：Step 1, data set is expressed for given cancer gene, data are merged into structure data matrix, and make normalized；Step 2, the data matrix for obtaining, are decomposed using low-rank expression, obtain a low-rank matrix and a sparse matrix；Step 3, the label information using training set, calculate the initial point of each classification respectively in low-rank matrix and sparse matrix；Step 4, a kind of unsupervised clustering is used in low-rank matrix and sparse matrix respectively, obtain the prediction result based on low-rank matrix and sparse matrix respectively；Two step 5, contrast prediction results, if without prediction identical sample or reaching maximum iteration, export the prediction result based on low-rank expression matrix；Otherwise, prediction identical sample is removed into test set and adds training set, return to step 3.Precision of prediction can be improved in the case of using a small amount of mark sample using the present invention, reduce the time in mark sample and human cost.

Description

Merge the gene expression data cancer classification method of self study and low-rank representation

Technical field

The present invention relates to bioinformatics gene expression and cancer classification field, is a kind of fusion self study specifically With the gene expression data cancer classification method of low-rank representation.

Background technology

Cancer is a kind of caused fatal disease due to cell abnormal growth, so far, is not had yet fully effective Treatment method.Early diagnosis can must effectively help treatment of cancer, so accurately classification prediction how is carried out to cancer is One has the problem of researching value very much.With the development of high-throughput techniques, the gene expression data about cancer is promptly Accumulation, while machine learning techniques also obtain significant progress in recent years, therefore utilize gene expression data and machine learning To predict that cancer class is possibly realized, such as:(1)Chen,X.Y.and Jian,C.R.Gene expression data clustering based on graph regularized subspace segmentation.Neurocomputing 2014；143:44-50.(2)Liao,Q.,Guan,N.and Zhang,Q.Gauss-Seidel based non-negative matrix factorization for gene expression clustering.In,2016IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE； 2016.p.2364-2368.(3)Liu,J.X.,et al.RPCA-Based Tumor Classification Using Gene Expression Data.IEEE ACM T Comput Bi 2015；12(4):964-970. wait.However, existing method is big Mostly unsupervised approaches and measure of supervision, both suffer from respective defect.

Unsupervised approaches learning method finds potential structure by proposing a model from without label data.By institute Some samples all do not mark, so can not carry out error correction using label information in the training of model.It is unsupervised This characteristic of study causes the predictive ability of model weaker, can not provide effective precision of prediction.Supervised learning method and nothing Supervised learning on the contrary, it by using there is label data to carry out training pattern.Due to label data can be used in training, supervision The model that learning method obtains can provide higher precision of prediction.But a model is trained to need to use greatly using measure of supervision The labeled data of amount, and the cost of labeled data is often sufficiently expensive and needs to expend substantial amounts of manpower and time, it is especially right Gene expression data is labeled.

In view of the defects of two kinds of learning methods suffer from overcoming, the proposition of semi-supervised learning carries to solve the above problems A new approaches are supplied：Using largely without label data and on a small quantity thering is label data to be provided come training pattern, the model It is much better than the prediction effect of unsupervised approaches.It is a kind of traditional semi-supervised learning side that the improved self learning model of this method, which is exactly, Method, it by the higher sample of confidence level in prediction by adding training set, the training and prediction of continuous iteration, finally to all surveys The data that examination is concentrated are classified.Nowadays, existing many effective semi-supervised methods are used for point of cancer gene expression data Analysis, such as：(1)Cai,X.F.,et al.Local and Global Preserving Semi-supervised Dimensionality Reduction Based on Random Subspace for Cancer Classification.IEEE J Biomed Health 2014；18(2):500-507.(2)Halder,A.and Misra, S.Semi-supervised fuzzy K-NN for cancer classification from microarray gene expression data.2014First International Conference on Automation,Control, Energy&Systems(ACES-14)2014:266-270. wait.However, the processing of gene expression data still leaves challenge：

(1) gene expression data has very high dimension

Because each feature of gene expression data corresponds to a gene, and the mankind have no less than 2.5 ten thousand genes, institute Often there are tens thousand of individual characteristic components with gene expression data.Traditional sorting technique is when handling high dimensional data, in data Noise and redundancy are very sensitive, it is difficult to provide accurate prediction；

(2) the data set very little of gene expression data

Because using the gene expression of gene microarray technology measure, costly, time and human cost are very high, therefore, one The data set very little that secondary property obtains, usually only includes tens or hundreds of samples, and too small data volume is difficult to train effectively Model.

The content of the invention

It is an object of the invention to provide a kind of gene expression data cancer classification side for merging self study and low-rank representation Method, solve the problems, such as that carrying out cancer classification prediction using gene expression data in the prior art is present：Data dimension is high, test set Small and labeled data is few.

The technical solution for realizing the object of the invention is：A kind of gene expression data for merging self study and low-rank representation Cancer classification method, comprises the following steps：

Step 1, data set is expressed for given cancer gene, wherein the collection for having label data is combined into training set, no label Data acquisition system is test set；Data are merged into structure data matrix X, and make normalized；

Step 2, the data matrix for obtaining, are decomposed using low-rank expression, obtain a low-rank matrix Z and One sparse matrix E；

Step 3, the label information using training set, calculate each classification i respectively on low-rank matrix Z and sparse matrix E Initial point coordinates p⁽ⁱ⁾；

Step 4, a kind of unsupervised clustering is used on low-rank matrix Z and sparse matrix E respectively, be based on respectively Low-rank matrix Z and sparse matrix E prediction result l_ZAnd l_E；

Two step 5, contrast prediction result l_ZAnd l_EIf without prediction identical sample or reaching maximum iteration, export Prediction result l based on low-rank expression matrix_Z；Otherwise, prediction identical sample is removed into test set and adds training set, returned Step 3.

Compared with prior art, its remarkable advantage is the present invention：1) low-rank representation method, Ke Yicong are combined in this method The global characteristics of essence are extracted in original high dimensional data；(2) this method used simultaneously decompose to obtain in low-rank representation it is low Order matrix information and sparse matrix information, than traditional method (merely with the information in a matrix) based on low-rank representation more To be effective.

Brief description of the drawings

Fig. 1 is an exemplary flow of the gene expression data cancer classification method of fusion self study and low-rank representation Figure.

Fig. 2 be a certain cancer gene express data set on schematic diagram, (a), (b), (c) be respectively former data matrix and Low-rank matrix, sparse matrix after low-rank decomposition, the gray value of each corresponding pixel of expression value in matrix.On each matrix In the horizontal stripe of side, the classification of the corresponding cancer of each color lump.

Embodiment

It is a typical high dimensional and small sample size problem to carry out cancer classification prediction using gene expression data.In order to solve this Individual problem, low-rank representation is used for reference, a kind of feature extracting method for being usually used in matrix recovery in image processing field, it passes through about Beam data rank of matrix, to obtain the essential low dimensional structures of data.

By using semi-supervised learning method and feature extracting method, can solve to carry out cancer using gene expression data The problem of disease classification prediction.

Embodiments of the present invention are described in detail by way of example below in conjunction with accompanying drawing.

As shown in figure 1, according to the preferred embodiment of the present invention, the gene expression data cancer of fusion self study and low-rank representation Disease sorting technique, the sample for expressing a cancer gene in data set carry out class prediction.In order to reflect practical application Scene, regard which part data as no label data, and be defined as test set；Remaining sample set is defined as training Collection.During training and prediction, the label information of sample in training set can only be used, the classification information of test set is used for and surveyed The prediction classification of examination collection compares..Classification prediction is divided into two stages：Feature extraction phases and training and forecast period, below With reference to shown in Fig. 1, the realization in above-mentioned two stage is described in detail.

(1) feature extraction phases

The first step, by training set and test set characteristic vector merge, build a data matrix X, the data of acquisition Matrix X needs to meet：

X=[x₁,x₂,…,x_n]∈R^d×n

Wherein, x_iFor the column vector of a gene expression data sample, vector dimension d.N sample is shared in X, n is The sum of sample number in training set and test set.Each vector is needed by normalized.

Second step, for the given data matrix X, decomposed using low-rank expression, obtained low-rank square Battle array Z and sparse matrix E need to meet following condition：

S.t., X=XZ+E

Wherein, Z=[z₁,z₂,...,z_n]∈R^n×n,||Z||_*=∑_iσ_i(Z) nuclear norm for being Z, σ_i(Z) it is i-th of Z Singular value；Refer to E l_2,1Norm；λ is balance parameters.In this example, select λ= 2 can obtain preferable effect.Alternating direction Multiplier Algorithm (Alternating Direction can be used by solving above formula Method of Multipliers)。

Fig. 2 is to express the schematic diagram on data set in a certain cancer gene, and the data set includes 14 kinds of cancer types altogether (BR:Breast cancer, PR:Prostate cancer, LU:Lung cancer, CO:Colon and rectum gland cancer, LY:Lymph cancer, BL:TCCB, ML: Melanoma, UT:Adenocarcinoma of endometrium, LE:Leukaemia, RE:Clear-cell carcinoma, PA:Cancer of pancreas, OV:Adenocarcinoma ovaries, MS:Pleura Celiothelioma, CNS:Central nervous system cancer), 198 samples, each sample shares 11370 characteristic components.As can be seen that (a) outstanding feature is not present in data distribution in；(b) it is obvious that the cancer sample of identical category falls the characteristics of data distribution in Enter in same sub-spaces, the classification based on the matrix obviously can be better than the classification based on original matrix；(c) data distribution in There is the characteristics of certain, due to openness, nonzero value is seldom in matrix, and those have the characteristic component of more nonzero value can be to final The result of classification causes bigger influence.Classification based on sparse matrix can provide a different visual angle, to final classification As a result booster action is played.

(2) training and forecast period

Different from the characteristics of training and prediction separate in traditional measure of supervision, this method is a kind of semi-supervised clustering side Method, the training and prediction of model are carried out simultaneously.As shown in figure 1, training and prediction iteration are carried out, three phases are divided into.

The first step, initial point is tried to achieve respectively to matrix Z and matrix E.Each classification i initial point coordinates in ZCalculating side Formula is as follows：

Wherein,For Z_lIn points in i-th of cluster,For Z_lIn i-th of cluster j-th of sample, Z_lTo have label in Z The matrix of sample composition；Can be in the hope of the initial point coordinates of each classification i in E with same method

Second step, based on the initial point tried to achieve in the first stepWithNothing is used to low-rank matrix Z and sparse matrix E Supervised Clustering Methods.In this example, the K-means algorithms of selection standard are as unsupervised clustering, distance metric choosing Preferable result can be obtained by selecting Minkowski distances.Based on low-rank matrix and the prediction result of sparse matrix vector difference It is expressed as l_ZAnd l_E。

Two 3rd step, contrast prediction result l_ZAnd l_E, the suitable sample that do not mark of selection is as under mark sample and addition An iteration；Or terminate iteration, output result.

Select the mode of suitable sample as follows：

(1) one does not mark sample and is chosen as marking sample addition next iteration, and and if only if, and following formula is set up：

Wherein,It is cluster result l_ZIn i-th of sample prediction result,It is cluster result l_EIn the i-th sample prediction As a result；

(2) a set S is defined, empty set is initialized with, all samples for meeting above-mentioned standard is put into wherein.

The specific standards whether evaluation algorithm terminates are as follows：

(1) if iterations reaches maximum times, termination algorithm；Otherwise, prediction identical sample is removed into test set simultaneously Training set is added, into next iteration；

(2) if S is empty set, termination algorithm；Otherwise, prediction identical sample is removed into test set and adds training set, entered Enter next iteration.

If algorithm does not terminate updates training set and test set as follows：If sample i, will in SFrom Z_uMiddle removal, And add Z_lIn；WillFrom E_uMiddle removal, and add E_lIn.Wherein, Z_lFor the matrix being made up of in Z exemplar, Z_uFor nothing in Z The matrix of exemplar composition；E_lFor the matrix being made up of in E exemplar, E_uThe matrix formed for unlabeled exemplars in E.

After algorithm terminates, l is returned_ZExported as prediction result.

Claims

A kind of 1. gene expression data cancer classification method for merging self study and low-rank representation, it is characterised in that including following step Suddenly：

Step 1, data set is expressed for given cancer gene, wherein the collection for having label data is combined into training set, no label data Collection is combined into test set；Data are merged into structure data matrix X, and make normalized；

Step 2, the data matrix for obtaining, are decomposed using low-rank expression, obtain a low-rank matrix Z and one Sparse matrix E；

Step 3, the label information using training set, calculate the first of each classification i respectively on low-rank matrix Z and sparse matrix E Initial point coordinate p⁽ⁱ⁾；

Step 4, a kind of unsupervised clustering is used on low-rank matrix Z and sparse matrix E respectively, obtain be based on low-rank respectively Matrix Z and sparse matrix E prediction result l_ZAnd l_E；

Two step 5, contrast prediction result l_ZAnd l_EIf without prediction identical sample or reaching maximum iteration, output is based on The prediction result l of low-rank expression matrix_Z；Otherwise, prediction identical sample is removed into test set and adds training set, return to step 3。
2. fusion self study according to claim 1 and the gene expression data cancer classification method of low-rank representation, it is special Sign is：Cancer gene expression data set is given described in step 1 to include label data and be without label data, wherein label Cancer class.
3. fusion self study according to claim 1 and the gene expression data cancer classification method of low-rank representation, it is special Sign is, in the step 1, the data matrix X of acquisition needs to meet：

X=[x₁,x₂,…,x_n]∈R^d×n

Wherein, x_iFor the column vector of a gene expression data sample, n sample is shared in vector dimension d, X, n is training set With the sum of sample number in test set；Each vector is needed by normalized.
4. fusion self study according to claim 1 and the gene expression data cancer classification method of low-rank representation, it is special Sign is：In the step 2, for given data matrix X, decomposed using low-rank expression, obtained low-rank Matrix Z and sparse matrix E need to meet following condition：

<mrow> <munder> <mrow> <mi>m</mi> <mi>i</mi> <mi>n</mi> </mrow> <mrow> <mi>Z</mi> <mo>,</mo> <mi>E</mi> </mrow> </munder> <mo>|</mo> <mo>|</mo> <mi>Z</mi> <mo>|</mo> <msub> <mo>|</mo> <mo>*</mo> </msub> <mo>+</mo> <mi>&lambda;</mi> <mo>|</mo> <mo>|</mo> <mi>E</mi> <mo>|</mo> <msub> <mo>|</mo> <mrow> <mn>2</mn> <mo>,</mo> <mn>1</mn> </mrow> </msub> </mrow>

S.t., X=XZ+E

Wherein, | | Z | |_*=∑_iσ_i(Z) nuclear norm for being Z, σ_i(Z) i-th of singular value for being Z；Refer to E l_2,1Norm；λ is balance parameters.
5. fusion self study according to claim 1 and the gene expression data cancer classification method of low-rank representation, it is special Sign is：In the step 3, each classification i initial point coordinates in ZCalculation is as follows：

<mrow> <msubsup> <mi>p</mi> <mi>z</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> </msubsup> <mo>=</mo> <mfrac> <mrow> <munderover> <mo>&Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <msubsup> <mi>n</mi> <mi>l</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> </msubsup> </munderover> <msubsup> <mi>z</mi> <mrow> <mi>l</mi> <mo>,</mo> <mi>j</mi> </mrow> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> </msubsup> </mrow> <msubsup> <mi>n</mi> <mi>l</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> </msubsup> </mfrac> </mrow>

Wherein,For Z_lIn points in i-th of cluster,For Z_lIn i-th of cluster j-th of sample, Z_lTo have exemplar in Z The matrix of composition；Each classification i initial point coordinates in ECalculation is as follows：：

<mrow> <msubsup> <mi>p</mi> <mi>E</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> </msubsup> <mo>=</mo> <mfrac> <mrow> <munderover> <mo>&Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <msubsup> <mi>n</mi> <mi>l</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> </msubsup> </munderover> <msubsup> <mi>e</mi> <mrow> <mi>l</mi> <mo>,</mo> <mi>j</mi> </mrow> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> </msubsup> </mrow> <msubsup> <mi>n</mi> <mi>l</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> </msubsup> </mfrac> </mrow>

Wherein,For E_lIn i-th of cluster j-th of sample, E_lFor the matrix being made up of in E exemplar.
6. fusion self study according to claim 1 and the gene expression data cancer classification method of low-rank representation, it is special Sign is：In the step 4, a kind of unsupervised clustering is used low-rank matrix Z and sparse matrix E, and this method needs true Fixed initial cluster centre, and select a kind of distance metric to weigh the similarity of two samples, in this step, initially Put and be respectivelyWithDistance metric is Minkowski distances.
7. fusion self study according to claim 1 and the gene expression data cancer classification method of low-rank representation, it is special Sign is：In the step 5, two prediction result l are contrasted_ZAnd l_E, select not marking sample as mark sample and add next Secondary iteration, its specific implementation are as follows：

(1) one does not mark sample and is chosen as marking sample addition next iteration, and and if only if, and following formula is set up：

<mrow> <msub> <mi>l</mi> <msub> <mi>z</mi> <mi>i</mi> </msub> </msub> <mo>=</mo> <msub> <mi>l</mi> <msub> <mi>e</mi> <mi>i</mi> </msub> </msub> </mrow>

Wherein,It is cluster result l_ZIn i-th of sample prediction result,It is cluster result l_EIn the i-th sample prediction knot Fruit；

(2) a set S is defined, empty set is initialized with, all samples for meeting above-mentioned standard is put into wherein.
8. fusion self study according to claim 1 and the gene expression data cancer classification method of low-rank representation, it is special Sign is：In the step 5, the specific standards whether evaluation algorithm terminates are as follows：

(1) if iterations reaches maximum times, termination algorithm；Otherwise, prediction identical sample is removed into test set and added Training set, return to step 3, into next iteration；

(2) if S is empty set, termination algorithm；Otherwise, prediction identical sample is removed into test set and adds training set, return to step Rapid 3, into next iteration；

After algorithm terminates, l is returned_ZExported as prediction result.
9. the gene expression data cancer classification method of the fusion self study and low-rank representation according to claim 7 or 8, its It is characterised by：In the step 5, prediction identical sample is removed into test set and adds being defined as follows for training set：

If sample i, will in SFrom Z_uMiddle removal, and add Z_lIn；WillFrom E_uMiddle removal, and add E_lIn；Wherein, Z_lFor Z In the matrix that is made up of exemplar, Z_uThe matrix formed for unlabeled exemplars in Z；E_lFor the square being made up of in E exemplar Battle array, E_uThe matrix formed for unlabeled exemplars in E.