CN107526946A - Merge the gene expression data cancer classification method of self study and low-rank representation - Google Patents
Merge the gene expression data cancer classification method of self study and low-rank representation Download PDFInfo
- Publication number
- CN107526946A CN107526946A CN201611207518.8A CN201611207518A CN107526946A CN 107526946 A CN107526946 A CN 107526946A CN 201611207518 A CN201611207518 A CN 201611207518A CN 107526946 A CN107526946 A CN 107526946A
- Authority
- CN
- China
- Prior art keywords
- mrow
- matrix
- low
- rank
- samples
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 55
- 230000014509 gene expression Effects 0.000 title claims abstract description 52
- 206010028980 Neoplasm Diseases 0.000 title claims abstract description 32
- 201000011510 cancer Diseases 0.000 title claims abstract description 30
- 239000011159 matrix material Substances 0.000 claims abstract description 75
- 238000012549 training Methods 0.000 claims abstract description 30
- 238000012360 testing method Methods 0.000 claims abstract description 19
- 108700019961 Neoplasm Genes Proteins 0.000 claims abstract description 8
- 102000048850 Neoplasm Genes Human genes 0.000 claims abstract description 8
- 239000013598 vector Substances 0.000 claims description 8
- 238000012545 processing Methods 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims description 4
- 238000000354 decomposition reaction Methods 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 2
- 238000000547 structure data Methods 0.000 abstract 1
- 238000000605 extraction Methods 0.000 description 4
- 108090000623 proteins and genes Proteins 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 208000036832 Adenocarcinoma of ovary Diseases 0.000 description 1
- 206010006187 Breast cancer Diseases 0.000 description 1
- 208000026310 Breast neoplasm Diseases 0.000 description 1
- 206010052360 Colorectal adenocarcinoma Diseases 0.000 description 1
- 206010058467 Lung neoplasm malignant Diseases 0.000 description 1
- 206010025323 Lymphomas Diseases 0.000 description 1
- 206010061328 Ovarian epithelial cancer Diseases 0.000 description 1
- 206010061902 Pancreatic neoplasm Diseases 0.000 description 1
- 206010035603 Pleural mesothelioma Diseases 0.000 description 1
- 206010060862 Prostate cancer Diseases 0.000 description 1
- 208000000236 Prostatic Neoplasms Diseases 0.000 description 1
- 208000006265 Renal cell carcinoma Diseases 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000003321 amplification Effects 0.000 description 1
- 230000010261 cell growth Effects 0.000 description 1
- 210000003169 central nervous system Anatomy 0.000 description 1
- 201000007455 central nervous system cancer Diseases 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 238000013399 early diagnosis Methods 0.000 description 1
- 201000003908 endometrial adenocarcinoma Diseases 0.000 description 1
- 208000029382 endometrium adenocarcinoma Diseases 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000013090 high-throughput technology Methods 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 208000032839 leukemia Diseases 0.000 description 1
- 201000005202 lung cancer Diseases 0.000 description 1
- 208000020816 lung neoplasm Diseases 0.000 description 1
- 208000015486 malignant pancreatic neoplasm Diseases 0.000 description 1
- 201000001441 melanoma Diseases 0.000 description 1
- 238000012775 microarray technology Methods 0.000 description 1
- 238000003199 nucleic acid amplification method Methods 0.000 description 1
- 208000013371 ovarian adenocarcinoma Diseases 0.000 description 1
- 201000006588 ovary adenocarcinoma Diseases 0.000 description 1
- 201000002528 pancreatic cancer Diseases 0.000 description 1
- 208000008443 pancreatic carcinoma Diseases 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 206010044412 transitional cell carcinoma Diseases 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
Landscapes
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Engineering & Computer Science (AREA)
- Medical Informatics (AREA)
- Biophysics (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Public Health (AREA)
- Bioethics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
- Investigating Or Analysing Biological Materials (AREA)
Abstract
The invention discloses a kind of gene expression data cancer classification method for merging self study and low-rank representation, including:Step 1, data set is expressed for given cancer gene, data are merged into structure data matrix, and make normalized;Step 2, the data matrix for obtaining, are decomposed using low-rank expression, obtain a low-rank matrix and a sparse matrix;Step 3, the label information using training set, calculate the initial point of each classification respectively in low-rank matrix and sparse matrix;Step 4, a kind of unsupervised clustering is used in low-rank matrix and sparse matrix respectively, obtain the prediction result based on low-rank matrix and sparse matrix respectively;Two step 5, contrast prediction results, if without prediction identical sample or reaching maximum iteration, export the prediction result based on low-rank expression matrix;Otherwise, prediction identical sample is removed into test set and adds training set, return to step 3.Precision of prediction can be improved in the case of using a small amount of mark sample using the present invention, reduce the time in mark sample and human cost.
Description
Technical Field
The invention relates to the field of bioinformatics gene expression and cancer classification, in particular to a gene expression data cancer classification method combining self-learning and low-rank representation.
Background
Cancer is a fatal disease caused by abnormal growth of cells, and a completely effective treatment method has not been available so far. Early diagnosis can effectively help cancer treatment, so that how to accurately classify and predict cancers is a very valuable problem. With the development of high-throughput technology, gene expression data on cancer is rapidly accumulating, and machine learning technology has advanced sufficiently in recent years, so it has become possible to predict cancer categories using gene expression data and machine learning, for example, (1) Chen, x.y.and Jian, c.r. gene expression based on standardized restriction analysis.neuronoputting 2014; 143 (2) Liao, Q., Guan, N.and Zhang, Q.Gauss-Seidel based non-innovative amplification for gene expression in,2016IEEE International conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2016, p.2364-2368, (3) Liu, J.X., et al, RPCA-Based Tumor Classification Using Gene expression data. IEEE ACM T Computt Bi 2015; 12(4), 964, 970, etc. However, most of the existing methods are unsupervised methods and supervised methods, and both methods have respective defects.
Unsupervised method learning approaches discover potential structures from unlabeled data by proposing a model. Since all samples are unlabeled, the label information cannot be used for error correction in the training of the model. This feature of unsupervised learning results in a model with poor prediction capability and no effective prediction accuracy. The supervised learning approach is in contrast to unsupervised learning, which trains models by using labeled data. The model obtained by the supervised learning method can provide higher prediction accuracy due to the fact that label data can be used in training. However, training a model by using a supervision method requires a large amount of labeled data, which is very expensive and consumes a lot of manpower and time, especially for labeling gene expression data.
In view of the defects that both learning methods cannot overcome, the semi-supervised learning provides a new idea for solving the problems: models are trained with large amounts of unlabeled data and small amounts of labeled data, which can provide much better prediction than unsupervised methods. The improved self-learning model of the method is a traditional semi-supervised learning method, and the method comprises the steps of adding samples with higher reliability in prediction into a training set, continuously carrying out iterative training and prediction, and finally classifying data in all test sets. Today, a number of effective semi-supervised methods have been used for the analysis of cancer gene expression data, such as: (1) cai, x.f., et al.local and Global prediction Semi-collaborative on Random basis for cancer classification. ieee J Biomed Health 2014; 18(2) 500-. However, the processing of gene expression data remains challenging:
(1) gene expression data with high dimensionality
Since each feature of gene expression data corresponds to one gene, and humans have not less than 2.5 ten thousand genes, the gene expression data often has several tens of thousands of feature components. When the traditional classification method is used for processing high-dimensional data, the traditional classification method is very sensitive to noise and redundancy in the data, and accurate prediction is difficult to provide;
(2) data set of gene expression data is small
Because the gene expression is expensive to measure by using the gene microarray technology, and the time and labor cost is high, a data set obtained at one time is very small, and often only dozens or hundreds of samples are contained, and an effective model is difficult to train by using an excessively small data volume.
Disclosure of Invention
The invention aims to provide a gene expression data cancer classification method combining self-learning and low-rank representation, which solves the problems of cancer classification prediction by using gene expression data in the prior art: the data dimension is high, the test set is small, and the labeled data is few.
The technical solution for realizing the purpose of the invention is as follows: a gene expression data cancer classification method combining self-learning and low-rank representation comprises the following steps:
step 1, for a given cancer gene expression data set, a set with label data is a training set, and a non-label data set is a testing set; merging the data to construct a data matrix X, and carrying out normalization processing;
step 2, decomposing the obtained data matrix by using a low-rank expression method to obtain a low-rank matrix Z and a sparse matrix E;
step 3, respectively calculating initial point coordinates p of each category i on the low-rank matrix Z and the sparse matrix E by using label information of the training set(i);
Step 4, respectively using an unsupervised clustering method on the low-rank matrix Z and the sparse matrix E to respectively obtain prediction results l based on the low-rank matrix Z and the sparse matrix EZAnd lE;
Step 5, comparing the two prediction results lZAnd lEIf the same sample is not predicted or the maximum iteration number is reached, outputting a prediction result l based on the low-rank expression matrixZ(ii) a Otherwise, removing the test set from the samples with the same prediction and adding the samples into the training set, and returning to the step 3.
Compared with the prior art, the invention has the following remarkable advantages: 1) the method combines a low-rank representation method, and can extract essential global features from original high-dimensional data; (2) the method uses low-rank matrix information and sparse matrix information obtained by decomposition in low-rank representation, and is more effective than a traditional method based on low-rank representation (only information in one matrix is utilized).
Drawings
FIG. 1 is an exemplary flow chart of a method for cancer classification incorporating self-learning and low rank representation of gene expression data.
FIG. 2 is a schematic diagram of a cancer gene expression data set, wherein (a), (b) and (c) are an original data matrix and a low-rank matrix and a sparse matrix after low-rank decomposition, respectively, and each expression value in the matrix corresponds to a gray value of a pixel point. In the bar above each matrix, each color patch corresponds to a category of cancer.
Detailed Description
The use of gene expression data for cancer classification prediction is a typical high-dimensional small sample problem. In order to solve the problem, a feature extraction method commonly used for matrix recovery in the field of image processing is used for obtaining an intrinsic low-dimensional structure of data by restricting the rank of a data matrix by taking low-rank representation as a reference.
By using a semi-supervised learning method and a feature extraction method, the problem of using gene expression data for cancer classification prediction can be solved.
Embodiments of the present invention will now be described in detail, by way of example, with reference to the accompanying drawings.
As shown in FIG. 1, according to the preferred embodiment of the present invention, the self-learning and low rank representation gene expression data cancer classification method is fused for class prediction of samples in a cancer gene expression data set. In order to reflect the actual application situation, part of the data is regarded as label-free data and is defined as a test set; the remaining sample set is defined as the training set. In the training and predicting process, only the label information of the samples in the training set can be used, and the class information of the test set is used for comparing with the predicted class of the test set. . The classification prediction is divided into two stages: the feature extraction stage and the training and prediction stage are shown in fig. 1, and the implementation of the two stages is described in detail below.
(1) Feature extraction stage
Firstly, combining the feature vectors in the training set and the test set to construct a data matrix X, wherein the obtained data matrix X needs to satisfy the following requirements:
X=[x1,x2,…,xn]∈Rd×n
wherein x isiIs a column vector of one gene expression data sample, the vector dimension being d. X has n samples in total, and n is the sum of the number of samples in the training set and the test set. Each vector needs to be normalized.
Secondly, decomposing the given data matrix X by using a low-rank expression method, wherein the obtained low-rank matrix Z and the obtained sparse matrix E need to meet the following conditions:
s.t.,X=XZ+E
wherein Z is [ Z ]1,z2,...,zn]∈Rn×n,||Z||*=∑iσi(Z) is the nuclear norm, σ, of Zi(Z) is the ith singular value of Z;l of finger E2,1A norm; λ is the equilibrium parameter. In this example, a better result is obtained by selecting λ 2. An Alternating direction multiplier algorithm (Alternating directive methods of Multipliers) may be used to solve the above equation.
FIG. 2 is a schematic representation of a cancer gene expression data set comprising a total of 14 cancer types (BR: breast cancer, PR: prostate cancer, LU: lung cancer, CO: colorectal adenocarcinoma, LY: lymphoma, BL: transitional cell carcinoma of the bladder, ML: melanoma, UT: endometrial adenocarcinoma, LE: leukemia, RE: renal cell carcinoma, PA: pancreatic cancer, OV: ovarian adenocarcinoma, MS: pleural mesothelioma, CNS: central nervous system cancer), 198 samples, each having a total of 11370 characteristic components. It can be seen that there is no significant feature in the data distribution in (a); (b) the characteristics of medium data distribution are obvious, cancer samples of the same category fall into the same subspace, and classification based on the matrix is obviously better than that based on the original matrix; (c) the data distribution also has certain characteristics, due to sparsity, non-zero values in the matrix are few, and characteristic components with more non-zero values can cause larger influence on the final classification result. The classification based on the sparse matrix provides a different visual angle, and plays an auxiliary role in the final classification result.
(2) Training and prediction phase
Different from the characteristic of separation of training and prediction in the traditional supervision method, the method is a semi-supervised clustering method, and the training and prediction of the model are carried out simultaneously. As shown in fig. 1, the training and prediction are iterated through three stages.
Firstly, respectively solving initial points of the matrix Z and the matrix E. Initial point coordinates for each category i in ZThe calculation method is as follows:
wherein,is ZlThe number of points in the ith cluster in the cluster,is ZlJ sample of the ith cluster, ZlIs a matrix composed of labeled samples in Z; the same method can be used to obtain the initial point coordinates of each category i in E
Second step, based on the initial point obtained in the first stepAndan unsupervised clustering method is used for the low rank matrix Z and the sparse matrix E. In this example, a standard K-means algorithm is selected as the unsupervised clustering method, and the distance metric selects the Minkowski distanceTo achieve better results. The prediction results based on the low-rank matrix and the sparse matrix are respectively expressed as l by using vectorsZAnd lE。
Thirdly, comparing the two predicted results lZAnd lESelecting a proper unlabeled sample as a labeled sample and adding the labeled sample into the next iteration; or ending the iteration and outputting the result.
The manner of selecting the appropriate sample is as follows:
(1) an unlabeled sample is selected as the labeled sample to be added to the next iteration if and only if the following holds:
wherein,is the clustering result lZThe prediction result of the ith sample in the prediction table,is the clustering result lEThe prediction result of the ith sample;
(2) a set S is defined, initialized to an empty set, and all samples meeting the above criteria are placed in it.
The specific criteria for judging whether the algorithm is finished are as follows:
(1) if the iteration times reach the maximum times, terminating the algorithm; otherwise, removing the test set from the sample with the same prediction, adding the sample into the training set, and entering the next iteration;
(2) if S is an empty set, terminating the algorithm; otherwise, removing the test set from the sample with the same prediction and adding the sample into the training set, and entering the next iteration.
If the algorithm is not terminated, the training set and the test set are updated as follows: if sample i is in S, it will beFrom ZuIs removed and Z is addedlPerforming the following steps; will be provided withFrom EuRemoving and adding ElIn (1). Wherein Z islIs a matrix of labeled samples in Z, ZuA matrix composed of unlabeled samples in Z; elIs a matrix of labeled samples in E, EuThe matrix is composed of unlabeled samples in E.
After the algorithm is terminated, return toZAnd outputting the result as a prediction result.
Claims (9)
1. A gene expression data cancer classification method combining self-learning and low-rank representation is characterized by comprising the following steps:
step 1, for a given cancer gene expression data set, a set with label data is a training set, and a non-label data set is a testing set; merging the data to construct a data matrix X, and carrying out normalization processing;
step 2, decomposing the obtained data matrix by using a low-rank expression method to obtain a low-rank matrix Z and a sparse matrix E;
step (ii) of3. Respectively calculating initial point coordinates p of each category i on a low-rank matrix Z and a sparse matrix E by using label information of a training set(i);
Step 4, respectively using an unsupervised clustering method on the low-rank matrix Z and the sparse matrix E to respectively obtain prediction results l based on the low-rank matrix Z and the sparse matrix EZAnd lE;
Step 5, comparing the two prediction results lZAnd lEIf the same sample is not predicted or the maximum iteration number is reached, outputting a prediction result l based on the low-rank expression matrixZ(ii) a Otherwise, removing the test set from the samples with the same prediction and adding the samples into the training set, and returning to the step 3.
2. The method for cancer classification based on gene expression data combining self-learning and low-rank representation according to claim 1, wherein: the given cancer gene expression dataset of step 1 comprises tagged data and untagged data, wherein the tag is a cancer class.
3. The method for classifying cancer by gene expression data combining self-learning and low rank expression according to claim 1, wherein the data matrix X obtained in step 1 satisfies the following requirements:
X=[x1,x2,…,xn]∈Rd×n
wherein x isiThe method comprises the following steps of (1) obtaining a column vector of a gene expression data sample, wherein the vector dimension is d, n samples are shared in X, and n is the sum of the numbers of the samples in a training set and a test set; each vector needs to be normalized.
4. The method for cancer classification based on gene expression data combining self-learning and low-rank representation according to claim 1, wherein: in the step 2, for a given data matrix X, decomposition is performed by using a low-rank expression method, and the obtained low-rank matrix Z and sparse matrix E need to satisfy the following conditions:
<mrow> <munder> <mrow> <mi>m</mi> <mi>i</mi> <mi>n</mi> </mrow> <mrow> <mi>Z</mi> <mo>,</mo> <mi>E</mi> </mrow> </munder> <mo>|</mo> <mo>|</mo> <mi>Z</mi> <mo>|</mo> <msub> <mo>|</mo> <mo>*</mo> </msub> <mo>+</mo> <mi>&lambda;</mi> <mo>|</mo> <mo>|</mo> <mi>E</mi> <mo>|</mo> <msub> <mo>|</mo> <mrow> <mn>2</mn> <mo>,</mo> <mn>1</mn> </mrow> </msub> </mrow>
s.t.,X=XZ+E
wherein | Z | Y calculation*=∑iσi(Z) is the nuclear norm, σ, of Zi(Z) is the ith singular value of Z;l of finger E2,1A norm; λ is the equilibrium parameter.
5. The method for cancer classification based on gene expression data combining self-learning and low-rank representation according to claim 1, wherein: in step 3, the initial point coordinates of each category i in ZThe calculation method is as follows:
<mrow> <msubsup> <mi>p</mi> <mi>z</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> </msubsup> <mo>=</mo> <mfrac> <mrow> <munderover> <mo>&Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <msubsup> <mi>n</mi> <mi>l</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> </msubsup> </munderover> <msubsup> <mi>z</mi> <mrow> <mi>l</mi> <mo>,</mo> <mi>j</mi> </mrow> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> </msubsup> </mrow> <msubsup> <mi>n</mi> <mi>l</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> </msubsup> </mfrac> </mrow>
wherein,is ZlThe number of points in the ith cluster in the cluster,is ZlJ sample of the ith cluster, ZlIs a matrix composed of labeled samples in Z; e initial point coordinates for each category iThe calculation method is as follows: :
<mrow> <msubsup> <mi>p</mi> <mi>E</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> </msubsup> <mo>=</mo> <mfrac> <mrow> <munderover> <mo>&Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <msubsup> <mi>n</mi> <mi>l</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> </msubsup> </munderover> <msubsup> <mi>e</mi> <mrow> <mi>l</mi> <mo>,</mo> <mi>j</mi> </mrow> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> </msubsup> </mrow> <msubsup> <mi>n</mi> <mi>l</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> </msubsup> </mfrac> </mrow>
wherein,is ElJ sample of the ith cluster, ElAnd E is a matrix formed by labeled samples.
6. The method for cancer classification based on gene expression data combining self-learning and low-rank representation according to claim 1, wherein: in step 4, an unsupervised clustering method is used for the low-rank matrix Z and the sparse matrix E, the method needs to determine an initial clustering center and select a distance metric to measure the similarity of two samples, and in this step, the initial points are respectivelyAndthe distance metric is the Minkowski distance.
7. The method for cancer classification based on gene expression data combining self-learning and low-rank representation according to claim 1, wherein: in said step 5, the two predictions l are comparedZAnd lESelecting an unlabeled sample as an labeled sample and adding the labeled sample into next iteration, wherein the specific implementation mode is as follows:
(1) an unlabeled sample is selected as the labeled sample to be added to the next iteration if and only if the following holds:
<mrow> <msub> <mi>l</mi> <msub> <mi>z</mi> <mi>i</mi> </msub> </msub> <mo>=</mo> <msub> <mi>l</mi> <msub> <mi>e</mi> <mi>i</mi> </msub> </msub> </mrow>
wherein,is the clustering result lZThe prediction result of the ith sample in the prediction table,is the clustering result lEThe prediction result of the ith sample;
(2) a set S is defined, initialized to an empty set, and all samples meeting the above criteria are placed in it.
8. The method for cancer classification based on gene expression data combining self-learning and low-rank representation according to claim 1, wherein: in step 5, the specific criteria for judging whether the algorithm is finished are as follows:
(1) if the iteration times reach the maximum times, terminating the algorithm; otherwise, removing the test set from the samples with the same prediction, adding the samples into the training set, returning to the step 3, and entering the next iteration;
(2) if S is an empty set, terminating the algorithm; otherwise, removing the test set from the samples with the same prediction, adding the samples into the training set, returning to the step 3, and entering the next iteration;
after the algorithm is terminated, return toZAnd outputting the result as a prediction result.
9. The method for cancer classification by fusing self-learning and low-rank representative gene expression data according to claim 7 or 8, wherein: in the step 5, the specific definitions of removing the test set from the samples with the same prediction and adding the samples into the training set are as follows:
if sample i is in S, it will beFrom ZuIs removed and Z is addedlPerforming the following steps; will be provided withFrom EuRemoving and adding ElPerforming the following steps; wherein Z islIs a matrix of labeled samples in Z, ZuA matrix composed of unlabeled samples in Z; elIs a matrix of labeled samples in E, EuThe matrix is composed of unlabeled samples in E.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611207518.8A CN107526946B (en) | 2016-12-23 | 2016-12-23 | Gene expression data cancer classification method combining self-learning and low-rank representation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611207518.8A CN107526946B (en) | 2016-12-23 | 2016-12-23 | Gene expression data cancer classification method combining self-learning and low-rank representation |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107526946A true CN107526946A (en) | 2017-12-29 |
CN107526946B CN107526946B (en) | 2021-07-06 |
Family
ID=60748589
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611207518.8A Active CN107526946B (en) | 2016-12-23 | 2016-12-23 | Gene expression data cancer classification method combining self-learning and low-rank representation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107526946B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108169728A (en) * | 2018-01-12 | 2018-06-15 | 西安电子科技大学 | Range extension target detection method based on Minkowski distances |
CN109378039A (en) * | 2018-08-20 | 2019-02-22 | 中国矿业大学 | Oncogene based on discrete constraint and the norm that binds expresses spectral-data clustering method |
CN109671468A (en) * | 2018-12-13 | 2019-04-23 | 韶关学院 | A kind of feature gene selection and cancer classification method |
CN109903166A (en) * | 2018-12-25 | 2019-06-18 | 阿里巴巴集团控股有限公司 | A kind of data Risk Forecast Method, device and equipment |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102722892A (en) * | 2012-06-13 | 2012-10-10 | 西安电子科技大学 | SAR (synthetic aperture radar) image change detection method based on low-rank matrix factorization |
CN103400143A (en) * | 2013-07-12 | 2013-11-20 | 中国科学院自动化研究所 | Data subspace clustering method based on multiple view angles |
US20140025689A1 (en) * | 2012-04-24 | 2014-01-23 | International Business Machines Corporation | Determining a similarity between graphs |
CN103793600A (en) * | 2014-01-16 | 2014-05-14 | 西安电子科技大学 | Isolated component analysis and linear discriminant analysis combined cancer forecasting method |
CN105427296A (en) * | 2015-11-11 | 2016-03-23 | 北京航空航天大学 | Ultrasonic image low-rank analysis based thyroid lesion image identification method |
CN106096654A (en) * | 2016-06-13 | 2016-11-09 | 南京信息工程大学 | A kind of cell atypia automatic grading method tactful based on degree of depth study and combination |
CN106202968A (en) * | 2016-07-28 | 2016-12-07 | 北京博源兴康科技有限公司 | The data analysing method of cancer and device |
-
2016
- 2016-12-23 CN CN201611207518.8A patent/CN107526946B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140025689A1 (en) * | 2012-04-24 | 2014-01-23 | International Business Machines Corporation | Determining a similarity between graphs |
CN102722892A (en) * | 2012-06-13 | 2012-10-10 | 西安电子科技大学 | SAR (synthetic aperture radar) image change detection method based on low-rank matrix factorization |
CN103400143A (en) * | 2013-07-12 | 2013-11-20 | 中国科学院自动化研究所 | Data subspace clustering method based on multiple view angles |
CN103793600A (en) * | 2014-01-16 | 2014-05-14 | 西安电子科技大学 | Isolated component analysis and linear discriminant analysis combined cancer forecasting method |
CN105427296A (en) * | 2015-11-11 | 2016-03-23 | 北京航空航天大学 | Ultrasonic image low-rank analysis based thyroid lesion image identification method |
CN106096654A (en) * | 2016-06-13 | 2016-11-09 | 南京信息工程大学 | A kind of cell atypia automatic grading method tactful based on degree of depth study and combination |
CN106202968A (en) * | 2016-07-28 | 2016-12-07 | 北京博源兴康科技有限公司 | The data analysing method of cancer and device |
Non-Patent Citations (2)
Title |
---|
ANINDYA HALDER ET AL.: ""Semi-supervised fuzzy K-NN for cancer classification from microarray gene expression data"", 《2014 FIRST INTERNATIONAL CONFERENCE ON AUTOMATION, CONTROL, ENERGY AND SYSTEMS (ACES)》 * |
刘晋苏: ""微小型无人直升机航拍动态阴影检测研究"", 《中国优秀硕士学位论文全文数据库 工程科技II辑》 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108169728A (en) * | 2018-01-12 | 2018-06-15 | 西安电子科技大学 | Range extension target detection method based on Minkowski distances |
CN109378039A (en) * | 2018-08-20 | 2019-02-22 | 中国矿业大学 | Oncogene based on discrete constraint and the norm that binds expresses spectral-data clustering method |
CN109378039B (en) * | 2018-08-20 | 2022-02-25 | 中国矿业大学 | Tumor gene expression profile data clustering method based on discrete constraint and capping norm |
CN109671468A (en) * | 2018-12-13 | 2019-04-23 | 韶关学院 | A kind of feature gene selection and cancer classification method |
CN109671468B (en) * | 2018-12-13 | 2023-08-15 | 韶关学院 | Characteristic gene selection and cancer classification method |
CN109903166A (en) * | 2018-12-25 | 2019-06-18 | 阿里巴巴集团控股有限公司 | A kind of data Risk Forecast Method, device and equipment |
CN109903166B (en) * | 2018-12-25 | 2024-01-30 | 创新先进技术有限公司 | Data risk prediction method, device and equipment |
Also Published As
Publication number | Publication date |
---|---|
CN107526946B (en) | 2021-07-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP3478728B1 (en) | Method and system for cell annotation with adaptive incremental learning | |
CN107526946B (en) | Gene expression data cancer classification method combining self-learning and low-rank representation | |
Drab et al. | Clustering in analytical chemistry | |
CN108038352B (en) | Method for mining whole genome key genes by combining differential analysis and association rules | |
Kersten | Simultaneous feature selection and Gaussian mixture model estimation for supervised classification problems | |
Cheng et al. | DGCyTOF: Deep learning with graphic cluster visualization to predict cell types of single cell mass cytometry data | |
Yang et al. | Image-based classification of protein subcellular location patterns in human reproductive tissue by ensemble learning global and local features | |
CN107016416B (en) | Data classification prediction method based on neighborhood rough set and PCA fusion | |
Liu et al. | SRAS‐net: Low‐resolution chromosome image classification based on deep learning | |
CN103440508A (en) | Remote sensing image target recognition method based on visual word bag model | |
CN106485289A (en) | A kind of sorting technique of the grade of magnesite ore and equipment | |
Shim et al. | Active cluster annotation for wafer map pattern classification in semiconductor manufacturing | |
El Malki et al. | Machine learning for optimal electrode wettability in lithium ion batteries | |
Yang et al. | Stacking-based and improved convolutional neural network: a new approach in rice leaf disease identification | |
Tu et al. | Robust learning of mislabeled training samples for remote sensing image scene classification | |
CN111863135B (en) | False positive structure variation filtering method, storage medium and computing device | |
CN117078960A (en) | Near infrared spectrum analysis method and system based on image feature extraction | |
Li et al. | SpaDiT: Diffusion Transformer for Spatial Gene Expression Prediction using scRNA-seq | |
CN117034110A (en) | Stem cell exosome detection method based on deep learning | |
CN114818900A (en) | Semi-supervised feature extraction method and user credit risk assessment method | |
Lobo et al. | Bayesian residual analysis for spatially correlated data | |
Zhang et al. | Multi-modal Learning with Missing Modality in Predicting Axillary Lymph Node Metastasis | |
CN113724060A (en) | Credit risk assessment method and system | |
Ahmed et al. | A CNN-based novel approach for the detection of compound Bangla handwritten characters | |
CN113033170A (en) | Table standardization processing method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |