CN112687329A - Cancer prediction system based on non-cancer tissue mutation information and construction method thereof - Google Patents
Cancer prediction system based on non-cancer tissue mutation information and construction method thereof Download PDFInfo
- Publication number
- CN112687329A CN112687329A CN201910992441.7A CN201910992441A CN112687329A CN 112687329 A CN112687329 A CN 112687329A CN 201910992441 A CN201910992441 A CN 201910992441A CN 112687329 A CN112687329 A CN 112687329A
- Authority
- CN
- China
- Prior art keywords
- cancer
- input
- nodes
- layer
- output
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 206010028980 Neoplasm Diseases 0.000 title claims abstract description 105
- 201000011510 cancer Diseases 0.000 title claims abstract description 97
- 230000035772 mutation Effects 0.000 title claims abstract description 58
- 238000010276 construction Methods 0.000 title claims abstract description 11
- 230000004913 activation Effects 0.000 claims description 50
- 238000012549 training Methods 0.000 claims description 37
- 238000000034 method Methods 0.000 claims description 33
- 238000012360 testing method Methods 0.000 claims description 33
- 210000002569 neuron Anatomy 0.000 claims description 24
- 230000003213 activating effect Effects 0.000 claims description 18
- 230000009466 transformation Effects 0.000 claims description 18
- 238000012545 processing Methods 0.000 claims description 14
- 239000011159 matrix material Substances 0.000 claims description 13
- 230000035945 sensitivity Effects 0.000 claims description 12
- 238000009826 distribution Methods 0.000 claims description 9
- 230000017105 transposition Effects 0.000 claims description 6
- 238000011156 evaluation Methods 0.000 claims description 5
- 238000010606 normalization Methods 0.000 claims description 5
- 238000005520 cutting process Methods 0.000 claims description 2
- 108700024394 Exon Proteins 0.000 abstract description 2
- 230000006870 function Effects 0.000 description 35
- 108090000623 proteins and genes Proteins 0.000 description 33
- 206010006187 Breast cancer Diseases 0.000 description 22
- 208000026310 Breast neoplasm Diseases 0.000 description 22
- 238000012163 sequencing technique Methods 0.000 description 15
- 239000000523 sample Substances 0.000 description 13
- 238000001514 detection method Methods 0.000 description 12
- 210000001519 tissue Anatomy 0.000 description 11
- 238000012706 support-vector machine Methods 0.000 description 10
- 238000010801 machine learning Methods 0.000 description 9
- 108700020463 BRCA1 Proteins 0.000 description 8
- 102000036365 BRCA1 Human genes 0.000 description 8
- 101150072950 BRCA1 gene Proteins 0.000 description 8
- 238000003066 decision tree Methods 0.000 description 7
- 210000004369 blood Anatomy 0.000 description 6
- 239000008280 blood Substances 0.000 description 6
- 230000008569 process Effects 0.000 description 6
- 238000007637 random forest analysis Methods 0.000 description 6
- 108700020462 BRCA2 Proteins 0.000 description 5
- 102000052609 BRCA2 Human genes 0.000 description 5
- 101150008921 Brca2 gene Proteins 0.000 description 5
- 230000000875 corresponding effect Effects 0.000 description 5
- 238000004458 analytical method Methods 0.000 description 4
- 238000013135 deep learning Methods 0.000 description 4
- 238000011161 development Methods 0.000 description 4
- 201000010099 disease Diseases 0.000 description 4
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 4
- 238000012216 screening Methods 0.000 description 4
- 108091007743 BRCA1/2 Proteins 0.000 description 3
- 206010061535 Ovarian neoplasm Diseases 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 3
- 238000003745 diagnosis Methods 0.000 description 3
- 238000013399 early diagnosis Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000001914 filtration Methods 0.000 description 3
- 230000001105 regulatory effect Effects 0.000 description 3
- 206010069754 Acquired gene mutation Diseases 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 2
- 102000012406 Carcinoembryonic Antigen Human genes 0.000 description 2
- 108010022366 Carcinoembryonic Antigen Proteins 0.000 description 2
- 208000034826 Genetic Predisposition to Disease Diseases 0.000 description 2
- 101000605639 Homo sapiens Phosphatidylinositol 4,5-bisphosphate 3-kinase catalytic subunit alpha isoform Proteins 0.000 description 2
- 206010033128 Ovarian cancer Diseases 0.000 description 2
- 102100038332 Phosphatidylinositol 4,5-bisphosphate 3-kinase catalytic subunit alpha isoform Human genes 0.000 description 2
- 102000007066 Prostate-Specific Antigen Human genes 0.000 description 2
- 108010072866 Prostate-Specific Antigen Proteins 0.000 description 2
- 238000009825 accumulation Methods 0.000 description 2
- 239000000427 antigen Substances 0.000 description 2
- 102000036639 antigens Human genes 0.000 description 2
- 108091007433 antigens Proteins 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 2
- 230000001174 ascending effect Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 210000004027 cell Anatomy 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000012217 deletion Methods 0.000 description 2
- 230000037430 deletion Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000001815 facial effect Effects 0.000 description 2
- 230000002068 genetic effect Effects 0.000 description 2
- 210000005259 peripheral blood Anatomy 0.000 description 2
- 239000011886 peripheral blood Substances 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 102000004169 proteins and genes Human genes 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000037439 somatic mutation Effects 0.000 description 2
- 230000004083 survival effect Effects 0.000 description 2
- 230000009946 DNA mutation Effects 0.000 description 1
- 238000000729 Fisher's exact test Methods 0.000 description 1
- 206010064571 Gene mutation Diseases 0.000 description 1
- 208000032612 Glial tumor Diseases 0.000 description 1
- 206010018338 Glioma Diseases 0.000 description 1
- 206010058467 Lung neoplasm malignant Diseases 0.000 description 1
- 108010075654 MAP Kinase Kinase Kinase 1 Proteins 0.000 description 1
- 102100033115 Mitogen-activated protein kinase kinase kinase 1 Human genes 0.000 description 1
- 208000034578 Multiple myelomas Diseases 0.000 description 1
- 206010035226 Plasma cell myeloma Diseases 0.000 description 1
- 108010078814 Tumor Suppressor Protein p53 Proteins 0.000 description 1
- 102000015098 Tumor Suppressor Protein p53 Human genes 0.000 description 1
- 239000008186 active pharmaceutical agent Substances 0.000 description 1
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 238000003556 assay Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 210000000481 breast Anatomy 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 230000009849 deactivation Effects 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 229940079593 drug Drugs 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000012854 evaluation process Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000007614 genetic variation Effects 0.000 description 1
- 238000012165 high-throughput sequencing Methods 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 238000009169 immunotherapy Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 201000005202 lung cancer Diseases 0.000 description 1
- 208000020816 lung neoplasm Diseases 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 239000002773 nucleotide Substances 0.000 description 1
- 238000011275 oncology therapy Methods 0.000 description 1
- 239000013610 patient sample Substances 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 230000005855 radiation Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000003860 storage Methods 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
- 238000002560 therapeutic procedure Methods 0.000 description 1
- 238000013334 tissue model Methods 0.000 description 1
Images
Landscapes
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
- Investigating Or Analysing Biological Materials (AREA)
Abstract
A cancer prediction system based on non-cancer tissue mutation information and a construction method thereof, the cancer prediction system includes an input layer, a plurality of hidden layers and an output layer which are sequentially connected. The cancer prediction system utilizes the mutation information of the whole exons to predict the cancer, obviously improves the prediction accuracy, reduces the data volume of the required mutation information, and can be widely used in clinical examination.
Description
Technical Field
The invention belongs to the field of artificial intelligence, and particularly relates to a cancer prediction system based on non-cancer tissue mutation information and a construction method thereof.
Background
Although targeted cancer therapy and tumor immunotherapy have been successful in curing many patients or significantly improving the overall survival of certain diseases, cancer detection and therapy remains a serious problem to face. The timely and effective diagnosis of cancer, thereby promoting the early stage of patient intervention and treatment, is one of the key factors for improving the overall survival rate of diseases. The conventional method for screening early cancers at present is mainly imaging detection, but has the reasons of large radiation dose, high cost and the like, and is not suitable for frequent use. Current cancer tests based on blood antigens, such as the detection of Prostate Specific Antigen (PSA), tumor associated antigen (CA-125) and carcinoembryonic antigen (CEA), are directed against only a single or a small number of cancer types and have a high number of false positive cases. Therefore, there is an urgent need in the field of clinical cancer detection for new blood detection methods to aid physicians in early diagnosis and screening.
On the other hand, due to genetic as well as environmental factors, the genome of an individual is subject to genetic mutations, and some of these "driver" mutations, coupled with the accumulation of mutations, may ultimately lead to the development of cancer. Thus, mutation detection in the genome of an individual holds promise for predicting the occurrence of cancer. For example, BRCA1 has been considered a susceptibility gene for breast and ovarian cancer, where mutations increase the risk of cancer, but only about 3% to 8% of all women with breast cancer carry BRCA1 or BRCA2 mutations. Likewise, the BRCA1 mutation is only visible in about 18% of ovarian cancers. Meanwhile, with The rapid development of next-generation high-throughput sequencing technologies, large-scale cooperative projects, such as Genome Project of thousand people (1000Genomes Project) and Cancer Genome Atlas Project (TCGA), are developed, which provide abundant genomic information of patients and normal people and provide possibility for Cancer prediction by using genomic variation of Cancer. Although pan-cancer analysis found several genes, such as TP53 and PIK3CA, that are highly correlated with more than 10% of patients in most cancers, recent studies found that mutations in these cancer-driving genes, again prevalent in the blood and tissues of normal individuals, suggesting that we could not predict cancer risk by virtue of mutations in only a single gene. Therefore, mutations at multiple key sites have been exploited for cancer prediction. The existing methods mainly use a Support Vector Machine (SVM) to predict breast cancer and multiple myeloma, but the prediction accuracy is only about 70%, and samples used in the researches are about hundreds of cases, and the results are not enough to prove the reliability of the method. Therefore, there is an urgent need in the field of cancer screening for more advanced and robust methods that can exploit the large number of mutations on an individual's genome for cancer risk prediction and require a large test set to prove the reliability of the method.
In recent years, a great deal of deep learning research has greatly promoted the development of artificial intelligence technology, and has been widely applied to the precise medical field, including drug molecule design, medical image diagnosis, disease-driven gene/mutation prediction, and the like. However, the application of the method in the cancer diagnosis field is mainly based on analysis of a large number of clinical images, and deep learning has not been studied to play a great role in early diagnosis of cancer.
Disclosure of Invention
In view of the above, the present invention utilizes a deep learning method to construct a deep neural network model to learn genome variation information of a large number of cancer patients and normal persons in an existing database, and can perform cancer risk prediction on a blood-derived sample, thereby establishing a system applicable to early cancer prediction.
In one aspect, the present invention provides a cancer prediction system based on non-cancerous tissue mutation information, the cancer prediction system comprising an input layer, a plurality of hidden layers, and an output layer, which are sequentially connected;
wherein the input layer is used for inputting mutation information, and the mutation information is a window with a preset length into which the exon is cut;
each hidden layer comprises M fully-connected layers, each fully-connected layer comprises N nodes, M and N are positive integers larger than 1, and a Maxout activation function is embedded between the hidden layers, wherein each node in each fully-connected layer performs the following linear transformation:
zij=xTWij+bij
wherein z isijJ-th activation unit, x, representing the ith neuronTRepresenting a transposition of the input, WijAnd bijAre parameters that the system needs to learn, and represent the input layer to the activation unit z respectivelyijThe weight matrix and the offset vector of (2); and the Maxout activating function carries out nonlinear transformation on the node information obtained by linear transformation to obtain N nodes and outputs the N nodes to the next layer connected in sequence, wherein the output of the Maxout activating function is the maximum value in the selected activating unit:
wherein k represents the number of activation units in the Maxout neuron activation unit group, and x represents the input of an activation function;
the output layer receives the N nodes obtained by the last hidden layer and outputs two nodes which respectively represent the probability of predicting the cancer or the normal of the individual.
In some embodiments, Dropout is added to every two connected hidden layers to regularize the output of the previous hidden layer, so that an over-fitting phenomenon possibly brought by Maxout neurons is avoided;
r~Bernoulli(p)
wherein p represents the ratio of disconnected neuron connections, the Bernoulli function Bernoulli randomly generates a vector containing 0, 1, r is a vector obeying Bernoulli distribution, h is an input vector,is the output vector after random disconnection.
In some embodiments, two nodes of the output layer output are normalized using a Softmax activation function. In some embodiments, the normalization process is performed by:
where σ denotes the activation function and z denotes z by1,...,zKVector of components, K representing the number of nodes of the input, zjRepresenting the input of the jth node, e is a natural constant.
In another aspect, the present invention provides a method for constructing the cancer prediction system, including:
acquiring mutation information of cancer patients and heritage variation information of normal persons, and taking the mutation information of all the cancer patients and the heritage variation information of a part of the normal persons as training sets;
cutting each exon into windows of predetermined length, leaving at least a window in which mutations are present in two cancer patients;
converting mutation information of cancer patients and normal people into a window x sample binarization matrix, randomly extracting binarization matrices with the same number as that of the cancer patients from the training set, respectively adding the binarization matrices with the window matrices of the cancer patients and binarizing to be used as the training set, wherein the extracted binarization matrices are not overlapped with each other;
constructing the cancer prediction system comprising an input layer, a plurality of hidden layers and an output layer connected in sequence; and (4) training the system by using a training set, and selecting the system with the highest accuracy in the first convergence interval as a final cancer prediction system.
In some embodiments, the window is 50-300bp in length (e.g., 100bp, 150bp, 200bp, 250bp, etc.).
In some embodiments, the method further comprises the step of evaluating the cancer prediction system.
In some embodiments, the classification performance of the cancer prediction system is assessed by plotting a subject operating characteristic (ROC) curve and a Precision-Recall (PR) curve.
In some embodiments, the assessment indicator is selected from one or more of test set accuracy (accuracy), sensitivity (sensitivity), specificity (specificity), area under the ROC curve (AUC), and average Accuracy (AP).
In another aspect, the present invention further provides a cancer prediction apparatus based on non-cancer tissue mutation information, which includes an input module, a data processing module and an output module connected in sequence;
wherein the input module is used for inputting mutation information, and the mutation information is a window with a preset length into which the exon is cut;
the data processing module comprises a plurality of hidden layers, each hidden layer comprises M full-connection layers, each full-connection layer comprises N nodes, M and N are positive integers larger than 1, and a Maxout activating function is embedded between the hidden layers, wherein each node in the full-connection layers is subjected to the following linear transformation:
zij=xTWij+bij
wherein z isijJ-th activation unit, x, representing the ith neuronTRepresenting a transposition of the input, WijAnd bijAre parameters that the system needs to learn, and represent the input layer to the activation unit z respectivelyijThe weight matrix and the offset vector of (2); and the Maxout activating function carries out nonlinear transformation on the node information obtained by linear transformation to obtain N nodes and outputs the N nodes to the next layer connected in sequence, wherein the output of the Maxout activating function is the maximum value in the selected activating unit:
wherein k represents the number of activation units in the Maxout neuron activation unit group, and x represents the input of an activation function;
the output module receives the N nodes obtained by the last hidden layer of the data processing module and outputs two nodes which respectively represent the probability of predicting the cancer or the normal of an individual.
In some embodiments, in the data processing module, Dropout is added to every two connected hidden layers to regularize the output of the previous hidden layer, so that an overfitting phenomenon possibly brought by Maxout neurons is avoided;
r~Bernoulli(p)
wherein p represents the ratio of disconnected neuron connections, the Bernoulli function Bernoulli randomly generates a vector containing 0, 1, r is a vector obeying Bernoulli distribution, h is an input vector,is the output vector after random disconnection.
In some embodiments, the two nodes output by the output module are normalized using a Softmax activation function.
In some embodiments, the normalization process is performed by:
where σ denotes the activation function and z denotes z by1,...,zKVector of components, K representing the number of nodes of the input, zjRepresenting the input of the jth node, e is a natural constant.
Compared with the prior art, the invention has the following beneficial effects:
compared with the existing methods for predicting and diagnosing cancers by detecting DNA mutation information through blood samples (such as BRCA1/2 single-gene detection, protein and multi-gene detection methods and multi-site PANEL detection), the cancer prediction system has the sensitivity and specificity respectively reaching 100 percent and 95 percent by the optimized neural learning model and the data preprocessing method, which are higher than those of other methods (such as the traditional BRCA1/2 detection, which is based on the fact that the risk of normal people carrying BRCA1/2 mutation is 45 percent to 85 percent), the newly published detection method based on several protein and gene mutations-cancerSEEK has only 33 percent of accuracy on the breast cancer, and the traditional machine learning method for predicting cancers by detecting multi-site mutation-Support Vector Machine (SVM) has 69 percent of accuracy).
Drawings
FIG. 1: constructing a MiScan model;
FIG. 2: a MiScan model profile;
FIG. 3: a training curve of the model;
wherein the training period between the two dotted lines represents a stable training interval.
FIG. 4: comparing the accuracy, sensitivity and specificity of the MiScan model with other machine learning methods in the test set;
wherein, the single model: SVM (support vector machine); DT (decision tree); KNN (K nearest neighbor)
The aggregation method comprises the following steps: RF (random forest); GBDT (gradient boosting decision tree);
FIG. 5: comparison of the MiScan model with other machine learning methods on ROC and PR curves;
FIG. 6: distribution of the predicted probability values of the MiScan and other methods to the training set and test set samples;
FIG. 7: a robustness test based on SNV sampling;
FIG. 8: a robustness test based on raw sequencing data samples;
FIG. 9: heat maps were evaluated based on gene importance of the misscan model;
after all windows corresponding to each gene are removed, retraining and testing are carried out again, and each gene is subjected to sequencing according to the following five indexes. The 12 genes noted were 10 highly mutated driver genes plus 2 genes responsible for the genetic susceptibility of breast cancer in breast cancer patients (BRCA1 and BRCA 2).
Detailed Description
In order that the objects, technical solutions and advantages of the present invention will become more apparent, the present invention will be further described in detail with reference to the accompanying drawings in conjunction with the following specific embodiments.
The invention develops a single-nucleotide variation (SNV) based multilayer neural network activated by a Maxout function, and predicts cancer risk through SNV analysis of Whole Exon Sequencing (WES) data.
The invention uses the mutation information obtained by sequencing the whole exons of some large cancer databases and normal people as a training set for training, so that the large cancer databases comprise TCGA, ICGC and the like, and the cancer types comprise more than 30 cancers such as breast cancer, lung cancer, ovarian cancer, glioma and the like.
The test set of the present examples was derived from the ICGC breast cancer database and other data used in some published studies (e.g., Rheinbay, Esther, et al, Current and functional regulatory issues in Breast cancer Nature, 2017; Gracia-Aznarez, Francisco Javier, et al, white exterior sequencing reports multiple of non-BRCA1/BRCA2 family Breast cancer company. plos one, 2013).
The invention takes breast cancer as an example to evaluate whether a cancer prediction system can use genome-wide variation to predict cancer, and the mutation state of the cancer shows wide heterogeneity in unsupervised clustering based on mutant genes, so the breast cancer is a good example. Existing studies using genes with high somatic mutations as a feature of cancer classification did not perform well, indicating that broad spectrum mutation patterns could not be used directly to classify or predict cancer. The invention uses a window strategy in analysis, thereby effectively reducing the characteristic number. The present invention collects a large number of independent WES patient datasets and compares them to other machine learning methods to assess the accuracy and robustness of the deep learning model. The cancer prediction system of the present invention is applicable not only to the prediction of breast cancer but also to various other cancers as long as sufficient sample data is available. The invention names the system as MiScan (Maxout induced SNV-based cancer prediction model).
As shown in fig. 1, the specific construction and evaluation process of the misccan model is as follows:
construction of MiScan model
1. The TCGA was downloaded with information on somatic Mutations (MAF) of 986 breast cancer patients and heritage Variation (VCF) of 2,504 normal persons in the third stage of the thousand genome. The normal human data set was randomly cut out to 80% (n-2,003) as a training set, and the rest was used as a test set.
2. Each exon is cut into 100bp long windows, and the remaining window length at the end of each exon may be less than 100 bp. Next, a window is filtered, wherein if there are at least two patients with a mutation (which may not be the same mutation) in a window, the window is retained, otherwise the window is filtered. Finally 13,885 valid windows remain.
3. And converting the mutation information of the patient and the normal person into a window x sample binary matrix. Each row is a window feature and each column is a sample, and if there is a sudden change in a certain window for a certain sample, the corresponding value is 1, otherwise it is 0. In order to balance the number of patients and normal persons in a training set, a complex mutation related network is discovered by using a model, two non-overlapping binarization matrixes containing 986 personal information are randomly extracted from the normal persons in the training set, and are respectively added with a window matrix of the patient and binarized. The final training set data is a binarized window matrix consisting of 13,885 window features and 3975 samples.
4. The MiScan model was constructed using the Keras API from python with TensorFlow as the back-end. As shown in fig. 2, the model contains one input layer, seven hidden layers and one output layer. Each hidden layer consists of 32 full-connection layers; there are 128 nodes per fully-connected layer, and Maxout activation functions are embedded between hidden layers to prevent gradient vanishing or overfitting of the deep neural network. The output layer has softmax as the activation function. In addition, one Dropout (random deactivation) layer was inserted between each adjacent layer, and the ratio of broken links was set to 0.25(p ═ 0.25) to prevent overfitting. Adam is an optimizer of the model (Adam), and the learning rate is set to 0.001(lr is 0.001); the loss function of the model is a cross-entropy loss function (loss ═ systematic _ cross), and the evaluation index is an accuracy (metrics ═ accuracy').
It should be noted that the number of hidden layers, the number of fully-connected layers, and the number of nodes in each fully-connected layer in the model are determined by factors such as window characteristics and the number of samples, and can be selected appropriately according to actual situations.
The constructed cancer prediction system based on the non-cancer tissue mutation information sequentially comprises an input layer I, hidden layers H1-H7 and an output layer O from left to right. The specific construction steps are as follows:
and 2, inputting the input layer I into a hidden layer H1, wherein H1 comprises 32 fully-connected layers, each fully-connected layer comprises 128 nodes, and each node in the fully-connected layers is subjected to the following linear transformation:
zij=xTWij+bij
wherein z isijJ-th activation unit, x, representing the ith neuronTRepresenting a transposition of the input, WijAnd bijAre parameters of the model to be learned, representing the input layer to the activation unit z respectivelyijThe weight matrix and the offset vector of (2); and then carrying out nonlinear transformation on the obtained node information through the following Maxout activation function, wherein the output of the Maxout activation function is the maximum value in the selected activation unit:
where k represents the number of activation cells in the Maxout neuron activation cell group and x represents the input to the activation function.
And 3, step 3: the Dropout is utilized to regularize the output of the hidden layer H1, so that the overfitting phenomenon possibly brought by Maxout neurons is avoided;
r~Bernoulli(p)
wherein p represents the ratio of disconnected neuron connections, the Bernoulli function Bernoulli randomly generates a vector containing 0, 1, r is a vector obeying Bernoulli distribution, h is an input vector,the vector is an output vector after the random disconnection; then will beInput to hidden layer H2; h2 also contains 32 fully-connected tiers, each containing 128 nodes; performing nonlinear transformation on the obtained node information through a Maxout activation function to obtain a model hidden layer H2;
step 5, inputting the hidden layer H7 into an output layer O comprising two nodes, and carrying out normalization processing by using a Softmax activation function to obtain the probability P of normality of the individualnAnd probability of developing cancer Pc:
Where σ denotes the activation function and z denotes z by1,...,zKVector of components, K representing the number of nodes of the input, zjRepresenting the input of the jth node, e representing an infinite non-circular natural constant with a value of 2.1782, and obtaining the output of the activatedLayer 9 of the model of the one-dimensional vector.
And 6, sequentially connecting the 9 layers of neural networks from left to right to obtain the cancer prediction system MiScan based on the non-cancer tissue mutation information.
5. After the labels of the data set are converted into the one-hot codes, the model training is started. The number of training batches was 500 (batch _ size 500) and the number of training rounds was 200 (epochs 200). The training process saves the model for each epoch and records the accuracy. The accuracy of all 5 epochs which occur for the first time is selected to exceed 0.99, and the accuracy derivative of adjacent epochs is less than 0.02, which is defined as the first-to-converge stable segment. The model with the highest accuracy is selected in the section as the final MiScan model.
Evaluation of MiScan model
1. Whole exon sequencing data for breast cancer patients were downloaded from tissue/blood samples adjacent to cancer (data sources were ICGC database and two studies: Rheinbay, Esther, et al, Current and functional regulatory issues in Breast cancer Canner, Nature, 2017; Gracia-Aznarez, Francisco Javier, et al, white exterior sequencing records of non-BRCA1/BRCA2 facial breakdown cancer company, ploS one, 2013), and data pre-processed to obtain VCF files containing mutation information.
2. The downloaded patient data is used as a test data set together with data from the remaining 501 normal persons (20%) of the previous thousand persons genome.
3. The VCF file is converted to a window sample binarization matrix, with 1 indicating that the sample has a variation in the window.
4. The test set is predicted by using MiScan, and a Receiver Operating Characteristic (ROC) curve and a Precision-Recall (PR) curve are drawn to evaluate the classification performance of the MiScan model. For each method, related indexes, namely test set accuracy (accuracycacy), sensitivity (sensitivity), specificity (specificity), area under ROC curve (AUC) and Average Precision (AP), are calculated according to the predicted values and the real labels.
5. And (3) downsampling the data of the test set to test the robustness of the MiScan model in two ways of directly extracting mutation information with a certain proportion from a VCF file format and directly extracting a certain data volume from original sequencing data.
6. In order to evaluate the influence of the genes on the model prediction, training and testing are performed again after all windows corresponding to each gene are removed, and the optimal model is selected by using the same conditions to determine whether the performance of the model is influenced.
The invention also provides a cancer prediction device based on the non-cancer tissue mutation information, which comprises an input module, a data processing module and an output module which are sequentially connected;
wherein the input module is used for inputting mutation information, and the mutation information is a window with a preset length into which the exon is cut;
the data processing module comprises a plurality of hidden layers, each hidden layer comprises M full-connection layers, each full-connection layer comprises N nodes, M and N are positive integers larger than 1, and a Maxout activating function is embedded between the hidden layers, wherein each node in the full-connection layers is subjected to the following linear transformation:
zij=xTWij+bij
wherein z isijJ-th activation unit, x, representing the ith neuronTRepresenting a transposition of the input, WijAnd bijAre parameters that the system needs to learn, and represent the input layer to the activation unit z respectivelyijThe weight matrix and the offset vector of (2); and the Maxout activating function carries out nonlinear transformation on the node information obtained by linear transformation to obtain N nodes and outputs the N nodes to the next layer connected in sequence, wherein the output of the Maxout activating function is the maximum value in the selected activating unit:
wherein k represents the number of activation units in the Maxout neuron activation unit group, and x represents the input of an activation function;
the output module receives the N nodes obtained by the last hidden layer of the data processing module and outputs two nodes which respectively represent the probability of predicting the cancer or the normal of an individual.
In the data processing module, Dropout is added into every two connected hidden layers to regularize the output of the previous hidden layer, so that the overfitting phenomenon possibly brought by Maxout neurons is avoided;
r~Bernoulli(p)
wherein p represents the ratio of disconnected neuron connections, the Bernoulli function Bernoulli randomly generates a vector containing 0, 1, r is a vector obeying Bernoulli distribution, h is an input vector,is the output vector after random disconnection.
And the two nodes output by the output module are normalized by utilizing a Softmax activation function.
Preferably, the normalization processing method is as follows:
where σ denotes the activation function and z denotes z by1,...,zKVector of components, K representing the number of nodes of the input, zjRepresenting the input of the jth node, e is a natural constant.
The functional blocks in the present invention may be hardware, for example, the hardware may be a circuit, including a digital circuit, an analog circuit, and the like. Physical implementations of hardware structures include, but are not limited to, physical devices including, but not limited to, transistors, memristors, and the like. The computing module in the computing device may be any suitable hardware processor, such as a CPU, GPU, FPGA, DSP, ASIC, and the like. The memory unit may be any suitable magnetic or magneto-optical storage medium, such as RRAM, DRAM, SRAM, EDRAM, HBM, HMC, etc.
It will be clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to perform all or part of the above described functions.
Example 1
Sample information
The breast cancer full exon data from published studies was downloaded from the web, including 152 samples of peripheral blood and 360 normal samples of paracancerous tissues (data sources are ICGC database and two studies: Rheinbay, Esther, et al, Current and functional regulatory events in Breast cancer chair, 2017; Gracia-Aznarez, Francisco Javier, et al, white exon sequencing collections multiple of non-BRCA1/BRCA2 facial tissue model and low cancer probability tissue company, PloS, 2013).
Second, the operation steps
1. Data pre-processing and prediction
(1) BWA software was used to align the original WES-seq FASTQ sequencing data to the reference genome hg 19.
(2) Duplicate labeling and base mass fraction recalibration (BQSR) were performed using Picard.
(3) Use of GATK to search for SNV and INDELs (INDEL)
(4) And filtering the searched SNV. The specific filtering parameters are set as follows:
QD<2.0||FS>60.0||MQ<40.0||MQRankSum<-12.5||ReadPosRankSum<-8.0||SOR>4.0
wherein QD (QualByDepth) represents the variant site confidence divided by the unfiltered non-reference read number; FS (FisherStrand) indicates the likelihood that Fisher's exact test assesses that the current variation is a strand deviation; MQ (RMSMapppingQuality) is the square root of the alignment quality in all samples; MQRankSum represents the evaluation of confidence level according to the comparison quality of reference and variation; readposranskumtest estimates the reliability of the variation through the position of the variation on the read, and the error rate at two ends of the read is higher; SQR (StrandOddsRatio) comprehensively assesses the likelihood of strand deviation.
(5) The found INDELs are filtered. The specific filtering parameters are set as follows:
QD<2.0||FS>200.0||SOR>10.0||MQRankSum<-12.5||ReadPosRankSum<-8.0
(6) the filtered SNP and INDEL files are integrated into the final VCF file format.
(7) The VCF file is converted to a window sample matrix.
(8) The patient data (n 512) and the data of the remaining sample of the thousand human genomes (n 501) are merged together as a complete test set.
(9) And reading in the trained MiScan model, predicting the test data, and giving the prediction probability value of each sample.
2. Performance testing
MiScan and other machine learning methods are tested and compared mainly from the aspects of accuracy, sensitivity, specificity, ROC curves, PR curves and the like.
3. Robustness testing
(1) Testing based on SNV extraction: from the resulting VCF file of the test set, 10% to 90% (10% of each interval) of data is randomly extracted, and then data processing and prediction are performed. This random process was repeated 100 times.
(2) Assays based on different sequencing depths: 10% to 90% (each interval of 10%) of the data is randomly extracted from the original document before mutation discovery, and then data processing and prediction are performed. This randomization procedure was repeated 10 times for each patient sample in the test set.
4. Assessment of Gene importance
(1) And removing all windows corresponding to a certain gene, and training and testing the training set subjected to window reduction by using the same model again.
(2) Repeating the process of (1) until all genes are treated.
(3) The test effect of each gene after deletion is sorted in ascending order according to five indexes of accuracy, sensitivity, specificity, AUC and AP.
(4) Ten driver genes that mutate at high frequency in breast cancer patients and 2 genes that cause genetic susceptibility to breast cancer (BRCAl and BRCA2) were selected and their relative positions are indicated in the figure.
Third, result summary
1. Model construction and training
As can be seen from fig. 3, the model converges gradually during the training process. After the epoch (training period) reached 100, the training of the model tended to stabilize. The segment where training accuracy is greater than 99% for the first occurrence of 5 consecutive epochs and the accuracy deviation is less than 0.02 is called the stable segment (the segment between the two dashed lines). In this stable segment, the model with the highest training accuracy (epoch 113) was selected as the final misscan model.
2. Model performance assessment
To evaluate the performance of the model, MiScan was compared to other popular machine learning methods, including single models-Decision Trees (DT), K-nearest neighbors (KNN), Support Vector Machines (SVM) and two ensemble methods-Random Forest (RF) and gradient ascending decision trees (GBDT). MiScan has the highest accuracy-97%, while the predicted accuracy of other machine learning methods is 86% (SVM), 73% (DT), 63% (RF), 56% (GBDT) and 49% (KNN) (FIG. 4, left).
Furthermore, the prediction of patient and normal individuals by misscan has high sensitivity (100%) and high specificity (95%) respectively, superior to other methods, while some methods like KNN and GBDT wrongly predict most normal as patients, showing that only misscan predicts best without any prediction preference (fig. 4, right).
Meanwhile, as can be seen from the ROC and PR curves, the MiScan model has the strongest classification performance, and the AUC and AP values thereof reach 0.994 and 0.989 respectively, which is significantly better than the rest of the other methods (fig. 5).
Further, the predicted probability distribution of the MiScan model was analyzed using training and testing data sets (FIG. 6). It can be clearly seen from the figure that the MiScan model has a good fit to the training set, and the predicted probability values of the patients and the normal persons in the test set are respectively concentrated around 0 and 1, which shows that the MiScan model can clearly distinguish the patients from the normal persons. All the evaluation indices for the predicted results for all the test samples are shown in table 1.
The above results demonstrate that the MiScan method has the optimal classification performance and prediction ability compared to other machine learning methods, and there is no preference for prediction of patients and normal persons.
Table 1: MiScan and other methods prediction Performance statistics
3. Robustness testing
The objective of the robustness test is to evaluate the anti-interference performance of the method and to evaluate the predictive effect of the method on low quality data, thereby detecting whether the method can be applied to early screening of cancer at considerable cost in the future. The invention first performs SNV downsampling on the VCF file for each sample in the test set. The inventors noticed that even when the number of mutations covered by the data is as low as 10% of the original data, the MiScan still has very high resolution, the AUC and AP values are higher than 0.99, and the SVM and RF with better classification performance become worse and worse as the SNV number of the data coverage is reduced, which fully indicates that the MiScan model has very strong stability (FIG. 7, top). MiScan also showed the best stability in prediction accuracy at various ratios of SNV inputs (FIG. 7, bottom).
Since sequencing depth is one of the key factors affecting SNV detection, the present invention also examined the performance of these methods at low sequencing depths. The raw sequencing data in the test set is downsampled so that low quality data can be simulated more realistically. The inventors noted that even though the amount of data sequenced was as low as one million reads (1M), the misccan was able to identify these patients with high sensitivity. In contrast, the performance of other machine learning methods declined dramatically with decreasing sequencing depth, confirming the robustness of the misccan to these models (fig. 8). Since the genomic variation information of peripheral blood is very easy to obtain, the high accuracy and robustness of MiScan are very suitable for early diagnosis of breast cancer.
3. Assessment of Gene importance
Deep learning, while powerful, has the internal black box nature that makes it difficult for researchers to study associations between internal features. In the Maxout model, it is important to evaluate the contribution of each gene to the model. In order to evaluate the contribution of each genetic variation to disease identification, the invention designs an algorithm for evaluating genetic weight, and the basic principle is as follows: for a specific gene, after all mutations corresponding to the gene are removed, model training and prediction (denoted as defect model) are carried out again. All defect models were then ranked according to different indices to account for the effect of each gene.
Surprisingly, although the contribution of prediction accuracy of certain breast cancer-related genes (e.g., PIK3CA, MAP3K1, BRCA1, etc.) may be stronger than conventional genes, ignoring any single gene does not significantly reduce the classification and prediction ability MiScan (fig. 9). More interestingly, deletion of any single gene reduced the prediction accuracy from 99% to 91% at best, and the classification index (AUC and AP) was always higher than 0.99, indicating that mutations in a few driver genes may not increase the likelihood of cancer. In contrast, the accumulation and synergy of mutations in many hot spots is more critical. This may also reflect the heterogeneity and complexity of cancer development and progression.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention and are not intended to limit the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (10)
1. A cancer prediction system based on non-cancerous tissue mutation information, said cancer prediction system comprising an input layer, a plurality of hidden layers and an output layer connected in sequence;
wherein the input layer is used for inputting mutation information, and the mutation information is a window with a preset length into which the exon is cut;
each hidden layer comprises M fully-connected layers, each fully-connected layer comprises N nodes, M and N are positive integers larger than 1, and a Maxout activation function is embedded between the hidden layers, wherein each node in each fully-connected layer performs the following linear transformation:
zij=xTWij+bij
wherein z isijJ-th activation unit, x, representing the ith neuronTRepresenting a transposition of the input, WijAnd bijAre parameters that the system needs to learn, and represent the input layer to the activation unit z respectivelyijThe weight matrix and the offset vector of (2); and the Maxout activating function carries out nonlinear transformation on the node information obtained by linear transformation to obtain N nodes and outputs the N nodes to the next layer connected in sequence, wherein the output of the Maxout activating function is the maximum value in the selected activating unit:
wherein k represents the number of activation units in the Maxout neuron activation unit group, and x represents the input of an activation function;
the output layer receives the N nodes obtained by the last hidden layer and outputs two nodes which respectively represent the probability of predicting the cancer or the normal of the individual.
2. The cancer prediction system of claim 1, wherein Dropout is added to every two consecutive hidden layers to regularize the output of the previous hidden layer to avoid overfitting phenomena that may be caused by Maxout neurons;
r~Bernoulli(p)
3. A cancer prediction system as claimed in claim 1, wherein the two nodes of the output layer output are normalised using a Softmax activation function, preferably by:
where σ denotes the activation function and z denotes z by1,...,zKVector of components, K representing the number of nodes of the input, zjRepresenting the input of the jth node, e is a natural constant.
4. A method of constructing a cancer prediction system according to any one of claims 1 to 3, comprising:
acquiring mutation information of cancer patients and heritage variation information of normal persons, and taking the mutation information of all the cancer patients and the heritage variation information of a part of the normal persons as training sets;
cutting each exon into windows of predetermined length, leaving at least a window in which mutations are present in two cancer patients;
converting mutation information of cancer patients and normal people into a window x sample binarization matrix, randomly extracting binarization matrices with the same number as that of the cancer patients from the training set, respectively adding the binarization matrices with the window matrices of the cancer patients and binarizing to be used as the training set, wherein the extracted binarization matrices are not overlapped with each other;
constructing the cancer prediction system comprising an input layer, a plurality of hidden layers and an output layer connected in sequence;
and (4) training the system by using a training set, and selecting the system with the highest accuracy in the first convergence interval as a final cancer prediction system.
5. The construction method according to claim 4, wherein the window has a length of 50-300bp (e.g., 100bp, 150bp, 200bp, 250bp, etc.).
6. The method of construction of claim 4, wherein the method further comprises the step of evaluating the cancer prediction system.
7. The construction method of claim 6, wherein the classification performance of the cancer prediction system is evaluated by plotting a Receiver Operating Characteristic (ROC) curve and a Precision-Recall (PR) curve, preferably the evaluation index is selected from one or more of test set accuracy (accuracy), sensitivity (sensitivity), specificity (specificity), area under ROC curve (AUC), and Average Precision (AP).
8. A cancer prediction device based on non-cancer tissue mutation information comprises an input module, a data processing module and an output module which are connected in sequence;
wherein the input module is used for inputting mutation information, and the mutation information is a window with a preset length into which the exon is cut;
the data processing module comprises a plurality of hidden layers, each hidden layer comprises M full-connection layers, each full-connection layer comprises N nodes, M and N are positive integers larger than 1, and a Maxout activating function is embedded between the hidden layers, wherein each node in the full-connection layers is subjected to the following linear transformation:
zij=xTWij+bij
wherein z isijJ-th activation unit, x, representing the ith neuronTRepresenting a transposition of the input, WijAnd bijAre parameters that the system needs to learn, and represent the input layer to the activation unit z respectivelyijThe weight matrix and the offset vector of (2); and the Maxout activating function carries out nonlinear transformation on the node information obtained by linear transformation to obtain N nodes and outputs the N nodes to the next layer connected in sequence, wherein the output of the Maxout activating function is the maximum value in the selected activating unit:
wherein k represents the number of activation units in the Maxout neuron activation unit group, and x represents the input of an activation function;
the output module receives the N nodes obtained by the last hidden layer of the data processing module and outputs two nodes which respectively represent the probability of predicting the cancer or the normal of an individual.
9. The cancer prediction apparatus of claim 8, wherein Dropout is added to every two consecutive hidden layers to regularize the output of the previous hidden layer to avoid overfitting phenomena that may be caused by Maxout neurons;
r~Bernoulli(p)
10. The cancer prediction apparatus of claim 8, wherein the two nodes output by the output module are normalized by a Softmax activation function, and preferably the normalization is performed by:
where σ denotes the activation function and z denotes z by1,...,zKVector of components, K representing the number of nodes of the input, zjRepresenting the input of the jth node, e is a natural constant.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910992441.7A CN112687329B (en) | 2019-10-17 | 2019-10-17 | Cancer prediction system based on non-cancer tissue mutation information and construction method thereof |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910992441.7A CN112687329B (en) | 2019-10-17 | 2019-10-17 | Cancer prediction system based on non-cancer tissue mutation information and construction method thereof |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112687329A true CN112687329A (en) | 2021-04-20 |
CN112687329B CN112687329B (en) | 2024-05-17 |
Family
ID=75444895
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910992441.7A Active CN112687329B (en) | 2019-10-17 | 2019-10-17 | Cancer prediction system based on non-cancer tissue mutation information and construction method thereof |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112687329B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113223613A (en) * | 2021-05-14 | 2021-08-06 | 西安电子科技大学 | Cancer detection method based on multi-dimensional single nucleotide variation characteristics |
CN118116585A (en) * | 2024-04-30 | 2024-05-31 | 奥明星程(杭州)生物科技有限公司 | Method and device for judging benign and malignant cancers through DNN |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104573410A (en) * | 2015-01-20 | 2015-04-29 | 合肥工业大学 | Cancer chemosensitivity prediction technique based on molecular subnet and random forest classifier |
US20190189242A1 (en) * | 2017-12-18 | 2019-06-20 | Personal Genome Diagnostics Inc. | Machine learning system and method for somatic mutation discovery |
CN110111840A (en) * | 2019-05-14 | 2019-08-09 | 吉林大学 | A kind of somatic mutation detection method |
US20190266493A1 (en) * | 2017-10-16 | 2019-08-29 | Illumina, Inc. | Deep Learning-Based Techniques for Pre-Training Deep Convolutional Neural Networks |
CN110265084A (en) * | 2019-06-05 | 2019-09-20 | 复旦大学 | The method and relevant device of riboSnitch element are rich in or lacked in prediction cancer gene group |
-
2019
- 2019-10-17 CN CN201910992441.7A patent/CN112687329B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104573410A (en) * | 2015-01-20 | 2015-04-29 | 合肥工业大学 | Cancer chemosensitivity prediction technique based on molecular subnet and random forest classifier |
US20190266493A1 (en) * | 2017-10-16 | 2019-08-29 | Illumina, Inc. | Deep Learning-Based Techniques for Pre-Training Deep Convolutional Neural Networks |
US20190189242A1 (en) * | 2017-12-18 | 2019-06-20 | Personal Genome Diagnostics Inc. | Machine learning system and method for somatic mutation discovery |
CN110111840A (en) * | 2019-05-14 | 2019-08-09 | 吉林大学 | A kind of somatic mutation detection method |
CN110265084A (en) * | 2019-06-05 | 2019-09-20 | 复旦大学 | The method and relevant device of riboSnitch element are rich in or lacked in prediction cancer gene group |
Non-Patent Citations (2)
Title |
---|
祁亮;沈洁;: "TCGA数据库基因突变信息结合机器学习软件RapidMiner构建肝细胞癌患者复发模型", 中国肝脏病杂志(电子版), no. 03 * |
胡丽娟;潘钦石;许刚;陈坚;丁鸿燕;王瑜敏;: "人工神经网络分析预测非小细胞肺癌患者EGFR基因突变的模型建立及其关联因素分析", 中国卫生检验杂志, no. 15 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113223613A (en) * | 2021-05-14 | 2021-08-06 | 西安电子科技大学 | Cancer detection method based on multi-dimensional single nucleotide variation characteristics |
CN118116585A (en) * | 2024-04-30 | 2024-05-31 | 奥明星程(杭州)生物科技有限公司 | Method and device for judging benign and malignant cancers through DNN |
Also Published As
Publication number | Publication date |
---|---|
CN112687329B (en) | 2024-05-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112086129B (en) | Method and system for predicting cfDNA of tumor tissue | |
CN111899882B (en) | Method and system for predicting cancer | |
US20230222311A1 (en) | Generating machine learning models using genetic data | |
WO2023217290A1 (en) | Genophenotypic prediction based on graph neural network | |
EP3945525A1 (en) | Methods for identifying chromosomal spatial instability such as homologous repair deficiency in low coverage next-generation sequencing data | |
CN112927757B (en) | Gastric cancer biomarker identification method based on gene expression and DNA methylation data | |
US20220254450A1 (en) | method for classifying individuals in mixtures of DNA and its deep learning model | |
CN114783524B (en) | Path abnormity detection system based on self-adaptive resampling depth encoder network | |
WO2023197825A1 (en) | Multi-cancer early screening model construction method and detection device | |
CN112131399A (en) | Old medicine new use analysis method and system based on knowledge graph | |
CN112687329B (en) | Cancer prediction system based on non-cancer tissue mutation information and construction method thereof | |
WO2021062198A1 (en) | Single cell rna-seq data processing | |
CN111739642A (en) | Colorectal cancer risk prediction method and system, computer equipment and readable storage medium | |
CN114373548A (en) | Pancreatic cancer prognosis risk prediction method and device established based on metabolic genes | |
CN114550831A (en) | Gastric cancer proteomics typing framework identification method based on deep learning feature extraction | |
CN117591953A (en) | Cancer classification method and system based on multiple groups of study data and electronic equipment | |
CN112926640A (en) | Cancer gene classification method and equipment based on two-stage depth feature selection and storage medium | |
Choi et al. | Cell subtype classification via representation learning based on a denoising autoencoder for single-cell RNA sequencing | |
CN115810398A (en) | TF-DNA binding identification method based on multi-feature fusion | |
CN111785319B (en) | Drug repositioning method based on differential expression data | |
Cudic et al. | Prediction of sorghum bicolor genotype from in-situ images using autoencoder-identified SNPs | |
CN113362927A (en) | Squamous esophageal cancer chemoradiotherapy effect prediction method based on deep learning | |
CN117912570B (en) | Classification feature determining method and system based on gene co-expression network | |
CN116597902B (en) | Method and device for screening multiple groups of chemical biomarkers based on drug sensitivity data | |
Korayem et al. | A hybrid genetic algorithm and artificial immune system for informative gene selection |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |