CN112687329A - Cancer prediction system based on non-cancer tissue mutation information and construction method thereof - Google Patents

Cancer prediction system based on non-cancer tissue mutation information and construction method thereof Download PDF

Info

Publication number
CN112687329A
CN112687329A CN201910992441.7A CN201910992441A CN112687329A CN 112687329 A CN112687329 A CN 112687329A CN 201910992441 A CN201910992441 A CN 201910992441A CN 112687329 A CN112687329 A CN 112687329A
Authority
CN
China
Prior art keywords
cancer
input
nodes
layer
output
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910992441.7A
Other languages
Chinese (zh)
Other versions
CN112687329B (en
Inventor
瞿昆
俞乔尼
黎斌
刘年平
方靖文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN201910992441.7A priority Critical patent/CN112687329B/en
Publication of CN112687329A publication Critical patent/CN112687329A/en
Application granted granted Critical
Publication of CN112687329B publication Critical patent/CN112687329B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

A cancer prediction system based on non-cancer tissue mutation information and a construction method thereof, the cancer prediction system includes an input layer, a plurality of hidden layers and an output layer which are sequentially connected. The cancer prediction system utilizes the mutation information of the whole exons to predict the cancer, obviously improves the prediction accuracy, reduces the data volume of the required mutation information, and can be widely used in clinical examination.

Description

Cancer prediction system based on non-cancer tissue mutation information and construction method thereof
Technical Field
The invention belongs to the field of artificial intelligence, and particularly relates to a cancer prediction system based on non-cancer tissue mutation information and a construction method thereof.
Background
Although targeted cancer therapy and tumor immunotherapy have been successful in curing many patients or significantly improving the overall survival of certain diseases, cancer detection and therapy remains a serious problem to face. The timely and effective diagnosis of cancer, thereby promoting the early stage of patient intervention and treatment, is one of the key factors for improving the overall survival rate of diseases. The conventional method for screening early cancers at present is mainly imaging detection, but has the reasons of large radiation dose, high cost and the like, and is not suitable for frequent use. Current cancer tests based on blood antigens, such as the detection of Prostate Specific Antigen (PSA), tumor associated antigen (CA-125) and carcinoembryonic antigen (CEA), are directed against only a single or a small number of cancer types and have a high number of false positive cases. Therefore, there is an urgent need in the field of clinical cancer detection for new blood detection methods to aid physicians in early diagnosis and screening.
On the other hand, due to genetic as well as environmental factors, the genome of an individual is subject to genetic mutations, and some of these "driver" mutations, coupled with the accumulation of mutations, may ultimately lead to the development of cancer. Thus, mutation detection in the genome of an individual holds promise for predicting the occurrence of cancer. For example, BRCA1 has been considered a susceptibility gene for breast and ovarian cancer, where mutations increase the risk of cancer, but only about 3% to 8% of all women with breast cancer carry BRCA1 or BRCA2 mutations. Likewise, the BRCA1 mutation is only visible in about 18% of ovarian cancers. Meanwhile, with The rapid development of next-generation high-throughput sequencing technologies, large-scale cooperative projects, such as Genome Project of thousand people (1000Genomes Project) and Cancer Genome Atlas Project (TCGA), are developed, which provide abundant genomic information of patients and normal people and provide possibility for Cancer prediction by using genomic variation of Cancer. Although pan-cancer analysis found several genes, such as TP53 and PIK3CA, that are highly correlated with more than 10% of patients in most cancers, recent studies found that mutations in these cancer-driving genes, again prevalent in the blood and tissues of normal individuals, suggesting that we could not predict cancer risk by virtue of mutations in only a single gene. Therefore, mutations at multiple key sites have been exploited for cancer prediction. The existing methods mainly use a Support Vector Machine (SVM) to predict breast cancer and multiple myeloma, but the prediction accuracy is only about 70%, and samples used in the researches are about hundreds of cases, and the results are not enough to prove the reliability of the method. Therefore, there is an urgent need in the field of cancer screening for more advanced and robust methods that can exploit the large number of mutations on an individual's genome for cancer risk prediction and require a large test set to prove the reliability of the method.
In recent years, a great deal of deep learning research has greatly promoted the development of artificial intelligence technology, and has been widely applied to the precise medical field, including drug molecule design, medical image diagnosis, disease-driven gene/mutation prediction, and the like. However, the application of the method in the cancer diagnosis field is mainly based on analysis of a large number of clinical images, and deep learning has not been studied to play a great role in early diagnosis of cancer.
Disclosure of Invention
In view of the above, the present invention utilizes a deep learning method to construct a deep neural network model to learn genome variation information of a large number of cancer patients and normal persons in an existing database, and can perform cancer risk prediction on a blood-derived sample, thereby establishing a system applicable to early cancer prediction.
In one aspect, the present invention provides a cancer prediction system based on non-cancerous tissue mutation information, the cancer prediction system comprising an input layer, a plurality of hidden layers, and an output layer, which are sequentially connected;
wherein the input layer is used for inputting mutation information, and the mutation information is a window with a preset length into which the exon is cut;
each hidden layer comprises M fully-connected layers, each fully-connected layer comprises N nodes, M and N are positive integers larger than 1, and a Maxout activation function is embedded between the hidden layers, wherein each node in each fully-connected layer performs the following linear transformation:
zij=xTWij+bij
wherein z isijJ-th activation unit, x, representing the ith neuronTRepresenting a transposition of the input, WijAnd bijAre parameters that the system needs to learn, and represent the input layer to the activation unit z respectivelyijThe weight matrix and the offset vector of (2); and the Maxout activating function carries out nonlinear transformation on the node information obtained by linear transformation to obtain N nodes and outputs the N nodes to the next layer connected in sequence, wherein the output of the Maxout activating function is the maximum value in the selected activating unit:
Figure BDA0002237276790000031
wherein k represents the number of activation units in the Maxout neuron activation unit group, and x represents the input of an activation function;
the output layer receives the N nodes obtained by the last hidden layer and outputs two nodes which respectively represent the probability of predicting the cancer or the normal of the individual.
In some embodiments, Dropout is added to every two connected hidden layers to regularize the output of the previous hidden layer, so that an over-fitting phenomenon possibly brought by Maxout neurons is avoided;
r~Bernoulli(p)
Figure BDA0002237276790000032
wherein p represents the ratio of disconnected neuron connections, the Bernoulli function Bernoulli randomly generates a vector containing 0, 1, r is a vector obeying Bernoulli distribution, h is an input vector,
Figure BDA0002237276790000033
is the output vector after random disconnection.
In some embodiments, two nodes of the output layer output are normalized using a Softmax activation function. In some embodiments, the normalization process is performed by:
Figure BDA0002237276790000034
where σ denotes the activation function and z denotes z by1,...,zKVector of components, K representing the number of nodes of the input, zjRepresenting the input of the jth node, e is a natural constant.
In another aspect, the present invention provides a method for constructing the cancer prediction system, including:
acquiring mutation information of cancer patients and heritage variation information of normal persons, and taking the mutation information of all the cancer patients and the heritage variation information of a part of the normal persons as training sets;
cutting each exon into windows of predetermined length, leaving at least a window in which mutations are present in two cancer patients;
converting mutation information of cancer patients and normal people into a window x sample binarization matrix, randomly extracting binarization matrices with the same number as that of the cancer patients from the training set, respectively adding the binarization matrices with the window matrices of the cancer patients and binarizing to be used as the training set, wherein the extracted binarization matrices are not overlapped with each other;
constructing the cancer prediction system comprising an input layer, a plurality of hidden layers and an output layer connected in sequence; and (4) training the system by using a training set, and selecting the system with the highest accuracy in the first convergence interval as a final cancer prediction system.
In some embodiments, the window is 50-300bp in length (e.g., 100bp, 150bp, 200bp, 250bp, etc.).
In some embodiments, the method further comprises the step of evaluating the cancer prediction system.
In some embodiments, the classification performance of the cancer prediction system is assessed by plotting a subject operating characteristic (ROC) curve and a Precision-Recall (PR) curve.
In some embodiments, the assessment indicator is selected from one or more of test set accuracy (accuracy), sensitivity (sensitivity), specificity (specificity), area under the ROC curve (AUC), and average Accuracy (AP).
In another aspect, the present invention further provides a cancer prediction apparatus based on non-cancer tissue mutation information, which includes an input module, a data processing module and an output module connected in sequence;
wherein the input module is used for inputting mutation information, and the mutation information is a window with a preset length into which the exon is cut;
the data processing module comprises a plurality of hidden layers, each hidden layer comprises M full-connection layers, each full-connection layer comprises N nodes, M and N are positive integers larger than 1, and a Maxout activating function is embedded between the hidden layers, wherein each node in the full-connection layers is subjected to the following linear transformation:
zij=xTWij+bij
wherein z isijJ-th activation unit, x, representing the ith neuronTRepresenting a transposition of the input, WijAnd bijAre parameters that the system needs to learn, and represent the input layer to the activation unit z respectivelyijThe weight matrix and the offset vector of (2); and the Maxout activating function carries out nonlinear transformation on the node information obtained by linear transformation to obtain N nodes and outputs the N nodes to the next layer connected in sequence, wherein the output of the Maxout activating function is the maximum value in the selected activating unit:
Figure BDA0002237276790000051
wherein k represents the number of activation units in the Maxout neuron activation unit group, and x represents the input of an activation function;
the output module receives the N nodes obtained by the last hidden layer of the data processing module and outputs two nodes which respectively represent the probability of predicting the cancer or the normal of an individual.
In some embodiments, in the data processing module, Dropout is added to every two connected hidden layers to regularize the output of the previous hidden layer, so that an overfitting phenomenon possibly brought by Maxout neurons is avoided;
r~Bernoulli(p)
Figure BDA0002237276790000052
wherein p represents the ratio of disconnected neuron connections, the Bernoulli function Bernoulli randomly generates a vector containing 0, 1, r is a vector obeying Bernoulli distribution, h is an input vector,
Figure BDA0002237276790000053
is the output vector after random disconnection.
In some embodiments, the two nodes output by the output module are normalized using a Softmax activation function.
In some embodiments, the normalization process is performed by:
Figure BDA0002237276790000054
where σ denotes the activation function and z denotes z by1,...,zKVector of components, K representing the number of nodes of the input, zjRepresenting the input of the jth node, e is a natural constant.
Compared with the prior art, the invention has the following beneficial effects:
compared with the existing methods for predicting and diagnosing cancers by detecting DNA mutation information through blood samples (such as BRCA1/2 single-gene detection, protein and multi-gene detection methods and multi-site PANEL detection), the cancer prediction system has the sensitivity and specificity respectively reaching 100 percent and 95 percent by the optimized neural learning model and the data preprocessing method, which are higher than those of other methods (such as the traditional BRCA1/2 detection, which is based on the fact that the risk of normal people carrying BRCA1/2 mutation is 45 percent to 85 percent), the newly published detection method based on several protein and gene mutations-cancerSEEK has only 33 percent of accuracy on the breast cancer, and the traditional machine learning method for predicting cancers by detecting multi-site mutation-Support Vector Machine (SVM) has 69 percent of accuracy).
Drawings
FIG. 1: constructing a MiScan model;
FIG. 2: a MiScan model profile;
FIG. 3: a training curve of the model;
wherein the training period between the two dotted lines represents a stable training interval.
FIG. 4: comparing the accuracy, sensitivity and specificity of the MiScan model with other machine learning methods in the test set;
wherein, the single model: SVM (support vector machine); DT (decision tree); KNN (K nearest neighbor)
The aggregation method comprises the following steps: RF (random forest); GBDT (gradient boosting decision tree);
FIG. 5: comparison of the MiScan model with other machine learning methods on ROC and PR curves;
FIG. 6: distribution of the predicted probability values of the MiScan and other methods to the training set and test set samples;
FIG. 7: a robustness test based on SNV sampling;
FIG. 8: a robustness test based on raw sequencing data samples;
FIG. 9: heat maps were evaluated based on gene importance of the misscan model;
after all windows corresponding to each gene are removed, retraining and testing are carried out again, and each gene is subjected to sequencing according to the following five indexes. The 12 genes noted were 10 highly mutated driver genes plus 2 genes responsible for the genetic susceptibility of breast cancer in breast cancer patients (BRCA1 and BRCA 2).
Detailed Description
In order that the objects, technical solutions and advantages of the present invention will become more apparent, the present invention will be further described in detail with reference to the accompanying drawings in conjunction with the following specific embodiments.
The invention develops a single-nucleotide variation (SNV) based multilayer neural network activated by a Maxout function, and predicts cancer risk through SNV analysis of Whole Exon Sequencing (WES) data.
The invention uses the mutation information obtained by sequencing the whole exons of some large cancer databases and normal people as a training set for training, so that the large cancer databases comprise TCGA, ICGC and the like, and the cancer types comprise more than 30 cancers such as breast cancer, lung cancer, ovarian cancer, glioma and the like.
The test set of the present examples was derived from the ICGC breast cancer database and other data used in some published studies (e.g., Rheinbay, Esther, et al, Current and functional regulatory issues in Breast cancer Nature, 2017; Gracia-Aznarez, Francisco Javier, et al, white exterior sequencing reports multiple of non-BRCA1/BRCA2 family Breast cancer company. plos one, 2013).
The invention takes breast cancer as an example to evaluate whether a cancer prediction system can use genome-wide variation to predict cancer, and the mutation state of the cancer shows wide heterogeneity in unsupervised clustering based on mutant genes, so the breast cancer is a good example. Existing studies using genes with high somatic mutations as a feature of cancer classification did not perform well, indicating that broad spectrum mutation patterns could not be used directly to classify or predict cancer. The invention uses a window strategy in analysis, thereby effectively reducing the characteristic number. The present invention collects a large number of independent WES patient datasets and compares them to other machine learning methods to assess the accuracy and robustness of the deep learning model. The cancer prediction system of the present invention is applicable not only to the prediction of breast cancer but also to various other cancers as long as sufficient sample data is available. The invention names the system as MiScan (Maxout induced SNV-based cancer prediction model).
As shown in fig. 1, the specific construction and evaluation process of the misccan model is as follows:
construction of MiScan model
1. The TCGA was downloaded with information on somatic Mutations (MAF) of 986 breast cancer patients and heritage Variation (VCF) of 2,504 normal persons in the third stage of the thousand genome. The normal human data set was randomly cut out to 80% (n-2,003) as a training set, and the rest was used as a test set.
2. Each exon is cut into 100bp long windows, and the remaining window length at the end of each exon may be less than 100 bp. Next, a window is filtered, wherein if there are at least two patients with a mutation (which may not be the same mutation) in a window, the window is retained, otherwise the window is filtered. Finally 13,885 valid windows remain.
3. And converting the mutation information of the patient and the normal person into a window x sample binary matrix. Each row is a window feature and each column is a sample, and if there is a sudden change in a certain window for a certain sample, the corresponding value is 1, otherwise it is 0. In order to balance the number of patients and normal persons in a training set, a complex mutation related network is discovered by using a model, two non-overlapping binarization matrixes containing 986 personal information are randomly extracted from the normal persons in the training set, and are respectively added with a window matrix of the patient and binarized. The final training set data is a binarized window matrix consisting of 13,885 window features and 3975 samples.
4. The MiScan model was constructed using the Keras API from python with TensorFlow as the back-end. As shown in fig. 2, the model contains one input layer, seven hidden layers and one output layer. Each hidden layer consists of 32 full-connection layers; there are 128 nodes per fully-connected layer, and Maxout activation functions are embedded between hidden layers to prevent gradient vanishing or overfitting of the deep neural network. The output layer has softmax as the activation function. In addition, one Dropout (random deactivation) layer was inserted between each adjacent layer, and the ratio of broken links was set to 0.25(p ═ 0.25) to prevent overfitting. Adam is an optimizer of the model (Adam), and the learning rate is set to 0.001(lr is 0.001); the loss function of the model is a cross-entropy loss function (loss ═ systematic _ cross), and the evaluation index is an accuracy (metrics ═ accuracy').
It should be noted that the number of hidden layers, the number of fully-connected layers, and the number of nodes in each fully-connected layer in the model are determined by factors such as window characteristics and the number of samples, and can be selected appropriately according to actual situations.
The constructed cancer prediction system based on the non-cancer tissue mutation information sequentially comprises an input layer I, hidden layers H1-H7 and an output layer O from left to right. The specific construction steps are as follows:
step 1, embedding a training data set into an input layer I containing 13885 nodes to obtain a model layer 1;
and 2, inputting the input layer I into a hidden layer H1, wherein H1 comprises 32 fully-connected layers, each fully-connected layer comprises 128 nodes, and each node in the fully-connected layers is subjected to the following linear transformation:
zij=xTWij+bij
wherein z isijJ-th activation unit, x, representing the ith neuronTRepresenting a transposition of the input, WijAnd bijAre parameters of the model to be learned, representing the input layer to the activation unit z respectivelyijThe weight matrix and the offset vector of (2); and then carrying out nonlinear transformation on the obtained node information through the following Maxout activation function, wherein the output of the Maxout activation function is the maximum value in the selected activation unit:
Figure BDA0002237276790000091
where k represents the number of activation cells in the Maxout neuron activation cell group and x represents the input to the activation function.
And 3, step 3: the Dropout is utilized to regularize the output of the hidden layer H1, so that the overfitting phenomenon possibly brought by Maxout neurons is avoided;
r~Bernoulli(p)
Figure BDA0002237276790000092
wherein p represents the ratio of disconnected neuron connections, the Bernoulli function Bernoulli randomly generates a vector containing 0, 1, r is a vector obeying Bernoulli distribution, h is an input vector,
Figure BDA0002237276790000093
the vector is an output vector after the random disconnection; then will be
Figure BDA0002237276790000094
Input to hidden layer H2; h2 also contains 32 fully-connected tiers, each containing 128 nodes; performing nonlinear transformation on the obtained node information through a Maxout activation function to obtain a model hidden layer H2;
step 4, analogizing in sequence, sequentially establishing hidden layers H3, H4, H5, H6 and H7 in the same step 3, and sequentially constructing the 4 th, 5 th, 6 th, 7 th and 8 th layers of the model, wherein the hidden layers have the same structures as H1 and H2;
step 5, inputting the hidden layer H7 into an output layer O comprising two nodes, and carrying out normalization processing by using a Softmax activation function to obtain the probability P of normality of the individualnAnd probability of developing cancer Pc
Figure BDA0002237276790000095
Where σ denotes the activation function and z denotes z by1,...,zKVector of components, K representing the number of nodes of the input, zjRepresenting the input of the jth node, e representing an infinite non-circular natural constant with a value of 2.1782, and obtaining the output of the activatedLayer 9 of the model of the one-dimensional vector.
And 6, sequentially connecting the 9 layers of neural networks from left to right to obtain the cancer prediction system MiScan based on the non-cancer tissue mutation information.
5. After the labels of the data set are converted into the one-hot codes, the model training is started. The number of training batches was 500 (batch _ size 500) and the number of training rounds was 200 (epochs 200). The training process saves the model for each epoch and records the accuracy. The accuracy of all 5 epochs which occur for the first time is selected to exceed 0.99, and the accuracy derivative of adjacent epochs is less than 0.02, which is defined as the first-to-converge stable segment. The model with the highest accuracy is selected in the section as the final MiScan model.
Evaluation of MiScan model
1. Whole exon sequencing data for breast cancer patients were downloaded from tissue/blood samples adjacent to cancer (data sources were ICGC database and two studies: Rheinbay, Esther, et al, Current and functional regulatory issues in Breast cancer Canner, Nature, 2017; Gracia-Aznarez, Francisco Javier, et al, white exterior sequencing records of non-BRCA1/BRCA2 facial breakdown cancer company, ploS one, 2013), and data pre-processed to obtain VCF files containing mutation information.
2. The downloaded patient data is used as a test data set together with data from the remaining 501 normal persons (20%) of the previous thousand persons genome.
3. The VCF file is converted to a window sample binarization matrix, with 1 indicating that the sample has a variation in the window.
4. The test set is predicted by using MiScan, and a Receiver Operating Characteristic (ROC) curve and a Precision-Recall (PR) curve are drawn to evaluate the classification performance of the MiScan model. For each method, related indexes, namely test set accuracy (accuracycacy), sensitivity (sensitivity), specificity (specificity), area under ROC curve (AUC) and Average Precision (AP), are calculated according to the predicted values and the real labels.
5. And (3) downsampling the data of the test set to test the robustness of the MiScan model in two ways of directly extracting mutation information with a certain proportion from a VCF file format and directly extracting a certain data volume from original sequencing data.
6. In order to evaluate the influence of the genes on the model prediction, training and testing are performed again after all windows corresponding to each gene are removed, and the optimal model is selected by using the same conditions to determine whether the performance of the model is influenced.
The invention also provides a cancer prediction device based on the non-cancer tissue mutation information, which comprises an input module, a data processing module and an output module which are sequentially connected;
wherein the input module is used for inputting mutation information, and the mutation information is a window with a preset length into which the exon is cut;
the data processing module comprises a plurality of hidden layers, each hidden layer comprises M full-connection layers, each full-connection layer comprises N nodes, M and N are positive integers larger than 1, and a Maxout activating function is embedded between the hidden layers, wherein each node in the full-connection layers is subjected to the following linear transformation:
zij=xTWij+bij
wherein z isijJ-th activation unit, x, representing the ith neuronTRepresenting a transposition of the input, WijAnd bijAre parameters that the system needs to learn, and represent the input layer to the activation unit z respectivelyijThe weight matrix and the offset vector of (2); and the Maxout activating function carries out nonlinear transformation on the node information obtained by linear transformation to obtain N nodes and outputs the N nodes to the next layer connected in sequence, wherein the output of the Maxout activating function is the maximum value in the selected activating unit:
Figure BDA0002237276790000111
wherein k represents the number of activation units in the Maxout neuron activation unit group, and x represents the input of an activation function;
the output module receives the N nodes obtained by the last hidden layer of the data processing module and outputs two nodes which respectively represent the probability of predicting the cancer or the normal of an individual.
In the data processing module, Dropout is added into every two connected hidden layers to regularize the output of the previous hidden layer, so that the overfitting phenomenon possibly brought by Maxout neurons is avoided;
r~Bernoulli(p)
Figure BDA0002237276790000112
wherein p represents the ratio of disconnected neuron connections, the Bernoulli function Bernoulli randomly generates a vector containing 0, 1, r is a vector obeying Bernoulli distribution, h is an input vector,
Figure BDA0002237276790000113
is the output vector after random disconnection.
And the two nodes output by the output module are normalized by utilizing a Softmax activation function.
Preferably, the normalization processing method is as follows:
Figure BDA0002237276790000121
where σ denotes the activation function and z denotes z by1,...,zKVector of components, K representing the number of nodes of the input, zjRepresenting the input of the jth node, e is a natural constant.
The functional blocks in the present invention may be hardware, for example, the hardware may be a circuit, including a digital circuit, an analog circuit, and the like. Physical implementations of hardware structures include, but are not limited to, physical devices including, but not limited to, transistors, memristors, and the like. The computing module in the computing device may be any suitable hardware processor, such as a CPU, GPU, FPGA, DSP, ASIC, and the like. The memory unit may be any suitable magnetic or magneto-optical storage medium, such as RRAM, DRAM, SRAM, EDRAM, HBM, HMC, etc.
It will be clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to perform all or part of the above described functions.
Example 1
Sample information
The breast cancer full exon data from published studies was downloaded from the web, including 152 samples of peripheral blood and 360 normal samples of paracancerous tissues (data sources are ICGC database and two studies: Rheinbay, Esther, et al, Current and functional regulatory events in Breast cancer chair, 2017; Gracia-Aznarez, Francisco Javier, et al, white exon sequencing collections multiple of non-BRCA1/BRCA2 facial tissue model and low cancer probability tissue company, PloS, 2013).
Second, the operation steps
1. Data pre-processing and prediction
(1) BWA software was used to align the original WES-seq FASTQ sequencing data to the reference genome hg 19.
(2) Duplicate labeling and base mass fraction recalibration (BQSR) were performed using Picard.
(3) Use of GATK to search for SNV and INDELs (INDEL)
(4) And filtering the searched SNV. The specific filtering parameters are set as follows:
QD<2.0||FS>60.0||MQ<40.0||MQRankSum<-12.5||ReadPosRankSum<-8.0||SOR>4.0
wherein QD (QualByDepth) represents the variant site confidence divided by the unfiltered non-reference read number; FS (FisherStrand) indicates the likelihood that Fisher's exact test assesses that the current variation is a strand deviation; MQ (RMSMapppingQuality) is the square root of the alignment quality in all samples; MQRankSum represents the evaluation of confidence level according to the comparison quality of reference and variation; readposranskumtest estimates the reliability of the variation through the position of the variation on the read, and the error rate at two ends of the read is higher; SQR (StrandOddsRatio) comprehensively assesses the likelihood of strand deviation.
(5) The found INDELs are filtered. The specific filtering parameters are set as follows:
QD<2.0||FS>200.0||SOR>10.0||MQRankSum<-12.5||ReadPosRankSum<-8.0
(6) the filtered SNP and INDEL files are integrated into the final VCF file format.
(7) The VCF file is converted to a window sample matrix.
(8) The patient data (n 512) and the data of the remaining sample of the thousand human genomes (n 501) are merged together as a complete test set.
(9) And reading in the trained MiScan model, predicting the test data, and giving the prediction probability value of each sample.
2. Performance testing
MiScan and other machine learning methods are tested and compared mainly from the aspects of accuracy, sensitivity, specificity, ROC curves, PR curves and the like.
3. Robustness testing
(1) Testing based on SNV extraction: from the resulting VCF file of the test set, 10% to 90% (10% of each interval) of data is randomly extracted, and then data processing and prediction are performed. This random process was repeated 100 times.
(2) Assays based on different sequencing depths: 10% to 90% (each interval of 10%) of the data is randomly extracted from the original document before mutation discovery, and then data processing and prediction are performed. This randomization procedure was repeated 10 times for each patient sample in the test set.
4. Assessment of Gene importance
(1) And removing all windows corresponding to a certain gene, and training and testing the training set subjected to window reduction by using the same model again.
(2) Repeating the process of (1) until all genes are treated.
(3) The test effect of each gene after deletion is sorted in ascending order according to five indexes of accuracy, sensitivity, specificity, AUC and AP.
(4) Ten driver genes that mutate at high frequency in breast cancer patients and 2 genes that cause genetic susceptibility to breast cancer (BRCAl and BRCA2) were selected and their relative positions are indicated in the figure.
Third, result summary
1. Model construction and training
As can be seen from fig. 3, the model converges gradually during the training process. After the epoch (training period) reached 100, the training of the model tended to stabilize. The segment where training accuracy is greater than 99% for the first occurrence of 5 consecutive epochs and the accuracy deviation is less than 0.02 is called the stable segment (the segment between the two dashed lines). In this stable segment, the model with the highest training accuracy (epoch 113) was selected as the final misscan model.
2. Model performance assessment
To evaluate the performance of the model, MiScan was compared to other popular machine learning methods, including single models-Decision Trees (DT), K-nearest neighbors (KNN), Support Vector Machines (SVM) and two ensemble methods-Random Forest (RF) and gradient ascending decision trees (GBDT). MiScan has the highest accuracy-97%, while the predicted accuracy of other machine learning methods is 86% (SVM), 73% (DT), 63% (RF), 56% (GBDT) and 49% (KNN) (FIG. 4, left).
Furthermore, the prediction of patient and normal individuals by misscan has high sensitivity (100%) and high specificity (95%) respectively, superior to other methods, while some methods like KNN and GBDT wrongly predict most normal as patients, showing that only misscan predicts best without any prediction preference (fig. 4, right).
Meanwhile, as can be seen from the ROC and PR curves, the MiScan model has the strongest classification performance, and the AUC and AP values thereof reach 0.994 and 0.989 respectively, which is significantly better than the rest of the other methods (fig. 5).
Further, the predicted probability distribution of the MiScan model was analyzed using training and testing data sets (FIG. 6). It can be clearly seen from the figure that the MiScan model has a good fit to the training set, and the predicted probability values of the patients and the normal persons in the test set are respectively concentrated around 0 and 1, which shows that the MiScan model can clearly distinguish the patients from the normal persons. All the evaluation indices for the predicted results for all the test samples are shown in table 1.
The above results demonstrate that the MiScan method has the optimal classification performance and prediction ability compared to other machine learning methods, and there is no preference for prediction of patients and normal persons.
Table 1: MiScan and other methods prediction Performance statistics
Figure BDA0002237276790000151
3. Robustness testing
The objective of the robustness test is to evaluate the anti-interference performance of the method and to evaluate the predictive effect of the method on low quality data, thereby detecting whether the method can be applied to early screening of cancer at considerable cost in the future. The invention first performs SNV downsampling on the VCF file for each sample in the test set. The inventors noticed that even when the number of mutations covered by the data is as low as 10% of the original data, the MiScan still has very high resolution, the AUC and AP values are higher than 0.99, and the SVM and RF with better classification performance become worse and worse as the SNV number of the data coverage is reduced, which fully indicates that the MiScan model has very strong stability (FIG. 7, top). MiScan also showed the best stability in prediction accuracy at various ratios of SNV inputs (FIG. 7, bottom).
Since sequencing depth is one of the key factors affecting SNV detection, the present invention also examined the performance of these methods at low sequencing depths. The raw sequencing data in the test set is downsampled so that low quality data can be simulated more realistically. The inventors noted that even though the amount of data sequenced was as low as one million reads (1M), the misccan was able to identify these patients with high sensitivity. In contrast, the performance of other machine learning methods declined dramatically with decreasing sequencing depth, confirming the robustness of the misccan to these models (fig. 8). Since the genomic variation information of peripheral blood is very easy to obtain, the high accuracy and robustness of MiScan are very suitable for early diagnosis of breast cancer.
3. Assessment of Gene importance
Deep learning, while powerful, has the internal black box nature that makes it difficult for researchers to study associations between internal features. In the Maxout model, it is important to evaluate the contribution of each gene to the model. In order to evaluate the contribution of each genetic variation to disease identification, the invention designs an algorithm for evaluating genetic weight, and the basic principle is as follows: for a specific gene, after all mutations corresponding to the gene are removed, model training and prediction (denoted as defect model) are carried out again. All defect models were then ranked according to different indices to account for the effect of each gene.
Surprisingly, although the contribution of prediction accuracy of certain breast cancer-related genes (e.g., PIK3CA, MAP3K1, BRCA1, etc.) may be stronger than conventional genes, ignoring any single gene does not significantly reduce the classification and prediction ability MiScan (fig. 9). More interestingly, deletion of any single gene reduced the prediction accuracy from 99% to 91% at best, and the classification index (AUC and AP) was always higher than 0.99, indicating that mutations in a few driver genes may not increase the likelihood of cancer. In contrast, the accumulation and synergy of mutations in many hot spots is more critical. This may also reflect the heterogeneity and complexity of cancer development and progression.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention and are not intended to limit the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A cancer prediction system based on non-cancerous tissue mutation information, said cancer prediction system comprising an input layer, a plurality of hidden layers and an output layer connected in sequence;
wherein the input layer is used for inputting mutation information, and the mutation information is a window with a preset length into which the exon is cut;
each hidden layer comprises M fully-connected layers, each fully-connected layer comprises N nodes, M and N are positive integers larger than 1, and a Maxout activation function is embedded between the hidden layers, wherein each node in each fully-connected layer performs the following linear transformation:
zij=xTWij+bij
wherein z isijJ-th activation unit, x, representing the ith neuronTRepresenting a transposition of the input, WijAnd bijAre parameters that the system needs to learn, and represent the input layer to the activation unit z respectivelyijThe weight matrix and the offset vector of (2); and the Maxout activating function carries out nonlinear transformation on the node information obtained by linear transformation to obtain N nodes and outputs the N nodes to the next layer connected in sequence, wherein the output of the Maxout activating function is the maximum value in the selected activating unit:
Figure FDA0002237276780000011
wherein k represents the number of activation units in the Maxout neuron activation unit group, and x represents the input of an activation function;
the output layer receives the N nodes obtained by the last hidden layer and outputs two nodes which respectively represent the probability of predicting the cancer or the normal of the individual.
2. The cancer prediction system of claim 1, wherein Dropout is added to every two consecutive hidden layers to regularize the output of the previous hidden layer to avoid overfitting phenomena that may be caused by Maxout neurons;
r~Bernoulli(p)
Figure FDA0002237276780000012
wherein p represents the ratio of disconnected neuron connections, the Bernoulli function Bernoulli randomly generates a vector containing 0, 1, r is a vector obeying Bernoulli distribution, h is an input vector,
Figure FDA0002237276780000013
is the output vector after random disconnection.
3. A cancer prediction system as claimed in claim 1, wherein the two nodes of the output layer output are normalised using a Softmax activation function, preferably by:
Figure FDA0002237276780000021
where σ denotes the activation function and z denotes z by1,...,zKVector of components, K representing the number of nodes of the input, zjRepresenting the input of the jth node, e is a natural constant.
4. A method of constructing a cancer prediction system according to any one of claims 1 to 3, comprising:
acquiring mutation information of cancer patients and heritage variation information of normal persons, and taking the mutation information of all the cancer patients and the heritage variation information of a part of the normal persons as training sets;
cutting each exon into windows of predetermined length, leaving at least a window in which mutations are present in two cancer patients;
converting mutation information of cancer patients and normal people into a window x sample binarization matrix, randomly extracting binarization matrices with the same number as that of the cancer patients from the training set, respectively adding the binarization matrices with the window matrices of the cancer patients and binarizing to be used as the training set, wherein the extracted binarization matrices are not overlapped with each other;
constructing the cancer prediction system comprising an input layer, a plurality of hidden layers and an output layer connected in sequence;
and (4) training the system by using a training set, and selecting the system with the highest accuracy in the first convergence interval as a final cancer prediction system.
5. The construction method according to claim 4, wherein the window has a length of 50-300bp (e.g., 100bp, 150bp, 200bp, 250bp, etc.).
6. The method of construction of claim 4, wherein the method further comprises the step of evaluating the cancer prediction system.
7. The construction method of claim 6, wherein the classification performance of the cancer prediction system is evaluated by plotting a Receiver Operating Characteristic (ROC) curve and a Precision-Recall (PR) curve, preferably the evaluation index is selected from one or more of test set accuracy (accuracy), sensitivity (sensitivity), specificity (specificity), area under ROC curve (AUC), and Average Precision (AP).
8. A cancer prediction device based on non-cancer tissue mutation information comprises an input module, a data processing module and an output module which are connected in sequence;
wherein the input module is used for inputting mutation information, and the mutation information is a window with a preset length into which the exon is cut;
the data processing module comprises a plurality of hidden layers, each hidden layer comprises M full-connection layers, each full-connection layer comprises N nodes, M and N are positive integers larger than 1, and a Maxout activating function is embedded between the hidden layers, wherein each node in the full-connection layers is subjected to the following linear transformation:
zij=xTWij+bij
wherein z isijJ-th activation unit, x, representing the ith neuronTRepresenting a transposition of the input, WijAnd bijAre parameters that the system needs to learn, and represent the input layer to the activation unit z respectivelyijThe weight matrix and the offset vector of (2); and the Maxout activating function carries out nonlinear transformation on the node information obtained by linear transformation to obtain N nodes and outputs the N nodes to the next layer connected in sequence, wherein the output of the Maxout activating function is the maximum value in the selected activating unit:
Figure FDA0002237276780000031
wherein k represents the number of activation units in the Maxout neuron activation unit group, and x represents the input of an activation function;
the output module receives the N nodes obtained by the last hidden layer of the data processing module and outputs two nodes which respectively represent the probability of predicting the cancer or the normal of an individual.
9. The cancer prediction apparatus of claim 8, wherein Dropout is added to every two consecutive hidden layers to regularize the output of the previous hidden layer to avoid overfitting phenomena that may be caused by Maxout neurons;
r~Bernoulli(p)
Figure FDA0002237276780000032
wherein p represents the ratio of disconnected neuron connections, the Bernoulli function Bernoulli randomly generates a vector containing 0, 1, r is a vector obeying Bernoulli distribution, h is an input vector,
Figure FDA0002237276780000041
for the output after random disconnectionAnd (6) outputting a vector.
10. The cancer prediction apparatus of claim 8, wherein the two nodes output by the output module are normalized by a Softmax activation function, and preferably the normalization is performed by:
Figure FDA0002237276780000042
where σ denotes the activation function and z denotes z by1,...,zKVector of components, K representing the number of nodes of the input, zjRepresenting the input of the jth node, e is a natural constant.
CN201910992441.7A 2019-10-17 2019-10-17 Cancer prediction system based on non-cancer tissue mutation information and construction method thereof Active CN112687329B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910992441.7A CN112687329B (en) 2019-10-17 2019-10-17 Cancer prediction system based on non-cancer tissue mutation information and construction method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910992441.7A CN112687329B (en) 2019-10-17 2019-10-17 Cancer prediction system based on non-cancer tissue mutation information and construction method thereof

Publications (2)

Publication Number Publication Date
CN112687329A true CN112687329A (en) 2021-04-20
CN112687329B CN112687329B (en) 2024-05-17

Family

ID=75444895

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910992441.7A Active CN112687329B (en) 2019-10-17 2019-10-17 Cancer prediction system based on non-cancer tissue mutation information and construction method thereof

Country Status (1)

Country Link
CN (1) CN112687329B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113223613A (en) * 2021-05-14 2021-08-06 西安电子科技大学 Cancer detection method based on multi-dimensional single nucleotide variation characteristics
CN118116585A (en) * 2024-04-30 2024-05-31 奥明星程(杭州)生物科技有限公司 Method and device for judging benign and malignant cancers through DNN

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104573410A (en) * 2015-01-20 2015-04-29 合肥工业大学 Cancer chemosensitivity prediction technique based on molecular subnet and random forest classifier
US20190189242A1 (en) * 2017-12-18 2019-06-20 Personal Genome Diagnostics Inc. Machine learning system and method for somatic mutation discovery
CN110111840A (en) * 2019-05-14 2019-08-09 吉林大学 A kind of somatic mutation detection method
US20190266493A1 (en) * 2017-10-16 2019-08-29 Illumina, Inc. Deep Learning-Based Techniques for Pre-Training Deep Convolutional Neural Networks
CN110265084A (en) * 2019-06-05 2019-09-20 复旦大学 The method and relevant device of riboSnitch element are rich in or lacked in prediction cancer gene group

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104573410A (en) * 2015-01-20 2015-04-29 合肥工业大学 Cancer chemosensitivity prediction technique based on molecular subnet and random forest classifier
US20190266493A1 (en) * 2017-10-16 2019-08-29 Illumina, Inc. Deep Learning-Based Techniques for Pre-Training Deep Convolutional Neural Networks
US20190189242A1 (en) * 2017-12-18 2019-06-20 Personal Genome Diagnostics Inc. Machine learning system and method for somatic mutation discovery
CN110111840A (en) * 2019-05-14 2019-08-09 吉林大学 A kind of somatic mutation detection method
CN110265084A (en) * 2019-06-05 2019-09-20 复旦大学 The method and relevant device of riboSnitch element are rich in or lacked in prediction cancer gene group

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
祁亮;沈洁;: "TCGA数据库基因突变信息结合机器学习软件RapidMiner构建肝细胞癌患者复发模型", 中国肝脏病杂志(电子版), no. 03 *
胡丽娟;潘钦石;许刚;陈坚;丁鸿燕;王瑜敏;: "人工神经网络分析预测非小细胞肺癌患者EGFR基因突变的模型建立及其关联因素分析", 中国卫生检验杂志, no. 15 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113223613A (en) * 2021-05-14 2021-08-06 西安电子科技大学 Cancer detection method based on multi-dimensional single nucleotide variation characteristics
CN118116585A (en) * 2024-04-30 2024-05-31 奥明星程(杭州)生物科技有限公司 Method and device for judging benign and malignant cancers through DNN

Also Published As

Publication number Publication date
CN112687329B (en) 2024-05-17

Similar Documents

Publication Publication Date Title
CN112086129B (en) Method and system for predicting cfDNA of tumor tissue
CN111899882B (en) Method and system for predicting cancer
US20230222311A1 (en) Generating machine learning models using genetic data
WO2023217290A1 (en) Genophenotypic prediction based on graph neural network
EP3945525A1 (en) Methods for identifying chromosomal spatial instability such as homologous repair deficiency in low coverage next-generation sequencing data
CN112927757B (en) Gastric cancer biomarker identification method based on gene expression and DNA methylation data
US20220254450A1 (en) method for classifying individuals in mixtures of DNA and its deep learning model
CN114783524B (en) Path abnormity detection system based on self-adaptive resampling depth encoder network
WO2023197825A1 (en) Multi-cancer early screening model construction method and detection device
CN112131399A (en) Old medicine new use analysis method and system based on knowledge graph
CN112687329B (en) Cancer prediction system based on non-cancer tissue mutation information and construction method thereof
WO2021062198A1 (en) Single cell rna-seq data processing
CN111739642A (en) Colorectal cancer risk prediction method and system, computer equipment and readable storage medium
CN114373548A (en) Pancreatic cancer prognosis risk prediction method and device established based on metabolic genes
CN114550831A (en) Gastric cancer proteomics typing framework identification method based on deep learning feature extraction
CN117591953A (en) Cancer classification method and system based on multiple groups of study data and electronic equipment
CN112926640A (en) Cancer gene classification method and equipment based on two-stage depth feature selection and storage medium
Choi et al. Cell subtype classification via representation learning based on a denoising autoencoder for single-cell RNA sequencing
CN115810398A (en) TF-DNA binding identification method based on multi-feature fusion
CN111785319B (en) Drug repositioning method based on differential expression data
Cudic et al. Prediction of sorghum bicolor genotype from in-situ images using autoencoder-identified SNPs
CN113362927A (en) Squamous esophageal cancer chemoradiotherapy effect prediction method based on deep learning
CN117912570B (en) Classification feature determining method and system based on gene co-expression network
CN116597902B (en) Method and device for screening multiple groups of chemical biomarkers based on drug sensitivity data
Korayem et al. A hybrid genetic algorithm and artificial immune system for informative gene selection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant