CN112687329B

CN112687329B - Cancer prediction system based on non-cancer tissue mutation information and construction method thereof

Info

Publication number: CN112687329B
Application number: CN201910992441.7A
Authority: CN
Inventors: 瞿昆; 俞乔尼; 黎斌; 刘年平; 方靖文
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2019-10-17
Filing date: 2019-10-17
Publication date: 2024-05-17
Anticipated expiration: 2039-10-17
Also published as: CN112687329A

Abstract

A cancer prediction system based on non-cancer tissue mutation information and a construction method thereof, wherein the cancer prediction system comprises an input layer, a plurality of hidden layers and an output layer which are sequentially connected. The cancer prediction system utilizes the whole exon mutation information to predict the cancer, obviously improves the prediction accuracy, reduces the data volume of the required mutation information, and can be widely used in clinical examination.

Description

Cancer prediction system based on non-cancer tissue mutation information and construction method thereof

Technical Field

The invention belongs to the field of artificial intelligence, and particularly relates to a cancer prediction system based on non-cancer tissue mutation information and a construction method thereof.

Background

Although targeted cancer therapy and tumor immunotherapy have successfully cured many patients or significantly improved the overall survival of certain diseases, detection and treatment of cancer remains a serious problem to be faced. And sometimes effectively diagnosing cancer, thereby promoting pre-dry pre-treatment and therapy for early stage patients, is one of the key factors for improving the overall survival rate of the disease. The conventional method for early cancer screening is mainly imaging detection, but is not suitable for frequent use due to high radiation dose, high cost and the like. Current blood antigen-based cancer assays, such as those for Prostate Specific Antigen (PSA), tumor-associated antigen (CA-125) and carcinoembryonic antigen (CEA), can only be directed against a single or small number of cancer types and present high false positive conditions. Thus, there is an urgent need in the field of clinical cancer detection for new blood detection methods to aid physicians in early diagnosis and screening.

On the other hand, genetic mutations occur in the genome of individuals due to genetic and environmental factors, and the occurrence of some "driving" mutations, coupled with the accumulation of mutations, may ultimately lead to the occurrence of cancer. Thus, mutation detection of the genome of an individual holds promise for predicting the occurrence of cancer. For example, BRCA1 has been considered a susceptibility gene to breast and ovarian cancer, and mutations in this gene increase the risk of developing cancer, but only about 3% -8% of all breast cancer females carry BRCA1 or BRCA2 mutations. Likewise, BRCA1 mutations are only visible in about 18% of ovarian cancers. Meanwhile, with the rapid development of next generation high throughput sequencing technology, large collaborative projects such as a thousand Genome Project (1000 Genome Project) and a cancer Genome map Project (TCGA, the Cancer Genome Atlas) are developed, so that abundant genomic information of patients and normal persons is provided, and the possibility of cancer prediction by utilizing genomic variation of cancer is provided. Although pan-cancer analysis found genes such as TP53 and PIK3CA that are highly correlated with more than 10% of patients in most cancers, recent studies found that mutations in these cancer-driving genes are also prevalent in the blood and tissues of normal individuals, suggesting that it is not feasible to predict cancer risk by mutation of only a single gene. Thus, mutations at multiple key sites have been used to predict cancer. The existing methods mainly utilize a Support Vector Machine (SVM) to predict breast cancer and multiple myeloma, but the prediction accuracy is only about 70%, and samples used in the researches are only about hundred cases, so that the results are insufficient to prove the reliability of the method. Therefore, there is an urgent need in the field of cancer screening for more advanced and powerful methods that can utilize a large number of mutations in an individual's genome to make a prediction of cancer risk, and a large number of test sets are required to demonstrate the reliability of the method.

In recent years, a great deal of deep learning research has greatly promoted the development of artificial intelligence technology, and has been widely used in the field of precision medicine, including drug molecular design, medical image diagnosis, disease-driven gene/mutation prediction, etc. However, its application in the field of cancer diagnosis is mainly based on analysis of a large number of clinical images, and no study has yet demonstrated that deep learning can play a great role in early diagnosis of cancer.

Disclosure of Invention

In view of the above, the present invention utilizes a deep learning method to construct a deep neural network model to learn genomic variation information of a large number of cancer patients and normal persons in the existing database, and can predict cancer risk of a blood-derived sample, thereby establishing a system applicable to early cancer prediction.

In one aspect, the present invention provides a cancer prediction system based on non-cancerous tissue mutation information, the cancer prediction system comprising an input layer, a plurality of hidden layers, and an output layer connected in sequence;

the input layer is used for inputting mutation information, wherein the mutation information is a window with a preset length for cutting an exon;

each hidden layer comprises M full-connection layers, each full-connection layer is provided with N nodes, M and N are positive integers larger than 1, maxout activation functions are embedded between the hidden layers, and each node in each full-connection layer performs the following linear transformation:

z_ij＝x^TW_ij+b_ij

Wherein z _ij represents the j-th activation unit of the i-th neuron, x ^T represents the transpose of the input, and W _ij and b _ij are parameters that the system needs to learn and represent the weight matrix and bias vector of the input layer to the activation unit z _ij, respectively; the Maxout activation function performs nonlinear transformation on node information obtained by linear transformation to obtain N nodes and outputs the N nodes to the next layer connected in sequence, wherein the output of the Maxout activation function is the maximum value in the selection activation unit:

Wherein k represents the number of activation units in Maxout neuron activation unit groups, and x represents the input of an activation function;

The output layer receives N nodes obtained by the last hidden layer and outputs two nodes which respectively represent the probability of predicting the cancer or the probability of predicting the normal cancer of the individual.

In some embodiments, dropout is added into each two connected hidden layers to regularize the output of the previous hidden layer, so as to avoid the overfitting phenomenon possibly caused by Maxout neurons;

r～Bernoulli(p)

where p represents the ratio of the disconnection of neurons, the Bernoulli function Bernoulli randomly generates a vector containing 0,1, r is a vector obeying the Bernoulli distribution, h is an input vector, Is the output vector after random disconnection.

In some embodiments, the two nodes output by the output layer are normalized using a Softmax activation function. In some embodiments, the normalization process is performed by:

Where σ represents the activation function, z represents a vector consisting of z ₁,...,z_K, K represents the number of nodes entered, z _j represents the input of the j-th node, and e is a natural constant.

In another aspect, the present invention provides a method for constructing the cancer prediction system, comprising:

Acquiring mutation information of cancer patients and heritage mutation information of normal persons, and taking the mutation information of all the cancer patients and heritage mutation information of a part of normal persons as a training set;

Splitting each exon into windows of predetermined length, retaining windows where mutations exist in at least two cancer patients;

Converting mutation information of cancer patients and normal persons into a window x-sample binarization matrix, randomly extracting binarization matrixes the same as the number of the cancer patients from the training set, and respectively adding and binarizing the binarization matrixes with the window matrixes of the cancer patients to serve as the training set, wherein the extracted binarization matrixes are not overlapped with each other;

Constructing the cancer prediction system comprising an input layer, a plurality of hidden layers, and an output layer connected in sequence; and selecting a system with highest accuracy in the first convergence interval as a final cancer prediction system by utilizing a training set training system.

In some embodiments, the window is 50-300bp (e.g., 100bp, 150bp, 200bp, 250bp, etc.) in length.

In some embodiments, the method further comprises the step of evaluating the cancer prediction system.

In some embodiments, the classification performance of the cancer prediction system is assessed by plotting a subject operating characteristic (ROC) curve and a Precision-Recall (PR) curve.

In some embodiments, the evaluation index is selected from one or more of test set accuracy (accuracy), sensitivity (sensitivity), specificity (specificity), area under ROC curve (AUC), average Accuracy (AP).

In still another aspect, the present invention further provides a cancer prediction apparatus based on non-cancerous tissue mutation information, including an input module, a data processing module, and an output module connected in sequence;

The input module is used for inputting mutation information, wherein the mutation information is a window with a preset length for cutting an exon;

The data processing module comprises a plurality of hidden layers, each hidden layer comprises M full-connection layers, each full-connection layer is provided with N nodes, M and N are positive integers larger than 1, maxout activation functions are embedded between the hidden layers, and each node in the full-connection layers performs the following linear transformation:

z_ij＝x^TW_ij+b_ij

The output module receives N nodes obtained by the last hidden layer of the data processing module and outputs two nodes which respectively represent the probability of predicting the cancer or the probability of predicting the normal cancer of the individual.

In some embodiments, in the data processing module, dropout is added into each two connected hidden layers to regularize the output of the previous hidden layer, so as to avoid the overfitting phenomenon possibly caused by Maxout neurons;

r～Bernoulli(p)

In some embodiments, the two nodes output by the output module are normalized using a Softmax activation function.

In some embodiments, the normalization process is performed by:

Compared with the prior art, the invention has the following beneficial effects:

Compared with the existing methods for carrying out cancer prediction and diagnosis by detecting DNA mutation information through blood samples (such as BRCA1/2 single gene detection, protein and multi-gene detection method and multi-site PANEL detection), the sensitivity and the specificity of the cancer prediction system of the invention are respectively up to 100% and 95% through a preferred neural learning model and a data preprocessing method, which are higher than those of other methods (such as traditional BRCA1/2 detection based on the fact that the risk of breast cancer of a normal person carrying BRCA1/2 mutation is 45% -85%), the accuracy of the latest detection method-CANCERSEEK based on several proteins and gene mutations on breast cancer is only 33%, and the accuracy of a traditional machine learning method-Support Vector Machine (SVM) for carrying out cancer prediction through detection of multi-site mutation is 69%).

Drawings

Fig. 1: miScan a construction process of a model;

Fig. 2: miScan model profiles;

Fig. 3: training curves of the model;

wherein the training period between the two dashed lines represents a steady training interval.

Fig. 4: the MiScan model is compared with other machine learning methods in terms of accuracy, sensitivity and specificity of the test set;

wherein, a single model: SVM (support vector machine); DT (decision tree); KNN (K nearest neighbor)

The method for collecting comprises the following steps: RF (random forest); GBDT (gradient-lifted decision tree);

fig. 5: comparison of MiScan models with other machine learning methods on ROC and PR curves;

Fig. 6: miScan and other methods to profile predicted probability values for training and testing set samples;

fig. 7: robustness testing based on SNV sampling;

fig. 8: a robustness test based on raw sequencing data sampling;

fig. 9: evaluating a heat map based on the genetic importance of the MiScan model;

After all windows corresponding to each gene are removed, retraining and testing are carried out, and each gene is respectively arranged in sequence according to the following five indexes. The 12 genes noted are 10 driver genes that are highly mutated in breast cancer patients plus 2 genes that lead to genetic susceptibility to breast cancer (BRCA 1 and BRCA 2).

Detailed Description

The present invention will be further described in detail below with reference to specific embodiments and with reference to the accompanying drawings, in order to make the objects, technical solutions and advantages of the present invention more apparent.

The present invention developed a Maxout function-activated single nucleotide variation (SNV, single-nucleotide variant) -based multi-layer neural network to predict cancer risk by SNV analysis of whole exon sequencing (WES, whole exome sequencing) data.

The invention uses the mutation information obtained by sequencing the whole exons of some large cancer databases and normal people as a training set for training, so that the large cancer databases comprise more than 30 cancers such as TCGA, ICGC and the like, and the types of the cancers comprise breast cancer, lung cancer, ovarian cancer, glioma and the like.

The test set of embodiments of the present invention is derived from ICGC breast cancer databases and other data used in some published studies (e.g. ：Rheinbay,Esther,et al.,Recurrent and functional regulatory mutations in breast cancer.Nature,2017;Gracia-Aznarez,Francisco Javier,et al.,Whole exome sequencing suggests much of non-BRCA1/BRCA2 familial breast cancer is due to moderate and low penetrance susceptibility alleles.PloS one,2013).

The present invention takes breast cancer as an example to evaluate whether a cancer prediction system can use whole genome variation to perform cancer prediction, and the mutation status of the cancer shows wide heterogeneity in unsupervised clustering based on mutant genes, so that breast cancer is a good example. Previous studies have utilized genes with high somatic mutations as a feature of cancer classification and have not resulted in good results, indicating that a broad spectrum of mutation maps cannot be used directly to classify or predict cancer. According to the invention, a window strategy is used in analysis, so that the feature number is effectively reduced. The present invention collects a large number of independent WES patient data sets and compares with other machine learning methods to assess the accuracy and robustness of the deep learning model. The cancer prediction system of the present invention is applicable not only to prediction of breast cancer, but also to various cancers as long as there is enough sample data. The present invention designates this system as MiScan (Maxout INFERRED SNV-based cancer prediction model).

As shown in fig. 1, the concrete construction and evaluation process of MiScan model is as follows:

1. construction of MiScan model

1. The downloaded TCGA contains somatic mutation information (file format: MAF) of 986 breast cancer patients and genetic mutation information (file format: VCF) of 2,504 normal persons in the third stage of the thousand genome. The normal human data set was randomly sliced out of 80% (n=2,003) data as training set, the remainder as test set.

2. Each exon is cut into windows of 100bp in length, and the remaining window length at the end of each exon may be less than 100bp. The filtering of the window is then performed, if there are at least two patients with mutations in a certain window (which may not be the same mutation), the window is kept, otherwise the window is filtered out. Finally 13,885 valid windows are reserved.

3. The mutation information of patients and normal persons is converted into a binary matrix of window x samples. Each row is a window feature and each column is a sample, and if a sample has a mutation in a window, the corresponding value is 1, otherwise it is 0. In order to balance the number of patients and normal persons in the training set, a complex mutation related network is found by using a model, two non-overlapping binarization matrixes containing 986 personal information are randomly extracted from the normal persons in the training set, and the binarization matrixes are respectively added with the window matrixes of the patients and binarized. The final training set data is a binarized window matrix of 13,885 window features and 3975 samples.

4. The model MiScan was constructed using Keras API of python with TensorFlow as the back end. As shown in fig. 2, the model comprises one input layer, seven hidden layers and one output layer. Each hidden layer is composed of 32 full-connection layers; there are 128 nodes per fully connected layer, and Maxout activation functions are embedded between hidden layers to prevent gradient vanishing or overfitting of the deep neural network. The output layer has softmax as the activation function. In addition, one Dropout (random inactivation) layer was interposed between each adjacent layer, and the proportion of broken links was set to 0.25 (p=0.25) to prevent overfitting. Adam is an optimizer of the model (optimizer =adam), and the learning rate is set to 0.001 (lr=0.001); the loss function of the model is a cross entropy loss function (loss= 'categorical _ crossentropy'), and the evaluation index is an accuracy rate (metrics= 'accuracy').

The number of hidden layers, the number of fully connected layers and the number of nodes in each fully connected layer in the model are determined by factors such as window characteristics and sample number, and can be properly selected according to practical situations.

The constructed cancer prediction system based on the non-cancer tissue mutation information sequentially comprises an input layer I, hidden layers H1-H7 and an output layer O from left to right. The specific construction steps are as follows:

step 1, embedding a training data set into an input layer I containing 13885 nodes to obtain a model layer 1;

step 2, inputting an input layer I into a hidden layer H1, wherein the H1 comprises 32 fully connected layers, each fully connected layer comprises 128 nodes, and each node in the fully connected layers performs the following linear transformation:

z_ij＝x^TW_ij+b_ij

Wherein z _ij represents the j-th activation unit of the i-th neuron, x ^T represents the transpose of the input, and W _ij and b _ij are parameters that the model needs to learn and represent the weight matrix and bias vector of the input layer to the activation unit z _ij respectively; the obtained node information is then non-linearly transformed by the following Maxout activation function, the output of which Maxout activation function is the maximum value in the selection activation unit:

Where k represents the number of activation cells in the Maxout neuron activation cell group and x represents the input to the activation function.

Step 3: regularizing the output of the hidden layer H1 by using Dropout, so as to avoid the overfitting phenomenon possibly brought by Maxout neurons;

r～Bernoulli(p)

where p represents the ratio of the disconnection of neurons, the Bernoulli function Bernoulli randomly generates a vector containing 0,1, r is a vector obeying the Bernoulli distribution, h is an input vector, The output vector is randomly disconnected; then/>Input to hidden layer H2; h2 also contains 32 fully connected layers, each containing 128 nodes; performing nonlinear transformation on the obtained node information through Maxout activation functions to obtain a model hidden layer H2;

step 4, analogizing, sequentially establishing hidden layers H3, H4, H5, H6 and H7 in the same step 3, and sequentially constructing the hidden layers with structures H1 and H2 to obtain model layers 4, 5, 6, 7 and 8;

Step 5, inputting the hidden layer H7 into an output layer O comprising two nodes, and carrying out normalization processing by using a Softmax activation function to obtain the normal probability P _n and the cancer probability P _c of the individual:

Wherein sigma represents an activation function, z represents a vector composed of z ₁,...,z_K, K represents the number of inputted nodes, z _j represents the input of the j-th node, e represents an infinite natural constant which does not circulate, the value is 2.1782, and the 9 th layer of the model which is outputted as the activated one-dimensional vector is obtained.

And step 6, connecting the 9 layers of neural networks sequentially from left to right to obtain a cancer prediction system MiScan based on non-cancer tissue mutation information.

5. After converting the labels of the dataset into one-hot codes, training the model is started. The number of training batches was 500 (batch_size=500), and the number of training rounds was 200 (epochs =200). The training process saves the model of each epoch and records the accuracy. The accuracy of the first occurrence of 5 consecutive epochs was chosen to be above 0.99 and the derivative of the accuracy of the adjacent epochs was less than 0.02, defining this segment as a stable segment of first convergence. The model with the highest accuracy in this section is selected as the final MiScan model.

2. Evaluation of MiScan model

1. The whole exon sequencing data of breast cancer patients from the paracancestral tissue/blood samples were downloaded (data source ICGC database and two studies ：Rheinbay,Esther,et al.,Recurrent and functional regulatory mutations in breast cancer.Nature,2017;Gracia-Aznarez,Francisco Javier,et al.,Whole exome sequencing suggests much of non-BRCA1/BRCA2 familial breast cancer is due to moderate and low penetrance susceptibility alleles.PloS one,2013), were data pre-processed to obtain VCF files containing mutation information.

2. The above downloaded patient data were used as test data sets along with the data of the remaining 501 normal persons (20%) of the previous thousand persons' genome.

3. The VCF file is converted into a window x sample binarization matrix, 1 indicating that the sample has variations in the window.

4. The classification performance of the MiScan model was evaluated by predicting the above test set using MiScan, drawing a subject operating characteristic (ROC) curve and a Precision-Recall (PR) curve. For each method, calculating a related index, namely test set accuracy (accuracy), sensitivity (sensitivity), specificity (specificity), area under ROC curve (AUC) and average Accuracy (AP), according to the predicted value and the real label.

5. The test set data is downsampled to test MiScan for robustness of the model in two ways, namely directly extracting a proportion of mutation information from the VCF file format and directly extracting a certain amount of data from the raw sequencing data.

6. In order to evaluate the influence of genes on model prediction, after all windows corresponding to each gene are removed, training and testing are conducted again, and the optimal model is selected by using the same conditions so as to determine whether the performance of the model is influenced.

The invention also provides a cancer prediction device based on the non-cancer tissue mutation information, which comprises an input module, a data processing module and an output module which are sequentially connected;

z_ij＝x^TW_ij+b_ij

In the data processing module, dropout is added into each two connected hidden layers to regularize the output of the previous hidden layer, so as to avoid the overfitting phenomenon possibly caused by Maxout neurons;

r～Bernoulli(p)

And the two nodes output by the output module are normalized by using a Softmax activation function.

Preferably, the normalization processing method comprises the following steps:

The functional modules in the present invention may be hardware, for example, the hardware may be a circuit, including a digital circuit, an analog circuit, and the like. Physical implementations of hardware structures include, but are not limited to, physical devices including, but not limited to, transistors, memristors, and the like. The computing modules in the computing device may be any suitable hardware processor, such as CPU, GPU, FPGA, DSP and ASICs, etc. The storage unit may be any suitable magnetic or magneto-optical storage medium, such as RRAM, DRAM, SRAM, EDRAM, HBM, HMC or the like.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional modules is illustrated, and in practical application, the above-described functional allocation may be performed by different functional modules according to needs, i.e. the internal structure of the apparatus is divided into different functional modules to perform all or part of the functions described above.

Example 1

1. Sample information

The data of the whole exon of the breast cancer of the published study was downloaded from the internet, including a sample of 152 exceptional peripheral blood and a normal sample of 360 paracancerous tissues (data source is ICGC database and two studies ：Rheinbay,Esther,et al.,Recurrent and functional regulatory mutations in breast cancer.Nature,2017;Gracia-Aznarez,Francisco Javier,et al.,Whole exome sequencing suggests much of non-BRCA1/BRCA2 familial breast cancer is due to moderate and low penetrance susceptibility alleles.PloS one,2013).

2. Operating procedure

1. Data preprocessing and prediction

(1) The original WES-seq FASTQ sequencing data was aligned to reference genome hg19 using BWA software.

(2) The Picard is used to perform the duplicate labeling and Base Quality Score Recalibration (BQSR).

(3) Using GATK to find SNV and INDELs (INDELs)

(4) The found SNV is filtered. The specific filter parameters were set as follows:

QD＜2.0||FS＞60.0||MQ＜40.0||MQRankSum＜-12.5||ReadPosRankSum＜-8.0||SOR＞4.0

Wherein QD (QualByDepth) represents the mutation site confidence divided by the unfiltered non-reference read number; FS (FisherStrand) denotes a Fisher exact test to assess the likelihood that the current variation is a strand bias; MQ (RMSMappingQuality) is the square root of the comparative mass in all samples; MQRankSum denotes evaluating the confidence level based on the aligned quality of the reference and variation; readPosRankSumTest evaluate mutation confidence by mutation location in the read, typically with higher error rates at both ends of the read; SQR (StrandOddsRatio) comprehensively evaluate the possibility of chain deviation.

(5) Filtering the searched INDEL. The specific filter parameters were set as follows:

QD＜2.0||FS＞200.0||SOR＞10.0||MQRankSum＜-12.5||ReadPosRankSum＜-8.0

(6) The filtered SNP and INDEL files are integrated into the final VCF file format.

(7) The VCF file is converted into a matrix of windows x samples.

(8) The above-described patient data (n=512) and data of the remaining samples of the thousand human genome (n=501) were combined together as a complete test set.

(9) And reading in the trained MiScan model, predicting the test data, and giving out a predicted probability value of each sample.

2. Performance testing

MiScan and other machine learning methods were tested and compared mainly in terms of accuracy, sensitivity, specificity, ROC curve and PR curve.

3. Robustness test

(1) SNV extraction-based assay: 10% -90% (10% of each interval) of data are randomly extracted from the VCF file obtained from the test set, and then data processing and prediction are carried out. This random process was repeated 100 times.

(2) Inspection based on different sequencing depths: 10% -90% (10% of each interval) of data are randomly extracted from the original file before mutation discovery, and then data processing and prediction are carried out. This randomization procedure was repeated 10 times for patient samples in each test set.

4. Assessment of Gene importance

(1) And removing all windows corresponding to a certain gene, and training and testing the training set after window reduction by using the same model again.

(2) The procedure of (1) was repeated until all genes were treated.

(3) The test effect after deletion of each gene is arranged in ascending order according to five indexes of accuracy, sensitivity, specificity, AUC and AP respectively.

(4) Ten driver genes mutated at high frequency in breast cancer patients and 2 genes (BRCAl and BRCA 2) responsible for genetic susceptibility to breast cancer were selected and their relative positions are indicated in the figure.

3. Summary of results

1. Model construction and training

As can be seen from fig. 3, the model gradually converges during the training process. After epoch reached 100, the training of the model tended to stabilize. The section where training accuracy of 5 epochs in succession is greater than 99% and the accuracy deviation is less than 0.02 for the first time is referred to as a stable section (section between two broken lines). The model with the highest training accuracy (epoch=113) was selected as the final MiScan model in this stable section.

2. Model performance assessment

To evaluate the performance of the model, miScan is compared to other popular machine learning methods, including a single model-Decision Tree (DT), K-nearest neighbor (KNN), support Vector Machine (SVM), and two aggregate methods-Random Forest (RF) and gradient-rising decision tree (GBDT). MiScan has the highest accuracy-97%, while other machine learning methods have predictive accuracy of 86% (SVM), 73% (DT), 63% (RF), 56% (GBDT) and 49% (KNN) (fig. 4, left).

Furthermore MiScan has a high sensitivity (100%) and a high specificity (95%) for the prediction of patients and normal individuals, respectively, over other methods, while some such as KNN and GBDT erroneously predict most normal persons as patients, which shows that only MiScan is optimal and no prediction is preferred (fig. 4, right).

Meanwhile, as can be seen from ROC and PR curves, the MiScan model has the strongest classification performance, and AUC and AP values of the MiScan model reach 0.994 and 0.989 respectively, which are significantly superior to those of the rest other methods (figure 5).

Further, the predictive probability distribution of MiScan models was analyzed using training and testing datasets (fig. 6). The figure can obviously show that the MiScan model fits well to the training set, and the predicted probability values of the patients and the normal persons in the test set are respectively concentrated near 0 and 1, so that the MiScan model can obviously distinguish the patients from the normal persons. All evaluation indexes for all test sample predictions are shown in table 1.

The above results demonstrate that MiScan method has optimal classification performance and predictive power compared to other machine learning methods, and no preference exists for prediction of patients and normal persons.

Table 1: miScan and other methods predictive performance statistics

3. Robustness test

The purpose of the robustness test is two, namely, the evaluation of the anti-interference performance of the method and the prediction effect of the method on low-quality data, so as to detect whether the method can be applied to early screening of cancers in the future at considerable cost. The invention firstly performs SNV downsampling on VCF files of each sample in the test set. The inventors noted that MiScan still had very high resolution even when the number of mutations in the data coverage was as low as 10% of the original data-both AUC and AP values were higher than 0.99, whereas SVM and RF, which were originally better in classification performance, became worse with decreasing number of SNV in the data coverage, which fully suggests that the MiScan model had very strong stability (fig. 7, top). MiScan also showed the best stability in terms of prediction accuracy at different ratios of SNV inputs (fig. 7, bottom).

Since sequencing depth is one of the key factors affecting SNV detection, the present invention also examines the performance of these methods at low sequencing depths. The original sequencing data in the test set is downsampled, so that the low-quality data can be simulated more truly. The inventors noted that MiScan were able to identify these patients with high sensitivity even with sequencing data volumes as low as one million reads long (1M). In contrast, as sequencing depth decreases, the performance of other machine learning methods drops dramatically, confirming MiScan's robustness to these models (fig. 8). Since genomic variation information of peripheral blood is very easily available, miScan is well suited for early diagnosis of breast cancer with high accuracy and robustness.

3. Assessment of Gene importance

Deep learning, while powerful, has its internal black box nature making it difficult for researchers to study the correlation between internal features. In Maxout models, it is particularly important to evaluate the contribution of each gene to the model. In order to evaluate the contribution of each genetic variation to disease recognition, the invention designs an algorithm for evaluating genetic weight, and the basic principle is as follows: for a particular gene, the model is trained and predicted again (denoted as a defect model) after all mutations corresponding to that gene are removed. All defect models were then ranked according to different criteria to account for the effect of each gene.

Surprisingly, although the contribution of prediction accuracy of certain breast cancer-related genes (e.g., PIK3CA, MAP3K1, BRCA1, etc.) may be stronger than conventional genes, ignoring any single gene does not significantly reduce classification and prediction capacity MiScan (fig. 9). More interestingly, deletion of any single gene reduced the prediction accuracy from 99% to 91% at maximum, and the classification index (AUC and AP) was always higher than 0.99, indicating that mutation of a few driver genes may not increase the likelihood of cancer. In contrast, accumulation and synergy of mutations in many hotspots is more critical. This may also reflect the heterogeneity and complexity of cancer occurrence and progression.

The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the invention thereto, but to limit the invention thereto, and any modifications, equivalents, improvements and equivalents thereof may be made without departing from the spirit and principles of the invention.

Claims

1. A cancer prediction system based on non-cancerous tissue mutation information, the cancer prediction system comprising an input layer, a plurality of hidden layers, and an output layer connected in sequence;

z_ij＝x^TW_ij+b_ij

The output layer receives N nodes obtained by the last hidden layer and outputs two nodes which respectively represent the probability of predicting the cancer of an individual or the probability of predicting the cancer of the individual;

the construction and training method of the cancer prediction system comprises the following steps:

constructing the cancer prediction system comprising an input layer, a plurality of hidden layers, and an output layer connected in sequence;

And selecting a system with highest accuracy in the first convergence interval as a final cancer prediction system by utilizing a training set training system.

2. The cancer prediction system of claim 1, wherein Dropout is added to each two connected hidden layers to regularize the output of the previous hidden layer, avoiding the overfitting phenomenon possibly caused by Maxout neurons;

r～Bernoulli(p)

3. The cancer prediction system of claim 1, wherein the two nodes of the output layer output are normalized using a Softmax activation function, the normalization being performed by:

Where σ represents the activation function, z represents a vector consisting of z ₁,…,z_K, K represents the number of nodes entered, z _j represents the input of the j-th node, and e is a natural constant.

4. The cancer prediction system of claim 1, wherein the window is 50-300bp in length.

5. The cancer prediction system of claim 1, wherein the window is 100bp or 150bp or 200bp or 250bp in length.

6. The cancer prediction system of claim 1, wherein the cancer prediction system further comprises the step of evaluating the cancer prediction system.

7. The cancer prediction system of claim 6, wherein the classification performance of the cancer prediction system is assessed by plotting a subject operating characteristic (ROC) curve and a Precision-Recall (PR) curve, the assessment index selected from one or more of test set accuracy (accuracy), sensitivity (sensitivity), specificity (specificity), area under ROC curve (AUC), average Precision (AP).

8. A cancer prediction device based on non-cancer tissue mutation information comprises an input module, a data processing module and an output module which are sequentially connected;

z_ij＝x^TW_ij+b_ij

the output module receives N nodes obtained by the last hidden layer of the data processing module and outputs two nodes which respectively represent the probability of predicting the cancer or the probability of predicting the normal cancer of an individual;

The construction and training method of the cancer prediction device comprises the following steps:

Constructing the cancer prediction device including one input layer, a plurality of hidden layers, and one output layer connected in sequence;

and selecting the device with the highest accuracy in the first convergence interval as a final cancer prediction device by using a training set training device.

9. The cancer prediction device according to claim 8, wherein Dropout is added to each two connected hidden layers to regularize the output of the previous hidden layer, so as to avoid overfitting phenomenon possibly caused by Maxout neurons;

r～Bernoulli(p)

10. The cancer prediction apparatus according to claim 8, wherein the two nodes output by the output module perform normalization processing using a Softmax activation function, the normalization processing method is as follows: