CN112687329A

CN112687329A - Cancer prediction system based on non-cancer tissue mutation information and construction method thereof

Info

Publication number: CN112687329A
Application number: CN201910992441.7A
Authority: CN
Inventors: 瞿昆; 俞乔尼; 黎斌; 刘年平; 方靖文
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2019-10-17
Filing date: 2019-10-17
Publication date: 2021-04-20
Anticipated expiration: 2039-10-17
Also published as: CN112687329B

Abstract

A cancer prediction system based on non-cancer tissue mutation information and a construction method thereof, the cancer prediction system includes an input layer, a plurality of hidden layers and an output layer which are sequentially connected. The cancer prediction system utilizes the mutation information of the whole exons to predict the cancer, obviously improves the prediction accuracy, reduces the data volume of the required mutation information, and can be widely used in clinical examination.

Description

Cancer prediction system based on non-cancer tissue mutation information and construction method thereof

Technical Field

The invention belongs to the field of artificial intelligence, and particularly relates to a cancer prediction system based on non-cancer tissue mutation information and a construction method thereof.

Background

Although targeted cancer therapy and tumor immunotherapy have been successful in curing many patients or significantly improving the overall survival of certain diseases, cancer detection and therapy remains a serious problem to face. The timely and effective diagnosis of cancer, thereby promoting the early stage of patient intervention and treatment, is one of the key factors for improving the overall survival rate of diseases. The conventional method for screening early cancers at present is mainly imaging detection, but has the reasons of large radiation dose, high cost and the like, and is not suitable for frequent use. Current cancer tests based on blood antigens, such as the detection of Prostate Specific Antigen (PSA), tumor associated antigen (CA-125) and carcinoembryonic antigen (CEA), are directed against only a single or a small number of cancer types and have a high number of false positive cases. Therefore, there is an urgent need in the field of clinical cancer detection for new blood detection methods to aid physicians in early diagnosis and screening.

On the other hand, due to genetic as well as environmental factors, the genome of an individual is subject to genetic mutations, and some of these "driver" mutations, coupled with the accumulation of mutations, may ultimately lead to the development of cancer. Thus, mutation detection in the genome of an individual holds promise for predicting the occurrence of cancer. For example, BRCA1 has been considered a susceptibility gene for breast and ovarian cancer, where mutations increase the risk of cancer, but only about 3% to 8% of all women with breast cancer carry BRCA1 or BRCA2 mutations. Likewise, the BRCA1 mutation is only visible in about 18% of ovarian cancers. Meanwhile, with The rapid development of next-generation high-throughput sequencing technologies, large-scale cooperative projects, such as Genome Project of thousand people (1000Genomes Project) and Cancer Genome Atlas Project (TCGA), are developed, which provide abundant genomic information of patients and normal people and provide possibility for Cancer prediction by using genomic variation of Cancer. Although pan-cancer analysis found several genes, such as TP53 and PIK3CA, that are highly correlated with more than 10% of patients in most cancers, recent studies found that mutations in these cancer-driving genes, again prevalent in the blood and tissues of normal individuals, suggesting that we could not predict cancer risk by virtue of mutations in only a single gene. Therefore, mutations at multiple key sites have been exploited for cancer prediction. The existing methods mainly use a Support Vector Machine (SVM) to predict breast cancer and multiple myeloma, but the prediction accuracy is only about 70%, and samples used in the researches are about hundreds of cases, and the results are not enough to prove the reliability of the method. Therefore, there is an urgent need in the field of cancer screening for more advanced and robust methods that can exploit the large number of mutations on an individual's genome for cancer risk prediction and require a large test set to prove the reliability of the method.

In recent years, a great deal of deep learning research has greatly promoted the development of artificial intelligence technology, and has been widely applied to the precise medical field, including drug molecule design, medical image diagnosis, disease-driven gene/mutation prediction, and the like. However, the application of the method in the cancer diagnosis field is mainly based on analysis of a large number of clinical images, and deep learning has not been studied to play a great role in early diagnosis of cancer.

Disclosure of Invention

In view of the above, the present invention utilizes a deep learning method to construct a deep neural network model to learn genome variation information of a large number of cancer patients and normal persons in an existing database, and can perform cancer risk prediction on a blood-derived sample, thereby establishing a system applicable to early cancer prediction.

In one aspect, the present invention provides a cancer prediction system based on non-cancerous tissue mutation information, the cancer prediction system comprising an input layer, a plurality of hidden layers, and an output layer, which are sequentially connected;

wherein the input layer is used for inputting mutation information, and the mutation information is a window with a preset length into which the exon is cut;

each hidden layer comprises M fully-connected layers, each fully-connected layer comprises N nodes, M and N are positive integers larger than 1, and a Maxout activation function is embedded between the hidden layers, wherein each node in each fully-connected layer performs the following linear transformation:

z_ij＝x^TW_ij+b_ij

wherein z is_ijJ-th activation unit, x, representing the ith neuron^TRepresenting a transposition of the input, W_ijAnd b_ijAre parameters that the system needs to learn, and represent the input layer to the activation unit z respectively_ijThe weight matrix and the offset vector of (2); and the Maxout activating function carries out nonlinear transformation on the node information obtained by linear transformation to obtain N nodes and outputs the N nodes to the next layer connected in sequence, wherein the output of the Maxout activating function is the maximum value in the selected activating unit:

wherein k represents the number of activation units in the Maxout neuron activation unit group, and x represents the input of an activation function;

the output layer receives the N nodes obtained by the last hidden layer and outputs two nodes which respectively represent the probability of predicting the cancer or the normal of the individual.

In some embodiments, Dropout is added to every two connected hidden layers to regularize the output of the previous hidden layer, so that an over-fitting phenomenon possibly brought by Maxout neurons is avoided;

r～Bernoulli(p)

wherein p represents the ratio of disconnected neuron connections, the Bernoulli function Bernoulli randomly generates a vector containing 0, 1, r is a vector obeying Bernoulli distribution, h is an input vector,

is the output vector after random disconnection.

In some embodiments, two nodes of the output layer output are normalized using a Softmax activation function. In some embodiments, the normalization process is performed by:

where σ denotes the activation function and z denotes z by₁，...，z_KVector of components, K representing the number of nodes of the input, z_jRepresenting the input of the jth node, e is a natural constant.

In another aspect, the present invention provides a method for constructing the cancer prediction system, including:

acquiring mutation information of cancer patients and heritage variation information of normal persons, and taking the mutation information of all the cancer patients and the heritage variation information of a part of the normal persons as training sets;

cutting each exon into windows of predetermined length, leaving at least a window in which mutations are present in two cancer patients;

converting mutation information of cancer patients and normal people into a window x sample binarization matrix, randomly extracting binarization matrices with the same number as that of the cancer patients from the training set, respectively adding the binarization matrices with the window matrices of the cancer patients and binarizing to be used as the training set, wherein the extracted binarization matrices are not overlapped with each other;

constructing the cancer prediction system comprising an input layer, a plurality of hidden layers and an output layer connected in sequence; and (4) training the system by using a training set, and selecting the system with the highest accuracy in the first convergence interval as a final cancer prediction system.

In some embodiments, the window is 50-300bp in length (e.g., 100bp, 150bp, 200bp, 250bp, etc.).

In some embodiments, the method further comprises the step of evaluating the cancer prediction system.

In some embodiments, the classification performance of the cancer prediction system is assessed by plotting a subject operating characteristic (ROC) curve and a Precision-Recall (PR) curve.

In some embodiments, the assessment indicator is selected from one or more of test set accuracy (accuracy), sensitivity (sensitivity), specificity (specificity), area under the ROC curve (AUC), and average Accuracy (AP).

In another aspect, the present invention further provides a cancer prediction apparatus based on non-cancer tissue mutation information, which includes an input module, a data processing module and an output module connected in sequence;

wherein the input module is used for inputting mutation information, and the mutation information is a window with a preset length into which the exon is cut;

the data processing module comprises a plurality of hidden layers, each hidden layer comprises M full-connection layers, each full-connection layer comprises N nodes, M and N are positive integers larger than 1, and a Maxout activating function is embedded between the hidden layers, wherein each node in the full-connection layers is subjected to the following linear transformation:

z_ij＝x^TW_ij+b_ij

the output module receives the N nodes obtained by the last hidden layer of the data processing module and outputs two nodes which respectively represent the probability of predicting the cancer or the normal of an individual.

In some embodiments, in the data processing module, Dropout is added to every two connected hidden layers to regularize the output of the previous hidden layer, so that an overfitting phenomenon possibly brought by Maxout neurons is avoided;

r～Bernoulli(p)

is the output vector after random disconnection.

In some embodiments, the two nodes output by the output module are normalized using a Softmax activation function.

In some embodiments, the normalization process is performed by:

Compared with the prior art, the invention has the following beneficial effects:

compared with the existing methods for predicting and diagnosing cancers by detecting DNA mutation information through blood samples (such as BRCA1/2 single-gene detection, protein and multi-gene detection methods and multi-site PANEL detection), the cancer prediction system has the sensitivity and specificity respectively reaching 100 percent and 95 percent by the optimized neural learning model and the data preprocessing method, which are higher than those of other methods (such as the traditional BRCA1/2 detection, which is based on the fact that the risk of normal people carrying BRCA1/2 mutation is 45 percent to 85 percent), the newly published detection method based on several protein and gene mutations-cancerSEEK has only 33 percent of accuracy on the breast cancer, and the traditional machine learning method for predicting cancers by detecting multi-site mutation-Support Vector Machine (SVM) has 69 percent of accuracy).

Drawings

FIG. 1: constructing a MiScan model;

FIG. 2: a MiScan model profile;

FIG. 3: a training curve of the model;

wherein the training period between the two dotted lines represents a stable training interval.

FIG. 4: comparing the accuracy, sensitivity and specificity of the MiScan model with other machine learning methods in the test set;

wherein, the single model: SVM (support vector machine); DT (decision tree); KNN (K nearest neighbor)

The aggregation method comprises the following steps: RF (random forest); GBDT (gradient boosting decision tree);

FIG. 5: comparison of the MiScan model with other machine learning methods on ROC and PR curves;

FIG. 6: distribution of the predicted probability values of the MiScan and other methods to the training set and test set samples;

FIG. 7: a robustness test based on SNV sampling;

FIG. 8: a robustness test based on raw sequencing data samples;

FIG. 9: heat maps were evaluated based on gene importance of the misscan model;

after all windows corresponding to each gene are removed, retraining and testing are carried out again, and each gene is subjected to sequencing according to the following five indexes. The 12 genes noted were 10 highly mutated driver genes plus 2 genes responsible for the genetic susceptibility of breast cancer in breast cancer patients (BRCA1 and BRCA 2).

Detailed Description

In order that the objects, technical solutions and advantages of the present invention will become more apparent, the present invention will be further described in detail with reference to the accompanying drawings in conjunction with the following specific embodiments.

The invention develops a single-nucleotide variation (SNV) based multilayer neural network activated by a Maxout function, and predicts cancer risk through SNV analysis of Whole Exon Sequencing (WES) data.

The invention uses the mutation information obtained by sequencing the whole exons of some large cancer databases and normal people as a training set for training, so that the large cancer databases comprise TCGA, ICGC and the like, and the cancer types comprise more than 30 cancers such as breast cancer, lung cancer, ovarian cancer, glioma and the like.

The test set of the present examples was derived from the ICGC breast cancer database and other data used in some published studies (e.g., Rheinbay, Esther, et al, Current and functional regulatory issues in Breast cancer Nature, 2017; Gracia-Aznarez, Francisco Javier, et al, white exterior sequencing reports multiple of non-BRCA1/BRCA2 family Breast cancer company. plos one, 2013).

The invention takes breast cancer as an example to evaluate whether a cancer prediction system can use genome-wide variation to predict cancer, and the mutation state of the cancer shows wide heterogeneity in unsupervised clustering based on mutant genes, so the breast cancer is a good example. Existing studies using genes with high somatic mutations as a feature of cancer classification did not perform well, indicating that broad spectrum mutation patterns could not be used directly to classify or predict cancer. The invention uses a window strategy in analysis, thereby effectively reducing the characteristic number. The present invention collects a large number of independent WES patient datasets and compares them to other machine learning methods to assess the accuracy and robustness of the deep learning model. The cancer prediction system of the present invention is applicable not only to the prediction of breast cancer but also to various other cancers as long as sufficient sample data is available. The invention names the system as MiScan (Maxout induced SNV-based cancer prediction model).

As shown in fig. 1, the specific construction and evaluation process of the misccan model is as follows:

construction of MiScan model

1. The TCGA was downloaded with information on somatic Mutations (MAF) of 986 breast cancer patients and heritage Variation (VCF) of 2,504 normal persons in the third stage of the thousand genome. The normal human data set was randomly cut out to 80% (n-2,003) as a training set, and the rest was used as a test set.

2. Each exon is cut into 100bp long windows, and the remaining window length at the end of each exon may be less than 100 bp. Next, a window is filtered, wherein if there are at least two patients with a mutation (which may not be the same mutation) in a window, the window is retained, otherwise the window is filtered. Finally 13,885 valid windows remain.

3. And converting the mutation information of the patient and the normal person into a window x sample binary matrix. Each row is a window feature and each column is a sample, and if there is a sudden change in a certain window for a certain sample, the corresponding value is 1, otherwise it is 0. In order to balance the number of patients and normal persons in a training set, a complex mutation related network is discovered by using a model, two non-overlapping binarization matrixes containing 986 personal information are randomly extracted from the normal persons in the training set, and are respectively added with a window matrix of the patient and binarized. The final training set data is a binarized window matrix consisting of 13,885 window features and 3975 samples.

4. The MiScan model was constructed using the Keras API from python with TensorFlow as the back-end. As shown in fig. 2, the model contains one input layer, seven hidden layers and one output layer. Each hidden layer consists of 32 full-connection layers; there are 128 nodes per fully-connected layer, and Maxout activation functions are embedded between hidden layers to prevent gradient vanishing or overfitting of the deep neural network. The output layer has softmax as the activation function. In addition, one Dropout (random deactivation) layer was inserted between each adjacent layer, and the ratio of broken links was set to 0.25(p ═ 0.25) to prevent overfitting. Adam is an optimizer of the model (Adam), and the learning rate is set to 0.001(lr is 0.001); the loss function of the model is a cross-entropy loss function (loss ═ systematic _ cross), and the evaluation index is an accuracy (metrics ═ accuracy').

It should be noted that the number of hidden layers, the number of fully-connected layers, and the number of nodes in each fully-connected layer in the model are determined by factors such as window characteristics and the number of samples, and can be selected appropriately according to actual situations.

The constructed cancer prediction system based on the non-cancer tissue mutation information sequentially comprises an input layer I, hidden layers H1-H7 and an output layer O from left to right. The specific construction steps are as follows:

step 1, embedding a training data set into an input layer I containing 13885 nodes to obtain a model layer 1;

and 2, inputting the input layer I into a hidden layer H1, wherein H1 comprises 32 fully-connected layers, each fully-connected layer comprises 128 nodes, and each node in the fully-connected layers is subjected to the following linear transformation:

z_ij＝x^TW_ij+b_ij

wherein z is_ijJ-th activation unit, x, representing the ith neuron^TRepresenting a transposition of the input, W_ijAnd b_ijAre parameters of the model to be learned, representing the input layer to the activation unit z respectively_ijThe weight matrix and the offset vector of (2); and then carrying out nonlinear transformation on the obtained node information through the following Maxout activation function, wherein the output of the Maxout activation function is the maximum value in the selected activation unit:

where k represents the number of activation cells in the Maxout neuron activation cell group and x represents the input to the activation function.

And 3, step 3: the Dropout is utilized to regularize the output of the hidden layer H1, so that the overfitting phenomenon possibly brought by Maxout neurons is avoided;

r～Bernoulli(p)

the vector is an output vector after the random disconnection; then will be

Input to hidden layer H2; h2 also contains 32 fully-connected tiers, each containing 128 nodes; performing nonlinear transformation on the obtained node information through a Maxout activation function to obtain a model hidden layer H2;

step 4, analogizing in sequence, sequentially establishing hidden layers H3, H4, H5, H6 and H7 in the same step 3, and sequentially constructing the 4 th, 5 th, 6 th, 7 th and 8 th layers of the model, wherein the hidden layers have the same structures as H1 and H2;

step 5, inputting the hidden layer H7 into an output layer O comprising two nodes, and carrying out normalization processing by using a Softmax activation function to obtain the probability P of normality of the individual_nAnd probability of developing cancer P_c：

Where σ denotes the activation function and z denotes z by₁，...，z_KVector of components, K representing the number of nodes of the input, z_jRepresenting the input of the jth node, e representing an infinite non-circular natural constant with a value of 2.1782, and obtaining the output of the activatedLayer 9 of the model of the one-dimensional vector.

And 6, sequentially connecting the 9 layers of neural networks from left to right to obtain the cancer prediction system MiScan based on the non-cancer tissue mutation information.

5. After the labels of the data set are converted into the one-hot codes, the model training is started. The number of training batches was 500 (batch _ size 500) and the number of training rounds was 200 (epochs 200). The training process saves the model for each epoch and records the accuracy. The accuracy of all 5 epochs which occur for the first time is selected to exceed 0.99, and the accuracy derivative of adjacent epochs is less than 0.02, which is defined as the first-to-converge stable segment. The model with the highest accuracy is selected in the section as the final MiScan model.

Evaluation of MiScan model

1. Whole exon sequencing data for breast cancer patients were downloaded from tissue/blood samples adjacent to cancer (data sources were ICGC database and two studies: Rheinbay, Esther, et al, Current and functional regulatory issues in Breast cancer Canner, Nature, 2017; Gracia-Aznarez, Francisco Javier, et al, white exterior sequencing records of non-BRCA1/BRCA2 facial breakdown cancer company, ploS one, 2013), and data pre-processed to obtain VCF files containing mutation information.

2. The downloaded patient data is used as a test data set together with data from the remaining 501 normal persons (20%) of the previous thousand persons genome.

3. The VCF file is converted to a window sample binarization matrix, with 1 indicating that the sample has a variation in the window.

4. The test set is predicted by using MiScan, and a Receiver Operating Characteristic (ROC) curve and a Precision-Recall (PR) curve are drawn to evaluate the classification performance of the MiScan model. For each method, related indexes, namely test set accuracy (accuracycacy), sensitivity (sensitivity), specificity (specificity), area under ROC curve (AUC) and Average Precision (AP), are calculated according to the predicted values and the real labels.

5. And (3) downsampling the data of the test set to test the robustness of the MiScan model in two ways of directly extracting mutation information with a certain proportion from a VCF file format and directly extracting a certain data volume from original sequencing data.

6. In order to evaluate the influence of the genes on the model prediction, training and testing are performed again after all windows corresponding to each gene are removed, and the optimal model is selected by using the same conditions to determine whether the performance of the model is influenced.

The invention also provides a cancer prediction device based on the non-cancer tissue mutation information, which comprises an input module, a data processing module and an output module which are sequentially connected;

z_ij＝x^TW_ij+b_ij

In the data processing module, Dropout is added into every two connected hidden layers to regularize the output of the previous hidden layer, so that the overfitting phenomenon possibly brought by Maxout neurons is avoided;

r～Bernoulli(p)

is the output vector after random disconnection.

And the two nodes output by the output module are normalized by utilizing a Softmax activation function.

Preferably, the normalization processing method is as follows:

The functional blocks in the present invention may be hardware, for example, the hardware may be a circuit, including a digital circuit, an analog circuit, and the like. Physical implementations of hardware structures include, but are not limited to, physical devices including, but not limited to, transistors, memristors, and the like. The computing module in the computing device may be any suitable hardware processor, such as a CPU, GPU, FPGA, DSP, ASIC, and the like. The memory unit may be any suitable magnetic or magneto-optical storage medium, such as RRAM, DRAM, SRAM, EDRAM, HBM, HMC, etc.

It will be clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to perform all or part of the above described functions.

Example 1

Sample information

The breast cancer full exon data from published studies was downloaded from the web, including 152 samples of peripheral blood and 360 normal samples of paracancerous tissues (data sources are ICGC database and two studies: Rheinbay, Esther, et al, Current and functional regulatory events in Breast cancer chair, 2017; Gracia-Aznarez, Francisco Javier, et al, white exon sequencing collections multiple of non-BRCA1/BRCA2 facial tissue model and low cancer probability tissue company, PloS, 2013).

Second, the operation steps

1. Data pre-processing and prediction

(1) BWA software was used to align the original WES-seq FASTQ sequencing data to the reference genome hg 19.

(2) Duplicate labeling and base mass fraction recalibration (BQSR) were performed using Picard.

(3) Use of GATK to search for SNV and INDELs (INDEL)

(4) And filtering the searched SNV. The specific filtering parameters are set as follows:

QD＜2.0||FS＞60.0||MQ＜40.0||MQRankSum＜-12.5||ReadPosRankSum＜-8.0||SOR＞4.0

wherein QD (QualByDepth) represents the variant site confidence divided by the unfiltered non-reference read number; FS (FisherStrand) indicates the likelihood that Fisher's exact test assesses that the current variation is a strand deviation; MQ (RMSMapppingQuality) is the square root of the alignment quality in all samples; MQRankSum represents the evaluation of confidence level according to the comparison quality of reference and variation; readposranskumtest estimates the reliability of the variation through the position of the variation on the read, and the error rate at two ends of the read is higher; SQR (StrandOddsRatio) comprehensively assesses the likelihood of strand deviation.

(5) The found INDELs are filtered. The specific filtering parameters are set as follows:

QD＜2.0||FS＞200.0||SOR＞10.0||MQRankSum＜-12.5||ReadPosRankSum＜-8.0

(6) the filtered SNP and INDEL files are integrated into the final VCF file format.

(7) The VCF file is converted to a window sample matrix.

(8) The patient data (n 512) and the data of the remaining sample of the thousand human genomes (n 501) are merged together as a complete test set.

(9) And reading in the trained MiScan model, predicting the test data, and giving the prediction probability value of each sample.

2. Performance testing

MiScan and other machine learning methods are tested and compared mainly from the aspects of accuracy, sensitivity, specificity, ROC curves, PR curves and the like.

3. Robustness testing

(1) Testing based on SNV extraction: from the resulting VCF file of the test set, 10% to 90% (10% of each interval) of data is randomly extracted, and then data processing and prediction are performed. This random process was repeated 100 times.

(2) Assays based on different sequencing depths: 10% to 90% (each interval of 10%) of the data is randomly extracted from the original document before mutation discovery, and then data processing and prediction are performed. This randomization procedure was repeated 10 times for each patient sample in the test set.

4. Assessment of Gene importance

(1) And removing all windows corresponding to a certain gene, and training and testing the training set subjected to window reduction by using the same model again.

(2) Repeating the process of (1) until all genes are treated.

(3) The test effect of each gene after deletion is sorted in ascending order according to five indexes of accuracy, sensitivity, specificity, AUC and AP.

(4) Ten driver genes that mutate at high frequency in breast cancer patients and 2 genes that cause genetic susceptibility to breast cancer (BRCAl and BRCA2) were selected and their relative positions are indicated in the figure.

Third, result summary

1. Model construction and training

As can be seen from fig. 3, the model converges gradually during the training process. After the epoch (training period) reached 100, the training of the model tended to stabilize. The segment where training accuracy is greater than 99% for the first occurrence of 5 consecutive epochs and the accuracy deviation is less than 0.02 is called the stable segment (the segment between the two dashed lines). In this stable segment, the model with the highest training accuracy (epoch 113) was selected as the final misscan model.

2. Model performance assessment

To evaluate the performance of the model, MiScan was compared to other popular machine learning methods, including single models-Decision Trees (DT), K-nearest neighbors (KNN), Support Vector Machines (SVM) and two ensemble methods-Random Forest (RF) and gradient ascending decision trees (GBDT). MiScan has the highest accuracy-97%, while the predicted accuracy of other machine learning methods is 86% (SVM), 73% (DT), 63% (RF), 56% (GBDT) and 49% (KNN) (FIG. 4, left).

Furthermore, the prediction of patient and normal individuals by misscan has high sensitivity (100%) and high specificity (95%) respectively, superior to other methods, while some methods like KNN and GBDT wrongly predict most normal as patients, showing that only misscan predicts best without any prediction preference (fig. 4, right).

Meanwhile, as can be seen from the ROC and PR curves, the MiScan model has the strongest classification performance, and the AUC and AP values thereof reach 0.994 and 0.989 respectively, which is significantly better than the rest of the other methods (fig. 5).

Further, the predicted probability distribution of the MiScan model was analyzed using training and testing data sets (FIG. 6). It can be clearly seen from the figure that the MiScan model has a good fit to the training set, and the predicted probability values of the patients and the normal persons in the test set are respectively concentrated around 0 and 1, which shows that the MiScan model can clearly distinguish the patients from the normal persons. All the evaluation indices for the predicted results for all the test samples are shown in table 1.

The above results demonstrate that the MiScan method has the optimal classification performance and prediction ability compared to other machine learning methods, and there is no preference for prediction of patients and normal persons.

Table 1: MiScan and other methods prediction Performance statistics

3. Robustness testing

The objective of the robustness test is to evaluate the anti-interference performance of the method and to evaluate the predictive effect of the method on low quality data, thereby detecting whether the method can be applied to early screening of cancer at considerable cost in the future. The invention first performs SNV downsampling on the VCF file for each sample in the test set. The inventors noticed that even when the number of mutations covered by the data is as low as 10% of the original data, the MiScan still has very high resolution, the AUC and AP values are higher than 0.99, and the SVM and RF with better classification performance become worse and worse as the SNV number of the data coverage is reduced, which fully indicates that the MiScan model has very strong stability (FIG. 7, top). MiScan also showed the best stability in prediction accuracy at various ratios of SNV inputs (FIG. 7, bottom).

Since sequencing depth is one of the key factors affecting SNV detection, the present invention also examined the performance of these methods at low sequencing depths. The raw sequencing data in the test set is downsampled so that low quality data can be simulated more realistically. The inventors noted that even though the amount of data sequenced was as low as one million reads (1M), the misccan was able to identify these patients with high sensitivity. In contrast, the performance of other machine learning methods declined dramatically with decreasing sequencing depth, confirming the robustness of the misccan to these models (fig. 8). Since the genomic variation information of peripheral blood is very easy to obtain, the high accuracy and robustness of MiScan are very suitable for early diagnosis of breast cancer.

3. Assessment of Gene importance

Deep learning, while powerful, has the internal black box nature that makes it difficult for researchers to study associations between internal features. In the Maxout model, it is important to evaluate the contribution of each gene to the model. In order to evaluate the contribution of each genetic variation to disease identification, the invention designs an algorithm for evaluating genetic weight, and the basic principle is as follows: for a specific gene, after all mutations corresponding to the gene are removed, model training and prediction (denoted as defect model) are carried out again. All defect models were then ranked according to different indices to account for the effect of each gene.

Surprisingly, although the contribution of prediction accuracy of certain breast cancer-related genes (e.g., PIK3CA, MAP3K1, BRCA1, etc.) may be stronger than conventional genes, ignoring any single gene does not significantly reduce the classification and prediction ability MiScan (fig. 9). More interestingly, deletion of any single gene reduced the prediction accuracy from 99% to 91% at best, and the classification index (AUC and AP) was always higher than 0.99, indicating that mutations in a few driver genes may not increase the likelihood of cancer. In contrast, the accumulation and synergy of mutations in many hot spots is more critical. This may also reflect the heterogeneity and complexity of cancer development and progression.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention and are not intended to limit the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A cancer prediction system based on non-cancerous tissue mutation information, said cancer prediction system comprising an input layer, a plurality of hidden layers and an output layer connected in sequence;

z_ij＝x^TW_ij+b_ij

2. The cancer prediction system of claim 1, wherein Dropout is added to every two consecutive hidden layers to regularize the output of the previous hidden layer to avoid overfitting phenomena that may be caused by Maxout neurons;

r～Bernoulli(p)

is the output vector after random disconnection.

3. A cancer prediction system as claimed in claim 1, wherein the two nodes of the output layer output are normalised using a Softmax activation function, preferably by:

4. A method of constructing a cancer prediction system according to any one of claims 1 to 3, comprising:

constructing the cancer prediction system comprising an input layer, a plurality of hidden layers and an output layer connected in sequence;

and (4) training the system by using a training set, and selecting the system with the highest accuracy in the first convergence interval as a final cancer prediction system.

5. The construction method according to claim 4, wherein the window has a length of 50-300bp (e.g., 100bp, 150bp, 200bp, 250bp, etc.).

6. The method of construction of claim 4, wherein the method further comprises the step of evaluating the cancer prediction system.

7. The construction method of claim 6, wherein the classification performance of the cancer prediction system is evaluated by plotting a Receiver Operating Characteristic (ROC) curve and a Precision-Recall (PR) curve, preferably the evaluation index is selected from one or more of test set accuracy (accuracy), sensitivity (sensitivity), specificity (specificity), area under ROC curve (AUC), and Average Precision (AP).

8. A cancer prediction device based on non-cancer tissue mutation information comprises an input module, a data processing module and an output module which are connected in sequence;

z_ij＝x^TW_ij+b_ij

9. The cancer prediction apparatus of claim 8, wherein Dropout is added to every two consecutive hidden layers to regularize the output of the previous hidden layer to avoid overfitting phenomena that may be caused by Maxout neurons;

r～Bernoulli(p)

for the output after random disconnectionAnd (6) outputting a vector.

10. The cancer prediction apparatus of claim 8, wherein the two nodes output by the output module are normalized by a Softmax activation function, and preferably the normalization is performed by: