CN116168762B

CN116168762B - Computer readable storage medium and device for predicting medulloblastoma typing by low depth whole genome sequencing technique and application thereof

Info

Publication number: CN116168762B
Application number: CN202310454365.0A
Authority: CN
Inventors: 马景娇
Original assignee: Genetron Health Beijing Co ltd
Current assignee: Genetron Health Beijing Co ltd
Priority date: 2023-04-25
Filing date: 2023-04-25
Publication date: 2023-06-27
Anticipated expiration: 2043-04-25
Also published as: CN116168762A

Abstract

The invention discloses a computer readable storage medium and a device for predicting medulloblastoma typing by a low-depth whole genome sequencing technology in the field of diagnosis and application thereof. The invention aims to solve the technical problem of molecular typing of medulloblastoma based on DNA whole genome low-depth sequencing data. The computer-readable storage medium of the present invention for predicting molecular subtypes of a medulloblastoma patient stores a computer program for causing a computer to execute the steps of: and constructing a molecular typing model of the medulloblastoma based on 48 pieces of characteristic information in high-frequency CNV detection results, age and sex characteristics of sequencing data of known molecular subtype medulloblastoma samples by using a machine learning algorithm, and predicting and obtaining the molecular typing by using the constructed model based on 48 pieces of characteristic information obtained by analyzing low-depth whole genome sequencing data of a patient with the medulloblastoma to be detected. The method has low requirements on detection samples and high prediction accuracy.

Description

Computer readable storage medium and device for predicting medulloblastoma typing by low depth whole genome sequencing technique and application thereof

Technical Field

The present invention relates to a computer readable storage medium and apparatus for predicting medulloblastoma genotyping by low depth whole-genome sequencing technique in the diagnostic field and applications thereof.

Background

Medulloblastoma is the most common brain tumor in children and has high mortality. In 2010, the scientific community reached consensus on its molecular subtypes (WNT, SHH, group and Group 4) at the consensus conference held by boston. Studies have shown that different molecular subtypes exhibit different genotypic characteristics and prognosis. Currently, methods for detecting transcriptome data of tumor samples by NanoString technology and using the transcriptome RNA expression data for molecular typing of medulloblastoma have been widely accepted. However, in the long-term storage process of tumor samples, RNA degradation can increase the failure rate of analysis, and the NanoString technology has lower flux and high detection cost.

Disclosure of Invention

The invention aims to solve the technical problem of molecular typing of medulloblastoma based on DNA whole genome low-depth sequencing data.

To solve the above technical problem, the present invention firstly provides a computer-readable storage medium predicting molecular subtypes of a medulloblastoma patient, the computer-readable storage medium storing a computer program causing a computer to execute the steps of:

a1 Obtaining sequencing data, age characteristics and sex characteristics of known molecular subtype medulloblastoma samples;

a2 Performing CNV detection on sequencing data of the known molecular subtype medulloblastoma sample to obtain a CNV result of the known molecular subtype medulloblastoma sample, and performing high-frequency CNV detection on the CNV result of the known molecular subtype medulloblastoma sample to obtain an arm-level SCNA and a focal SCNA result of the known molecular subtype medulloblastoma sample;

a3 Based on 48 features, constructing a molecular typing model of the medulloblastoma by using a machine learning algorithm; the 48 features include 36 chromosome-arm-level copy number variation features and 10 gene-level copy number variation features, as well as age and sex features, obtained from the arm-level SCNA and focal SCNA results; the 36 chromosome arms include: 1p,1q,2p,2q,3p,3q,4p,4q,5p,5q,6p,6q,7p,7q,8p,8q,9p,9q,10p,10q,11p,11q,13q,14q,16p,16q,17p,17q,18p,18q,19p,19q,20q,21p,21q and 22q; the p represents a short chromosome arm, and the q represents a long chromosome arm; the 10 genes include: MYCN, GLI2, MYC, PVT1, OTX2, SCAPER, WWOX, SIRPB1, PTCH1 and SMYD4;

a4 Obtaining sequencing data of a patient with the medulloblastoma to be detected; comparing the sequencing data to a reference genome to obtain a comparison result file;

a5 Performing CNV detection on the comparison result file to obtain CNV results of the patient with the medulloblastoma to be detected; combining the CNV result of the patient with the medulloblastoma to be detected with the CNV result of the known molecular subtype medulloblastoma sample to obtain a combined CNV result; detecting the combined CNV result to obtain a high-frequency CNV result; extracting an arm-level SCNA and a focal SCNA result of a patient with the medulloblastoma to be detected based on the high-frequency CNV result;

a6 Extracting the 48 features of the patient with the medulloblastoma to be tested based on the arm-level SCNA and focal SCNA results and age and sex information of the patient with the medulloblastoma to be tested, and predicting the molecular subtype of the patient with the medulloblastoma to be tested based on the 48 features of the patient with the medulloblastoma to be tested.

In the above computer-readable storage medium, the 1p may specifically refer to a chromosome 1 short arm. The 1q may specifically refer to chromosome 1 long arm. The chromosome arm horizontal copy number variation characteristic may comprise a chromosome arm deletion characteristic or a chromosome arm amplification characteristic. The 10 genes may specifically be MYCN (nc_000002.12, feb 3, 2014), GLI2 (nc_000002.12, feb 3, 2014), MYC (ng_ 007161.2, sep 20, 2017), PVT1 (nc_000008.11, feb 3, 2014), OTX2 (nc_000014.9, feb 3, 2014), SCAPER (nc_000015.10, feb 3, 2014), WWOX (nc_ 000016.10, feb 3, 2014), SIRPB1 (nc_ 000020.11, feb 3, 2014), PTCH1 (nc_ 000009.12, feb 3, 2014) and SMYD4 (nc_ 000017.11, feb 3, 2014). The gene level copy number variation characteristic may comprise a gene deletion characteristic or a gene amplification characteristic.

In the above computer-readable storage medium, the sequencing data of the patient with medulloblastoma to be tested may be low-depth whole-genome sequencing data. The low depth may be a sequencing depth of 2 or more for the sequencing data.

In the above computer-readable storage medium, the machine learning algorithm may be a naive bayes, a random forest, an AdaBoost iterative algorithm, a logistic regression, or a support vector machine.

In the above computer-readable storage medium, the machine learning algorithm may be a support vector machine. The kernel function of the support vector machine may be a linear kernel= 'linear'. The remaining parameters of the support vector machine may be default parameters. The parameters of the naive bayes, random forests, adaBoost iterative algorithm, or logistic regression may be default parameters.

In the above computer readable storage medium, the sequencing data of the known molecular subtype medulloblastoma sample may be chip sequencing data. The CNV detection may be using dnapoy and/or ReadDepth software; the high frequency CNV detection may be a detection using the gist 2 software.

In order to solve the above technical problem, the present invention also provides a device for predicting molecular subtype of a medulloblastoma patient, which may include the following modules:

b1 Known medulloblastoma sample data acquisition module: sequencing data, age characteristics and sex characteristics for obtaining samples of known molecular subtypes of medulloblastoma;

b2 Known high frequency CNV detection module of medulloblastoma samples: the method comprises the steps of detecting and obtaining CNV results of a known molecular subtype medulloblastoma sample based on sequencing data of the known molecular subtype medulloblastoma sample, and performing high-frequency CNV detection based on the CNV results of the known molecular subtype medulloblastoma sample to obtain arm-level SCNA and focal SCNA results of the known molecular subtype medulloblastoma sample;

b3 Medulloblastoma classification model building block): the method is used for constructing a molecular typing model of the medulloblastoma based on 48 characteristics by using a machine learning algorithm; the 48 features include 36 chromosome-arm-level copy number variation features and 10 gene-level copy number variation features, as well as age and sex features, obtained from the arm-level SCNA and focal SCNA results; the 36 chromosome arms include: 1p,1q,2p,2q,3p,3q,4p,4q,5p,5q,6p,6q,7p,7q,8p,8q,9p,9q,10p,10q,11p,11q,13q,14q,16p,16q,17p,17q,18p,18q,19p,19q,20q,21p,21q and 22q; the p represents a short chromosome arm, and the q represents a long chromosome arm; the 10 genes include: MYCN, GLI2, MYC, PVT1, OTX2, SCAPER, WWOX, SIRPB1, PTCH1 and SMYD4;

b4 CNV detection module for patient with medulloblastoma to be detected: the method comprises the steps of obtaining sequencing data of a patient with the medulloblastoma to be tested; obtaining a comparison result file of a patient with the medulloblastoma to be detected based on the comparison result of the sequencing data and the reference genome; detecting CNV results of a patient with the medulloblastoma to be detected based on the comparison result file;

b5 High frequency CNV detection module: the CNV result of the patient with the medulloblastoma to be detected and the CNV result of the known molecular subtype medulloblastoma sample are combined to obtain a combined CNV result; detecting a high frequency CNV result of the combined CNV; extracting an arm-level SCNA and a focal SCNA result of a patient with the medulloblastoma to be detected based on the high-frequency CNV result;

b6 Molecular subtype prediction module of patient with medulloblastoma to be detected: the method is used for extracting 48 characteristics of the patient with the medulloblastoma to be detected based on the arm-level SCNA and focal SCNA results of the patient with the medulloblastoma to be detected, and predicting molecular subtype typing of the patient with the medulloblastoma to be detected based on the 48 characteristics of the patient with the medulloblastoma to be detected.

In the above device, the 1p may specifically refer to a short arm of chromosome 1. The 1q may specifically refer to chromosome 1 long arm. The chromosome arm horizontal copy number variation characteristic may comprise a chromosome arm deletion characteristic or a chromosome arm amplification characteristic. The 10 genes may specifically be MYCN (nc_000002.12, feb 3, 2014), GLI2 (nc_000002.12, feb 3, 2014), MYC (ng_ 007161.2, sep 20, 2017), PVT1 (nc_000008.11, feb 3, 2014), OTX2 (nc_000014.9, feb 3, 2014), SCAPER (nc_000015.10, feb 3, 2014), WWOX (nc_ 000016.10, feb 3, 2014), SIRPB1 (nc_ 000020.11, feb 3, 2014), PTCH1 (nc_ 000009.12, feb 3, 2014) and SMYD4 (nc_ 000017.11, feb 3, 2014). The gene level copy number variation characteristic may comprise a gene deletion characteristic or a gene amplification characteristic.

In the above device, the sequencing data of the patient with medulloblastoma to be tested may be low-depth whole genome sequencing data; the low depth may be a sequencing depth of 2 or more for the sequencing data.

In the above apparatus, the machine learning algorithm may be a naive bayes, a random forest, an AdaBoost iterative algorithm, a logistic regression, or a support vector machine.

In the above apparatus, the machine learning algorithm may specifically be a support vector machine; the kernel function of the support vector machine may be a linear kernel= 'linear'. The remaining parameters of the support vector machine may be default parameters. The parameters of the naive bayes, random forests, adaBoost iterative algorithm, or logistic regression may be default parameters.

The sequencing data of the known molecular subtype medulloblastoma sample may be chip sequencing data. The CNV detection may be using dnapoy and/or ReadDepth software; the high frequency CNV detection may be a detection using the gist 2 software.

Any of the following applications of the computer readable storage medium described above are also within the scope of the present invention:

c1 The use of a product for predicting molecular subtype typing of a patient with medulloblastoma;

c2 For developing or preparing a medicament for treating or alleviating medulloblastoma;

c3 Use in the development or manufacture of a product for medulloblastoma guidance medicine;

c4 For the preparation of a product for predicting the prognosis of molecular subtype typing of a patient with medulloblastoma.

Any of the following applications of the device described above is also within the scope of the present invention:

d1 The use of a product for predicting molecular subtype typing of a patient with medulloblastoma;

d2 For developing or preparing a medicament for treating or alleviating medulloblastoma;

d3 Use in the development or manufacture of a product for medulloblastoma guidance medicine;

d4 For the preparation of a product for predicting the prognosis of molecular subtype typing of a patient with medulloblastoma.

The invention provides a DNA whole genome low-depth (more than or equal to 2X depth) sequencing technology, which uses a machine learning algorithm to realize molecular typing of medulloblastoma.

The technical scheme of the invention comprises two parts of model training and sample detection, wherein the model training part is carried out by using common data, and the sample detection part is used for clinically collected medulloblastoma samples.

The following is the model training part: the invention retrieves GSE37385 (https:// www.ncbi.nlm.nih.gov/geo/query/acc. Cgiac=GSE 37385) data set from GEO (Gene Expression Omnibus), and 1097 medulloblastoma SNP array CEL data and clinical information. After a series of quality evaluations, the product is protected800 samples were left as training set (640 samples) and test set (160 samples) for the machine learning model. Furthermore, the present invention is described in Robinson G (Robinson, G., parker, M., kranenburg, T.).et al. Novel mutations target distinct subgroups of medulloblastoma. Nature488, 43-48 (2012) et al, 32 medulloblastoma samples were collected as a validation set.

For the samples of the training set, test set, validation set above, pennCNV software (Wang K, li M, hadley D was used.et al. PennCNV an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data Genome Research 17:1665-1674, 2007) the Log R Ratio (LRR) and B Allele Frequency (BAF) of each sample were calculated as input data to DNAcopy software (Seshan VE, oldhen A. DNAcopy: DNA copy number data analysis, R package version 1.72.3.) and CNV (copy number variant copy number variation) was analyzed to obtain CNV results for each sample.

GISTIC2 software (Mermel CH, schumacher SE, hill B, meyerson ML was used.et al.GISTIC2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancer Genome biol 2011;12 (4): R41.) analysis of the samples for CNV data, high frequency CNV results were obtained for each sample: namely an arm-level SCNA (arm-level somatic copy number alteration chromosome arm level somatic cell copy number change, defined as a length at which a copy number change occurs equal to or greater than 50% of the chromosome arm length) and a focal SCNA (focal somatic copy number alteration local somatic cell copy number change, defined as a length at which a copy number change occurs less than 50% of the chromosome arm length), the focal SCNA of the present invention being defined at the length of the gene level). Analysis of high frequency CNV using the gist 2 software requires a large number of sample queues to construct background mutation frequencies, 640 samples of the training set are analyzed as a queue, each sample of the test set and the validation set and 640 samples of the training set form a 641 sample queue for the gist 2 analysis, and the arm-level SCNA and the focal SCNA of each sample are obtained.

The high frequency CNV obtained by the gist 2 and the age and sex were used as input feature data, five-fold cross-validation (training set 640 samples, test set 160 samples) was used, and the AdaBoost was used to select the important 48 features. Five machine-learned classification algorithm models, naive Bayes, random Forest, adaBoost, logistic Regression and Support Vector Machine (SVM, support vector machine), were used based on these 48 features, respectively, to divide the samples into four subtypes WNT, SHH, group3, group4, and AUROC was used to evaluate each algorithm performance. The result shows that the comprehensive performance of the SVM machine learning classification algorithm model is superior to that of other four classification algorithms, so that the SVM is finally selected as an algorithm of a medulloblastoma sample molecular typing model.

The following are sample detection portions: for a clinical medulloblastoma tissue sample to be detected, sequencing data are obtained, after quality control of the data is qualified, CNV data are obtained by analyzing the sample and 640 samples of a training set to form a queue, the arm-level SCNA and the focal SCNA characteristics of the sample to be detected are obtained by using GISTIC2 analysis, age and gender information of the sample are added, molecular typing is carried out on the sample by using an SVM classification algorithm model, a probability is obtained for each molecular subtype of WNT, SHH, group3 and Group4, and if the difference of the probability of the maximum two subtypes is greater than or equal to 0.1, the final result is the subtype with the maximum probability, and if the difference of the probabilities is less than 0.1, the molecular typing result cannot be determined.

Due to the adoption of the technical scheme, the invention has the following advantages:

1. only DNA low-depth WGS sequencing data is needed, the requirement on a sample is low, and the sample is easy to obtain;

2. five machine learning models are used for SVM with optimal performance, and prediction accuracy is high.

The DNA sequencing data described above may be sequencing data having a sequencing depth of 2 x or greater.

Drawings

Fig. 1 is a model feature selection diagram. The abscissa is the features used by the model and the ordinate is the feature importance index of the AdaBoost analysis.

Fig. 2 is an AUROC picture of five classification models. A is the ROC curve of the SMB-NB, the left graph is a training set sample, the middle graph is a test set sample, and the right graph is a verification set sample; b is the ROC curve of SMB-RF, left graph is training set sample, middle graph is test set sample, right graph is validation set sample; c is the ROC curve of the SMB-AB, the left graph is a training set sample, the middle graph is a test set sample, and the right graph is a verification set sample; d is the ROC curve of SMB-LR, the left graph is a training set sample, the middle graph is a test set sample, and the right graph is a verification set sample; e is the ROC curve of the SMB-SVM, the left graph is a training set sample, the middle graph is a test set sample, and the right graph is a verification set sample. The ordinate is true positive rate, and the abscissa is false positive rate.

Fig. 3 shows that the prediction accuracy of the five classification models and the difference between the existence probabilities are less than 0.1 are the accuracy of the unpredictable samples.

FIG. 4 shows one example of clinical molecular typing results of medulloblastoma.

Detailed Description

The following detailed description of the invention is provided in connection with the accompanying drawings that are presented to illustrate the invention and not to limit the scope thereof. The examples provided below are intended as guidelines for further modifications by one of ordinary skill in the art and are not to be construed as limiting the invention in any way.

The experimental methods in the following examples, unless otherwise specified, are conventional methods, and are carried out according to techniques or conditions described in the literature in the field or according to the product specifications. Materials, reagents and the like used in the examples described below are commercially available unless otherwise specified.

Example 1, construction and validation of a medulloblastoma typing model.

The invention obtains data from a public database, and is divided into a training set, a testing set and a verification set, a model is trained by using a machine learning algorithm, and the embodiment describes the construction process of the model and the verification result of the model performance.

1. Copy Number Variation (CNV) of the samples was detected.

Retrieval from GEO (Gene Expression Omnibus) databaseTo the GSE37385 (https:// www.ncbi.nlm.nih.gov/geo/query) dataset, a total of 1097 medulloblastoma SNP array CEL data (chip sequencing data) and clinical information. After a series of quality evaluations, data information of 800 samples is reserved for constructing a machine learning model; 800 medulloblastoma samples were randomly divided into training (640 samples) and test (160 samples) at a 4:1 ratio. Furthermore, the present invention is described in Robinson G (Robinson, G., parker, M., kranenburg, T.).et al.Novel mutations target distinct subgroups of medulloblastoma. Nature488, 43-48 (2012)) et al, 32 medulloblastoma sample data were collected as a validation set.

For the data sets (training set, test set and validation set) described above, pennCNV software (Wang K, li M, hadley D was used.et al. PennCNV an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data Genome Research 17:1665-1674, 2007) Log R Ratio (LRR) and B Allele Frequency (BAF) were calculated for each sample. The copy number variation (copy number variant, CNV) of each sample was detected using DNAcopy (Seshan VE, oldhen A. DNAcopy: DNA copy number data analysis. R package version 1.72.3) software with LRR and BAF as input data.

2. High frequency CNV of the sample is detected.

GISTIC2 software (Mermel CH, schumacher SE, hill B, meyerson ML was used.et al.GISTIC2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancer Genome biol 2011;12 (4): R41.) CNV data of training set, test set and validation set samples were analyzed, respectively, to obtain high frequency CNV results for each sample: namely an arm-level SCNA (arm-level somatic copy number alteration chromosome arm-level somatic cell copy number change, defined as a change in copy number of 50% or more of the chromosome arm length), and a focal SCNA (focal somatic copy number alteration local somatic cell copy number change, defined as a change in copy number of less than 50% of the chromosome arm length,the focal SCNA of the present invention is limited to the length at the gene level). The GISTIC2 software analysis needs a large number of sample queues to construct background mutation frequency, 640 samples of the training set are used as a queue for analysis, each sample of the test set and the verification set and 640 samples of the training set form a 641 sample queue for GISTIC2 analysis, and the arm-level SCNA and the focal SCNA of each sample are obtained.

3. Model feature selection.

Five-fold cross-validation (640 cases in training set and 160 cases in test set) was used, and the input gist 2 analysis was performed to obtain the arm-level SCNA and focal SCNA information of the samples, as well as clinical information (including Age and Sex set, all obtained from a database download), and the AdaBoost (Adaptive Boosting, adaptive enhancement) algorithm was used to select the important features of the medulloblastoma, as shown in fig. 1, and finally 48 features were selected as the features of molecular classification of the medulloblastoma in total: including 36 arm-level SCNA (chromosome arm copy number variation: amplification or deletion) features: human 1p (chromosome 1 short arm, 1p, p for short, features including 1p amplified or 1p deleted), 1q (chromosome 1 long arm, 1q, q for long, features including 1q amplified or 1q deleted), 2p,2q,3p,3q,4p,4q,5p,5q,6p,6q,7p,7q,8p,8q,9p,9q,10p,10q,11p,11q,13q,14q,16p,16q,17p,17q,18p,18q,19p,19q,20q,21p,21q,22q;10 focal SCNA (gene level copy number variation: amplification or deletion) characteristics: MYCN (nc_000002.12, feb 3, 2014), GLI2 (nc_000002.12, feb 3, 2014), MYC (ng_ 007161.2, sep 20, 2017), PVT1 (nc_000008.11, feb 3, 2014), OTX2 (nc_000014.9, feb 3, 2014), SCAPER (nc_000015.10, feb 3, 2014), WWOX (nc_ 000016.10, feb 3, 2014), SIRPB1 (nc_ 000020.11, feb 3, 2014), PTCH1 (nc_ 000009.12, feb 3, 2014) and SMYD4 (nc_ 000017.11, feb 3, 2014); and both Age (Age) and gender (set).

4. And constructing a molecular subtype classification model of the medulloblastoma.

For training set data of 640 cases of medulloblastoma samples, selecting five algorithms including arm-level SCNA, focal SCNA, age and gender selected in step 3 in each sample to form a training set feature matrix, and constructing five medulloblastoma classification models respectively called Nave Bayes algorithm model (SMB-NB), SHH subtype, group3 subtype and Group4 subtype, which are given in GSE37385 data, by using Nave Bayes, random Forest, adaBoost iterative algorithm, logistic Regression (logistic regression) and SVM (support vector machine ) (SVM algorithm parameter setting: kernel= 'linear', and the rest are default parameters) as class labels, wherein the five algorithms are respectively called Nave Bayes algorithm model (SMB-NB), random Forest algorithm model (SMB-RF), random Boost algorithm model (SMB-AB), random Boost algorithm model (62 LR) and SVM algorithm model (SVM-SVM).

5. And (5) detecting the prediction accuracy of the medulloblastoma classification model.

For the five algorithm classification models constructed in the step 4, testing the performance of each algorithm model by using test set and verification set samples, inputting each sample into the 48 feature matrices selected in the step 3, and calculating the probability (range: 0-1) that each sample is classified into WNT, SHH, group and Group4 four molecular subtypes by using the five algorithm models. The present invention selects the most probable subtype as the subtype for this sample. The algorithmically predicted subtypes are compared to the sample true subtypes, and each model performance is represented using the ROC (Receiver Operating Characteristic) area under the curve AUC (Area Under Curve) value (fig. 2) and the prediction accuracy of each subtype (fig. 3). The left panels in figures a-E of fig. 2 show the AUC values of five algorithm models (SMB-NB, SMB-RF, SMB-AB, SMB-LR, and SMB-SVM) for molecular typing of the training set medulloblastoma samples, respectively, each model being greater than 90%, indicating that all five algorithm models were effective for molecular typing of medulloblastoma. For the test set (shown in the middle panels of A-E in FIG. 2), the AUC for molecular typing of test set medulloblastoma samples by the five algorithm models SMB-NB, SMB-RF, SMB-AB, SMB-LR and SMB-SVM were 90.98%, 92.3%, 91.32%, 92.1% and 92.02%, respectively; validation set results as shown in the right panels of a-E in fig. 2, AUCs for molecular typing by the five algorithm models were 91.37%, 95.28%, 93.36%, 92.25% and 93.13%, respectively.

As shown in FIG. 3, the accuracy of the five algorithm models of SMB-NB, SMB-RF, SMB-NB, SMB-LR and SMB-SVM on the test set is: 76.25%, 75%, 77.5% and 80.62%; the accuracy on the validation set is: 81.25%, 75% and 78.12%. The accuracy of the three algorithm models of the SMB-NB, the SMB-RF and the SMB-NB on the test set and the verification set is greatly different, the accuracy of the SMB-LR and the SMB-SVM on the test set and the verification set is balanced, and the accuracy of the SMB-SVM is slightly higher than that of the SMB-LR.

The algorithm model of the present invention will output a probability value for each subtype of WNT, SHH, group, group4, the present invention determines the subtype with the greatest probability as the final subtype of the sample, but if the second largest probability value differs from the largest value by less than 0.1, it will be explained that the sample is also highly likely to be the subtype with the second largest probability value. Thus allowing samples with differences between the probability maxima and the second largest values of the four subtypes predicted by the model of less than 0.1 to be used as unpredictable samples. When unpredictable samples are allowed to exist, the accuracy of the five algorithm models of the SMB-NB, the SMB-RF, the SMB-AB, the SMB-LR and the SMB-SVM on a test set is respectively as follows: 76.16%, 78.03%, 78.26%, 80.85% and 83.56%; the accuracy on the validation set is: 85.71%, 80.65%, 85.19%, 80% and 82.14%. The accuracy of each model algorithm is improved by about 3% compared with the accuracy of all predictions. The accuracy of the SMB-LR and the SMB-SVM are balanced, and the accuracy of the SMB-SVM is slightly higher than that of the SMB-LR, so that the SMB-SVM is finally selected as an algorithm model for molecular typing of the medulloblastoma.

Example 2, a model of SVM algorithm was used to predict the typing of a patient with medulloblastoma.

This example describes an example of the process of clinical medulloblastoma samples from raw data generated by NGS sequencing to predictive medulloblastoma molecular typing.

1. Sequencing data processing.

Raw data (FASTQ format) of whole genome sequencing (sequencing depth 2×) generated by NGS sequencing of clinically obtained medulloblastoma tissue samples were used with trimmatic (Anthony m. Bolger, marc Lohse, bjoern Usadel, trimomatic: a flexible trimmer for Illumina sequence data, bioenformats, volume 30, issue 15, august 2014, pages 2114-2120) software (default parameters), removing reads of low quality from the original FASTQ data (low quality reads define (1) excision of the adapter sequence in the reads, (2) excision of bases with base matrix values below 3 at the head and tail ends of the reads, (3) 4 bases as a window, and 4 bases with average homogeneity value below 4, excision, sliding window excision, if reads contain the above 3 types of bases, the reads remaining after excision are of low quality reads less than 36bp (base pair) to obtain valid data in FASTQ format, and comparison of valid data to human reference genome (grp 2009-27) using BWA-MEM (https:// sourceforge. Net/pro bio-BWA /) software to obtain comparison result file in format of human reference genome (grp 37-27).

2. CNV detection and high frequency CNV analysis.

The BAM format of the data obtained in step 1 was aligned using readDepth (Miller Christopher A, hampton Oliver, coarfa Cristian)et al. ReadDepth: a parallel R package for detecting copy number alterations from short sequencing reads.[J]The software PLoS One, 2011, 6:e16327.) performs CNV detection to obtain a CNV detection result of the tumor sample, and the CNV result is combined with the CNV results of 640 medulloblastoma samples in the training set of example 1, and high frequency CNV results of all 641 samples are obtained by analysis with the software gist 2. The GISTIC2 software needs large queue samples to construct background mutation frequency, and combines training set data to analyze new medulloblastoma samples, so that high-frequency CNV results of medulloblastoma patients, namely arm-level SCNA and focal SCNA results, are extracted from analysis results.

3. Molecular typing of medulloblastoma samples was predicted using SMB-SVM.

Extracting 46 pieces of genome mutation characteristic information of the sample from the result of the step 2: which contains 36 arm-level SCNA:1p,1q,2p,2q,3p,3q,4p,4q,5p,5q,6p,6q,7p,7q,8p,8q,9p,9q,10p,10q,11p,11q,13q,14q,16p,16q,17p,17q,18p,18q,19p,19q,20q,21p,21q,22q;10 focal SCNA: MYCN, GLI2, MYC, PVT1, OTX2, SCAPER, WWOX, SIRPB1, PTCH1, SMYD4. In addition to 48 characteristics of the patient's age and sex, the sample was predicted using the SMB-SVM model constructed in example 1. As shown in FIG. 4, the highest probability of the SHH subtype is 0.7622, the probability of the Group4 subtype is 0.1597, and the difference from the SHH probability is more than 0.1, so that the final prediction result is the SHH subtype.

The present invention is described in detail above. It will be apparent to those skilled in the art that the present invention can be practiced in a wide range of equivalent parameters, concentrations, and conditions without departing from the spirit and scope of the invention and without undue experimentation. While the invention has been described with respect to specific embodiments, it will be appreciated that the invention may be further modified. In general, this application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains.

Claims

1. A computer readable storage medium storing a computer program for predicting molecular subtypes of a patient with medulloblastoma, characterized in that: the computer program causes a computer to execute the steps of:

a1 Obtaining sequencing data, age and sex characteristics of known molecular subtype medulloblastoma samples;

2. The computer-readable storage medium of claim 1, wherein: the sequencing data of the patient with the medulloblastoma to be detected are low-depth whole genome sequencing data; the low depth is that the sequencing depth of the sequencing data is greater than or equal to 2.

3. The computer-readable storage medium according to claim 1 or 2, wherein: the machine learning algorithm is a naive Bayes, random forest, adaBoost iterative algorithm, logistic regression or support vector machine.

4. The computer-readable storage medium according to claim 1 or 2, wherein: the machine learning algorithm is a support vector machine; the kernel function of the support vector machine is a linear kernel= 'linear'.

5. A device for predicting molecular subtype in a medulloblastoma patient, characterized in that: the device comprises the following modules:

6. The apparatus according to claim 5, wherein: the sequencing data of the patient with the medulloblastoma to be detected are low-depth whole genome sequencing data; the low depth is that the sequencing depth of the sequencing data is greater than or equal to 2.

7. The apparatus according to claim 5 or 6, wherein: the machine learning algorithm is a naive Bayes, random forest, adaBoost iterative algorithm, logistic regression or support vector machine.

8. The apparatus according to claim 5 or 6, wherein: the machine learning algorithm is a support vector machine; the kernel function of the support vector machine is a linear kernel= 'linear'.

9. Use of the computer readable storage medium of any one of claims 1-4 for any one of the following:

10. Use of the device of any one of claims 5-8 for any one of the following: