CN116168762B - Computer readable storage medium and device for predicting medulloblastoma typing by low depth whole genome sequencing technique and application thereof - Google Patents

Computer readable storage medium and device for predicting medulloblastoma typing by low depth whole genome sequencing technique and application thereof Download PDF

Info

Publication number
CN116168762B
CN116168762B CN202310454365.0A CN202310454365A CN116168762B CN 116168762 B CN116168762 B CN 116168762B CN 202310454365 A CN202310454365 A CN 202310454365A CN 116168762 B CN116168762 B CN 116168762B
Authority
CN
China
Prior art keywords
medulloblastoma
patient
cnv
scna
arm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310454365.0A
Other languages
Chinese (zh)
Other versions
CN116168762A (en
Inventor
马景娇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Genetron Health Beijing Co ltd
Original Assignee
Genetron Health Beijing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Genetron Health Beijing Co ltd filed Critical Genetron Health Beijing Co ltd
Priority to CN202310454365.0A priority Critical patent/CN116168762B/en
Publication of CN116168762A publication Critical patent/CN116168762A/en
Application granted granted Critical
Publication of CN116168762B publication Critical patent/CN116168762B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

The invention discloses a computer readable storage medium and a device for predicting medulloblastoma typing by a low-depth whole genome sequencing technology in the field of diagnosis and application thereof. The invention aims to solve the technical problem of molecular typing of medulloblastoma based on DNA whole genome low-depth sequencing data. The computer-readable storage medium of the present invention for predicting molecular subtypes of a medulloblastoma patient stores a computer program for causing a computer to execute the steps of: and constructing a molecular typing model of the medulloblastoma based on 48 pieces of characteristic information in high-frequency CNV detection results, age and sex characteristics of sequencing data of known molecular subtype medulloblastoma samples by using a machine learning algorithm, and predicting and obtaining the molecular typing by using the constructed model based on 48 pieces of characteristic information obtained by analyzing low-depth whole genome sequencing data of a patient with the medulloblastoma to be detected. The method has low requirements on detection samples and high prediction accuracy.

Description

Computer readable storage medium and device for predicting medulloblastoma typing by low depth whole genome sequencing technique and application thereof
Technical Field
The present invention relates to a computer readable storage medium and apparatus for predicting medulloblastoma genotyping by low depth whole-genome sequencing technique in the diagnostic field and applications thereof.
Background
Medulloblastoma is the most common brain tumor in children and has high mortality. In 2010, the scientific community reached consensus on its molecular subtypes (WNT, SHH, group and Group 4) at the consensus conference held by boston. Studies have shown that different molecular subtypes exhibit different genotypic characteristics and prognosis. Currently, methods for detecting transcriptome data of tumor samples by NanoString technology and using the transcriptome RNA expression data for molecular typing of medulloblastoma have been widely accepted. However, in the long-term storage process of tumor samples, RNA degradation can increase the failure rate of analysis, and the NanoString technology has lower flux and high detection cost.
Disclosure of Invention
The invention aims to solve the technical problem of molecular typing of medulloblastoma based on DNA whole genome low-depth sequencing data.
To solve the above technical problem, the present invention firstly provides a computer-readable storage medium predicting molecular subtypes of a medulloblastoma patient, the computer-readable storage medium storing a computer program causing a computer to execute the steps of:
a1 Obtaining sequencing data, age characteristics and sex characteristics of known molecular subtype medulloblastoma samples;
a2 Performing CNV detection on sequencing data of the known molecular subtype medulloblastoma sample to obtain a CNV result of the known molecular subtype medulloblastoma sample, and performing high-frequency CNV detection on the CNV result of the known molecular subtype medulloblastoma sample to obtain an arm-level SCNA and a focal SCNA result of the known molecular subtype medulloblastoma sample;
a3 Based on 48 features, constructing a molecular typing model of the medulloblastoma by using a machine learning algorithm; the 48 features include 36 chromosome-arm-level copy number variation features and 10 gene-level copy number variation features, as well as age and sex features, obtained from the arm-level SCNA and focal SCNA results; the 36 chromosome arms include: 1p,1q,2p,2q,3p,3q,4p,4q,5p,5q,6p,6q,7p,7q,8p,8q,9p,9q,10p,10q,11p,11q,13q,14q,16p,16q,17p,17q,18p,18q,19p,19q,20q,21p,21q and 22q; the p represents a short chromosome arm, and the q represents a long chromosome arm; the 10 genes include: MYCN, GLI2, MYC, PVT1, OTX2, SCAPER, WWOX, SIRPB1, PTCH1 and SMYD4;
a4 Obtaining sequencing data of a patient with the medulloblastoma to be detected; comparing the sequencing data to a reference genome to obtain a comparison result file;
a5 Performing CNV detection on the comparison result file to obtain CNV results of the patient with the medulloblastoma to be detected; combining the CNV result of the patient with the medulloblastoma to be detected with the CNV result of the known molecular subtype medulloblastoma sample to obtain a combined CNV result; detecting the combined CNV result to obtain a high-frequency CNV result; extracting an arm-level SCNA and a focal SCNA result of a patient with the medulloblastoma to be detected based on the high-frequency CNV result;
a6 Extracting the 48 features of the patient with the medulloblastoma to be tested based on the arm-level SCNA and focal SCNA results and age and sex information of the patient with the medulloblastoma to be tested, and predicting the molecular subtype of the patient with the medulloblastoma to be tested based on the 48 features of the patient with the medulloblastoma to be tested.
In the above computer-readable storage medium, the 1p may specifically refer to a chromosome 1 short arm. The 1q may specifically refer to chromosome 1 long arm. The chromosome arm horizontal copy number variation characteristic may comprise a chromosome arm deletion characteristic or a chromosome arm amplification characteristic. The 10 genes may specifically be MYCN (nc_000002.12, feb 3, 2014), GLI2 (nc_000002.12, feb 3, 2014), MYC (ng_ 007161.2, sep 20, 2017), PVT1 (nc_000008.11, feb 3, 2014), OTX2 (nc_000014.9, feb 3, 2014), SCAPER (nc_000015.10, feb 3, 2014), WWOX (nc_ 000016.10, feb 3, 2014), SIRPB1 (nc_ 000020.11, feb 3, 2014), PTCH1 (nc_ 000009.12, feb 3, 2014) and SMYD4 (nc_ 000017.11, feb 3, 2014). The gene level copy number variation characteristic may comprise a gene deletion characteristic or a gene amplification characteristic.
In the above computer-readable storage medium, the sequencing data of the patient with medulloblastoma to be tested may be low-depth whole-genome sequencing data. The low depth may be a sequencing depth of 2 or more for the sequencing data.
In the above computer-readable storage medium, the machine learning algorithm may be a naive bayes, a random forest, an AdaBoost iterative algorithm, a logistic regression, or a support vector machine.
In the above computer-readable storage medium, the machine learning algorithm may be a support vector machine. The kernel function of the support vector machine may be a linear kernel= 'linear'. The remaining parameters of the support vector machine may be default parameters. The parameters of the naive bayes, random forests, adaBoost iterative algorithm, or logistic regression may be default parameters.
In the above computer readable storage medium, the sequencing data of the known molecular subtype medulloblastoma sample may be chip sequencing data. The CNV detection may be using dnapoy and/or ReadDepth software; the high frequency CNV detection may be a detection using the gist 2 software.
In order to solve the above technical problem, the present invention also provides a device for predicting molecular subtype of a medulloblastoma patient, which may include the following modules:
b1 Known medulloblastoma sample data acquisition module: sequencing data, age characteristics and sex characteristics for obtaining samples of known molecular subtypes of medulloblastoma;
b2 Known high frequency CNV detection module of medulloblastoma samples: the method comprises the steps of detecting and obtaining CNV results of a known molecular subtype medulloblastoma sample based on sequencing data of the known molecular subtype medulloblastoma sample, and performing high-frequency CNV detection based on the CNV results of the known molecular subtype medulloblastoma sample to obtain arm-level SCNA and focal SCNA results of the known molecular subtype medulloblastoma sample;
b3 Medulloblastoma classification model building block): the method is used for constructing a molecular typing model of the medulloblastoma based on 48 characteristics by using a machine learning algorithm; the 48 features include 36 chromosome-arm-level copy number variation features and 10 gene-level copy number variation features, as well as age and sex features, obtained from the arm-level SCNA and focal SCNA results; the 36 chromosome arms include: 1p,1q,2p,2q,3p,3q,4p,4q,5p,5q,6p,6q,7p,7q,8p,8q,9p,9q,10p,10q,11p,11q,13q,14q,16p,16q,17p,17q,18p,18q,19p,19q,20q,21p,21q and 22q; the p represents a short chromosome arm, and the q represents a long chromosome arm; the 10 genes include: MYCN, GLI2, MYC, PVT1, OTX2, SCAPER, WWOX, SIRPB1, PTCH1 and SMYD4;
b4 CNV detection module for patient with medulloblastoma to be detected: the method comprises the steps of obtaining sequencing data of a patient with the medulloblastoma to be tested; obtaining a comparison result file of a patient with the medulloblastoma to be detected based on the comparison result of the sequencing data and the reference genome; detecting CNV results of a patient with the medulloblastoma to be detected based on the comparison result file;
b5 High frequency CNV detection module: the CNV result of the patient with the medulloblastoma to be detected and the CNV result of the known molecular subtype medulloblastoma sample are combined to obtain a combined CNV result; detecting a high frequency CNV result of the combined CNV; extracting an arm-level SCNA and a focal SCNA result of a patient with the medulloblastoma to be detected based on the high-frequency CNV result;
b6 Molecular subtype prediction module of patient with medulloblastoma to be detected: the method is used for extracting 48 characteristics of the patient with the medulloblastoma to be detected based on the arm-level SCNA and focal SCNA results of the patient with the medulloblastoma to be detected, and predicting molecular subtype typing of the patient with the medulloblastoma to be detected based on the 48 characteristics of the patient with the medulloblastoma to be detected.
In the above device, the 1p may specifically refer to a short arm of chromosome 1. The 1q may specifically refer to chromosome 1 long arm. The chromosome arm horizontal copy number variation characteristic may comprise a chromosome arm deletion characteristic or a chromosome arm amplification characteristic. The 10 genes may specifically be MYCN (nc_000002.12, feb 3, 2014), GLI2 (nc_000002.12, feb 3, 2014), MYC (ng_ 007161.2, sep 20, 2017), PVT1 (nc_000008.11, feb 3, 2014), OTX2 (nc_000014.9, feb 3, 2014), SCAPER (nc_000015.10, feb 3, 2014), WWOX (nc_ 000016.10, feb 3, 2014), SIRPB1 (nc_ 000020.11, feb 3, 2014), PTCH1 (nc_ 000009.12, feb 3, 2014) and SMYD4 (nc_ 000017.11, feb 3, 2014). The gene level copy number variation characteristic may comprise a gene deletion characteristic or a gene amplification characteristic.
In the above device, the sequencing data of the patient with medulloblastoma to be tested may be low-depth whole genome sequencing data; the low depth may be a sequencing depth of 2 or more for the sequencing data.
In the above apparatus, the machine learning algorithm may be a naive bayes, a random forest, an AdaBoost iterative algorithm, a logistic regression, or a support vector machine.
In the above apparatus, the machine learning algorithm may specifically be a support vector machine; the kernel function of the support vector machine may be a linear kernel= 'linear'. The remaining parameters of the support vector machine may be default parameters. The parameters of the naive bayes, random forests, adaBoost iterative algorithm, or logistic regression may be default parameters.
The sequencing data of the known molecular subtype medulloblastoma sample may be chip sequencing data. The CNV detection may be using dnapoy and/or ReadDepth software; the high frequency CNV detection may be a detection using the gist 2 software.
Any of the following applications of the computer readable storage medium described above are also within the scope of the present invention:
c1 The use of a product for predicting molecular subtype typing of a patient with medulloblastoma;
c2 For developing or preparing a medicament for treating or alleviating medulloblastoma;
c3 Use in the development or manufacture of a product for medulloblastoma guidance medicine;
c4 For the preparation of a product for predicting the prognosis of molecular subtype typing of a patient with medulloblastoma.
Any of the following applications of the device described above is also within the scope of the present invention:
d1 The use of a product for predicting molecular subtype typing of a patient with medulloblastoma;
d2 For developing or preparing a medicament for treating or alleviating medulloblastoma;
d3 Use in the development or manufacture of a product for medulloblastoma guidance medicine;
d4 For the preparation of a product for predicting the prognosis of molecular subtype typing of a patient with medulloblastoma.
The invention provides a DNA whole genome low-depth (more than or equal to 2X depth) sequencing technology, which uses a machine learning algorithm to realize molecular typing of medulloblastoma.
The technical scheme of the invention comprises two parts of model training and sample detection, wherein the model training part is carried out by using common data, and the sample detection part is used for clinically collected medulloblastoma samples.
The following is the model training part: the invention retrieves GSE37385 (https:// www.ncbi.nlm.nih.gov/geo/query/acc. Cgiac=GSE 37385) data set from GEO (Gene Expression Omnibus), and 1097 medulloblastoma SNP array CEL data and clinical information. After a series of quality evaluations, the product is protected800 samples were left as training set (640 samples) and test set (160 samples) for the machine learning model. Furthermore, the present invention is described in Robinson G (Robinson, G., parker, M., kranenburg, T.).et al. Novel mutations target distinct subgroups of medulloblastoma. Nature488, 43-48 (2012) et al, 32 medulloblastoma samples were collected as a validation set.
For the samples of the training set, test set, validation set above, pennCNV software (Wang K, li M, hadley D was used.et al. PennCNV an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data Genome Research 17:1665-1674, 2007) the Log R Ratio (LRR) and B Allele Frequency (BAF) of each sample were calculated as input data to DNAcopy software (Seshan VE, oldhen A. DNAcopy: DNA copy number data analysis, R package version 1.72.3.) and CNV (copy number variant copy number variation) was analyzed to obtain CNV results for each sample.
GISTIC2 software (Mermel CH, schumacher SE, hill B, meyerson ML was used.et al.GISTIC2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancer Genome biol 2011;12 (4): R41.) analysis of the samples for CNV data, high frequency CNV results were obtained for each sample: namely an arm-level SCNA (arm-level somatic copy number alteration chromosome arm level somatic cell copy number change, defined as a length at which a copy number change occurs equal to or greater than 50% of the chromosome arm length) and a focal SCNA (focal somatic copy number alteration local somatic cell copy number change, defined as a length at which a copy number change occurs less than 50% of the chromosome arm length), the focal SCNA of the present invention being defined at the length of the gene level). Analysis of high frequency CNV using the gist 2 software requires a large number of sample queues to construct background mutation frequencies, 640 samples of the training set are analyzed as a queue, each sample of the test set and the validation set and 640 samples of the training set form a 641 sample queue for the gist 2 analysis, and the arm-level SCNA and the focal SCNA of each sample are obtained.
The high frequency CNV obtained by the gist 2 and the age and sex were used as input feature data, five-fold cross-validation (training set 640 samples, test set 160 samples) was used, and the AdaBoost was used to select the important 48 features. Five machine-learned classification algorithm models, naive Bayes, random Forest, adaBoost, logistic Regression and Support Vector Machine (SVM, support vector machine), were used based on these 48 features, respectively, to divide the samples into four subtypes WNT, SHH, group3, group4, and AUROC was used to evaluate each algorithm performance. The result shows that the comprehensive performance of the SVM machine learning classification algorithm model is superior to that of other four classification algorithms, so that the SVM is finally selected as an algorithm of a medulloblastoma sample molecular typing model.
The following are sample detection portions: for a clinical medulloblastoma tissue sample to be detected, sequencing data are obtained, after quality control of the data is qualified, CNV data are obtained by analyzing the sample and 640 samples of a training set to form a queue, the arm-level SCNA and the focal SCNA characteristics of the sample to be detected are obtained by using GISTIC2 analysis, age and gender information of the sample are added, molecular typing is carried out on the sample by using an SVM classification algorithm model, a probability is obtained for each molecular subtype of WNT, SHH, group3 and Group4, and if the difference of the probability of the maximum two subtypes is greater than or equal to 0.1, the final result is the subtype with the maximum probability, and if the difference of the probabilities is less than 0.1, the molecular typing result cannot be determined.
Due to the adoption of the technical scheme, the invention has the following advantages:
1. only DNA low-depth WGS sequencing data is needed, the requirement on a sample is low, and the sample is easy to obtain;
2. five machine learning models are used for SVM with optimal performance, and prediction accuracy is high.
The DNA sequencing data described above may be sequencing data having a sequencing depth of 2 x or greater.
Drawings
Fig. 1 is a model feature selection diagram. The abscissa is the features used by the model and the ordinate is the feature importance index of the AdaBoost analysis.
Fig. 2 is an AUROC picture of five classification models. A is the ROC curve of the SMB-NB, the left graph is a training set sample, the middle graph is a test set sample, and the right graph is a verification set sample; b is the ROC curve of SMB-RF, left graph is training set sample, middle graph is test set sample, right graph is validation set sample; c is the ROC curve of the SMB-AB, the left graph is a training set sample, the middle graph is a test set sample, and the right graph is a verification set sample; d is the ROC curve of SMB-LR, the left graph is a training set sample, the middle graph is a test set sample, and the right graph is a verification set sample; e is the ROC curve of the SMB-SVM, the left graph is a training set sample, the middle graph is a test set sample, and the right graph is a verification set sample. The ordinate is true positive rate, and the abscissa is false positive rate.
Fig. 3 shows that the prediction accuracy of the five classification models and the difference between the existence probabilities are less than 0.1 are the accuracy of the unpredictable samples.
FIG. 4 shows one example of clinical molecular typing results of medulloblastoma.
Detailed Description
The following detailed description of the invention is provided in connection with the accompanying drawings that are presented to illustrate the invention and not to limit the scope thereof. The examples provided below are intended as guidelines for further modifications by one of ordinary skill in the art and are not to be construed as limiting the invention in any way.
The experimental methods in the following examples, unless otherwise specified, are conventional methods, and are carried out according to techniques or conditions described in the literature in the field or according to the product specifications. Materials, reagents and the like used in the examples described below are commercially available unless otherwise specified.
Example 1, construction and validation of a medulloblastoma typing model.
The invention obtains data from a public database, and is divided into a training set, a testing set and a verification set, a model is trained by using a machine learning algorithm, and the embodiment describes the construction process of the model and the verification result of the model performance.
1. Copy Number Variation (CNV) of the samples was detected.
Retrieval from GEO (Gene Expression Omnibus) databaseTo the GSE37385 (https:// www.ncbi.nlm.nih.gov/geo/query) dataset, a total of 1097 medulloblastoma SNP array CEL data (chip sequencing data) and clinical information. After a series of quality evaluations, data information of 800 samples is reserved for constructing a machine learning model; 800 medulloblastoma samples were randomly divided into training (640 samples) and test (160 samples) at a 4:1 ratio. Furthermore, the present invention is described in Robinson G (Robinson, G., parker, M., kranenburg, T.).et al.Novel mutations target distinct subgroups of medulloblastoma. Nature488, 43-48 (2012)) et al, 32 medulloblastoma sample data were collected as a validation set.
For the data sets (training set, test set and validation set) described above, pennCNV software (Wang K, li M, hadley D was used.et al. PennCNV an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data Genome Research 17:1665-1674, 2007) Log R Ratio (LRR) and B Allele Frequency (BAF) were calculated for each sample. The copy number variation (copy number variant, CNV) of each sample was detected using DNAcopy (Seshan VE, oldhen A. DNAcopy: DNA copy number data analysis. R package version 1.72.3) software with LRR and BAF as input data.
2. High frequency CNV of the sample is detected.
GISTIC2 software (Mermel CH, schumacher SE, hill B, meyerson ML was used.et al.GISTIC2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancer Genome biol 2011;12 (4): R41.) CNV data of training set, test set and validation set samples were analyzed, respectively, to obtain high frequency CNV results for each sample: namely an arm-level SCNA (arm-level somatic copy number alteration chromosome arm-level somatic cell copy number change, defined as a change in copy number of 50% or more of the chromosome arm length), and a focal SCNA (focal somatic copy number alteration local somatic cell copy number change, defined as a change in copy number of less than 50% of the chromosome arm length,the focal SCNA of the present invention is limited to the length at the gene level). The GISTIC2 software analysis needs a large number of sample queues to construct background mutation frequency, 640 samples of the training set are used as a queue for analysis, each sample of the test set and the verification set and 640 samples of the training set form a 641 sample queue for GISTIC2 analysis, and the arm-level SCNA and the focal SCNA of each sample are obtained.
3. Model feature selection.
Five-fold cross-validation (640 cases in training set and 160 cases in test set) was used, and the input gist 2 analysis was performed to obtain the arm-level SCNA and focal SCNA information of the samples, as well as clinical information (including Age and Sex set, all obtained from a database download), and the AdaBoost (Adaptive Boosting, adaptive enhancement) algorithm was used to select the important features of the medulloblastoma, as shown in fig. 1, and finally 48 features were selected as the features of molecular classification of the medulloblastoma in total: including 36 arm-level SCNA (chromosome arm copy number variation: amplification or deletion) features: human 1p (chromosome 1 short arm, 1p, p for short, features including 1p amplified or 1p deleted), 1q (chromosome 1 long arm, 1q, q for long, features including 1q amplified or 1q deleted), 2p,2q,3p,3q,4p,4q,5p,5q,6p,6q,7p,7q,8p,8q,9p,9q,10p,10q,11p,11q,13q,14q,16p,16q,17p,17q,18p,18q,19p,19q,20q,21p,21q,22q;10 focal SCNA (gene level copy number variation: amplification or deletion) characteristics: MYCN (nc_000002.12, feb 3, 2014), GLI2 (nc_000002.12, feb 3, 2014), MYC (ng_ 007161.2, sep 20, 2017), PVT1 (nc_000008.11, feb 3, 2014), OTX2 (nc_000014.9, feb 3, 2014), SCAPER (nc_000015.10, feb 3, 2014), WWOX (nc_ 000016.10, feb 3, 2014), SIRPB1 (nc_ 000020.11, feb 3, 2014), PTCH1 (nc_ 000009.12, feb 3, 2014) and SMYD4 (nc_ 000017.11, feb 3, 2014); and both Age (Age) and gender (set).
4. And constructing a molecular subtype classification model of the medulloblastoma.
For training set data of 640 cases of medulloblastoma samples, selecting five algorithms including arm-level SCNA, focal SCNA, age and gender selected in step 3 in each sample to form a training set feature matrix, and constructing five medulloblastoma classification models respectively called Nave Bayes algorithm model (SMB-NB), SHH subtype, group3 subtype and Group4 subtype, which are given in GSE37385 data, by using Nave Bayes, random Forest, adaBoost iterative algorithm, logistic Regression (logistic regression) and SVM (support vector machine ) (SVM algorithm parameter setting: kernel= 'linear', and the rest are default parameters) as class labels, wherein the five algorithms are respectively called Nave Bayes algorithm model (SMB-NB), random Forest algorithm model (SMB-RF), random Boost algorithm model (SMB-AB), random Boost algorithm model (62 LR) and SVM algorithm model (SVM-SVM).
5. And (5) detecting the prediction accuracy of the medulloblastoma classification model.
For the five algorithm classification models constructed in the step 4, testing the performance of each algorithm model by using test set and verification set samples, inputting each sample into the 48 feature matrices selected in the step 3, and calculating the probability (range: 0-1) that each sample is classified into WNT, SHH, group and Group4 four molecular subtypes by using the five algorithm models. The present invention selects the most probable subtype as the subtype for this sample. The algorithmically predicted subtypes are compared to the sample true subtypes, and each model performance is represented using the ROC (Receiver Operating Characteristic) area under the curve AUC (Area Under Curve) value (fig. 2) and the prediction accuracy of each subtype (fig. 3). The left panels in figures a-E of fig. 2 show the AUC values of five algorithm models (SMB-NB, SMB-RF, SMB-AB, SMB-LR, and SMB-SVM) for molecular typing of the training set medulloblastoma samples, respectively, each model being greater than 90%, indicating that all five algorithm models were effective for molecular typing of medulloblastoma. For the test set (shown in the middle panels of A-E in FIG. 2), the AUC for molecular typing of test set medulloblastoma samples by the five algorithm models SMB-NB, SMB-RF, SMB-AB, SMB-LR and SMB-SVM were 90.98%, 92.3%, 91.32%, 92.1% and 92.02%, respectively; validation set results as shown in the right panels of a-E in fig. 2, AUCs for molecular typing by the five algorithm models were 91.37%, 95.28%, 93.36%, 92.25% and 93.13%, respectively.
As shown in FIG. 3, the accuracy of the five algorithm models of SMB-NB, SMB-RF, SMB-NB, SMB-LR and SMB-SVM on the test set is: 76.25%, 75%, 77.5% and 80.62%; the accuracy on the validation set is: 81.25%, 75% and 78.12%. The accuracy of the three algorithm models of the SMB-NB, the SMB-RF and the SMB-NB on the test set and the verification set is greatly different, the accuracy of the SMB-LR and the SMB-SVM on the test set and the verification set is balanced, and the accuracy of the SMB-SVM is slightly higher than that of the SMB-LR.
The algorithm model of the present invention will output a probability value for each subtype of WNT, SHH, group, group4, the present invention determines the subtype with the greatest probability as the final subtype of the sample, but if the second largest probability value differs from the largest value by less than 0.1, it will be explained that the sample is also highly likely to be the subtype with the second largest probability value. Thus allowing samples with differences between the probability maxima and the second largest values of the four subtypes predicted by the model of less than 0.1 to be used as unpredictable samples. When unpredictable samples are allowed to exist, the accuracy of the five algorithm models of the SMB-NB, the SMB-RF, the SMB-AB, the SMB-LR and the SMB-SVM on a test set is respectively as follows: 76.16%, 78.03%, 78.26%, 80.85% and 83.56%; the accuracy on the validation set is: 85.71%, 80.65%, 85.19%, 80% and 82.14%. The accuracy of each model algorithm is improved by about 3% compared with the accuracy of all predictions. The accuracy of the SMB-LR and the SMB-SVM are balanced, and the accuracy of the SMB-SVM is slightly higher than that of the SMB-LR, so that the SMB-SVM is finally selected as an algorithm model for molecular typing of the medulloblastoma.
Example 2, a model of SVM algorithm was used to predict the typing of a patient with medulloblastoma.
This example describes an example of the process of clinical medulloblastoma samples from raw data generated by NGS sequencing to predictive medulloblastoma molecular typing.
1. Sequencing data processing.
Raw data (FASTQ format) of whole genome sequencing (sequencing depth 2×) generated by NGS sequencing of clinically obtained medulloblastoma tissue samples were used with trimmatic (Anthony m. Bolger, marc Lohse, bjoern Usadel, trimomatic: a flexible trimmer for Illumina sequence data, bioenformats, volume 30, issue 15, august 2014, pages 2114-2120) software (default parameters), removing reads of low quality from the original FASTQ data (low quality reads define (1) excision of the adapter sequence in the reads, (2) excision of bases with base matrix values below 3 at the head and tail ends of the reads, (3) 4 bases as a window, and 4 bases with average homogeneity value below 4, excision, sliding window excision, if reads contain the above 3 types of bases, the reads remaining after excision are of low quality reads less than 36bp (base pair) to obtain valid data in FASTQ format, and comparison of valid data to human reference genome (grp 2009-27) using BWA-MEM (https:// sourceforge. Net/pro bio-BWA /) software to obtain comparison result file in format of human reference genome (grp 37-27).
2. CNV detection and high frequency CNV analysis.
The BAM format of the data obtained in step 1 was aligned using readDepth (Miller Christopher A, hampton Oliver, coarfa Cristian)et al. ReadDepth: a parallel R package for detecting copy number alterations from short sequencing reads.[J]The software PLoS One, 2011, 6:e16327.) performs CNV detection to obtain a CNV detection result of the tumor sample, and the CNV result is combined with the CNV results of 640 medulloblastoma samples in the training set of example 1, and high frequency CNV results of all 641 samples are obtained by analysis with the software gist 2. The GISTIC2 software needs large queue samples to construct background mutation frequency, and combines training set data to analyze new medulloblastoma samples, so that high-frequency CNV results of medulloblastoma patients, namely arm-level SCNA and focal SCNA results, are extracted from analysis results.
3. Molecular typing of medulloblastoma samples was predicted using SMB-SVM.
Extracting 46 pieces of genome mutation characteristic information of the sample from the result of the step 2: which contains 36 arm-level SCNA:1p,1q,2p,2q,3p,3q,4p,4q,5p,5q,6p,6q,7p,7q,8p,8q,9p,9q,10p,10q,11p,11q,13q,14q,16p,16q,17p,17q,18p,18q,19p,19q,20q,21p,21q,22q;10 focal SCNA: MYCN, GLI2, MYC, PVT1, OTX2, SCAPER, WWOX, SIRPB1, PTCH1, SMYD4. In addition to 48 characteristics of the patient's age and sex, the sample was predicted using the SMB-SVM model constructed in example 1. As shown in FIG. 4, the highest probability of the SHH subtype is 0.7622, the probability of the Group4 subtype is 0.1597, and the difference from the SHH probability is more than 0.1, so that the final prediction result is the SHH subtype.
The present invention is described in detail above. It will be apparent to those skilled in the art that the present invention can be practiced in a wide range of equivalent parameters, concentrations, and conditions without departing from the spirit and scope of the invention and without undue experimentation. While the invention has been described with respect to specific embodiments, it will be appreciated that the invention may be further modified. In general, this application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains.

Claims (10)

1. A computer readable storage medium storing a computer program for predicting molecular subtypes of a patient with medulloblastoma, characterized in that: the computer program causes a computer to execute the steps of:
a1 Obtaining sequencing data, age and sex characteristics of known molecular subtype medulloblastoma samples;
a2 Performing CNV detection on sequencing data of the known molecular subtype medulloblastoma sample to obtain a CNV result of the known molecular subtype medulloblastoma sample, and performing high-frequency CNV detection on the CNV result of the known molecular subtype medulloblastoma sample to obtain an arm-level SCNA and a focal SCNA result of the known molecular subtype medulloblastoma sample;
a3 Based on 48 features, constructing a molecular typing model of the medulloblastoma by using a machine learning algorithm; the 48 features include 36 chromosome-arm-level copy number variation features and 10 gene-level copy number variation features, as well as age and sex features, obtained from the arm-level SCNA and focal SCNA results; the 36 chromosome arms include: 1p,1q,2p,2q,3p,3q,4p,4q,5p,5q,6p,6q,7p,7q,8p,8q,9p,9q,10p,10q,11p,11q,13q,14q,16p,16q,17p,17q,18p,18q,19p,19q,20q,21p,21q and 22q; the p represents a short chromosome arm, and the q represents a long chromosome arm; the 10 genes include: MYCN, GLI2, MYC, PVT1, OTX2, SCAPER, WWOX, SIRPB1, PTCH1 and SMYD4;
a4 Obtaining sequencing data of a patient with the medulloblastoma to be detected; comparing the sequencing data to a reference genome to obtain a comparison result file;
a5 Performing CNV detection on the comparison result file to obtain CNV results of the patient with the medulloblastoma to be detected; combining the CNV result of the patient with the medulloblastoma to be detected with the CNV result of the known molecular subtype medulloblastoma sample to obtain a combined CNV result; detecting the combined CNV result to obtain a high-frequency CNV result; extracting an arm-level SCNA and a focal SCNA result of a patient with the medulloblastoma to be detected based on the high-frequency CNV result;
a6 Extracting the 48 features of the patient with the medulloblastoma to be tested based on the arm-level SCNA and focal SCNA results and age and sex information of the patient with the medulloblastoma to be tested, and predicting the molecular subtype of the patient with the medulloblastoma to be tested based on the 48 features of the patient with the medulloblastoma to be tested.
2. The computer-readable storage medium of claim 1, wherein: the sequencing data of the patient with the medulloblastoma to be detected are low-depth whole genome sequencing data; the low depth is that the sequencing depth of the sequencing data is greater than or equal to 2.
3. The computer-readable storage medium according to claim 1 or 2, wherein: the machine learning algorithm is a naive Bayes, random forest, adaBoost iterative algorithm, logistic regression or support vector machine.
4. The computer-readable storage medium according to claim 1 or 2, wherein: the machine learning algorithm is a support vector machine; the kernel function of the support vector machine is a linear kernel= 'linear'.
5. A device for predicting molecular subtype in a medulloblastoma patient, characterized in that: the device comprises the following modules:
b1 Known medulloblastoma sample data acquisition module: sequencing data, age characteristics and sex characteristics for obtaining samples of known molecular subtypes of medulloblastoma;
b2 Known high frequency CNV detection module of medulloblastoma samples: the method comprises the steps of detecting and obtaining CNV results of a known molecular subtype medulloblastoma sample based on sequencing data of the known molecular subtype medulloblastoma sample, and performing high-frequency CNV detection based on the CNV results of the known molecular subtype medulloblastoma sample to obtain arm-level SCNA and focal SCNA results of the known molecular subtype medulloblastoma sample;
b3 Medulloblastoma classification model building block): the method is used for constructing a molecular typing model of the medulloblastoma based on 48 characteristics by using a machine learning algorithm; the 48 features include 36 chromosome-arm-level copy number variation features and 10 gene-level copy number variation features, as well as age and sex features, obtained from the arm-level SCNA and focal SCNA results; the 36 chromosome arms include: 1p,1q,2p,2q,3p,3q,4p,4q,5p,5q,6p,6q,7p,7q,8p,8q,9p,9q,10p,10q,11p,11q,13q,14q,16p,16q,17p,17q,18p,18q,19p,19q,20q,21p,21q and 22q; the p represents a short chromosome arm, and the q represents a long chromosome arm; the 10 genes include: MYCN, GLI2, MYC, PVT1, OTX2, SCAPER, WWOX, SIRPB1, PTCH1 and SMYD4;
b4 CNV detection module for patient with medulloblastoma to be detected: the method comprises the steps of obtaining sequencing data of a patient with the medulloblastoma to be tested; obtaining a comparison result file of a patient with the medulloblastoma to be detected based on the comparison result of the sequencing data and the reference genome; detecting CNV results of a patient with the medulloblastoma to be detected based on the comparison result file;
b5 High frequency CNV detection module: the CNV result of the patient with the medulloblastoma to be detected and the CNV result of the known molecular subtype medulloblastoma sample are combined to obtain a combined CNV result; detecting a high frequency CNV result of the combined CNV; extracting an arm-level SCNA and a focal SCNA result of a patient with the medulloblastoma to be detected based on the high-frequency CNV result;
b6 Molecular subtype prediction module of patient with medulloblastoma to be detected: the method is used for extracting 48 characteristics of the patient with the medulloblastoma to be detected based on the arm-level SCNA and focal SCNA results of the patient with the medulloblastoma to be detected, and predicting molecular subtype typing of the patient with the medulloblastoma to be detected based on the 48 characteristics of the patient with the medulloblastoma to be detected.
6. The apparatus according to claim 5, wherein: the sequencing data of the patient with the medulloblastoma to be detected are low-depth whole genome sequencing data; the low depth is that the sequencing depth of the sequencing data is greater than or equal to 2.
7. The apparatus according to claim 5 or 6, wherein: the machine learning algorithm is a naive Bayes, random forest, adaBoost iterative algorithm, logistic regression or support vector machine.
8. The apparatus according to claim 5 or 6, wherein: the machine learning algorithm is a support vector machine; the kernel function of the support vector machine is a linear kernel= 'linear'.
9. Use of the computer readable storage medium of any one of claims 1-4 for any one of the following:
c1 The use of a product for predicting molecular subtype typing of a patient with medulloblastoma;
c2 For developing or preparing a medicament for treating or alleviating medulloblastoma;
c3 Use in the development or manufacture of a product for medulloblastoma guidance medicine;
c4 For the preparation of a product for predicting the prognosis of molecular subtype typing of a patient with medulloblastoma.
10. Use of the device of any one of claims 5-8 for any one of the following:
d1 The use of a product for predicting molecular subtype typing of a patient with medulloblastoma;
d2 For developing or preparing a medicament for treating or alleviating medulloblastoma;
d3 Use in the development or manufacture of a product for medulloblastoma guidance medicine;
d4 For the preparation of a product for predicting the prognosis of molecular subtype typing of a patient with medulloblastoma.
CN202310454365.0A 2023-04-25 2023-04-25 Computer readable storage medium and device for predicting medulloblastoma typing by low depth whole genome sequencing technique and application thereof Active CN116168762B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310454365.0A CN116168762B (en) 2023-04-25 2023-04-25 Computer readable storage medium and device for predicting medulloblastoma typing by low depth whole genome sequencing technique and application thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310454365.0A CN116168762B (en) 2023-04-25 2023-04-25 Computer readable storage medium and device for predicting medulloblastoma typing by low depth whole genome sequencing technique and application thereof

Publications (2)

Publication Number Publication Date
CN116168762A CN116168762A (en) 2023-05-26
CN116168762B true CN116168762B (en) 2023-06-27

Family

ID=86418608

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310454365.0A Active CN116168762B (en) 2023-04-25 2023-04-25 Computer readable storage medium and device for predicting medulloblastoma typing by low depth whole genome sequencing technique and application thereof

Country Status (1)

Country Link
CN (1) CN116168762B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108728533A (en) * 2017-04-20 2018-11-02 常青 The purposes of gene group and SNCA genes as the biomarker of 4 type medulloblastomas for medulloblastoma molecule parting
CN109182517A (en) * 2018-09-07 2019-01-11 杭州可帮基因科技有限公司 One group of gene and its application for medulloblastoma molecule parting
CN110029157A (en) * 2018-01-11 2019-07-19 北京大学 A method of the unicellular genome monoploid of detection tumour copies number variation

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2021521857A (en) * 2018-04-28 2021-08-30 北京▲師▼▲範▼大学 Molecular classification of multiple myeloma and its application

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108728533A (en) * 2017-04-20 2018-11-02 常青 The purposes of gene group and SNCA genes as the biomarker of 4 type medulloblastomas for medulloblastoma molecule parting
CN110029157A (en) * 2018-01-11 2019-07-19 北京大学 A method of the unicellular genome monoploid of detection tumour copies number variation
CN109182517A (en) * 2018-09-07 2019-01-11 杭州可帮基因科技有限公司 One group of gene and its application for medulloblastoma molecule parting

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
不同分子亚型髓母细胞瘤的影像表现及预后分析;张雨婷 等;临床儿科杂志(第05期);第20-24页 *
髓母细胞瘤的分子分型及预后进展;何晓蓉 等;临床与实验病理学杂志(第02期);第77-80页 *

Also Published As

Publication number Publication date
CN116168762A (en) 2023-05-26

Similar Documents

Publication Publication Date Title
Xu et al. Genotype-free demultiplexing of pooled single-cell RNA-seq
CN112802548B (en) Method for predicting allele-specific copy number variation of single-sample whole genome
Zhang et al. Clinical interpretation of sequence variants
US20230114581A1 (en) Systems and methods for predicting homologous recombination deficiency status of a specimen
Kim et al. rSW-seq: algorithm for detection of copy number alterations in deep sequencing data
US20190338349A1 (en) Methods and systems for high fidelity sequencing
Yuan et al. CNV_IFTV: an isolation forest and total variation-based detection of CNVs from short-read sequencing data
IL258999A (en) Methods for detecting copy-number variations in next-generation sequencing
Yuan et al. A sparse regulatory network of copy-number driven gene expression reveals putative breast cancer oncogenes
WO2019242445A1 (en) Detection method, device, computer equipment and storage medium of pathogen operation group
Mu et al. CNAPE: a machine learning method for copy number alteration prediction from gene expression
Karimi et al. Approach to genetic diagnosis of inborn errors of immunity through next-generation sequencing
CN112735594B (en) Method for screening mutation sites related to disease phenotype and application thereof
Tenenhaus et al. Gene association networks from microarray data using a regularized estimation of partial correlation based on PLS regression
Zhang et al. MaLAdapt reveals novel targets of adaptive introgression from Neanderthals and Denisovans in worldwide human populations
Westphal et al. SMaSH: Sample matching using SNPs in humans
CN116168762B (en) Computer readable storage medium and device for predicting medulloblastoma typing by low depth whole genome sequencing technique and application thereof
Cuevas-Córdoba et al. A bioinformatics pipeline for Mycobacterium tuberculosis sequencing that cleans contaminant reads from sputum samples
US20230162817A1 (en) Hash-based efficient comparison of sequencing results
CN114207727A (en) System and method for determining a cell of origin from variant identification data
Guo et al. DAM: A Bayesian method for detecting genome-wide associations on multiple diseases
Huang et al. Reveel: large-scale population genotyping using low-coverage sequencing data
MJ. Sontrop et al. Breast cancer subtype predictors revisited: from consensus to concordance?
CN114566214A (en) Method for detecting genome deletion insertion variation, detection device, computer-readable storage medium and application
Kuhn et al. Finding small somatic structural variants in exome sequencing data: a machine learning approach

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: Computer readable storage media and devices for predicting the subtype of medulloblastoma using low depth whole genome sequencing technology and their applications

Granted publication date: 20230627

Pledgee: Beijing Daxing Development Rongda Financing Guarantee Co.,Ltd.

Pledgor: GENETRON HEALTH (BEIJING) Co.,Ltd.

Registration number: Y2024990000036