CN117316289A - Methylation sequencing typing method and system for central nervous system tumor - Google Patents

Methylation sequencing typing method and system for central nervous system tumor Download PDF

Info

Publication number
CN117316289A
CN117316289A CN202311144613.8A CN202311144613A CN117316289A CN 117316289 A CN117316289 A CN 117316289A CN 202311144613 A CN202311144613 A CN 202311144613A CN 117316289 A CN117316289 A CN 117316289A
Authority
CN
China
Prior art keywords
methylation
sequencing
sample
tumor
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311144613.8A
Other languages
Chinese (zh)
Other versions
CN117316289B (en
Inventor
史之峰
周良辅
汤奇胜
李明辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huashan Hospital of Fudan University
Original Assignee
Huashan Hospital of Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huashan Hospital of Fudan University filed Critical Huashan Hospital of Fudan University
Priority to CN202311144613.8A priority Critical patent/CN117316289B/en
Publication of CN117316289A publication Critical patent/CN117316289A/en
Application granted granted Critical
Publication of CN117316289B publication Critical patent/CN117316289B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B35/00ICT specially adapted for in silico combinatorial libraries of nucleic acids, proteins or peptides

Abstract

The invention discloses a methylation sequencing typing method and a methylation sequencing typing system for central nervous system tumors, comprising the following steps: extracting genome DNA from a tumor sample to be classified to obtain sample genome DNA; analyzing the methylation status of a plurality of independent genomic CpG positions in the genomic DNA of the sample, and constructing a methylation dataset based on the methylation status and the common dataset; the classification rules obtained by analyzing the plurality of methylation data sets are learned through a random forest model algorithm, and subtypes of the tumor sample to be classified are classified according to the learning results; the method provided by the invention obtains the standardized CNS tumor typing result through the standardization of data detection and analysis, can accurately distinguish different subtypes of CNS tumor, and avoids the problem of similar histological features and human interpretation errors among the subtypes; and the methylation capturing sequencing technology platform is specially designed for CNS tumor identification, can be added into sites beneficial to CNS tumor diagnosis in a targeted manner, and maintains scalability.

Description

Methylation sequencing typing method and system for central nervous system tumor
Technical Field
The invention belongs to the technical field of medicine detection, and particularly relates to a methylation sequencing typing method and a methylation sequencing typing system for central nervous system tumors.
Background
The central nervous system (Central Nervous System, CNS) is composed of a wide variety of cell types, also resulting in a wide variety of CNS tumor subtypes. The world health organization lists over 100 brain tumor subtypes in the CNS tumor classification, many of which exhibit overlapping histological features. Furthermore, even histologically identical tumors may belong to different molecular subgroups, with very different therapeutic requirements and prognosis. Thus, more advanced diagnostic tools are needed to distinguish between different subtypes of CNS tumors.
The current CNS tumor diagnosis is mainly carried out by pathological section staining and artificial naked eye interpretation. The CNS tumors are complex and various, similar histological features exist among subtypes, and the problem of inconsistent diagnosis among different doctors exists in manual interpretation. Many studies have identified the problem of inconsistent diagnostic results from glioma pathology to physician, which can affect clinical decisions. For example, in a systematic study of the diagnosis of a cohort of 500 brain tumor patients, 42.8% of the patients were found to have a degree of divergence in diagnosis, and 8.8% of the patients were considered to have serious divergence. In a study on diagnosis differences of adult glioma, 26% of 457 cases of the referral were found to have inconsistent diagnosis; the degree of case diagnosis inconsistency is higher in the referral from a community hospital. Of these, 16% of inconsistent diagnoses are considered clinically significant, altering patient management and prognosis. In a prospective study of 244 cases by 4 pathologists, it was shown that diagnostic consistency could be improved by repeated examinations. In the first trial, four experts only agreed to 52% of cases, but after the fourth trial this ratio was 69%. From this, it can be seen that with respect to complex CNS tumors, accuracy is paramount, and even if repeatedly confirmed by the most experienced pathologist, it is difficult to agree on more than 30% of cases.
Currently, another approach to CNS tumor diagnosis is the Illumina 450K methylation chip data-based typing method of the german tumor center. The 450K chip is an old methylation chip of Illumina company, has been stopped, has about 45 ten thousand probes, and detects 45 ten thousand CpG sites correspondingly. The Germany method specifically comprises the steps of detecting 2800 CNS tumors by using 450K chips, screening 32000 probe sites with the largest variation, performing unsupervised clustering on 2800 samples, and finally obtaining 91 subclasses based on methylation data. And constructing a random forest machine learning algorithm by using the classified grouping guidance of the 91 subclasses, and screening out 10000 most important probe sites capable of distinguishing 91 subclasses. The subclass of CNS tumors was predicted by the random forest algorithm from the 10000 methylation site data.
However, the chip technology platform also has a non-negligible problem:
the chip generates fluorescent markers based on probe hybridization, and then calculates methylation values by scanning the intensity of the fluorescent markers, which are analog signals and not real methylation values; the inconsistency of the excitation wavelength and intensity of the methylation fluorescence and the unmethylation fluorescence also affects the accuracy of the methylation value; during probe hybridization of the chip, nonspecific hybridization occurs, including: the methylated template will hybridize to the unmethylated probe, or the unmethylated template hybridizes to the methylated probe; and the homologous sequences of the genome hybridize to the probes, which all produce false signals;
the chip has interference of background value in the scanning process, so that the methylation value of the chip is deviated; SNPs at the site of interest can affect probe hybridization and further interfere with the determination of methylation values. The data for initial assay and algorithm development were generated from the 450K chip platform based on Illumina commercial chips, but the 450K chip had been shut down. The new chip has obvious batch-to-batch differences between data and the old chip due to format and the like, 850K is also stopped, and the currently applied new 935K chip has new probes in probe design, but some original probes are removed, and about 5% of the original probes can be lost, which also affects the chip-based algorithm.
Disclosure of Invention
This section is intended to outline some aspects of embodiments of the invention and to briefly introduce some preferred embodiments. Some simplifications or omissions may be made in this section as well as in the description summary and in the title of the application, to avoid obscuring the purpose of this section, the description summary and the title of the invention, which should not be used to limit the scope of the invention.
The present invention has been made in view of the above and/or problems occurring in the prior art.
In a first aspect of embodiments of the present invention, there is provided a methylation sequencing typing method for a central nervous system tumor, comprising: extracting genome DNA from a tumor sample to be classified to obtain sample genome DNA;
analyzing the methylation status of a plurality of independent genomic CpG positions in the sample genomic DNA and constructing a methylation dataset based on the methylation status and a common dataset;
and learning classification rules obtained by analyzing a plurality of methylation data sets through a random forest model algorithm, and classifying the subtypes of the tumor sample to be classified according to learning results.
As a preferred embodiment of the methylation sequencing typing method for central nervous system tumors according to the present invention, wherein: analysis of methylation status of a plurality of independent genomic CpG positions in the sample genomic DNA includes,
designing and customizing a methylation capture probe panel of a target region;
capturing a target sequence based on the methylation capture probe panel and constructing a sequencing library;
hybridizing the sequencing library with the methylation capture probe panel, amplifying an eluted product by PCR, performing on-machine sequencing, and detecting methylation states of a plurality of independent genome CpG positions in the genomic DNA of the sample.
As a preferred embodiment of the methylation sequencing typing method for central nervous system tumors according to the present invention, wherein: the regional design of the methylated capture probe panel includes,
the regional scope of the methylation-captured probe panel can cover the design of CpG island, promoter region and enhancer regulatory elements in the whole genome scope, and can also aim at the methylation region of the difference between subtypes of CNS tumor;
calculating methylation areas of differences among subtypes of CNS tumors based on a GEO data set, carrying out self-organizing clustering on samples in the GEO data set by a t-SNE clustering method to obtain multiple subtypes, analyzing difference sites among different subtypes of CNS tumors in the GEO data set by using a machine learning algorithm of random forests, screening out the most important 1000-40000 probe sites capable of distinguishing the multiple subtypes, and then designing a probe pattern in a targeted manner.
As a preferred embodiment of the methylation sequencing typing method for central nervous system tumors according to the present invention, wherein: comprising the steps of (a) a step of,
the sequence design of the methylation capture probe panel can firstly capture target DNA fragments, then sulfite conversion, enzyme conversion and borane conversion are carried out, namely, the probe sequence is complementary with the original DNA sequence, and the complementary probe of any one or two of sense strand or antisense strand can be designed;
the methylation state detection method comprises the steps of firstly capturing target DNA fragments, then performing sulfite conversion, selecting a small amount of sample genome DNA, performing sequencing library construction after fragmentation, mainly filling the fragmented DNA with a complement, adding a sequencing joint at the 3' end, hybridizing with a methylation capture probe panel, performing sulfite conversion on a captured product, amplifying a converted product by PCR, and performing sequencing detection on the methylation state.
As a preferred embodiment of the methylation sequencing typing method for central nervous system tumors according to the present invention, wherein: comprising the steps of (a) a step of,
the sequence design of the methylation capture probe panel can also be carried out firstly, and then target DNA fragments can be captured, wherein the transformation can be carried out through sulfite transformation, enzyme transformation and borane transformation, the probe sequence is complementary to the subsequent sequence, and the sequence can be designed into one or more of the following four sequences: a sequence after transformation of the sense strand, a sequence after transformation of the antisense strand, a sequence complementary to the sequence after transformation of the sense strand, a sequence complementary to the sequence after transformation of the antisense strand;
firstly, performing sulfite conversion, then capturing target DNA fragments, performing methylation state detection including double-chain library construction and single-chain library construction, obtaining a converted library, hybridizing the library with a capture probe, amplifying an eluted product by PCR, and performing on-machine sequencing to detect the methylation state;
the double-chain library construction comprises the steps of selecting a small amount of sample genome DNA, fragmenting, constructing a sequencing library, mainly filling the fragmented DNA, adding A at the 3' end, adding a sequencing joint, then carrying out sulfite or enzyme or borane conversion, hybridizing with a methylation capture probe panel, and amplifying a capture product by PCR to obtain a converted methylation library with the sequencing joint;
the single-stranded library construction comprises the steps of selecting a small amount of sample genome DNA to perform sulfite or enzyme or borane conversion, converting the converted product into single-stranded DNA fragments, and then performing single-stranded DNA library construction, wherein the converted product is a methylation library with a sequencing joint.
As a preferred embodiment of the methylation sequencing typing method for central nervous system tumors according to the present invention, wherein: learning classification rules by a random forest model algorithm includes,
screening candidate methylation markers identified by CNS tumors according to a standard by analyzing a plurality of methylation data sets, screening out suitable candidate methylation marker sites, and filtering out low-quality sites;
introducing the filtered data into a random forest model algorithm, and further screening molecular markers from the residual sites to obtain 1000-40000 probe methylation sites capable of distinguishing subtypes of CNS tumors;
meanwhile, the random forest model algorithm combines methylation sites to form a random forest algorithm of a decision tree containing the methylation sites, the random forest algorithm comprises 1000-30000 decision trees, each decision tree comprises 200-300 nodes, each node comprises a methylation marker and a threshold value of the methylation marker, whether the methylation marker belongs to a subtype classification or enters the next decision node is judged according to the relation between the methylation value of each sample at the node and the threshold value, and each decision tree generates a decision result for determining the subtype and votes for the subtype once;
after the methylation values of all the sites of each sample are input, voting scoring results of a decision tree are obtained, corresponding scoring results are obtained for each subtype after comprehensive statistics, CNS tumor subtype judgment can be directly carried out, probability score correction can be carried out after corresponding branch results are obtained, and the prediction results after probability score correction can be used as the basis for more accurate tumor sample classification.
As a preferred embodiment of the methylation sequencing typing method for central nervous system tumors according to the present invention, wherein: screening criteria for candidate methylation markers identified for CNS tumors include,
the methylation value in the chip and the sequencing platform keeps better correlation;
including characteristics of each CNS tumor subtype, hypermethylation or hypomethylation only in specific subtypes, can be used for high quality methylation sites of classification.
In a second aspect of embodiments of the present invention, there is provided a methylation sequencing typing system for a central nervous system tumor, comprising:
a sample acquisition unit for extracting genomic DNA from a tumor sample to be classified to obtain sample genomic DNA;
a state analysis unit for analyzing methylation states of a plurality of independent genomic CpG positions in the sample genomic DNA and constructing a methylation dataset based on the methylation states and a common dataset;
and the classification learning unit is used for learning classification rules obtained by analyzing a plurality of methylation data sets through a random forest model algorithm and classifying the subtypes of the tumor sample to be classified according to a learning result.
In a third aspect of embodiments of the present invention, there is provided an apparatus, comprising,
a processor;
a memory for storing processor-executable instructions;
the processor is configured to invoke the instructions stored in the memory to perform the method according to any of the embodiments of the present invention.
In a fourth aspect of embodiments of the present invention, there is provided a computer readable storage medium having stored thereon computer program instructions comprising:
the computer program instructions, when executed by a processor, implement a method according to any of the embodiments of the present invention.
Compared with the prior art, the invention has the beneficial effects that:
(1) the standardized CNS tumor typing result is obtained through data detection and analysis standardization, the condition that the clinical diagnosis result of the CNS tumor is inconsistent among different hospitals or different doctors at present is avoided, and the standardized and high-repeatability typing identification result can be obtained through methylation marker analysis.
(2) Compared with naked eye judgment, the invention has more accurate typing and tens of thousands of methylation markers, the dimension of the carried information is far more than that of pathological section staining results, different subtypes of CNS tumors can be accurately distinguished, and the problems of similar histological characteristics and human interpretation errors among the subtypes are avoided.
(3) The technical platform is specially designed for CNS tumor identification, and solves the problem of chip updating iteration; the capture product is specially designed for CNS tumor identification, can ensure the stability of a probe set, and only increases sites which are more beneficial to CNS typing, but does not reduce useful sites; the chip iteration upgrading and the loss of identification sites are avoided, the stability of a database is affected, and precious data accumulated before is affected.
(4) The technical platform of the invention is based on sequencing data, the main factor influencing the accuracy of the data is the depth of sequencing, the accuracy can be improved as long as the depth is enough, and the result is closer to the true value; the chip technology can not be improved because of the problems of probe sequences and hybridization background and the fact that the difference between the chip technology and the real value is too large.
(5) The methylation capturing sequencing technology platform is specially designed for CNS tumor identification, can be added with a site which is beneficial to CNS tumor diagnosis in a targeted way, and keeps expandability, so long as a new capturing probe is added.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. Wherein:
FIG. 1 is a flow chart showing a method and system for methylation sequencing typing of a central nervous system tumor according to the present invention;
FIG. 2 is a cluster map of CNS tumor subtypes t-SNE of a method and system for methylation sequencing typing of CNS tumors provided by the invention;
FIG. 3 is a schematic diagram of a random forest decision tree of the method and system for methylation sequencing typing of CNS tumors provided by the invention.
Detailed Description
So that the manner in which the above recited objects, features and advantages of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to the embodiments, some of which are illustrated in the appended drawings. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways other than those described herein, and persons skilled in the art will readily appreciate that the present invention is not limited to the specific embodiments disclosed below.
Further, reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic can be included in at least one implementation of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.
While the embodiments of the present invention have been illustrated and described in detail in the drawings, the cross-sectional view of the device structure is not to scale in the general sense for ease of illustration, and the drawings are merely exemplary and should not be construed as limiting the scope of the invention. In addition, the three-dimensional dimensions of length, width and depth should be included in actual fabrication.
Also in the description of the present invention, it should be noted that the directions or positional relationships indicated by the terms "upper, lower, inner and outer", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the system or element to be referred to must have a specific direction, be constructed and operated in the specific direction, and thus should not be construed as limiting the present invention. Furthermore, the terms "first, second, or third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
The terms "mounted, connected, and coupled" should be construed broadly in this disclosure unless otherwise specifically indicated and defined, such as: can be fixed connection, detachable connection or integral connection; it may also be a mechanical connection, an electrical connection, or a direct connection, or may be indirectly connected through an intermediate medium, or may be a communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.
Example 1
Referring to fig. 1 to 3, in one embodiment of the present invention, a methylation sequencing typing method for central nervous system tumor is provided, which mainly comprises the following steps:
s1: extracting genome DNA from tumor samples to be classified, and obtaining sample genome DNA. It should be noted that:
tumor samples to be classified obtained from patient tumors, and optionally genomic DNA isolated therefrom, are extracted using QIAamp DNA Mini Kit (Qiagen, valencia, CA) or similar kits.
S2: analyzing methylation status of a plurality of independent genomic CpG positions in the genomic DNA of the sample. It should be noted that:
analysis of methylation status includes
Designing and customizing a methylation capture probe panel of a target region; capturing a target sequence based on a methylation capture probe panel and constructing a sequencing library; hybridizing the sequencing library with a methylation capture probe panel, amplifying the eluted product by PCR, sequencing on the machine, and detecting the methylation states of a plurality of independent genome CpG positions in the genomic DNA of the sample.
Furthermore, the region design range of the methylation capture probe panel can cover the design of CpG island, promoter region and enhancer regulatory elements in the whole genome range, and can also aim at the methylation region of the difference between subtypes of CNS tumor;
specifically, calculating methylation areas of differences among subtypes of CNS tumors based on GEO data sets GSE90496, GSE122994, GSE143843, GSE58218, GSE121723, GSE74193 and the like, carrying out self-organizing clustering on samples in the GEO data sets by a t-SNE clustering method to obtain 91 subtypes, analyzing difference sites among different subtypes of CNS tumors in the GEO data sets by using a random forest machine learning algorithm, screening out the most important 10000 probe sites capable of distinguishing the 91 subtypes, and designing a probe pattern in a targeted manner as shown in fig. 2;
the specific steps for screening the probe sites are as follows:
the batch effect of the chip between FFPE and fresh material samples was corrected (for german tumor center 2801 samples) using a linear hybrid model; filtering probes on XY chromosomes, filtering probes affected by SNP loci, and the like, wherein the probes cannot be compared with genome only; constructing a random forest classification model by adopting a random forest R packet, and finally selecting a probe with an importance degree top 40000;
and analyzing methylation capture sequencing and chip data in the own database, matching the methylation capture sequencing data with a top40000 probe site according to position information, and finally screening to obtain 10000 methylation sites which exist in a methylation capture sequencing result and have good coverage and high consistency.
Further, two designs can be used for the sequence design of the probe, which correspond to two experimental procedures:
a. capturing target DNA fragments and then converting: the probe sequence is complementary with the original DNA sequence, and a complementary probe of either one or two of a sense strand or an antisense strand can be designed;
b. conversion is performed first, and then target DNA fragments are captured: the probe sequence being complementary to the latter sequence, the transformation may be by sulfite transformation or enzymatic transformation (e.g.TET, APOBEC or the like, e.g.NEBEnzymatic Methyl-seq Kit or similar kits and methods) and borane transformation (e.g., TAPS, etc.), can be designed as a combination of one or more of the following four sequences: a sequence after transformation of the sense strand, a sequence after transformation of the antisense strand, a sequence complementary to the sequence after transformation of the sense strand, and a sequence complementary to the sequence after transformation of the antisense strand.
Further, constructing a sequencing library and sequencing to detect methylation status includes,
(1) scheme one: capturing target DNA fragments, and then performing sulfite conversion;
taking 500 ng-1 mug genome DNA, fragmenting, constructing a sequencing library, mainly filling the fragmented DNA with A at the 3' end, adding a sequencing joint, then hybridizing with a capture probe, and carrying out sulfite conversion on the capture product, wherein EZ DNA Methylation Gold Kit (ZYMO Research, irvine, CA) or a similar kit can be used for sulfite conversion;
unmethylated cytosine (C) in the DNA sequence after sulfite conversion treatment is changed into uracil (U), and U is amplified into thymine (T); whereas methylated cytosine (5 mC) is not altered by sulfite treatment and remains cytosine.
It should be noted that, the transformation of sulfite changes the methylation modification state into a base sequence which is easy to be identified, and the methylation modification state can be detected by means of sequencing and the like, and the methylation state is detected by sequencing on a machine after PCR amplification of the transformation product.
(2) Scheme II: firstly, sulfite conversion is carried out, and then target DNA fragments are captured;
the second scheme is a scheme of firstly carrying out sulfite conversion before capturing, and can be divided into two processes of double-chain library establishment and single-chain library establishment:
a. double-chain library establishment: taking 500 ng-1 mug genome DNA, fragmenting, constructing a sequencing library, mainly filling the fragmented DNA, adding A at the 3' end, adding a sequencing joint, then converting by sulfite or enzyme or borane, then hybridizing with a capture probe, and carrying out PCR amplification on the capture product to obtain a converted methylation library with the sequencing joint.
b. Single chain library building: the transformation of sulfite, enzyme or borane is carried out on 500 ng-1 mug genome DNA, the transformation product is single-stranded DNA fragment, then single-stranded DNA Library establishment is carried out, and a PBAT scheme or similar scheme, such as an IDT xGen Methyl-Seq Library Prep Kit (original name Accel-NGS Methyl-Seq Library Kit) or similar Kit, is used, and the product is a methylation Library after transformation with a sequencing joint.
After the converted library is obtained, the library is hybridized with a capture probe, and the eluted product is amplified by PCR and then sequenced by an upper machine to detect the methylation state.
S3: and learning classification rules obtained by analyzing the plurality of methylation data sets through a random forest model algorithm, and classifying subtypes of the tumor sample to be classified according to learning results. It should be noted that:
classification rules are obtained by analyzing multiple methylation datasets, including common datasets, such as GEO datasets GSE90496, GSE122994, GSE143843, GSE58218, GSE121723, GSE74193, etc., containing 450K chip results, self dataset containing 850K chips of a series of CNS samples, and methylation-captured sequencing results;
screening candidate methylation markers identified by CNS tumors according to a standard by analyzing a plurality of methylation data sets, screening out suitable candidate methylation marker sites, and filtering out low-quality sites;
screening criteria for candidate methylation markers identified for CNS tumors include,
(1) the methylation value in the chip and the sequencing platform keeps better correlation;
(2) including characteristics of each CNS tumor subtype, hypermethylation or hypomethylation only in specific subtypes, can be used for high quality methylation sites of classification.
Introducing the filtered data into a random forest model algorithm, and further screening molecular markers from the residual sites to obtain 10000 probe methylation sites capable of distinguishing subtypes of CNS tumors;
meanwhile, the random forest model algorithm combines methylation sites to form a random forest algorithm of decision trees containing the methylation sites, each decision tree contains 1000-30000 decision trees, each decision tree contains 200-300 nodes, each node contains 1 methylation marker and a threshold value of the decision thereof, whether the methylation marker belongs to a subtype classification or enters the next decision node is judged according to the relation between the methylation value of each sample at the node and the threshold value, and each decision tree generates a decision result for determining the subtype and votes for one time relative to the subtype;
after the methylation values of all the sites of each sample are input, voting scoring results of a decision tree are obtained, and after comprehensive statistics, corresponding scoring results are obtained for each subtype, as shown in figure 3, in short, which classification result of the decision tree has the highest classification score, and then the result is taken as a final result of subtype identification by a random forest;
after the prediction result of the random forest is obtained, probability score correction is carried out, the model adopts an L2 norm regularized multi-classification logistic regression algorithm (Multinomial Logistic Regression) to correct the original score predicted by the random forest algorithm, and the corrected probability score is obtained.
It should be noted that, the glmcet R packet is adopted to take the methylation classification result as a response variable, the random forest model score as an interpretation variable, the fitting of the L2 regularized multiple logistic regression model is performed, and the L2 penalty term coefficient is determined by adopting a 10-fold cross validation method so as to reduce the overfitting, correct the classification model score and improve the classification prediction level of the random forest model.
And taking the original probability score predicted by the random forest model as input data of the trained probability score correction model to obtain a corrected classification result and a probability score result. While the classification model can classify the sample to be detected, the accurate probability of each class cannot be predicted, and the probability score correction model can obtain corrected probability scores, which are taken as final prediction results, as shown in table 1.
Table 1: and a corrected random forest model prediction probability score table.
Compared with the prior art, the method provided by the invention has the beneficial effects that:
(1) the standardized CNS tumor typing result is obtained through data detection and analysis standardization, the condition that the clinical diagnosis result of the CNS tumor is inconsistent among different hospitals or different doctors at present is avoided, and the standardized and high-repeatability typing identification result can be obtained through methylation marker analysis.
(2) Compared with naked eye judgment, the invention has more accurate typing and tens of thousands of methylation markers, the dimension of the carried information is far more than that of pathological section staining results, different subtypes of CNS tumors can be accurately distinguished, and the problems of similar histological characteristics and human interpretation errors among the subtypes are avoided.
(3) The technical platform is specially designed for CNS tumor identification, and solves the problem of chip updating iteration; the capture product is specially designed for CNS tumor identification, can ensure the stability of a probe set, and only increases sites which are more beneficial to CNS typing, but does not reduce useful sites; the chip iteration upgrading and the loss of identification sites are avoided, the stability of a database is affected, and precious data accumulated before is affected.
(4) The technical platform of the invention is based on sequencing data, the main factor influencing the accuracy of the data is the depth of sequencing, the accuracy can be improved as long as the depth is enough, and the result is closer to the true value; the chip technology can not be improved because of the problems of probe sequences and hybridization background and the fact that the difference between the chip technology and the real value is too large.
(5) The methylation capturing sequencing technology platform is specially designed for CNS tumor identification, can be added with a site which is beneficial to CNS tumor diagnosis in a targeted way, and keeps expandability, so long as a new capturing probe is added.
In a second aspect of the present disclosure,
there is provided a methylation sequencing typing system for a central nervous system tumor, comprising:
a sample acquisition unit for extracting genomic DNA from a tumor sample to be classified to obtain sample genomic DNA;
a state analysis unit for analyzing methylation states of a plurality of independent genomic CpG positions in the sample genomic DNA and constructing a methylation dataset based on the methylation states and the common dataset;
and the classification learning unit is used for learning classification rules obtained by analyzing the plurality of methylation data sets through a random forest model algorithm and classifying subtypes of the tumor sample to be classified according to learning results.
In a third aspect of the present disclosure,
there is provided an apparatus comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to invoke the instructions stored in the memory to perform the method of any of the preceding.
In a fourth aspect of the present disclosure,
there is provided a computer readable storage medium having stored thereon computer program instructions comprising:
the computer program instructions, when executed by a processor, implement a method of any of the preceding.
The present invention may be a method, apparatus, system, and/or computer program product, which may include a computer-readable storage medium having computer-readable program instructions embodied thereon for performing various aspects of the present invention.
The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.
It should be noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that the technical solution of the present invention may be modified or substituted without departing from the spirit and scope of the technical solution of the present invention, and it should be covered in the scope of the present invention.

Claims (10)

1. A methylation sequencing typing method for a central nervous system tumor, characterized by: comprising the steps of (a) a step of,
extracting genome DNA from a tumor sample to be classified to obtain sample genome DNA;
analyzing the methylation status of a plurality of independent genomic CpG positions in the sample genomic DNA and constructing a methylation dataset based on the methylation status and a common dataset;
and learning classification rules obtained by analyzing a plurality of methylation data sets through a random forest model algorithm, and classifying the subtypes of the tumor sample to be classified according to learning results.
2. The methylation sequencing typing method of a central nervous system tumor of claim 1, wherein: analysis of methylation status of a plurality of independent genomic CpG positions in the sample genomic DNA includes,
designing and customizing a methylation capture probe panel of a target region;
capturing a target sequence based on the methylation capture probe panel and constructing a sequencing library;
hybridizing the sequencing library with the methylation capture probe panel, amplifying an eluted product by PCR, performing on-machine sequencing, and detecting methylation states of a plurality of independent genome CpG positions in the genomic DNA of the sample.
3. The methylation sequencing typing method of a central nervous system tumor of claim 2, wherein: the regional design of the methylated capture probe panel includes,
the regional scope of the methylation-captured probe panel can cover the design of CpG island, promoter region and enhancer regulatory elements in the whole genome scope, and can also aim at the methylation region of the difference between subtypes of CNS tumor;
calculating methylation areas of differences among subtypes of CNS tumors based on a GEO data set, carrying out self-organizing clustering on samples in the GEO data set by a t-SNE clustering method to obtain multiple subtypes, analyzing difference sites among different subtypes of CNS tumors in the GEO data set by using a machine learning algorithm of random forests, screening out the most important 1000-40000 probe sites capable of distinguishing the multiple subtypes, and then designing a probe pattern in a targeted manner.
4. The methylation sequencing typing method of a central nervous system tumor of claim 3, wherein: comprising the steps of (a) a step of,
the sequence design of the methylation capture probe panel can firstly capture target DNA fragments, then sulfite conversion, enzyme conversion and borane conversion are carried out, namely, the probe sequence is complementary with the original DNA sequence, and the complementary probe of any one or two of sense strand or antisense strand can be designed;
the methylation state detection method comprises the steps of firstly capturing target DNA fragments, then performing sulfite conversion, selecting a small amount of sample genome DNA, performing sequencing library construction after fragmentation, mainly filling the fragmented DNA with a complement, adding a sequencing joint at the 3' end, hybridizing with a methylation capture probe panel, performing sulfite conversion on a captured product, amplifying a converted product by PCR, and performing sequencing detection on the methylation state.
5. The methylation sequencing typing method of a central nervous system tumor of claim 3, wherein: comprising the steps of (a) a step of,
the sequence design of the methylation capture probe panel can also be carried out firstly, and then target DNA fragments can be captured, wherein the transformation can be carried out through sulfite transformation, enzyme transformation and borane transformation, the probe sequence is complementary to the subsequent sequence, and the sequence can be designed into one or more of the following four sequences: a sequence after transformation of the sense strand, a sequence after transformation of the antisense strand, a sequence complementary to the sequence after transformation of the sense strand, a sequence complementary to the sequence after transformation of the antisense strand;
firstly, performing sulfite conversion, then capturing target DNA fragments, performing methylation state detection including double-chain library construction and single-chain library construction, obtaining a converted library, hybridizing the library with a capture probe, amplifying an eluted product by PCR, and performing on-machine sequencing to detect the methylation state;
the double-chain library construction comprises the steps of selecting a small amount of sample genome DNA, fragmenting, constructing a sequencing library, mainly filling the fragmented DNA, adding A at the 3' end, adding a sequencing joint, then carrying out sulfite or enzyme or borane conversion, hybridizing with a methylation capture probe panel, and amplifying a capture product by PCR to obtain a converted methylation library with the sequencing joint;
the single-stranded library construction comprises the steps of selecting a small amount of sample genome DNA to perform sulfite or enzyme or borane conversion, converting the converted product into single-stranded DNA fragments, and then performing single-stranded DNA library construction, wherein the converted product is a methylation library with a sequencing joint.
6. The methylation sequencing typing method of a central nervous system tumor according to any one of claims 1 to 5, wherein: learning classification rules by a random forest model algorithm includes,
screening candidate methylation markers identified by CNS tumors according to a standard by analyzing a plurality of methylation data sets, screening out suitable candidate methylation marker sites, and filtering out low-quality sites;
introducing the filtered data into a random forest model algorithm, and further screening molecular markers from the residual sites to obtain 1000-40000 probe methylation sites capable of distinguishing subtypes of CNS tumors;
meanwhile, the random forest model algorithm combines methylation sites to form a random forest algorithm of a decision tree containing the methylation sites, the random forest algorithm comprises 1000-30000 decision trees, each decision tree comprises 200-300 nodes, each node comprises a methylation marker and a threshold value of the methylation marker, whether the methylation marker belongs to a subtype classification or enters the next decision node is judged according to the relation between the methylation value of each sample at the node and the threshold value, and each decision tree generates a decision result for determining the subtype and votes for the subtype once;
after the methylation values of all the sites of each sample are input, voting scoring results of a decision tree are obtained, corresponding scoring results are obtained for each subtype after comprehensive statistics, CNS tumor subtype judgment can be directly carried out, probability score correction can be carried out after corresponding branch results are obtained, and the prediction results after probability score correction can be used as the basis for more accurate tumor sample classification.
7. The methylation sequencing typing method of a central nervous system tumor of claim 6, wherein: screening criteria for candidate methylation markers identified for CNS tumors include,
the methylation value in the chip and the sequencing platform keeps better correlation;
including characteristics of each CNS tumor subtype, hypermethylation or hypomethylation only in specific subtypes, can be used for high quality methylation sites of classification.
8. A system for performing the methylation sequencing typing method of a central nervous system tumor according to any one of claims 1 to 7, wherein: comprising the steps of (a) a step of,
a sample acquisition unit for extracting genomic DNA from a tumor sample to be classified to obtain sample genomic DNA;
a state analysis unit for analyzing methylation states of a plurality of independent genomic CpG positions in the sample genomic DNA and constructing a methylation dataset based on the methylation states and a common dataset;
and the classification learning unit is used for learning classification rules obtained by analyzing a plurality of methylation data sets through a random forest model algorithm and classifying the subtypes of the tumor sample to be classified according to a learning result.
9. An apparatus, characterized in that the apparatus comprises,
a processor;
a memory for storing processor-executable instructions;
the processor is configured to invoke the instructions stored in the memory to perform the method of any of claims 1-7.
10. A computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the method of any of claims 1 to 7.
CN202311144613.8A 2023-09-06 2023-09-06 Methylation sequencing typing method and system for central nervous system tumor Active CN117316289B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311144613.8A CN117316289B (en) 2023-09-06 2023-09-06 Methylation sequencing typing method and system for central nervous system tumor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311144613.8A CN117316289B (en) 2023-09-06 2023-09-06 Methylation sequencing typing method and system for central nervous system tumor

Publications (2)

Publication Number Publication Date
CN117316289A true CN117316289A (en) 2023-12-29
CN117316289B CN117316289B (en) 2024-04-26

Family

ID=89248933

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311144613.8A Active CN117316289B (en) 2023-09-06 2023-09-06 Methylation sequencing typing method and system for central nervous system tumor

Country Status (1)

Country Link
CN (1) CN117316289B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180066317A1 (en) * 2015-03-11 2018-03-08 Deutsches Krebsforschungszentrum Stiftung des öffentlichen Rechts Dna-methylation based method for classifying tumor species
CN113811618A (en) * 2019-05-21 2021-12-17 深圳华大智造科技股份有限公司 Sequencing library constructed based on methylated DNA target region, system and application
CN115094142A (en) * 2022-07-19 2022-09-23 中国医学科学院肿瘤医院 Methylation markers for diagnosing colorectal adenocarcinoma
CN115424666A (en) * 2022-09-13 2022-12-02 江苏先声医学诊断有限公司 Method and system for screening pan-cancer early-screening molecular marker based on whole genome bisulfite sequencing data
CN115612744A (en) * 2022-12-14 2023-01-17 中国医学科学院肿瘤医院 Human papilloma virus typing and related gene methylation integrated detection model and construction method thereof
CN115725591A (en) * 2022-09-22 2023-03-03 上海奕谱生物科技有限公司 Novel tumor detection marker TAGMe and application thereof
CN116064819A (en) * 2022-12-12 2023-05-05 无锡泛生子生物科技有限公司 Method for detecting mutation and methylation of tumor specific gene in ctDNA
CN116218973A (en) * 2023-02-22 2023-06-06 北京元码医学检验实验室有限公司 Probe set, method and system for methylation target detection for second generation sequencing

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180066317A1 (en) * 2015-03-11 2018-03-08 Deutsches Krebsforschungszentrum Stiftung des öffentlichen Rechts Dna-methylation based method for classifying tumor species
CN113811618A (en) * 2019-05-21 2021-12-17 深圳华大智造科技股份有限公司 Sequencing library constructed based on methylated DNA target region, system and application
CN115094142A (en) * 2022-07-19 2022-09-23 中国医学科学院肿瘤医院 Methylation markers for diagnosing colorectal adenocarcinoma
CN115424666A (en) * 2022-09-13 2022-12-02 江苏先声医学诊断有限公司 Method and system for screening pan-cancer early-screening molecular marker based on whole genome bisulfite sequencing data
CN115725591A (en) * 2022-09-22 2023-03-03 上海奕谱生物科技有限公司 Novel tumor detection marker TAGMe and application thereof
CN116064819A (en) * 2022-12-12 2023-05-05 无锡泛生子生物科技有限公司 Method for detecting mutation and methylation of tumor specific gene in ctDNA
CN115612744A (en) * 2022-12-14 2023-01-17 中国医学科学院肿瘤医院 Human papilloma virus typing and related gene methylation integrated detection model and construction method thereof
CN116218973A (en) * 2023-02-22 2023-06-06 北京元码医学检验实验室有限公司 Probe set, method and system for methylation target detection for second generation sequencing

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
叶松山;刘先娟;侯俊然;毛秉豫;邱耕;: "基于p73和DAPK基因异常甲基化模式的白血病肿瘤标志物研究", 中华肿瘤防治杂志, no. 11, 14 June 2016 (2016-06-14) *
王刚;张红河;何华东;陈昭典;: "尿液P16基因甲基化对浅表性膀胱移行细胞癌早期诊断的意义", 实用肿瘤杂志, no. 02, 10 April 2009 (2009-04-10) *

Also Published As

Publication number Publication date
CN117316289B (en) 2024-04-26

Similar Documents

Publication Publication Date Title
US20190316209A1 (en) Multi-Assay Prediction Model for Cancer Detection
US20200395100A1 (en) Population based treatment recommender using cell free dna
CN112802548B (en) Method for predicting allele-specific copy number variation of single-sample whole genome
EP3704264B1 (en) Using nucleic acid size range for noninvasive prenatal testing and cancer detection
CN107771221A (en) The abrupt climatic change analyzed for screening for cancer and fetus
CN113257350B (en) ctDNA mutation degree analysis method and device based on liquid biopsy and ctDNA performance analysis device
US11581062B2 (en) Systems and methods for classifying patients with respect to multiple cancer classes
CN117778576A (en) Free DNA end characterization
JP2008507993A (en) Automated analysis of multiple probe target interaction patterns: pattern matching and allele identification
CN113851185B (en) Prognosis evaluation method for immunotherapy of non-small cell lung cancer patient
CN109859796B (en) Dimension reduction analysis method for DNA methylation spectrum of gastric cancer
CN117316289B (en) Methylation sequencing typing method and system for central nervous system tumor
CN109712671B (en) Gene detection device based on ctDNA, storage medium and computer system
US20190108311A1 (en) Site-specific noise model for targeted sequencing
CN115976209A (en) Training method of lung cancer prediction model, prediction device and application
US20210310050A1 (en) Identification of global sequence features in whole genome sequence data from circulating nucleic acid
CN110310700B (en) DNA methylation chip mark site screening method based on deep learning model
CN116209777A (en) Genetic relationship judging method and device based on noninvasive prenatal gene detection data
CN116168761B (en) Method and device for determining characteristic region of nucleic acid sequence, electronic equipment and storage medium
CN115851930A (en) Methylation marker for detecting benign and malignant lung nodules and application thereof
CN116987789A (en) UTUC molecular typing, single sample classifier and construction method thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant