CN115497561B - Methylation marker layered screening method and device - Google Patents

Methylation marker layered screening method and device Download PDF

Info

Publication number
CN115497561B
CN115497561B CN202211067693.7A CN202211067693A CN115497561B CN 115497561 B CN115497561 B CN 115497561B CN 202211067693 A CN202211067693 A CN 202211067693A CN 115497561 B CN115497561 B CN 115497561B
Authority
CN
China
Prior art keywords
methylation
samples
sample
cancer
screening
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211067693.7A
Other languages
Chinese (zh)
Other versions
CN115497561A (en
Inventor
曾秋红
李俊
黄毅
易鑫
杨玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Guiinga Medical Laboratory
Beijing Jiyinjia Medical Laboratory Co ltd
Original Assignee
Shenzhen Guiinga Medical Laboratory
Beijing Jiyinjia Medical Laboratory Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Guiinga Medical Laboratory, Beijing Jiyinjia Medical Laboratory Co ltd filed Critical Shenzhen Guiinga Medical Laboratory
Priority to CN202211067693.7A priority Critical patent/CN115497561B/en
Publication of CN115497561A publication Critical patent/CN115497561A/en
Application granted granted Critical
Publication of CN115497561B publication Critical patent/CN115497561B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/40Population genetics; Linkage disequilibrium
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment

Abstract

A method and apparatus for hierarchical screening of methylation markers, the method for layering samples comprising: a data acquisition step comprising acquiring methylation modification data of a sample; preprocessing the methylation modification data to obtain preprocessed samples of all types; the dimension reduction processing step comprises the step of respectively carrying out dimension reduction processing on each type of sample after pretreatment; and layering, namely clustering the samples subjected to the dimension reduction treatment, and determining the optimal clustering number to realize layering of the samples. The method is a one-to-many method, can screen methylation markers of various types of samples at the same time, and is efficient and reliable.

Description

Methylation marker layered screening method and device
Technical Field
The invention relates to the field of bioinformatics, in particular to a method and a device for hierarchical screening of methylation markers.
Background
According to the statistics of the national cancer center, the incidence rate of new cancer in China is 406.4 ten thousand in 2016, and the incidence rate of the world standard is 186.46/10 ten thousand. The first five new cases are: lung cancer, colorectal cancer, gastric cancer, liver cancer, and breast cancer. Early screening, early diagnosis, and timely treatment are effective ways to reduce cancer mortality. European medical oncology institute (ESMO) states that: cancer morbidity and mortality in western countries decreases year by year, mainly due to early screening of cancer, early benign adenoma excision and early treatment of cancer lesions.
At present, although there are a few tumor markers in clinic, such as carcinoembryonic antigen (CEA), alpha Fetoprotein (AFP), cancer antigen 125 (CA 125), carbohydrate antigen 19-9 (CA 19-9), prostate Specific Antigen (PSA), etc., the sensitivity or specificity thereof generally cannot meet the requirements of clinical diagnosis. In particular, certain tumor markers may also be elevated under certain physiological conditions or benign lesions; such as: serum CA125 may be elevated during menstrual period, early gestation, cirrhosis, chronic active hepatitis, etc.; cholestasis may lead to elevated serum CA 19-9. Thus, clinicians typically test multiple markers simultaneously at a time and integrate them with other means such as clinical symptoms and imaging examinations. Therefore, the broad screening of healthy people is not highly generalized in terms of the tumor markers themselves. However, the discovery and utilization of tumor-specific biomarkers, and the ability to discover its organs and perform treatments early in tumorigenesis, is a key factor in enhancing tumor therapeutic effects and extending patient life.
DNA methylation is an important epigenetic modification in eukaryotes where methylation occurs only at cytosines, i.e., the conversion of 5 'cytosines to 5' methylcytosines by DNA methyltransferases. Numerous studies have shown that DNA methylation plays an important role in the development of cancer, and that abnormal changes are one of the hallmark events in the progression of cancer. CpG islands in the human gene promoter region are usually in an unmethylated state, and in cancer hypermethylation can occur, potentially leading to transcriptional silencing of some important oncogenes, DNA repair genes, while the whole genome is usually presented as a demethylated state with greater correlation to genomic stability. Some studies exist indicating that tumor DNA methylation patterns can be used for tumor classification, diagnosis and treatment, and thus DNA methylation may be a potential biomarker for healthy human plasma component analysis, organ damage, immune response monitoring, early screening of cancer, recurrence detection, organ transplantation monitoring, and localization of primary sites of metastatic cancer.
However, existing screening methods based on DNA methylation are not amenable to one-to-many screening and have low accuracy.
Disclosure of Invention
According to a first aspect, in an embodiment, there is provided a method of layering samples, comprising:
a data acquisition step comprising acquiring methylation modification data of a sample;
preprocessing the methylation modification data to obtain preprocessed samples of all types;
the dimension reduction processing step comprises the step of respectively carrying out dimension reduction processing on each type of sample after pretreatment;
and layering, namely clustering the samples subjected to the dimension reduction treatment, and determining the optimal clustering number to realize layering of the samples.
According to a second aspect, in an embodiment, there is provided a method of hierarchical screening for methylation markers comprising:
a first layer screening step comprising screening N groupings of samples obtained according to the method of any one of the first aspects for methylation markers between the N groupings, i.e., first layer methylation markers;
and a second layer screening step, comprising screening the methylation markers of different types of samples in each group respectively, namely the second layer methylation markers, according to the N groups of samples obtained by the method according to any one of the first aspect.
According to a third aspect, in an embodiment, there is provided an apparatus for layering samples, comprising:
the data acquisition module is used for acquiring methylation modification data of the samples in the database;
the pretreatment module is used for carrying out pretreatment on the methylation modification data to obtain pretreated samples of all types;
the dimension reduction processing module is used for respectively carrying out dimension reduction processing on each type of sample after pretreatment;
and the layering module is used for clustering the samples after the dimension reduction treatment, determining the optimal clustering number and layering the samples.
According to a fourth aspect, in one embodiment, there is provided an apparatus for hierarchical screening of methylation markers, comprising:
a first layer screening module, configured to screen N groups of samples obtained according to the method of any one of the first aspect, for methylation markers between the N groups, i.e., first layer methylation markers;
and the second layer screening module is used for screening the methylation markers of different types of samples in each group respectively according to the N groups of samples obtained by the method in any one of the first aspect, namely the second layer methylation markers.
According to a fifth aspect, in an embodiment, there is provided an apparatus for layering samples, comprising:
A memory for storing a program;
a processor for implementing the method of any one of the first aspects by executing a program stored in the memory.
According to a sixth aspect, in one embodiment, there is provided an apparatus for hierarchical screening of methylation markers, comprising:
a memory for storing a program;
a processor for implementing the method of any one of the second aspects by executing a program stored in the memory.
According to a seventh aspect, in an embodiment, a computer readable storage medium is provided, on which a program is stored, the program being executable by a processor to implement the method as in any one of the first or second aspects.
According to the method and the device for hierarchical screening of the methylation markers, disclosed by the embodiment of the invention, the method is one-to-many, and can be used for screening the methylation markers of multiple types of samples at the same time, so that the method and the device are efficient and reliable.
In one embodiment, not only is the methylation level applicable for a single methylation site, but also the methylation level of a region. The measures of methylation level include, but are not limited to, average methylation rate, methylation entropy (methylation entropy), apparent polymorphism (epi-polymorphism), methylation haploid burden (methylation haplotype load, MHL), and haploid number (haplotypes counts).
In one embodiment, the methylation region can reduce false positives for a single methylation site, i.e., is less susceptible to detection techniques, than for a single methylation site; in the application of sequencing data, some methylation sites may not be detected to produce a missing value, and the occurrence of a missing value may be avoided if a regional methylation level is used.
Drawings
FIG. 1 is a schematic flow diagram of hierarchical screening in one embodiment;
FIG. 2 is a sample layering result graph in one embodiment;
FIG. 3 is a schematic flow chart of hierarchical screening of methylation markers in one embodiment;
FIG. 4 is a schematic illustration of a methylation marker evaluation flow scheme in one embodiment;
FIG. 5 is a graph of prediction accuracy results in one embodiment.
Detailed Description
The application will be described in further detail below with reference to the drawings by means of specific embodiments. In the following embodiments, numerous specific details are set forth in order to provide a better understanding of the present application. However, one skilled in the art will readily recognize that some of the features may be omitted in various situations, or replaced by other materials, methods. In some instances, related operations of the present application have not been shown or described in the specification in order to avoid obscuring the core portions of the present application, and may be unnecessary to persons skilled in the art from a detailed description of the related operations, which may be presented in the description and general knowledge of one skilled in the art.
Furthermore, the described features, operations, or characteristics of the description may be combined in any suitable manner in various embodiments. Also, various steps or acts in the method descriptions may be interchanged or modified in a manner apparent to those of ordinary skill in the art. Thus, the various orders in the description and drawings are for clarity of description of only certain embodiments, and are not meant to be required orders unless otherwise indicated.
The numbering of the components itself, e.g. "first", "second", etc., is used herein merely to distinguish between the described objects and does not have any sequential or technical meaning.
According to a first aspect, in an embodiment, there is provided a method of layering samples, comprising:
a data acquisition step comprising acquiring methylation modification data of a sample;
preprocessing the methylation modification data to obtain preprocessed samples of all types;
the dimension reduction processing step comprises the step of respectively carrying out dimension reduction processing on each type of sample after pretreatment;
and layering, namely clustering the samples subjected to the dimension reduction treatment, and determining the optimal clustering number to realize layering of the samples.
In an embodiment, the pretreatment step comprises probe filtration, sample filtration.
In an embodiment, in the step of performing the dimension reduction processing, the method for performing the dimension reduction processing on each type of sample includes: calculating the discrete degree of each probe in the target type samples, sequencing the probes from large to small according to the discrete degree, taking the probes with the discrete degree ranked in front of a preset ranking as effective features, clustering the samples, determining the optimal clustering number according to indexes, and realizing dimension reduction of each type of samples.
In an embodiment, in the dimension reduction processing step, when the probe with the discrete degree ranked before the preset ranking is taken as the effective feature, the preset ranking (K) may be set according to practical applications, including but not limited to 10%, 20%, 30%, and so on.
In one embodiment, the index includes, but is not limited to, at least one of variance ratio criteria (Variance rat io criterion, VRC; also known as Calinsky criterion), gap Static (GS), contour coefficients (Average silhouette method), and the like.
In one embodiment, the index includes, but is not limited to, at least two of variance ratio criteria (Variance ratio criterion, VRC; also known as Calinsky criterion), gap Static (GS), contour coefficients (Average silhouette method), and the like.
In an embodiment, in the step of dimension reduction processing, if the optimal cluster numbers of two or more indexes are consistent, the optimal cluster number is the final optimal cluster number; otherwise, selecting the optimal cluster number determined by the contour coefficient as the final optimal cluster number, and realizing dimension reduction of each type of sample.
In an embodiment, in the dimension reduction processing step, the method for clustering samples includes at least one of hierarchical-based clustering algorithm and density-based clustering algorithm.
In one embodiment, the step of dimension-reducing processing further comprises calculating methylation levels of all samples in each class in each probe capture region;
in one embodiment, the methylation level comprises an average methylation rate, methylation entropy (methylation entropy), apparent polymorphism (epi-polymorphism), methylated haploid burden (methylation haplotype load, MHL), or haploid number (haplotypes counts).
In one embodiment, the methylation level comprises an average of beta values.
In an embodiment, the step of dimension reduction processing further includes calculating a mean value of beta values of all samples in each class in each probe capture area.
beta value = M/(m+u+offset), U represents the unmethylated signal intensity, M represents the methylated signal intensity, and offset is the offset. offset is to prevent the occurrence of the case where the denominator is 0. beta is the percentage of methylation signal intensity.
The beta value applies to the chip data. Other indicators may be substituted for beta values if sequencing data, including but not limited to methylation entropy (methylation entropy), apparent polymorphism (epi-polymorphism), methylated haploid burden (methylation haplotype load, MHL) or haploid number (haplotypes counts).
In an embodiment, in the layering step, the discrete degree of each probe in all samples is calculated, the probes are ordered from large to small according to the discrete degree, the probes with the discrete degree arranged in front of a preset ranking are taken as effective features, the samples are clustered, and the optimal clustering number is determined according to the index.
In an embodiment, in the layering step, the discrete degree of each probe in all samples is calculated, the probes are ordered from large to small according to the discrete degree, the probes with the discrete degree arranged in front of a preset ranking are taken as effective features, the samples are clustered, and the optimal clustering number is determined according to the index.
In an embodiment, in the layering step, the method for clustering samples includes at least one of the following methods: non-weighted Pair-Group Method with Arithmetic means (UPGMA), phylogenetic tree adjacency, partition-based clustering algorithms, hierarchical-based clustering algorithms, network-based clustering algorithms.
In one embodiment, the step of layering, the metrics include, but are not limited to, at least two of variance ratio criteria (Variance ratio criterion, VRC; also known as Calinsky criterion), gap Statistics (GS), contour coefficients (Average silhouette method), and the like.
In an embodiment, in the layering step, if the number of the best clusters of two or more indexes is consistent, the number of the best clusters is the final number of the best clusters; otherwise, the optimal cluster number N determined by the contour coefficient is taken as the final optimal cluster number.
In an embodiment, in the layering step, the plurality of types of samples are finally divided into N groups, and each group contains at least one type of sample. The term "plurality" as used herein means two or more.
In one embodiment, the type comprises a cancer species, a developmental lineage, a tissue type, or a cell type.
In one embodiment, in the layering step, the plurality of cancer species may be eventually divided into N groups, each of which may include one or more cancer species.
In one embodiment, in the data acquisition step, the methylation modification data of the sample is derived from a database.
In an embodiment, in the step of obtaining data, the database comprises a public database.
In one embodiment, in the data acquisition step, methylation modification data of the sample may also be derived from self-test data.
In an embodiment, in the step of obtaining data, the sample includes, but is not limited to, at least one of a tissue sample and a body fluid sample.
In an embodiment, in the step of obtaining data, the tissue sample includes, but is not limited to, at least one of cancer tissue and normal tissue.
In an embodiment, in the step of data acquisition, the body fluid sample includes, but is not limited to, a plasma sample.
In one embodiment, in the data acquisition step, the sample includes, but is not limited to, a normal cell sample.
In an embodiment, in the data acquisition step, the sample includes, but is not limited to, a cancer sample.
In one embodiment, in the step of obtaining the data, the cancer sample includes, but is not limited to, a tumor tissue sample of a primary tumor of a pan-cancer queue.
In one embodiment, in the step of obtaining data, the tumor includes, but is not limited to, at least one of hepatocellular carcinoma, cholangiocarcinoma, lung adenocarcinoma, lung squamous carcinoma, gastric cancer, esophageal carcinoma, colon carcinoma, rectal adenocarcinoma, pancreatic cancer, breast carcinoma, ovarian carcinoma, cervical cancer, endometrial carcinoma, uterine sarcoma, prostate carcinoma, bladder urothelial carcinoma, adrenocortical carcinoma, renal chromophobe carcinoma, renal clear cell carcinoma, renal papillary cell carcinoma, head and neck squamous cell carcinoma, thyroid carcinoma, thymoma, mesothelioma, sarcoma, skin melanoma, ocular melanoma, pheochromocytoma, paraganglioma, brain low-grade glioma, glioblastoma.
In an embodiment, in the preprocessing step, the probe filtering rule includes: if the 10bp upstream and downstream of the probe contains SNP loci, rejecting the probe; and simultaneously eliminating probes on sex chromosomes and probes with the sample missing value proportion exceeding a preset threshold value.
The ratio of the sample missing values refers to the ratio of the number of samples in which no signal is detected on a certain probe to the total number of samples, and for example, if 30 samples out of 100 samples are detected on the probe, the ratio is expressed by the missing value NA. Here, the sample deletion value ratio of the probe=30/100.
In an embodiment, in the preprocessing step, the sample filtering rule includes: and identifying an abnormal sample by adopting at least one algorithm, and eliminating the sample if the identification result of at least one algorithm in the adopted algorithms shows that the sample is abnormal.
In one embodiment, if the recognition result of two or more algorithms shows that the sample is abnormal, the sample is rejected.
In principle, only one algorithm may be used to identify an abnormal sample, but it is more reliable if one sample is determined to be an abnormal sample by a variety of methods. If an algorithm is defined, more samples may be rejected. The algorithm can be generally adjusted according to actual needs, and the number of algorithms can be selected according to the number of samples finally included in the analysis.
In an embodiment, the algorithm for identifying abnormal samples includes, but is not limited to, at least one of an isolated Forest (Isolation Forest), a local anomaly factor detection algorithm (Local Outlier Factor, LOF), a Density-based clustering algorithm (Density-Based Spatial Clustering of Applications with Noise, DBSCAN), a partition-based clustering algorithm, a hierarchy-based clustering algorithm, a network-based clustering algorithm, and preferably at least two of the foregoing algorithms.
According to a second aspect, in an embodiment, there is provided a method of hierarchical screening for methylation markers comprising:
a first layer screening step comprising screening N groupings of samples obtained according to the method of any one of the first aspects for methylation markers between the N groupings, i.e., first layer methylation markers;
and a second layer screening step, comprising screening the methylation markers of different types of samples in each group respectively, namely the second layer methylation markers, according to the N groups of samples obtained by the method according to any one of the first aspect.
In one embodiment, the first layer screening step includes:
a sample selection step comprising obtaining samples of N groupings for a first tier methylation marker screen;
dividing methylation areas, namely taking methylation areas with the correlation coefficient between any two adjacent CpG sites being larger than a preset value and the number of the CpG sites being larger than the preset number as the same methylation area;
a methylation marker screening step, which comprises checking whether each methylation region has a significant statistical difference between N groups, if so, comparing the methylation levels of any two groups of the N groups in pairs, and judging whether the methylation levels of the methylation regions have a significant statistical difference between the two groups; if there is a significant statistical difference in the methylation level of a specific group of samples and other M groups of samples in a methylation region, the methylation region is determined to be the specific methylation marker of the specific group, and M is any natural number less than N. If not, no further steps are carried out.
In an embodiment, in the methylation region partitioning step, the correlation coefficient includes, but is not limited to, a pearson correlation coefficient (Pearson correlation coefficient) or a spearman correlation coefficient (Spearman correlation coefficient).
In one embodiment, M may be set to N-1, N-2, etc. according to practical applications in the methylation marker screening step, and the specificity of the methylation marker is the best when M is N-1.
In one embodiment, in the methylation marker screening step, the assay method includes, but is not limited to, at least one of an analysis of variance, a kruercal-wales assay.
In one embodiment, in the methylation marker screening step, the p-value is corrected by a multiple comparison method to obtain a corrected p-value (padj), and if the corrected p-value is less than a predetermined threshold, it is determined that there is a significant statistical difference in the methylation level of the methylation region between N packets.
In one embodiment, in the methylation marker screening step, when the p-value of the pairwise comparison is less than a predetermined threshold and the absolute value of the methylation level difference is greater than the predetermined threshold, it is determined that there is a significant statistical difference in the methylation level of the methylation region between the two groupings.
In one embodiment, the methylation level comprises an average methylation rate, methylation entropy (methylation entropy), apparent polymorphism (epi-polymorphism), methylated haploid burden (methylation haplotype load, MHL), or haploid number (haplotypes counts).
In one embodiment, the methylation level comprises an average of beta values.
In one embodiment, the methylation level difference comprises the methylation level of each CpG site within the region, the difference between the two sets of samples.
In one embodiment, the methylation level comprises an average of beta values for all CpG sites contained in the methylation region.
In one embodiment, in the methylation marker screening step, any two of the N packets are compared in pairs, and statistical methods used include, but are not limited to, at least one of Tukey's Honest Significant Difference (Tukey's HSD), least significant difference (Least Significance Difference, LSD), dunnett-t test, complex polar method (Student New man Keuls, SNK method), new complex polar method (Duncan's new multiple range test), and the like.
In one embodiment, the second layer screening step is performed with reference to the first layer screening step.
In one embodiment, the second layer screening step includes:
A sample selection step including classifying samples in each group according to cancer species, the number X of cancer species being identical to the number of categories of samples, for example, if a sample in a certain group involves X cancer species, the samples in the group are classified into X categories;
dividing methylation areas, namely dividing methylation areas with the pearson correlation coefficient between any two adjacent CpG sites larger than a preset value and the number of CpG sites larger than the preset number into the same methylation area;
a methylation marker screening step, which comprises checking whether the methylation level of each methylation region has a significant statistical difference between X categories, and if so, comparing the methylation levels of any two categories in each category to judge whether the methylation level of each methylation region has a significant statistical difference between the two categories; if a methylation region is different from other Y types of samples, judging that the methylation region is a specific methylation marker of the specific type of sample, and Y is a natural number smaller than X.
In one embodiment, in the methylation marker screening step, Y may be set to X-1, X-2, etc. according to practical applications, and when Y is X-1, the specificity of the methylation marker is the best.
In one embodiment, in the methylation marker screening step, the assay method includes, but is not limited to, at least one of an analysis of variance, a kruercal-wales assay.
In one embodiment, in the methylation marker screening step, the p-value is corrected by a multiple comparison method to obtain a corrected p-value (padj), and if the corrected p-value is less than a predetermined threshold, it is determined that there is a significant statistical difference in the methylation level of the methylation region between the X categories.
In one embodiment, in the methylation marker screening step, when the p value of the pairwise comparison is smaller than the preset threshold and the absolute value of the beta difference is larger than the preset threshold, it is determined that there is a significant statistical difference in the methylation level of the methylation region between the two categories.
In one embodiment, the methylation level comprises an average of beta values for all CpG sites contained in the methylation region.
According to a third aspect, in an embodiment, there is provided an apparatus for layering samples, comprising:
the data acquisition module is used for acquiring methylation modification data of the samples in the database;
the pretreatment module is used for carrying out pretreatment on the methylation modification data to obtain pretreated samples of all types;
The dimension reduction processing module is used for respectively carrying out dimension reduction processing on each type of sample after pretreatment;
and the layering module is used for clustering the samples after the dimension reduction treatment, determining the optimal clustering number and layering the samples.
According to a fourth aspect, in one embodiment, there is provided an apparatus for hierarchical screening of methylation markers, comprising:
a first layer screening module, configured to screen N groups of samples obtained according to the method of any one of the first aspect, for methylation markers between the N groups, i.e., first layer methylation markers;
and the second layer screening module is used for screening the methylation markers of different types of samples in each group respectively according to the N groups of samples obtained by the method in any one of the first aspect, namely the second layer methylation markers.
According to a fifth aspect, in an embodiment, there is provided an apparatus for layering samples, comprising:
a memory for storing a program;
a processor for implementing the method of any one of the first aspects by executing a program stored in the memory.
According to a sixth aspect, in one embodiment, there is provided an apparatus for hierarchical screening of methylation markers, comprising:
A memory for storing a program;
a processor for implementing the method of any one of the second aspects by executing a program stored in the memory.
According to a seventh aspect, in an embodiment, a computer readable storage medium is provided, on which a program is stored, the program being executable by a processor to implement the method as in any one of the first or second aspects.
According to an eighth aspect, in an embodiment, there is provided a methylation marker assessment method comprising:
and a verification step, including screening the obtained first-layer methylation markers according to the method of any one of the second aspect, constructing a first-layer classification model, and verifying by adopting an independent data set, wherein the data comprises N grouping samples.
Sample selection, namely dividing data into a training set and a verification set, wherein the training set data is used for model construction, and the verification set data is used for model performance verification;
a feature selection step, comprising further screening the first layer methylation markers by a recursive elimination method (Recursive feature elimination, RFE) according to the screened first layer methylation markers, and taking the obtained partial methylation markers as final features;
a classification model construction step, which comprises the step of training a machine learning model according to the final characteristics obtained in the characteristic selection step to obtain an optimal methylation classification model;
A model prediction step, which comprises predicting verification set data according to the methylation classification model constructed in the classification model construction step, and evaluating the performance of the model;
and respectively constructing a second-layer classification model for each group according to the methylation markers of each group screened before, and verifying by adopting an independent data set.
In one embodiment, in the model prediction step, the model performance evaluation index includes, but is not limited to, at least one of accuracy, precision, recall, and the like.
In one embodiment, a method for hierarchical screening of hierarchical methylation markers is provided, which is applicable in fields including, but not limited to, organ damage, early cancer screening, tissue tracing, and organ transplantation. For early screening of cancers, some tumor markers clinically used at present have low sensitivity and specificity, and cannot well meet clinical requirements; molecular markers may assist in early screening of cancer. The invention provides a method for screening methylation markers in a layered manner, which can be used for screening methylation markers with cell specificity, tissue specificity and cancer species specificity. Meanwhile, the markers can assist in analysis of plasma components of healthy human, evaluation of organ injury, monitoring of immune response, early screening of cancers, detection of recurrence, monitoring of organ transplantation, positioning of primary sites of metastatic cancers and the like.
In one embodiment, a method for hierarchical methylation marker hierarchical screening is provided, which can be applied to the analysis of plasma components, organ damage, immune response monitoring, early cancer screening, recurrence detection, organ transplantation monitoring, and the localization of primary sites of metastatic cancer in healthy people.
In one embodiment, the present invention provides a method for layering multiple types of samples, such as multiple cancer species. In particular, the sample may be a plasma, tissue sample, including in particular but not limited to cancer tissue, normal cells and normal tissue. Taking cancer sample stratification as an example, the flow comprises:
and obtaining methylation modification data of the primary tumor tissue of the flood queue recorded in the public database.
Wherein the tumor tissue comprises: hepatocellular carcinoma, cholangiocarcinoma, lung adenocarcinoma, lung squamous carcinoma, gastric cancer, esophageal carcinoma, colon cancer, rectal adenocarcinoma, pancreatic cancer, breast cancer, ovarian cancer, cervical cancer, endometrial cancer, uterine sarcoma, prostate cancer, bladder urothelial cancer, adrenocortical cancer, renal chromocytoma, renal clear cell carcinoma, renal papillary cell carcinoma, head and neck squamous cell carcinoma, thyroid cancer, thymoma, mesothelioma, sarcoma, skin melanoma, ocular melanoma, pheochromocytoma and paraganglioma, brain low-grade glioma, and glioblastoma.
And preprocessing the methylation modification data obtained by downloading, including probe filtration, sample filtration and the like. Probe filtration rules: if the 10bp upstream and downstream of the probe contains SNP loci, rejecting the probe; meanwhile, probes on sex chromosomes and probes with the sample deletion value proportion exceeding a preset threshold value are removed, and the preset threshold value can be set to be 10%,20%,30% and the like. Sample filtering rules: different algorithms are used to identify abnormal samples, such as isolated Forest (Isolation Forest), local anomaly factor detection algorithm (Local Outlier Factor, LOF), density-based clustering algorithm (Density-Based Spatial Clustering of Applications with Noise, DBSCAN), partition-based clustering algorithm, hierarchical-based clustering algorithm, network-based clustering algorithm; if two or more algorithms consider the sample to be abnormal, the sample is rejected.
And respectively carrying out dimension reduction treatment on each type of sample after pretreatment. For example, the method for performing dimension reduction processing on a type of sample includes: firstly, calculating the discrete degree of each probe in a target type sample, sequencing the probes from large to small according to the discrete degree, taking the probes with the discrete degree arranged in front of a preset ranking (K) as effective characteristics, wherein the preset ranking K can be set to be 10%,20%,30% and the like according to actual application; then, hierarchical clustering is carried out on the samples; the optimal number of clusters is then determined based on metrics including, but not limited to, variance ratio criteria (Variance ratio criterion, VRC; also known as Calinsky criterion), interval statistics (GS), and contour coefficients (Average silhouette method). If the optimal cluster numbers of two or more indexes are consistent, the optimal cluster number is the final optimal cluster number; otherwise, selecting the optimal cluster number determined by the contour coefficient as the final optimal cluster number. Finally, according to the optimal cluster number determined in the last step, dimension reduction of each type of sample is realized; the mean of methylation level indicators (including but not limited to beta values) for all samples in each class at each probe capture region is calculated separately. The multiple samples are represented by one value (i.e., dimension reduction) for subsequent sample layering.
In one embodiment, for a particular sample type, the class of that type of sample is determined and then the beta mean for each class of sample is calculated.
And clustering the samples subjected to the dimension reduction treatment, determining the optimal clustering number, and realizing sample layering. Firstly, calculating the discrete degree of each probe capturing area in all samples, and taking the first 10 percent (Top 10 percent) probe with the largest discrete degree as an effective characteristic; then, clustering the samples using an Unweighted group average method (UPGMA); finally, determining the optimal cluster number according to indexes, wherein the indexes comprise: variance ratio criteria (Variance ratio criterion, VRC; also known as Calinsky criterion), interval statistics (GS), and contour coefficients (Average silhouette method), etc. If the optimal cluster numbers of two or more indexes are consistent, the optimal cluster numbers are the final optimal cluster numbers; when the optimal cluster numbers are inconsistent, the optimal cluster number N determined by the contour coefficient is taken. In particular, the plurality of cancer species may ultimately be divided into N groupings, each of which may contain one or more cancer species.
In another embodiment, the invention also provides a method for hierarchical screening of methylation markers, taking cancer sample hierarchical methylation marker screening as an example, comprising:
According to the N groups determined in the previous step, the methylation markers among the N groups, namely the first layer methylation markers, are screened first.
Sample selection: the data was divided into training and validation sets while samples of the N groupings for the first tier methylation marker screening were determined.
Methylation region division: dividing a methylation region with the pearson correlation coefficient between any two adjacent CpG sites being larger than a preset value and the number of CpG sites being larger than the preset number into the same methylation region, and taking the average value of beta values of all CpG sites contained in the methylation region as the methylation level of the region.
The preset value of the pearson correlation coefficient can be set to be 0.8, 0.85, 0.9, 0.95 and the like according to actual application situations; the preset number of CpG sites in the same methylation region can be set to 3, 4, 5, 6, etc. according to the actual application situation. Methylation regions can reduce false positives for single methylation sites compared to single methylation sites, i.e., are less susceptible to detection techniques; in the application of sequencing data, some methylation sites may not be detected to produce a missing value, and the occurrence of a missing value may be avoided if a regional methylation level is used.
Methylation marker screening: first, using analysis of variance or the Kruskal-Wallis test (KW test), it was examined whether or not there was a significant statistical difference between N groups in each methylation region. Correcting the p value by adopting a Bonferroni (multiple comparison method) to obtain a corrected p value (padj), and judging that the methylation region has obvious statistical difference among N groups if the padj is smaller than a preset threshold value. The preset threshold value of the corrected p-value padj may be set to 0.01, 0.05, etc. according to the actual application. When the above test shows that the overall difference has obvious statistical significance, comparing any two groups in N groups in pairs; different statistical methods may be chosen depending on the application, such as Tukey's Honest Significant Difference (Tukey's HSD), least significant difference method (Least Significance Difference, LSD), dunnett-t test, complex polar difference method (Student Newman Keuls, SNK method), and new complex polar difference method (Duncan's new multiple range test), etc. When the p-value of the pairwise comparison is less than the preset threshold and the absolute value of the beta difference is greater than the preset threshold, the methylation region has a significant statistical difference between the two packets. The preset threshold value of p values compared in pairs may be set to 0.01, 0.05, etc. according to practical application, and the preset threshold value of the absolute value of the beta difference value may be set to 0.15, 0.2, 0.25, 0.3, 0.35, etc. according to practical application. If a methylation region differs from each of the other M grouping samples, then the methylation region is the specific methylation marker for that particular grouping. M may be set to N-1, N-2, etc. according to practical applications, and when M is N-1, the specificity of the methylation marker is the best.
Based on the N groups determined before, the methylation markers of different types of samples inside each group, namely the methylation markers of the second layer, are then screened. Taking one of the N groupings as an example, the screening method is similar to the screening of the first tier methylation markers, including:
sample selection: samples in a specific group are classified according to the original labels, and if the samples in the specific group relate to X cancer species, the samples can be classified into X categories.
Methylation region division: dividing a methylation region with the pearson correlation coefficient between any two adjacent CpG sites being larger than a preset value and the number of CpG sites being larger than the preset number into the same methylation region, and taking the average value of beta values of all CpG sites contained in the methylation region as the methylation level of the region. The preset value of the pearson correlation coefficient can be set to be 0.8, 0.85, 0.9, 0.95 and the like according to actual application situations; the preset number of CpG sites in the same methylation region can be set to 3, 4, 5, 6, etc. according to the actual application situation.
Methylation marker screening: first, using analysis of variance or the Kruskal-Wallis test (KW test), it was examined whether or not there was a significant statistical difference between the X categories for each methylation region. And (3) correcting the p value by using Bonferroni to obtain a corrected p value (padj), and judging that the methylation region has obvious statistical difference among X categories if the padj is smaller than a preset threshold value. The preset threshold value of the corrected p-value padj may be set to 0.01, 0.05, etc. according to the actual application. When the above test is carried out, if the overall difference is found to have obvious statistical significance, comparing any two categories in the X categories in pairs; different statistical methods may be chosen depending on the application, such as Tukey's Honest Significant Difference (Tukey's HSD), least significant difference method (Least Significance Difference, LSD), dunnett-t test, complex polar difference method (Student Newman Keuls, SNK method), and new complex polar difference method (Duncan's new multiple range test), etc. When the p value of the pairwise comparison is smaller than a preset threshold value and the absolute value of the beta difference value is larger than the preset threshold value, the methylation region is judged to have significant statistical difference between the two categories. The preset threshold value of p values compared in pairs may be set to 0.01, 0.05, etc. according to practical application, and the preset threshold value of the absolute value of the beta difference value may be set to 0.15, 0.2, 0.25, 0.3, 0.35, etc. according to practical application. If a methylation region is different from one specific class sample and the other Y class samples, judging the methylation region as a specific methylation marker of the specific class. Y may be set to X-1, X-2, etc. according to practical applications, and when Y is X-1, the specificity of the methylation marker is the best.
In another embodiment, the invention also provides a methylation marker assessment method comprising:
and constructing a first-layer classification model according to the first-layer methylation markers screened before, and verifying by adopting an independent data set. The data includes N grouped samples.
Sample selection: the data are divided into a training set and a verification set, wherein the training set data are used for model construction, and the verification set data are used for model performance verification.
Feature selection: based on the first layer of methylation markers screened, they were further screened using recursive elimination (Recursive feature elimination, RFE) with the resulting partially methylated markers as final features. The recursive elimination method is to select a reference model, transfer all the features into the reference model, and reject unimportant features to reserve important features by sequencing the importance of the features while confirming the performance of the model.
And (3) constructing a classification model: and (3) training a machine learning model based on the final characteristics obtained in the characteristic selection step to obtain an optimal methylation classification model. Machine learning methods include, but are not limited to, support vector machines, logistic regression, neural networks, random forests, extreme gradient boosting (eXtreme Gradient Boosting, XGBoost), and the like.
Model prediction: based on the constructed classification model, the verification set data is predicted, and the performance of the model is evaluated. The model performance evaluation index comprises: accuracy, precision, recall, etc.
And respectively constructing a second-layer classification model for each group according to the methylation markers of each group screened before and adopting an independent data set for verification. Taking one of the N groupings as an example, the model construction is similar to the first layer model construction and verification method, including:
sample selection: samples in a specific group are classified according to the original labels, and if the samples in the specific group relate to X cancer species, the samples can be classified into X categories.
Feature selection: based on the first layer of methylation markers screened, they were further screened using recursive elimination (Recursive feature elimination, RFE) with the resulting partially methylated markers as final features. The recursive elimination method is to select a reference model, transfer all the features into the reference model, sort the importance of the features while confirming the performance of the model, reject the non-important features and reserve the important features.
And (3) constructing a classification model: and based on the final characteristics obtained by the characteristic selection, training a machine learning model to obtain an optimal methylation classification model. The machine learning method is not limited to support vector machines, logistic regression, neural networks, random forests, extreme gradient lifting (eXtreme Gradient Boosting, XGBoost), etc.
Model prediction: based on the constructed classification model, the verification set data is predicted, and the performance of the model is evaluated. The model performance evaluation index comprises: accuracy, precision, recall, etc.
Example 1
The present embodiment provides a method for multiple type sample layering. This embodiment, which layers a plurality of cancer species, as shown in fig. 1, includes:
step 1.1, obtaining methylation modification data of primary tumor tissues of a common database pan-cancer queue.
step 1.2, preprocessing the methylation modification data obtained by downloading, wherein the data preprocessing comprises the following steps: probe filtration and sample filtration.
Step 1.3, performing dimension reduction treatment on each type of sample.
Step 1.4, clustering according to the samples subjected to the dimension reduction treatment, and determining the optimal clustering number to finish sample layering.
In this example, methylation modification data of 30 cancer species samples were first obtained in a public database, and specific data are shown in table 1. Then, preprocessing the data, wherein the preprocessing of the data comprises the following steps: probe filtration and sample filtration. Probe filtration rules: if the 10bp upstream and downstream of the probe contains SNP loci, rejecting the probe; and simultaneously eliminating probes on sex chromosomes and probes with the sample deletion value ratio exceeding 20 percent. Sample filtering rules: different methods are used for identifying abnormal samples, such as isolated forests, local abnormality factor detection algorithms, density-based clustering algorithms, partition-based clustering algorithms, hierarchical clustering algorithms and network-based clustering algorithms, and if two or more algorithms consider the samples to be abnormal, the samples are removed.
The data of this example is derived from the TCGA (The Cancer Genome Atlas) database and the GEO (Gene Expression Omnibus) database.
TABLE 1
/>
After the data preprocessing is completed, the dimension reduction processing is carried out on each type of sample. Dimension reduction is performed for a type of sample: firstly, calculating the standard deviation of each probe in a target type sample, and taking the first 10 percent (Top 10 percent) probe with the maximum standard deviation as an effective characteristic; then, hierarchical clustering is carried out on the samples; next, the optimal cluster number is determined according to an index, wherein the index comprises: variance ratio criteria (Variance ratio criterion, VRC; also known as Calinsky criterion), interval statistics (GS), and contour coefficients (Average silhouette method). If the optimal cluster numbers of two or more methods are consistent, the optimal cluster number is taken as the final optimal cluster number; otherwise, selecting the optimal cluster number determined by the contour coefficient. Finally, according to the optimal cluster number determined in the last step, dimension reduction of each type of sample is realized; and respectively calculating the average value of the beta value of all samples in each probe.
And finally, clustering the samples subjected to the dimension reduction treatment, and determining the optimal clustering number, thereby realizing sample layering. The method comprises the following specific steps: firstly, calculating variation coefficients of each probe in all samples, such as standard deviation, sequencing the probes from large to small according to the standard deviation, and taking the probe with the standard deviation of the first 10 percent (Top 10 percent) as an effective characteristic; then, clustering the samples using an Unweighted group average method (UPGMA); finally, determining the optimal cluster number according to indexes, wherein the indexes comprise: variance ratio criteria (Variance ratio criterion, VRC; also known as Calinsky criterion), interval statistics (GS), and contour coefficients (Average silhouette method), etc. In this example, 30 cancer species are divided into 12 groups. The specific results are shown in FIG. 2.
Fig. 2 is a graph showing the results of sample stratification, wherein the numbers behind cancer seeds represent the number of categories after dimension reduction of the cancer seeds, and different numbers represent different groups, and the groups are separated by dotted lines. Since cervical cancer and esophageal cancer are each involved in both adenocarcinoma and squamous carcinoma in the cancer seed sample listed in table 1, they are shown in table 2; the other cancer species in table 1 do not change the classification and are not listed here.
TABLE 2 categories after dimension reduction of cancer species
TABLE 3 sample stratification results (grouping of cancer species)
English abbreviation for cancer species Layering situation
SKCM、UVM group_7
LIHC group_9
THYM group_3
ACC、GBM、LGG、PCPG group_6
OV group_1
BRCA、PRAD group_8
UCEC、UCS group_2
MESO、SARC group_4
BLCA、CESC、ESCA、HNSC、LUSC group_12
KICH、KIRC、KIRP、THCA group_5
CHOL、LUAD、PAAD group_10
CESC、COAD、ESCA、READ、STAD group_11
The sample layering strategy adopted in the embodiment can judge which group the newly added sample belongs to when the new sample type is added, and can not completely push up the methylation marker before the newly added sample, so that the applicability and stability of the marker can be improved.
Example 2
The present example provides a method for hierarchical screening of methylation markers, comprising screening of first and second layers of methylation markers. Taking cancer sample stratified methylation marker screening as an example, the methylation marker screening step is shown in fig. 3, and includes:
step 2.1, samples for screening methylation markers were determined.
Step 2.2, calculating the methylation level of the pre-divided methylation region based on the methylation region.
Step 2.3, screening for methylation markers.
In this example, first screening for first methylation markers: according to the 12 groups in the example 1, the 30 cancer species in the training set data are divided into 12 groups; methylation levels were calculated from pre-partitioned methylation regions. The methylation region division rule is: the pearson correlation coefficient between any two adjacent CpG sites in the same methylation region is greater than 0.9 and the number of CpG sites in the same methylation region is greater than 3. Then, checking whether a significant statistical difference exists among 12 groups of each methylation region by using analysis of variance, and performing Bonferroni correction on the p value of the analysis of variance to obtain a padj; if padj is less than 0.05, it is judged that there is a significant statistical difference in the methylation region between 12 packets. When the variance analysis test shows that the differences have significant statistical significance, any two of the 12 packets are further compared pairwise using Tukey's Honest Significant Difference (Tukey's HSD). When the p-value of the pairwise comparison is less than 0.05 and the absolute value of the beta difference is greater than 0.2, the methylation region is judged to have a significant statistical difference between the two packets. If a methylation region differs from each of the other 11 grouping samples, then the methylation region is the specific methylation marker for that particular grouping. By analogy, specific methylation markers for each of the 12 groupings can be obtained.
The screening of the second-tier methylation markers follows, and since there are 12 groupings, the second-tier methylation markers need to be performed within each of the 12 groupings, particularly if only one type of sample is included in the grouping, there is no need to perform the second-tier methylation marker screening. Taking the third group in table 3 as an example, if the group includes 4 types of samples, the samples in the third group are classified into 4 categories. Then, calculating methylation level according to the pre-divided methylation areas, checking whether obvious statistical difference exists among 4 categories of each methylation area by using analysis of variance, and performing Bonferroni correction on p value of the analysis of variance to obtain padj; if padj is less than 0.05, there is a significant statistical difference in the methylation region between the 4 classes. When tested by analysis of variance, the differences were found to be statistically significant, any two of the 4 categories were further compared pairwise using Tukey's Hone st Significant Difference (Tukey's HSD). When the p-value of the pairwise comparison is less than 0.05 and the absolute value of the beta difference is greater than 0.2, it is determined that the methylation region has a significant statistical difference between the two categories. If a methylation region is different from the other 3 types of samples, the methylation region is judged to be a specific methylation marker of the specific type. Similarly, specific methylation markers of each of the 4 classes can be obtained; thus, the methylation markers inside each group can be obtained, namely, the screening of the methylation markers of the second layer is completed.
In this example, screening for hierarchical methylation markers was performed based on sample stratification results. When a new sample type is added, which group the new sample belongs to can be judged first, the methylation marker before the new sample is not completely pushed up, and the applicability and stability of the marker can be ensured.
Example 3
The present embodiment provides a methylation marker assessment method, as shown in fig. 4, comprising:
step 3.1, determining samples for model training and prediction;
step 3.2, further screening model features;
step 3.3, model construction;
step 3.4, model prediction.
In this embodiment, first, the first layer model is constructed: samples for first layer model training were determined and the first layer methylation markers in example 2 were further screened using recursive elimination (Recursive feature elimination, RFE) to yield model final methylation markers. And constructing a first layer random forest classification model based on the final characteristics obtained by the characteristic selection. Then, respectively constructing second-layer models in 12 groups, and constructing 12 second-layer models at most; in particular, if only one type of sample is included in a packet, no second-layer modeling of the packet is required. Taking the third grouping as an example: the samples of the model construction were of the third group of 4 classes and the third group of second layer methylation markers of example 2 were further screened using recursive elimination (Recursive feature elimination, RFE) to obtain the final methylation markers of the model. And constructing a third grouping second layer random forest classification model based on the final features obtained by feature selection. Finally, the data of the verification set are predicted, the model prediction accuracy (the percentage of the number of accurate predicted samples to the number of the real samples of the cancer species) is evaluated, the overall cancer species accuracy is 95.7%, and the specific performance is shown in figure 5. It can be seen that the prediction accuracy rate of the method for the diagnosis of Gong Rouliu and esophageal squamous carcinoma is about 70%, the prediction accuracy rate of the method for other cancers is as high as more than 90%, and the prediction accuracy rate of the method for the diagnosis of lung adenocarcinoma, prostate cancer and thyroid cancer is close to 100%.
In one embodiment, the present invention provides a method for layering multiple types of samples, in particular, the samples may be plasma, tissue samples, including but not limited to normal cells, cancer tissue, and normal tissue. The method can be used for constructing human developmental lineages, exploring the development relationship of tissue/cell types, cancer lineages and the like.
The existing method for screening the methylation markers is basically one-to-one screening, and in one embodiment, the invention provides a method for screening the methylation markers in a layering manner, which is a one-to-many method, and can be used for screening the methylation markers of various types of samples at the same time, so that the method is efficient and reliable.
"one-to-one" means: a set of markers is obtained by comparison between the two types of samples.
"one-to-many" means: a comparison is made between multiple types (greater than 2 types) of samples simultaneously, with multiple sets of markers being obtained.
In one embodiment, the detection of the methylation markers screened in plasma/tissue of the present invention can be used for analysis of plasma composition in healthy human blood, organ damage, monitoring of immune response, early screening of cancer, detection of recurrence, monitoring of organ transplantation, and localization of primary sites of metastatic cancer.
In one embodiment, the invention further screens for stratified methylation markers based on sample stratification results. When a new sample type is added, which group the new sample belongs to can be judged first, the methylation marker before the new sample is not completely pushed up, and the applicability and stability of the marker can be ensured.
In one embodiment, the overall method of the invention is applicable not only to single methylation site methylation level detection, but also to regional methylation level detection. The measure of methylation level is not limited to average methylation rate, methylation entropy (methylation entropy), apparent polymorphism (epi-polymorphism), methylation haploid burden (methylation haplotype load, MHL), haploid number (haplotypes counts), and the like. The methylation region can reduce false positives of a single methylation site compared with a single methylation site, i.e., is less susceptible to detection techniques; in the application of sequencing data, some methylation sites may not be detected to produce a missing value, and the occurrence of a missing value may be avoided if a regional methylation level is used.
In one embodiment, the present invention provides a method for layering multiple types of samples: a. in identifying abnormal samples, methods that may be employed include, but are not limited to, isolated Forest (Isolation Forest), local anomaly factor detection algorithm (Local Outlier Factor, LOF), density-based clustering algorithm (Density-Based Spatial Clustering of Applications with Noise, DBSCAN), partition-based clustering algorithm, hierarchical-based clustering algorithm, network-based clustering algorithm, and the like. b. In dimension reduction of the sample, the index of evaluating the degree of dispersion includes, but is not limited to, variance, standard deviation, square deviation, coefficient of variation, and the like, and the clustering method includes, but is not limited to, a hierarchical-based clustering and a density-based clustering method. c. In hierarchical clustering of samples, clustering methods include, but are not limited to, unweighted Pair-Group Method with Arithmetic means (UPGMA), phylogenetic tree adjacency, and the like. d. In determining the optimal number of clusters, the metrics include, but are not limited to, variance ratio criteria (Variance ratio criterion, VRC; also known as Calinsky criterion), gap Statistics (GS), and profile coefficients (Average silhouette method), among others.
In one embodiment, the invention provides a method of stratified screening for methylation markers. In determining whether or not there is a difference in each group of samples, methods used include, but are not limited to, analysis of variance and the Kruskal Wallis, KW test, and the like. In making the pairwise comparisons, statistical methods that may be employed include, but are not limited to, tukey's Honest Significant Difference (Tukey's HSD), least significant difference (Least Significance Difference, LSD), dunnett-t test, complex polar error (Student Newman Keuls, SNK) and new complex polar error (Duncan's new multiple range test), among others.
In one embodiment, methods that may be employed in the present invention in the classification model construction include, but are not limited to, support vector machines, logistic regression, neural networks, random forests, and extreme gradient lifting (eXtreme Gradient Boosting, XGBoost), among others.
In an embodiment, the application scenario of the present invention includes, but is not limited to, the following scenarios: sample types include, but are not limited to, plasma, tissue samples, and the like. Sample stratification strategies include, but are not limited to, construction of human developmental lineages, exploration of tissue/cell type developmental relationships, cancer seed lineages, and the like. Applications of methylation markers include, but are not limited to, healthy human plasma component analysis, organ damage, immune response monitoring, early screening for cancer, recurrence detection, organ transplant monitoring, localization of primary sites of metastatic cancer, and the like.
Those skilled in the art will appreciate that all or part of the functions of the various methods in the above embodiments may be implemented by hardware, or may be implemented by a computer program. When all or part of the functions in the above embodiments are implemented by means of a computer program, the program may be stored in a computer readable storage medium, and the storage medium may include: read-only memory, random access memory, magnetic disk, optical disk, hard disk, etc., and the program is executed by a computer to realize the above-mentioned functions. For example, the program is stored in the memory of the device, and when the program in the memory is executed by the processor, all or part of the functions described above can be realized. In addition, when all or part of the functions in the above embodiments are implemented by means of a computer program, the program may be stored in a storage medium such as a server, another computer, a magnetic disk, an optical disk, a flash disk, or a removable hard disk, and the program in the above embodiments may be implemented by downloading or copying the program into a memory of a local device or updating a version of a system of the local device, and when the program in the memory is executed by a processor.
The foregoing description of the invention has been presented for purposes of illustration and description, and is not intended to be limiting. Several simple deductions, modifications or substitutions may also be made by a person skilled in the art to which the invention pertains, based on the idea of the invention.

Claims (15)

1. A method of hierarchical screening for methylation markers, comprising:
a first layer screening step, including layering samples to obtain N groups of samples, and screening methylation markers among the N groups to obtain first layer methylation markers;
a second layer screening step, which comprises screening methylation markers of different types of samples in each group according to N groups of samples obtained in the first layer screening step, namely, a second layer methylation marker;
in the first layer screening step, the method for layering the sample comprises the following steps:
a data acquisition step comprising acquiring methylation modification data of a sample;
preprocessing the methylation modification data to obtain preprocessed samples of all types;
the dimension reduction processing step comprises the step of respectively carrying out dimension reduction processing on each type of sample after pretreatment;
The layering step comprises the steps of clustering samples after dimension reduction treatment, determining the optimal clustering number and layering the samples;
in the dimension reduction processing step, the dimension reduction processing method for each type of sample comprises the following steps: calculating the discrete degree of each probe in the target type samples, sequencing the probes from large to small according to the discrete degree, taking the probes with the discrete degree ranked in front of a preset ranking as effective features, clustering the samples, determining the optimal clustering number according to indexes, and realizing dimension reduction of each type of samples;
in the layering step, calculating the discrete degree of each probe in all samples, sequencing the probes from large to small according to the discrete degree, taking the probes with the discrete degree ranked in front of a preset ranking as effective features, clustering the samples, and determining the optimal clustering number according to indexes; the index comprises at least two of a variance ratio criterion, an interval statistic and a contour coefficient; if the optimal cluster numbers of two or more indexes are consistent, the optimal cluster number is the final optimal cluster number; otherwise, taking the optimal cluster number N determined by the contour coefficient as the final optimal cluster number; the plurality of types of samples are finally divided into N groups, and each group contains at least one type of sample;
The second layer screening step is performed with reference to the first layer screening step;
the second layer screening step includes:
a sample selection step, which comprises classifying samples in each group according to cancer species, wherein the number X of the cancer species is consistent with the number of the categories of the samples;
dividing methylation areas into the same methylation area, wherein the pearson correlation coefficient between any two adjacent CpG sites is larger than a preset value, and the methylation areas with the number of CpG sites larger than the preset number are divided into the same methylation area, and taking the average value of beta values of all CpG sites contained in the methylation area as the methylation level of the area;
a methylation marker screening step, which comprises checking whether the methylation level of each methylation region has a significant statistical difference between X categories, and if so, comparing the methylation levels of any two categories in each category to judge whether the methylation level of each methylation region has a significant statistical difference between the two categories; if a methylation region is different from other Y types of samples, judging that the methylation region is a specific methylation marker of the specific type of sample, and Y is a natural number smaller than X.
2. The method of claim 1, wherein in the dimension-reduction processing step, the index includes at least two of a variance ratio criterion, an interval statistic, and a contour coefficient;
in the dimension reduction processing step, if the optimal cluster numbers of two or more indexes are consistent, the optimal cluster number is the final optimal cluster number; otherwise, selecting the optimal cluster number determined by the contour coefficient as the final optimal cluster number, and realizing dimension reduction of each type of sample;
in the dimension reduction processing step, the method for clustering the samples comprises at least one of a hierarchical clustering algorithm and a density-based clustering algorithm.
3. The method of claim 1, wherein the dimension-reduction processing step further comprises separately calculating the methylation level of all samples in each class at each probe capture region.
4. The method of claim 3, wherein the methylation level comprises an average methylation rate, methylation entropy, apparent polymorphism, methylated haploid burden, or haploid number.
5. The method of claim 3, wherein the methylation level comprises an average of beta values.
6. The method of claim 1, wherein in the step of layering, the method of clustering samples comprises at least one of: non-weighted group averaging, phylogenetic tree adjacency, partition-based clustering algorithms, hierarchical-based clustering algorithms, network-based clustering algorithms.
7. The method of claim 1, wherein the type comprises a cancer species, a developmental lineage, a tissue type, or a cell type.
8. The method of claim 1, wherein in the data acquisition step, the methylation modification data of the sample is derived from a database;
in the data acquisition step, the sample comprises at least one of a tissue sample and a body fluid sample;
in the data acquisition step, the tissue sample comprises at least one of cancer tissue and normal tissue;
in the data acquisition step, the sample comprises a cancer sample;
in the data acquisition step, the cancer sample comprises a carcinoma primary tumor tissue sample;
in the data acquisition step, the tumor includes at least one of hepatocellular carcinoma, cholangiocarcinoma, lung adenocarcinoma, lung squamous carcinoma, gastric cancer, esophageal carcinoma, colon cancer, rectal adenocarcinoma, pancreatic cancer, breast cancer, ovarian cancer, cervical cancer, endometrial cancer, uterine sarcoma, prostate cancer, bladder urothelial cancer, adrenocortical cancer, renal chromocytoma, renal clear cell carcinoma, renal papillary cell carcinoma, head and neck squamous cell carcinoma, thyroid cancer, thymoma, mesothelioma, sarcoma, skin melanoma, ocular melanoma, pheochromocytoma, paraganglioma, brain low-grade glioma, glioblastoma.
9. The method of claim 1, wherein in the preprocessing step, the preprocessing includes probe filtering, and sample filtering, and in the preprocessing step, the probe filtering rule includes: if the 10bp upstream and downstream of the probe contains SNP loci, rejecting the probe; meanwhile, eliminating probes on sex chromosomes and probes with the sample missing value proportion exceeding a preset threshold value;
in the preprocessing step, the sample filtering rule includes: identifying an abnormal sample by adopting at least one algorithm, and eliminating the sample if the identification result of at least one algorithm in the adopted algorithms shows that the sample is abnormal;
algorithms for identifying abnormal samples include at least one of isolated forests, local anomaly factor detection algorithms, density-based clustering algorithms, partition-based clustering algorithms, hierarchical-based clustering algorithms, network-based clustering algorithms.
10. The method of claim 1, wherein the first layer screening step comprises:
a sample selection step comprising obtaining samples of N groupings for a first tier methylation marker screen;
dividing methylation areas, namely taking methylation areas with the correlation coefficient between any two adjacent CpG sites being larger than a preset value and the number of the CpG sites being larger than the preset number as the same methylation area;
A methylation marker screening step, which comprises checking whether the methylation level of each methylation region has a significant statistical difference between N groups, and if so, comparing the methylation levels of any two groups of the N groups in pairs, and judging whether the methylation level of the methylation region has a significant statistical difference between the two groups; if a methylation region has significant statistical difference between methylation levels of a specific grouping sample and other M grouping samples, judging that the methylation region is a specific methylation marker of the specific grouping, wherein M is any natural number smaller than N; if not, no further steps are carried out.
11. The method of claim 10, wherein in the methylation region partitioning step, the correlation coefficient comprises a pearson correlation coefficient or a spearman correlation coefficient;
in the methylation marker screening step, the testing method comprises at least one of analysis of variance and a krueschel-wales test;
in the methylation marker screening step, correcting the p value by adopting a multiple comparison method to obtain a corrected p value, and judging that the methylation level of the methylation region has obvious statistical difference among N groups if the corrected p value is smaller than a preset threshold value;
In the methylation marker screening step, when the p value compared with each other is smaller than a preset threshold value and the absolute value of the methylation level difference value is larger than the preset threshold value, the methylation level of the methylation region is judged to have a significant statistical difference between the two groups.
12. The method of claim 10, wherein in the methylation marker screening step, the assay method comprises at least one of analysis of variance, a krusec-wales assay;
in the methylation marker screening step, correcting the p value by adopting a multiple comparison method to obtain a corrected p value, and judging that the methylation level of the methylation region has obvious statistical difference among X categories if the corrected p value is smaller than a preset threshold value;
in the methylation marker screening step, when the p value compared pairwise is smaller than a preset threshold value and the absolute value of the beta difference value is larger than the preset threshold value, judging that the methylation level of the methylation region has obvious statistical difference between two categories;
the methylation level includes the average of the beta values of all CpG sites contained in the methylation region.
13. A device for hierarchical screening of methylation markers, comprising:
The first layer screening module is used for layering the samples to obtain N groups of the samples, and screening methylation markers among the N groups to obtain first layer methylation markers;
the second layer screening module is used for screening methylation markers of different types of samples in each group according to the N groups of the samples obtained by the first layer screening module, namely, the methylation markers are second layer methylation markers;
in the first layer screening module, the method for layering the samples comprises the following steps:
a data acquisition step comprising acquiring methylation modification data of a sample;
preprocessing the methylation modification data to obtain preprocessed samples of all types;
the dimension reduction processing step comprises the step of respectively carrying out dimension reduction processing on each type of sample after pretreatment;
the layering step comprises the steps of clustering samples after dimension reduction treatment, determining the optimal clustering number and layering the samples;
in the dimension reduction processing step, the dimension reduction processing method for each type of sample comprises the following steps: calculating the discrete degree of each probe in the target type samples, sequencing the probes from large to small according to the discrete degree, taking the probes with the discrete degree ranked in front of a preset ranking as effective features, clustering the samples, determining the optimal clustering number according to indexes, and realizing dimension reduction of each type of samples;
In the layering step, calculating the discrete degree of each probe in all samples, sequencing the probes from large to small according to the discrete degree, taking the probes with the discrete degree ranked in front of a preset ranking as effective features, clustering the samples, and determining the optimal clustering number according to indexes; the index comprises at least two of a variance ratio criterion, an interval statistic and a contour coefficient; if the optimal cluster numbers of two or more indexes are consistent, the optimal cluster number is the final optimal cluster number; otherwise, taking the optimal cluster number N determined by the contour coefficient as the final optimal cluster number; the plurality of types of samples are finally divided into N groups, and each group contains at least one type of sample;
the second layer screening module is performed with reference to the first layer screening module;
the second layer screening module includes:
the sample selection module is used for classifying samples in each group according to cancer seeds, and the number X of the cancer seeds is consistent with the number of the categories of the samples;
the methylation area dividing module is used for dividing methylation areas with the pearson correlation coefficient between any two adjacent CpG sites being larger than a preset value and the number of the CpG sites being larger than the preset number into the same methylation area, and taking the average value of beta values of all CpG sites contained in the methylation area as the methylation level of the area;
The methylation marker screening module is used for checking whether the methylation level of each methylation region has obvious statistical difference among X categories, if so, comparing the methylation levels of any two categories in each category, and judging whether the methylation level of the methylation region has obvious statistical difference between the two categories; if a methylation region is different from other Y types of samples, judging that the methylation region is a specific methylation marker of the specific type of sample, and Y is a natural number smaller than X.
14. A device for hierarchical screening of methylation markers, comprising:
a memory for storing a program;
a processor configured to implement the method according to any one of claims 1 to 12 by executing a program stored in the memory.
15. A computer readable storage medium having stored thereon a program executable by a processor to implement the method of any one of claims 1 to 12.
CN202211067693.7A 2022-09-01 2022-09-01 Methylation marker layered screening method and device Active CN115497561B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211067693.7A CN115497561B (en) 2022-09-01 2022-09-01 Methylation marker layered screening method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211067693.7A CN115497561B (en) 2022-09-01 2022-09-01 Methylation marker layered screening method and device

Publications (2)

Publication Number Publication Date
CN115497561A CN115497561A (en) 2022-12-20
CN115497561B true CN115497561B (en) 2023-08-29

Family

ID=84469435

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211067693.7A Active CN115497561B (en) 2022-09-01 2022-09-01 Methylation marker layered screening method and device

Country Status (1)

Country Link
CN (1) CN115497561B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116168761B (en) * 2023-04-18 2023-06-30 珠海圣美生物诊断技术有限公司 Method and device for determining characteristic region of nucleic acid sequence, electronic equipment and storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014064584A1 (en) * 2012-10-23 2014-05-01 Koninklijke Philips N.V. Comparative analysis and interpretation of genomic variation in individual or collections of sequencing data
CN108410980A (en) * 2018-01-22 2018-08-17 深圳华大基因股份有限公司 Screen method, kit and the application of the target area for the PCR detections that methylate
CN109996893A (en) * 2016-07-07 2019-07-09 西门子医疗有限公司 Identify the relevant research of the apparent gene group range of heart development gene model and a new class of heart failure biomarker
CN112086199A (en) * 2020-09-14 2020-12-15 中科院计算所西部高等技术研究院 Liver cancer data processing system based on multiple groups of mathematical data
CN112397151A (en) * 2021-01-21 2021-02-23 臻和(北京)生物科技有限公司 Methylation marker screening and evaluating method and device based on target capture sequencing
CN112951418A (en) * 2021-05-17 2021-06-11 臻和(北京)生物科技有限公司 Method and device for evaluating methylation of linked regions based on liquid biopsy, terminal equipment and storage medium
CN113257350A (en) * 2021-06-10 2021-08-13 臻和(北京)生物科技有限公司 ctDNA mutation degree analysis method and device based on liquid biopsy and ctDNA performance analysis device
WO2021202351A1 (en) * 2020-03-31 2021-10-07 Freenome Holdings, Inc. Methods and systems for detecting colorectal cancer via nucleic acid methylation analysis
CN113755585A (en) * 2020-06-07 2021-12-07 复旦大学 DNA methylation biomarkers associated with prognosis of renal clear cell carcinoma

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10229244B2 (en) * 2002-11-11 2019-03-12 Affymetrix, Inc. Methods for identifying DNA copy number changes using hidden markov model based estimations
CA2505687A1 (en) * 2002-11-19 2004-06-03 Applera Corporation Polynucleotide sequence detection assays and analysis
US9370304B2 (en) * 2012-06-06 2016-06-21 The Regents Of The University Of Michigan Subvolume identification for prediction of treatment outcome
WO2016020551A1 (en) * 2014-08-08 2016-02-11 Ait Austrian Institute Of Technology Gmbh Thyroid cancer diagnosis by dna methylation analysis
EP3067432A1 (en) * 2015-03-11 2016-09-14 Deutsches Krebsforschungszentrum Stiftung des Öffentlichen Rechts DNA-methylation based method for classifying tumor species of the brain
CA3019381A1 (en) * 2016-04-01 2017-10-05 Nanomed Diagnostics B.V. Detection of cancer in urine
WO2019040757A1 (en) * 2017-08-23 2019-02-28 The General Hospital Corporation Multiplexed proteomics and predictive drug candidate assessment
US20220145393A1 (en) * 2020-11-12 2022-05-12 Oklahoma Medical Research Foundation Peripheral blood dna methylation models as predictors of knee osteoarthritis radiographic progression
US20220208337A1 (en) * 2020-12-29 2022-06-30 Kpn Innovations, Llc. Systems and methods for generating a cancer alleviation nourishment plan

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014064584A1 (en) * 2012-10-23 2014-05-01 Koninklijke Philips N.V. Comparative analysis and interpretation of genomic variation in individual or collections of sequencing data
CN109996893A (en) * 2016-07-07 2019-07-09 西门子医疗有限公司 Identify the relevant research of the apparent gene group range of heart development gene model and a new class of heart failure biomarker
CN108410980A (en) * 2018-01-22 2018-08-17 深圳华大基因股份有限公司 Screen method, kit and the application of the target area for the PCR detections that methylate
WO2021202351A1 (en) * 2020-03-31 2021-10-07 Freenome Holdings, Inc. Methods and systems for detecting colorectal cancer via nucleic acid methylation analysis
CN113755585A (en) * 2020-06-07 2021-12-07 复旦大学 DNA methylation biomarkers associated with prognosis of renal clear cell carcinoma
CN112086199A (en) * 2020-09-14 2020-12-15 中科院计算所西部高等技术研究院 Liver cancer data processing system based on multiple groups of mathematical data
CN112397151A (en) * 2021-01-21 2021-02-23 臻和(北京)生物科技有限公司 Methylation marker screening and evaluating method and device based on target capture sequencing
CN112951418A (en) * 2021-05-17 2021-06-11 臻和(北京)生物科技有限公司 Method and device for evaluating methylation of linked regions based on liquid biopsy, terminal equipment and storage medium
CN113257350A (en) * 2021-06-10 2021-08-13 臻和(北京)生物科技有限公司 ctDNA mutation degree analysis method and device based on liquid biopsy and ctDNA performance analysis device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Comprehensive molecular comparison of BRCA1 hypermethylated and BRCA1 mutated triple negative breast cancers;Dominik Glodzik等;《Nature 》;第1-16页 *

Also Published As

Publication number Publication date
CN115497561A (en) 2022-12-20

Similar Documents

Publication Publication Date Title
CN102016881B (en) Classification of sample data
Afshar et al. Prediction of breast cancer survival through knowledge discovery in databases
Zhuang et al. Identification of hub subnetwork based on topological features of genes in breast cancer
US20020169730A1 (en) Methods for classifying objects and identifying latent classes
CN115497561B (en) Methylation marker layered screening method and device
Voigt et al. Phenotype in combination with genotype improves outcome prediction in acute myeloid leukemia: a report from Children’s Oncology Group protocol AAML0531
CN113270188A (en) Method and device for constructing prognosis prediction model of patient after esophageal squamous carcinoma radical treatment
CN113539498A (en) Decision tree model-based system for predicting malignant risk of isolated pulmonary nodules
CN106460045A (en) Use of recurrent copy number variations in constitutional human genome for prediction of predisposition to cancer
CN113555070A (en) Machine learning algorithm for constructing drug sensitivity related gene classifier of acute myeloid leukemia
Razavi et al. Predicting metastasis in breast cancer: comparing a decision tree with domain experts
KR20220086603A (en) Cancer classification using tissue-of-origin thresholding
Wu et al. Construction of novel gene signature-based predictive model for the diagnosis of acute myocardial infarction by combining random forest with artificial neural network
Men et al. A prognostic 11 genes expression model for ovarian cancer
Karabacak et al. Deep learning for prediction of isocitrate dehydrogenase mutation in gliomas: a critical approach, systematic review and meta-analysis of the diagnostic test performance using a Bayesian approach
CN109686414A (en) It is only used for the choosing method of the special DNA methylation assay Sites Combination of Hepatocarcinoma screening
CN111986819B (en) Adverse drug reaction monitoring method and device, electronic equipment and readable storage medium
CN116864011A (en) Colorectal cancer molecular marker identification method and system based on multiple sets of chemical data
Daemen et al. Improving the preoperative classification of adnexal masses as benign or malignant by second‐stage tests
US20140297194A1 (en) Gene signatures for detection of potential human diseases
Feng et al. Flexible diagnostic measures and new cut‐point selection methods under multiple ordered classes
Yi et al. Identification of four novel prognostic biomarkers and construction of two nomograms in adrenocortical carcinoma: a multi-omics data study via bioinformatics and machine learning methods
Liu et al. Personalized identification of differentially expressed modules in osteosarcoma
US11935627B2 (en) System and method for text-based biological information processing with analysis refinement
Meng et al. Identification and validation of a novel prognostic gene model for colorectal cancer

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant