CN116884491B - Method and kit for screening methylation site collection in high throughput manner and application of kit - Google Patents

Method and kit for screening methylation site collection in high throughput manner and application of kit Download PDF

Info

Publication number
CN116884491B
CN116884491B CN202311150628.5A CN202311150628A CN116884491B CN 116884491 B CN116884491 B CN 116884491B CN 202311150628 A CN202311150628 A CN 202311150628A CN 116884491 B CN116884491 B CN 116884491B
Authority
CN
China
Prior art keywords
methylation
tumor
site
sites
samples
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311150628.5A
Other languages
Chinese (zh)
Other versions
CN116884491A (en
Inventor
张亚飞
顾凯丽
邓啸
刘佳
蒋泽宇
王方杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Meijie Transformation Medical Research Suzhou Co ltd
Original Assignee
Meijie Transformation Medical Research Suzhou Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Meijie Transformation Medical Research Suzhou Co ltd filed Critical Meijie Transformation Medical Research Suzhou Co ltd
Priority to CN202311150628.5A priority Critical patent/CN116884491B/en
Publication of CN116884491A publication Critical patent/CN116884491A/en
Application granted granted Critical
Publication of CN116884491B publication Critical patent/CN116884491B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/20Polymerase chain reaction [PCR]; Primer or probe design; Probe optimisation
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • C12Q1/6874Methods for sequencing involving nucleic acid arrays, e.g. sequencing by hybridisation
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • C12Q1/6886Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material for cancer
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/154Methylation markers

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Organic Chemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Wood Science & Technology (AREA)
  • Genetics & Genomics (AREA)
  • Zoology (AREA)
  • Medical Informatics (AREA)
  • Immunology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Analytical Chemistry (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Pathology (AREA)
  • General Engineering & Computer Science (AREA)
  • Biochemistry (AREA)
  • Bioethics (AREA)
  • Microbiology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Oncology (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Hospice & Palliative Care (AREA)
  • Artificial Intelligence (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • Software Systems (AREA)
  • Public Health (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The application provides a method for screening methylation site sets in high throughput, a kit and application thereof, wherein the method comprises the following steps: (1) Collecting methylation data of a tumor group and a corresponding normal group in a public database; (2) Analyzing the difference of CpG methylation sites of the tumor sample and the normal sample in the public database by adopting a statistical test method, and screening CpG methylation sites with obvious difference to obtain a candidate CpG methylation site set; (3) Randomly extracting methylation data of tumor samples and normal tissue samples in a public database, and constructing an initial methylation model by using the candidate CpG methylation sites screened in the step (2); (4) Extracting methylation data of tumor samples and normal samples in a public database, adopting the initial methylation model in the step (3), and carrying out fine screening by using a displacement test to screen out a final methylation site set most relevant to a tumor group.

Description

Method and kit for screening methylation site collection in high throughput manner and application of kit
Technical Field
The application belongs to the technical field of gene detection, and particularly relates to a method for screening methylation site sets in a high throughput manner, a kit and application thereof.
Background
DNA methylation (DNA methylation) refers to the process of selectively adding methyl groups to specific bases of DNA by the action of DNA methyltransferases, and is an important epigenetic regulatory mode. In mammals, DNA methylation occurs predominantly at cytosine (C) at the 5 'end of cytosine-phosphate-guanine (CpG) islands, producing 5' methylcytosine (m 5C). DNA methylation does not alter DNA sequences and can cause changes in chromosome structure, DNA conformation, DNA stability, and the manner in which DNA interacts with proteins, thereby controlling gene expression.
In recent years, a great deal of research has shown that abnormal methylation of DNA is closely related to the occurrence and development of tumors and canceration of cells, and the abnormal methylation of DNA is probably caused by the following mechanisms: 1. cytosine in methylated CpG island dinucleotides deaminates with higher frequency to become thymine, so that gene mutation is caused; 2. the cancer suppressor gene and DNA repair gene are silenced due to hypermethylation; 3. reduced methylation level of oncogenes and activation; 4. reduced levels of global methylation of the genome activate transposons and repeat sequences resulting in reduced chromosomal stability. Therefore, methylation can be used as a biomarker and a prognosis evaluation index for early diagnosis of tumors and the like, and has important significance for screening and risk evaluation, early diagnosis, classification, prognosis judgment and treatment monitoring of tumors. On this basis, screening of methylation markers is an important premise for realizing tumor-related diagnosis and treatment activities.
CN108410980a discloses a method, kit and application for screening target region of methylation PCR detection. The method utilizes methylation chips in a database and corresponding transcriptome data to pre-screen methylation sites which have obvious differences and are inversely related to transcription expression, and finally screens site sets which are disclosed in a literature and have optimal sensitivity and specificity. The locus collection has the advantages that methylation PCR can be adopted for detection, and the locus collection is simple and convenient, has the defects that the locus collection only can obtain the locus reported by the former, is easy to cause the omission of important potential biological markers which are not fully researched or disclosed at present, needs to read a large number of documents, and has higher realization difficulty. Meanwhile, once the number of the screened sites is large, the detection flux of methylation PCR is limited easily, and the detection is difficult to be carried out at one time and high efficiency.
Therefore, the method for screening the methylation site set more efficiently and simply and the detection method not limited by the number of the sites are provided, and have great significance for early screening and early diagnosis of tumors and tiny residual focus MRD based on methylation detection.
Disclosure of Invention
Aiming at the defects of the prior art, the application aims to provide a method for screening methylation site sets with high throughput, a kit and application thereof. The site set screened by the method adopts the primary screening of the methylation sites of the public database and the verification and fine screening of a random forest model, screens the simplified methylation site set most relevant to the tumor, and reduces the detection cost. The kit has high detection flux, no site number limitation, and can detect all target methylation sites at one time, and is convenient to use and high in detection efficiency, thereby laying a foundation for early screening and early diagnosis of tumors and micro residual focus MRD based on methylation detection.
In order to achieve the aim of the application, the application adopts the following technical scheme:
in a first aspect, the application provides a method of high throughput screening of a set of methylation sites, the method comprising:
(1) Collecting methylation data of a tumor group and a corresponding normal group in a public database;
(2) Analyzing the difference of CpG methylation sites of the tumor sample and the normal sample in the public database by adopting a statistical test method, and screening CpG methylation sites with obvious difference to obtain a candidate CpG methylation site set;
(3) Randomly extracting methylation data of tumor samples and normal tissue samples in a public database, and constructing an initial methylation model by using the candidate CpG methylation sites screened in the step (2);
(4) Extracting methylation data of tumor samples and normal samples in a public database, adopting the initial methylation model in the step (3), and carrying out fine screening by using a substitution test method (permutation test), so as to screen out a final methylation site set most relevant to a tumor group.
In the application, the replacement test is adopted to further refine the selected sites according to the methylation difference significance, compared with the conventional method adopting methylation rate difference and/or gene expression level difference screening, the fine screening process of the application greatly reduces noise methylation sites, further eliminates the methylation sites which are erroneously included due to the difference between sample groups caused by accidental factors, can effectively improve the efficiency of distinguishing tumor from normal samples, and simultaneously saves the detection cost.
Preferably, in step (1), the public database is a TCGA database and/or a GEO database dataset.
Preferably, in step (2), the CpG methylation sites with significant differences have significant methylation frequency differences and significant P values in the tumor sample group and the normal sample group.
Preferably, in the step (2), the absolute value of the methylation frequency difference of the CpG methylation sites with significant difference in the tumor sample group and the normal sample group is more than or equal to 0.5.
Preferably, in step (2), the test method is a Wilcoxon rank sum test (Wilcoxon rank sum test) corrected by Benjamin-Huo Jiba lattice method (Benjamin-Hochberg method) and/or a strict Wald test (Wald test) based on β negative binomial distribution.
Preferably, the CpG methylation sites with significant differences have FDR (False Discovery Rate ) values <0.01 as determined by the wilcoxon rank sum test corrected by the benjamin-Huo Jiba lattice method; or, P-value <0.001 based on strict wald test of β negative binomial distribution.
Preferably, in step (3), the public database is a TCGA database and/or a GEO database.
Preferably, in the step (3), the number of tumor samples randomly extracted is more than or equal to 100, and the number of normal tissue samples is more than or equal to 100.
Preferably, in step (3), the model is constructed using a random forest method.
Preferably, in step (3), the model is a mapping relationship between methylation rate of each site and tumor or normal sample of the corresponding sample, and the model can calculate a score for each sample according to the mapping relationship.
Preferably, in step (3), the model comprises the sum of squares of the residual error and the mean square error entropy (IncMSE, increase in mean squared error) of the individual methylation sites (incnodeanity, increase in node purity).
Preferably, in step (4), the public database is a TCGA database and/or a GEO database dataset.
Preferably, in the step (4), the number of tumor samples is more than or equal to 300, and the number of normal samples is more than or equal to 300; the tumor samples comprise phase I-III tumor samples, wherein the proportion of the phase I tumor samples is more than or equal to 30%.
Preferably, in step (4), the step of substitution testing includes:
(a) Randomly extracting a certain number of samples from a sample set, wherein the sample quantity is greater than 80% of the total sample quantity, establishing a random forest model for verification, and calculating the sum of square error entropy and residual error square of each candidate methylation site;
(b) Comparing the mean square error entropy and the residual square sum of the candidate methylation sites in the step (a) with the mean square error entropy and the residual square sum of the initial methylation model in the step (3), and calculating the difference absolute value delta IncNodePurity of the mean square error entropy and the difference absolute value delta IncNodePurty of the residual square sum;
(c) Repeating the steps of random extraction, modeling and calculation for n times, wherein the random number n is more than or equal to 1000, and two data sets comprising a data set A and a data set B can be obtained aiming at methylation sites in each original model; data set a: Δimse { Δincmse1, Δincmse2, … Δincmsen }; data set B: Δinp { ΔincnodecoPurity 1, ΔincnodecoPurity 2, … ΔincnodecoPurityl };
(d) P values of the mean square error entropy (IncMSE) and the residual square sum (IncNodePurty) corresponding to all the positions in the candidate methylation site set are calculated; screening out a final methylation site set most relevant to the tumor group according to the P value of each methylation site;
counting outliers of a data set A and/or a data set B corresponding to each position point in the candidate methylation position point set respectively; outliers are defined as being greater than a set of valuesCounting the number of outliers m in each candidate methylation site data set A and data set B by using a value of an upper limit or less than a lower limit of a numerical set A M B
The outlier calculation method is as follows:
q1 is the first quartile of the data set and Q3 is the third quartile; the quartile range iqr=q3-Q1;
upper limit = q3+1.5×iqr;
lower bound = Q1-1.5 xiqr;
calculating each site P in the candidate methylation site set IncMSE =m A /n,P NodePurity =m B /n;
Setting a P value threshold of each methylation site, and screening out a final methylation site set most relevant to a tumor group; the threshold is: p (P) IncMSE <0.01 and/or P NodePurity <0.01。
In the application, the first step of the substitution test is as follows: the assumption of H0 is established that methylation sites S in a candidate methylation site set are included in an original model to be random events, the contribution of the methylation sites S to the model is random and can be removed, and corresponding mean square error entropy (IncMSE) and/or residual square sum (IncNodePurty) in random forest models constructed by different sample sets have larger randomness compared with the original model.
And a second step of: and calculating the statistical test quantity delta IncmSE and/or delta IncNodePurty of each site in the candidate methylation site set, namely, the absolute value of the difference value of the mean square error entropy (IncmSE) and/or the residual square sum (IncNodePurty) of the new random forest model and the site in the original model.
And a third step of: constructing a statistical test quantity set of each site in a candidate methylation site set, and randomly extracting a certain number of samples from a new sample set>80% total sample size), repeating n times, respectively constructing a new random forest model and calculating the statistical test amount Δincmse and/or Δincnodeanity of each methylation site, i.e. forming two data sets for each methylation site, including data set a: ΔIMSE { ΔIncMSE 1 ,ΔIncMSE 2 ,…ΔIncMSE n -and data set B: ΔINP { ΔIncNodepurity 1 ,ΔIncNodePurity 2 ,…ΔIncNodePurity n }。
Fourth step: calculating the P value corresponding to the mean square error entropy (IncMSE) and/or the sum of squares residual (IncNodePurty) of each site in the candidate methylation site set, namely P IncMSE P NodePurity . If the P value is greater than a preset threshold, H0 assumption is accepted, namely the methylation site is removed; otherwise reject the H0 hypothesis, i.e., preserve the methylation site.
According to the application, a final methylation model is constructed according to a final methylation site set, scores are calculated on tumor and normal samples according to the model, an ROC curve (subject working curve) is drawn, a scoring value corresponding to a coordinate of a point with the shortest linear distance relative to the coordinate (0, 1) site on the curve is taken as a scoring threshold value for distinguishing the tumor from the normal samples, at the moment, the corresponding proportion of identifying the true tumor sample as the tumor sample is the true positive rate (True Positive Rate, TPR) of the model, and the proportion of identifying the true normal sample as the tumor sample is the false positive rate (False Positive Rate, FPR) of the model. In the application, the true positive rate of the model is more than or equal to 90%, the false positive rate is less than or equal to 5%, and the area under the curve (AUC) is more than or equal to 0.95.
In a second aspect, the present application provides a set of targets for methylation abnormal sites associated with colorectal cancer, the set of targets being screened by the method of high throughput screening of a set of methylation sites according to the first aspect, the set of targets comprising: one or more methylation sites of 68 CpG sites, the GpG site coordinates based on the hg38 version of the human genome, the set of targets being shown in table 1.
TABLE 1
In a third aspect, the application provides a probe pool for detecting a set of methylation site targets associated with colorectal cancer, the probe pool comprising: a fully methylated form of a probe directed to a site in a collection of targets according to the second aspect, and a fully unmethylated form of a probe directed to a site in a collection of targets according to the second aspect.
In the application, the complete methylation form probe refers to a probe designed by an original reference sequence corresponding to the position of a target methylation site, and the complete non-methylation form probe refers to a probe designed by a sequence in which a cytosine C base in the target methylation site is replaced by a thymine T base.
Preferably, the probe covers the methylation sites of the target collection of the second aspect.
Preferably, the methylation site is located in the middle of the probe.
Preferably, the GC content of the probe is 40% -60%, for example, 40%, 45%, 50%, 55% or 60% or the like.
Preferably, the probe length is 110-130 bp, for example, 110 bp, 115 bp, 120 bp, 125 bp or 130 bp, and preferably 120 bp.
In a fourth aspect, the application provides a kit for detecting a methylation site collection based on a liquid-phase chip targeted capture high-throughput sequencing technology, wherein the kit comprises a probe library for detecting a methylation site target collection related to colorectal cancer in the third aspect.
Preferably, the kit further comprises any one or a combination of at least two of methylation conversion reagents, genome pre-library construction reagents, linkers, hybridization capture reagents, hybridization product amplification reagents, or streptomycin affinity magnetic beads.
Preferably, the linker is a methylated linker.
Preferably, the methylation conversion reagent is selected from the group consisting of a bisulphite conversion reagent or an enzymatic methylation conversion reagent.
In a fifth aspect, the application provides an application of the probe library for detecting methylation site target sets related to colorectal cancer in the third aspect and/or the kit for detecting methylation site sets based on the liquid phase chip-based targeted capture high-throughput sequencing technology in preparation of tumor or tiny residual focus detection products.
The numerical ranges recited herein include not only the recited point values, but also any point values between the recited numerical ranges that are not recited, and are limited to, and for the sake of brevity, the application is not intended to be exhaustive of the specific point values that the recited range includes.
Compared with the prior art, the application has the following beneficial effects:
(1) The screening method of the application is independent of the published literature report of the prior person, and can screen tumor related methylation sites unbiased.
(2) The site set screened by the method adopts the primary screening of the methylation sites of the public database and the verification and fine screening of a random forest model, screens the simplified methylation site set most relevant to the tumor, and reduces the detection cost.
(3) The kit has high detection flux, no site number limitation, and can detect all target methylation sites at one time, and is convenient to use and high in detection efficiency.
(4) The methylation site screening method and the kit lay a foundation for early screening and early diagnosis of tumors and MRD of tiny residual lesions based on methylation detection.
Drawings
FIG. 1 is a graph of the work of subjects with a collection of screening sites to distinguish colorectal cancer from clinical samples of healthy subjects.
FIG. 2 is a quality control diagram of a methylation hybridization capture library constructed by the kit.
Detailed Description
The technical scheme of the application is further described by the following specific embodiments. It will be apparent to those skilled in the art that the examples are merely to aid in understanding the application and are not to be construed as a specific limitation thereof.
The specific techniques or conditions are not identified in the examples and are described in the literature in this field or are carried out in accordance with the product specifications. The reagents or apparatus used were conventional products commercially available through regular channels, with no manufacturer noted.
Materials:
genome pre-library construction reagents were purchased from Tiangen Biochemical technologies (Beijing) Inc., cat# NG302-02.
The linker was purchased from NEB company under the product number E7140S.
Enzymatic methylation conversion reagent was purchased from NEB company under the product number E7125S.
The linker was purchased from IDT (trade name 8nt UDI Adapters, cat# 2003200).
Hybridization capture reagents were purchased from Twist, quick wash (trade name Twist Fast Wash Buffers, trade name 100972), binding and purification beads (trade name Twist Binding and Purification Beads, trade name 100984), respectively; quick hybridization reagents (trade name Twist Fast Hybridization Reagent (trade name: 100968) and universal blocking reagents (trade name Twist Universal Blockers, trade name: 100767).
Hybridization product amplification reagents were purchased from KAPA library amplification kit (trade name KAPA Library Amplification Kits for Illu min a platforms, cat# KK 2621).
Purification magnetic beads were purchased from Beckman (trade name purification magnetic beads, cat# A63882);
the free nucleic acid cfDNA was originally from a clinical partner.
Example 1
This example provides a set of methylation site targets for detection of colorectal cancer, the set comprising 68 methylation sites, the specific site information being shown in table 1. The site set is pre-screened through methylation data of a public database, an initial model is built on the basis of random forests for the screened differential methylation candidate site set, the contribution rate P value of each site to the model is evaluated after another group of methylation data is put into the model, and the site set most relevant to tumors is fine screened. The specific screening steps are as follows:
(1) Tumor groups and corresponding normal group methylation data in the TCGA database and/or GEO database dataset were collected.
(2) Analyzing the difference of CpG methylation sites of the tumor sample and the normal sample in the public database by adopting a statistical test method, and screening CpG methylation sites with obvious difference to obtain a candidate CpG methylation site set; cpG methylation sites with significant differences have significant methylation frequency differences and significant P values in tumor sample groups and normal sample groups; the absolute value of the methylation frequency difference value of the CpG methylation sites with the significant difference in the tumor sample group and the normal sample group is more than or equal to 0.5; the test method is a Wilcoxon rank sum test (Wilcoxon rank sum test) corrected by a Benjamin-Huo Jiba lattice method (Benjamin-Hochberg method) and/or a strict Wald test (Wald test) based on beta negative binomial distribution; the CpG methylation sites with significant difference have FDR value <0.01 after Wilkinson rank sum test corrected by the Benjamin-Huo Jiba lattice method; or, P-value <0.001 based on strict wald test of β negative binomial distribution.
(3) Randomly extracting methylation data of tumor samples and normal tissue samples of the TCGA database and the GEO database, and constructing an initial methylation model by utilizing the candidate CpG methylation sites screened in the step (2); the model is constructed by adopting a random forest method; the model is a mapping relation between the methylation rate of each position and the tumor or normal sample corresponding to the sample, and can calculate scores of the samples according to the mapping relation; the model includes the mean square error entropy (IncMSE, increase in mean squared error) and the sum of squares of residuals (incnodecoPurty, increase in node purity) for each methylation site.
(4) Fine screening of candidate methylation site sets, namely extracting methylation data of tumor samples and normal samples in a TCGA database and a GEO database data set, wherein the number of the tumor samples is more than or equal to 300, and the number of the normal samples is more than or equal to 300; the tumor samples comprise phase I-III tumor samples, wherein the proportion of the phase I tumor samples is more than or equal to 30%; and (3) adopting the initial methylation model in the step (3), and carrying out fine screening by using a displacement test to screen out a final methylation site set most relevant to the tumor group.
The step of substitution testing includes:
(a) Randomly extracting a certain number of samples from a sample set, wherein the sample quantity is greater than 80% of the total sample quantity, establishing a random forest model for verification, and calculating the sum of square error entropy and residual error square of each candidate methylation site;
(b) Comparing the mean square error entropy and the residual square sum of the candidate methylation sites in the step (a) with the mean square error entropy and the residual square sum of the initial methylation model in the step (3), and calculating the difference absolute value delta IncNodePurity of the mean square error entropy and the difference absolute value delta IncNodePurty of the residual square sum;
(c) Repeating the steps of random extraction, modeling and calculation for n times, wherein the random number n is more than or equal to 1000, and two data sets comprising a data set A and a data set B can be obtained aiming at methylation sites in each original model; data set a: Δimse { Δincmse1, Δincmse2, … Δincmsen }; data set B: Δinp { ΔincnodecoPurity 1, ΔincnodecoPurity 2, … ΔincnodecoPurityl };
(d) P values of the mean square error entropy (IncMSE) and the residual square sum (IncNodePurty) corresponding to all the positions in the candidate methylation site set are calculated; screening out a final methylation site set most relevant to the tumor group according to the P value of each methylation site;
counting outliers of a data set A and/or a data set B corresponding to each position point in the candidate methylation position point set respectively; outliers are defined as values greater than the upper bound of the set of values or less than the lower bound of the set of values, and the number of outliers in each candidate methylation site data set A and data set B is counted as m A M B
The outlier calculation method is as follows:
q1 is the first quartile of the data set and Q3 is the third quartile; the quartile range iqr=q3-Q1;
upper limit = q3+1.5×iqr;
lower bound = Q1-1.5 xiqr;
calculating each site P in the candidate methylation site set IncMSE =m A /n,P NodePurity =m B /n;
Setting a P value threshold of each methylation site, and screening out a final methylation site set most relevant to a tumor group; the threshold is: p (P) IncMSE <0.01 and/or P NodePurity <0.01。
The target set obtained by screening comprises: one or more methylation sites of 68 CpG sites, the GpG site coordinates being based on the hg38 version of the human genome.
Example 2
In this example, 90 colorectal cancer samples of stage I, 90 colorectal cancer samples of stage II, 120 colorectal cancer samples of stage III and 300 normal samples were selected from the TCGA database by using the methylation sites and the corrected model screened in example 1, and methylation data of the samples were put into the model, and ROC curves were drawn, and specific results are shown in fig. 1. The results show that the true positive rate of the colorectal cancer sample and the normal sample of the model are up to 93 percent, the specificity (1-false positive rate) is up to 96 percent, and the obtained scoring threshold value is 0.34.
Example 3
This example designed a synthetic probe pool for the pool of 68 methylation sites selected in example 1. The probe library comprises probes aiming at a complete methylated form and a complete unmethylated form of sites in a target set, wherein the methylated sites are positioned in the middle of the probes, and the GC content of the probes is 40% -60%; the probe length is 120 bp; the probe can efficiently capture the molecular fragments corresponding to the target set, and the capture efficiency is more than 70%.
Example 4
This example uses 13 colorectal cancer cfDNA samples and 11 healthy human cfDNA samples collected clinically for high throughput methylation pre-library construction.
1. Library construction
1.1 End repair plus a (adenine).
1.1.1 Taking out the related reagent and placing on ice for standby; wherein the 5 XERA enzyme mixture is melted, flicked with finger, and mixed without vortex, and the rest reagents can be mixed with transient vortex.
1.1.2 1 new 200 mu L thin-walled tube was taken, a reaction system was prepared on ice according to Table 2, and cfDNA samples were put into 30 ng.
TABLE 2
1.1.3 Adding 10 mu L of 5 XERA enzyme mixture into the thin-wall pipe, gently sucking and beating for 10 times, and uniformly mixing, wherein vortex is not required; this step needs to be kept in an ice bath.
1.1.4 The prepared reaction system is put into a pre-cooled PCR instrument at 4 ℃ and is reacted according to the procedure of the table 3, wherein the temperature of a hot cover is set to be 70 ℃.
TABLE 3 Table 3
1.1.5 After the reaction procedure was completed, the thin walled tube was removed from the PCR instrument and placed on ice immediately starting the subsequent adapter ligation step.
1.2 Joint connection
1.2.1 Adding 2.5 mu L of joint solution into the 1.1.5 reaction system, gently sucking, beating, uniformly mixing and then placing on ice.
1.2.2 The linker connection reaction system was prepared according to table 4, and the prepared reaction system was gently mixed and then placed on ice.
TABLE 4 Table 4
1.2.3 Adding 47.5 mu L of the prepared connection reaction solution in 1.2.2 into the reaction solution in 1.2.1, gently sucking and beating the mixture for 10 times, and then placing the mixture into a metal bath or a PCR instrument with the preset temperature of 20 ℃ for reaction for 15 minutes. If the reaction is carried out by using a PCR instrument in the step, the temperature of a thermal cover of the PCR instrument is set to be less than or equal to 40 ℃.
1.3 Purification of the linker ligation products
1.3.1 The purified beads were taken out of the refrigerator at 4℃in advance, vortexed and mixed well and left at room temperature for 30 minutes.
1.3.2 Shaking, mixing, purifying, adding 110 μl of magnetic beads into 1.2.3 reaction system, gently and repeatedly blowing for 10 times by using a pipettor, mixing thoroughly, and incubating at room temperature for 5 min.
1.3.3 And placing the centrifuge tube on a magnetic rack for standing for 5 minutes, discarding the supernatant after the solution is clarified, and retaining the magnetic beads.
1.3.4 The centrifuge tube was kept on a magnetic rack, 200. Mu.L of newly prepared 80% ethanol was added to the reaction tube, and after standing for 30. 30 s, the supernatant was removed and the magnetic beads were retained.
1.3.5 The above steps were repeated once.
1.3.6 The residual ethanol in the centrifuge tube was removed as much as possible using a 10. Mu.L pipette, and the tube was placed on a magnetic rack for 3-5 minutes until the beads were completely dried.
1.3.7 Add 29. Mu.L of nuclease-free water to the centrifuge tube, blow mix, spin briefly, and incubate for 2 minutes at room temperature.
1.3.8 The centrifuge tube was placed on a magnetic rack for 2-3 minutes until the solution was clear.
1.3.9 28 μl of the supernatant was carefully pipetted into a new PCR tube, the beads discarded and placed on ice for later use in subsequent experiments.
2. Enzymatic conversion
2.1 Oxidation of 5-methylcytosine and 5-hydroxymethylcytosine
2.1.1 Preparation of TET2 buffer: add 100 μl TET2 reaction buffer to TET2 reaction buffer supplement and mix well.
2.1.2 Ingredients shown in table 5 were prepared on ice and 28 μl adaptor-ligated DNA from step 1.3.9 was added:
TABLE 5
2.1.3 Diluting 500 mM Fe (II) solution, and taking 1-1249 mu L of nuclease-free water; the diluted Fe (II) solution was added to the components of step 2.1.2 and mixed as shown in table 6:
TABLE 6
2.1.4 The reaction system is placed in a PCR instrument for incubation for 1 hour at 37 ℃, and a thermal cover is arranged at a temperature of more than or equal to 45 ℃.
2.1.5 And after the operation is finished, transferring the sample onto ice, adding 1 mu L of termination reagent, and uniformly mixing.
2.1.6 The reaction system is placed in a PCR instrument for incubation for 30 minutes at 37 ℃, kept at 4 ℃ and provided with a heat cover not less than 45 ℃.
2.2 Purification of the transformed DNA
2.2.1 The purified beads were taken out of the refrigerator at 4℃in advance, vortexed and mixed well and left at room temperature for 30 minutes.
2.2.2 Shaking, mixing evenly, purifying the magnetic beads, adding 90 mu L of the magnetic beads into the reaction system in the step 2.1.6, gently and repeatedly blowing for 10 times by using a pipettor, fully mixing evenly, and incubating for 5 minutes at room temperature.
2.2.3 And placing the centrifuge tube on a magnetic rack for standing for 5 minutes, discarding the supernatant after the solution is clarified, and retaining the magnetic beads.
2.2.4 The centrifuge tube was kept on a magnetic rack, 200. Mu.L of newly prepared 80% ethanol was added to the reaction tube, and after standing for 30. 30 s, the supernatant was removed and the magnetic beads were retained.
2.2.5 The above steps were repeated once.
2.2.6 The residual ethanol in the centrifuge tube was removed as much as possible using a 10. Mu.L pipette, and the tube was placed on a magnetic rack for 3-5 minutes until the beads were completely dried.
2.2.7 17 mu L of nuclease-free water is added into the centrifuge tube, the mixture is blown and mixed uniformly by a gun head, and the mixture is subjected to instantaneous centrifugation and incubated for 2 minutes at room temperature.
2.2.8 The centrifuge tube was placed on a magnetic rack for 2-3 minutes until the solution was clear.
2.2.9 mu.L of the supernatant was carefully pipetted into a new PCR tube, the beads discarded and placed on ice for subsequent experiments.
2.3 Denatured DNA
2.3.1 The PCR instrument was preheated to 85 ℃.
2.3.2 Adding 4 mu L Formamide into the DNA purified by 16 mu L in the step 2.2.9, and uniformly mixing.
2.3.3 Placing the reaction solution in a PCR instrument for incubation for 10 minutes at 85 ℃, and setting a thermal cover at 105 ℃; immediately after the completion of the program operation, the reaction solution was taken out and placed on ice.
2.4 Cytosine deamination
2.4.1 To the reaction solution in step 2.3.3 on ice were added the components shown in Table 7:
TABLE 7
2.4.2 Mixing the above reaction solution, placing in a PCR instrument, incubating at 37deg.C for 3 hr, maintaining at 4deg.C, and setting the temperature of the thermal cover to be equal to or higher than 45deg.C.
2.5 Deaminated DNA purification
2.5.1 The purified beads were taken out of the refrigerator at 4℃in advance, vortexed and mixed well and left at room temperature for 30 minutes.
2.5.2 Shaking, mixing, purifying, adding 100 μl of magnetic beads into the reaction system, gently and repeatedly blowing for 10 times by using a pipettor, mixing thoroughly, and incubating at room temperature for 5 min.
2.5.3 And placing the centrifuge tube on a magnetic rack for standing for 5 minutes, discarding the supernatant after the solution is clarified, and retaining the magnetic beads.
2.5.4 The centrifuge tube was kept on a magnetic rack, 200. Mu.L of newly prepared 80% ethanol was added to the reaction tube, and after standing for 30. 30 s, the supernatant was removed and the magnetic beads were retained.
2.5.5 The above steps were repeated once.
2.5.6 The residual ethanol in the centrifuge tube was removed as much as possible using a 10. Mu.L pipette, and the tube was placed on a magnetic rack for 3-5 minutes until the beads were completely dried.
2.5.7 21 mu L of nuclease-free water is added into the centrifuge tube, the mixture is blown and mixed uniformly by a gun head, and the mixture is subjected to instantaneous centrifugation and incubated for 2 minutes at room temperature.
2.5.8 The centrifuge tube was placed on a magnetic rack for 2-3 minutes until the solution was clear.
2.5.9 mu.L of the supernatant was carefully pipetted into a new PCR tube, the beads discarded and placed on ice for subsequent experiments.
3. PCR amplification
3.1 The components shown in table 8 were prepared on ice and added to the DNA of step 2.5.9 and mixed well:
TABLE 8
3.2 PCR instrument program was run as in table 9:
TABLE 9
3.3 Purification
3.3.1 The purified beads were taken out of the refrigerator at 4℃in advance, vortexed and mixed well and left at room temperature for 30 minutes.
3.3.2 Shaking, mixing, purifying, adding 45 μl of magnetic beads into the reaction system, gently and repeatedly blowing for 10 times by using a pipettor, mixing thoroughly, and incubating at room temperature for 5 min.
3.3.3 And placing the centrifuge tube on a magnetic rack for standing for 5 minutes, discarding the supernatant after the solution is clarified, and retaining the magnetic beads.
3.3.4 The centrifuge tube was kept on a magnetic rack, 200. Mu.L of newly prepared 80% ethanol was added to the reaction tube, and after standing for 30. 30 s, the supernatant was removed and the magnetic beads were retained.
3.3.5 The above steps were repeated once.
3.3.6 The residual ethanol in the centrifuge tube was removed as much as possible using a 10. Mu.L pipette, and the tube was placed on a magnetic rack for 3-5 minutes until the beads were completely dried.
3.3.7 21 mu L of nuclease-free water is added into the centrifuge tube, the mixture is blown and mixed uniformly by a gun head, and the mixture is subjected to instantaneous centrifugation and incubated for 2 minutes at room temperature.
3.3.8 The centrifuge tube was placed on a magnetic rack for 2-3 minutes until the solution was clear.
3.3.9 mu.L of the supernatant was carefully pipetted into a new PCR tube, the beads discarded and placed on ice for subsequent experiments.
Example 5
This example performed probe hybridization capture library construction on 24 methylated pre-libraries in example 4.
1. Reagent and PCR instrument preparation
1.1 Before the hybridization reagent is used, the hybridization reagent must be melted according to the requirements of Table 10, and the hybridization reagent is uniformly mixed by vortex and centrifuged instantaneously before the hybridization reagent is used:
table 10
Note that: if crystallization is separated out from the rapid hybridization reagent, the rapid hybridization reagent is incubated at 65 ℃ until the rapid hybridization reagent is completely dissolved.
1.2 PCR procedure (gradient PCR instrument) was set up as in table 11:
TABLE 11
2. Library hybridization
2.1 The pre-libraries of example 4 were randomly divided into 8 groups, each pre-library was placed into 300. 300 ng, placed in a new 1.5 mL centrifuge tube, 8 μl of blocking agent, 5 μl of hybridization buffer, 4 μl of probe were added thereto, placed in a vacuum concentrator, and dried at constant temperature of 45 ℃ until no liquid was present.
2.2 Sequentially adding 20 mu L of rapid hybridization reagent preheated for 10 minutes at 65 ℃ in advance into a dried centrifuge tube; standing at room temperature for 5-10 min, mixing by vortex, transferring 20 μl of solution into 0.2 mL low adsorption PCR tube after transient centrifugation, and transient centrifuging.
2.3 To a 0.2. 0.2 mL low adsorption PCR tube, 30. Mu.L of hybridization enhancer was added to the surface for oil sealing (keeping layering without shaking mixing) and transferred to a PCR apparatus for running hybridization program (hot cap 85 ℃), keeping at 95℃in the first step, starting from the second step.
3. Preparation of the cleaning reagent
3.1, placing the rapid binding buffer solution, the rapid cleaning solution 1 and the cleaning solution 2 on a 48 ℃ constant temperature metal bath, heating, dissolving and incubating for 30 minutes, taking out streptavidin magnetic beads from 2-8 ℃ half an hour in advance, and placing the streptavidin magnetic beads at room temperature for balancing; the washing solution was packaged and stored according to the requirements in table 12:
table 12
3.2 Preparation of streptavidin magnetic beads
3.2.1 The streptavidin beads were removed from the refrigerator at 4℃and left at room temperature for 30 minutes, and vortexed to mix well 15 s. Each hybridization reaction aspirates 50. Mu.L of magnetic beads into a new 1.5. 1.5 mL low adsorption centrifuge tube.
3.2.2 200. Mu.L of quick-binding buffer was added and mixed by blowing for 10 times. Transferring the centrifuge tube to a magnetic rack until the liquid is completely clarified, discarding the supernatant, and retaining the magnetic beads; the above procedure was repeated twice for three total washes.
3.2.3 200. Mu.L of quick-binding buffer was added to each hybridization reaction, and the mixture was swiftly blown and homogenized (magnetic beads were prevented from drying on the tube wall).
4. Streptavidin magnetic bead capture library
4.1 After the hybridization program in the PCR instrument is finished, the hybridization product is quickly transferred into the solution of the last step on the PCR instrument, and the hybridization product is placed on a vibration mixing instrument for vibration (800 turns) for 30 minutes at room temperature after vibration mixing.
4.2 Streptavidin magnetic bead cleaning 1
4.2.1 After the vibration is finished, the centrifuge tube in the last step is instantly separated and placed on a magnetic frame for clarification, the supernatant is discarded, 200 mu L of quick cleaning liquid 1 preheated at 66 ℃ is added, the mixture is evenly mixed by spot vibration, the mixture is incubated for 5 minutes at 66 ℃, and then the centrifuge tube is transferred to the magnetic frame until the liquid is completely clarified.
4.2.2 The supernatant was carefully aspirated, 200. Mu.L of a quick washing liquid 1 preheated at 66℃was again added, mixed by spot shaking, and incubated at 66℃for 5 minutes.
4.2.3 After transferring the liquid to a new 1.5 mL centrifuge tube, the liquid was clarified on a magnetic rack, the supernatant was aspirated off, and the magnetic beads were retained.
4.3 Streptavidin magnetic bead cleaning 2
4.3.1 adding 200 mu L of a cleaning solution 2 preheated at 48 ℃ into the centrifuge tube in the step 4.2.3, uniformly mixing and centrifuging, incubating at 48 ℃ for 5 minutes, transferring the centrifuge tube to a magnetic rack until the liquid is completely clear, carefully sucking and discarding the supernatant, and reserving magnetic beads; the above washing steps were repeated two more times for a total of three times.
4.3.2 Transfer the centrifuge tube to a magnetic rack until the liquid is completely clear. The supernatant was carefully aspirated and the beads were retained. Taking down the centrifuge tube from the magnetic rack, placing the centrifuge tube on the magnetic rack for standing after instantaneous centrifugation, and sucking the residual cleaning liquid 2 by using a small-range pipette.
4.3.3 The centrifuge tube is taken off the magnetic rack, 45 mu L of water without nuclease is added, and the mixture is blown and evenly mixed for standby.
5 hybridization library amplification
5.1 The amplification buffer (2×) and the universal amplification primer were removed and thawed on ice, mixed well before use, and centrifuged transiently. The amplification system was formulated as in table 13:
TABLE 13
5.2 Adding 22.5 mu L of the magnetic bead heavy suspension in the step 4.3.3 into the amplification system, and blowing and uniformly mixing; the mixture was placed in a PCR apparatus, and the PCR conditions were set according to Table 14, and the remaining 22.5. Mu.L of the magnetic bead suspension was stored at-20 ℃.
TABLE 14
5.3 Purification after amplification of the hybridization library
5.3.1 The purified beads were removed from the refrigerator at 4℃in advance and allowed to equilibrate to room temperature for 30 minutes.
5.3.2 The beads were shaken until they were fully resuspended, 90. Mu.L of beads were added to the PCR tube, and then gently and repeatedly blown 10 times with a pipette to mix them well. Incubate at room temperature for 5-10 min.
5.3.3 And placing the centrifuge tube on a magnetic rack for standing for 5 minutes, removing supernatant after the solution is clarified, and retaining the magnetic beads.
5.3.4 The centrifuge tube was kept on a magnetic rack, 200. Mu.L of newly prepared 80% ethanol was added to the reaction tube, and after standing for 30. 30 s, the supernatant was removed and the magnetic beads were retained.
5.3.5 The above steps were repeated once.
5.3.6 And (3) removing the residual ethanol in the centrifuge tube as much as possible, and placing the centrifuge tube on a magnetic rack for standing for 3-5 minutes until the magnetic beads are completely dried.
5.3.7 To each sample, 21. Mu.L of nuclease-free water was added, vortexed, centrifuged briefly, and incubated at room temperature for 5-10 minutes.
5.3.8 The centrifuge tube was placed on a magnetic rack for 2-3 minutes until the solution was clear.
5.3.9 mu.L of the supernatant was carefully pipetted into a new 200. Mu.L PCR tube and the beads discarded.
5.3.10 The library obtained in this time can be stored at 4 ℃ for 1 week, and stored for a long time in a refrigerator at-20 ℃.
5.3.11 The library was quantified using Qubit 3.0 and the library fragment size was measured using agilent bioanalyzer 2100, as shown in fig. 2, with the remaining final library stored at-20 ℃.
Example 6
In this example, quality control and methylation frequency analysis were performed on data obtained from 24 clinical cfDNA sample assays in example 5. The methylation frequencies of the 68 site sets corresponding to each sample are input into a model, the score of each sample is calculated according to the model, and sample type judgment is performed according to the score threshold obtained in example 2, and the result is shown in table 15:
TABLE 15
As shown in the table results, the methylation site collection and the kit screened by the method can judge 12 colorectal cancer patients by model scoring, the true positive rate is 12/13 (92.3%), the positive predictive value is 12/12 (100%), the true negative rate is 11/11 (100%), the negative predictive value is 11/12 (91.7%), tumor signals can be identified efficiently, and a foundation is laid for early screening of early diagnosis of tumors and detection of tiny residual lesions.
The applicant states that the detailed method of the present application is illustrated by the above examples, but the present application is not limited to the detailed method described above, i.e. it does not mean that the present application must be practiced in dependence upon the detailed method described above. It should be apparent to those skilled in the art that any changes or substitutions that would be easily contemplated by one skilled in the art within the scope of the present disclosure are within the scope of the present disclosure and the scope of the present disclosure.

Claims (9)

1. A method of high throughput screening of a collection of methylation sites, the method comprising:
(1) Collecting methylation data of a tumor group and a corresponding normal group in a public database;
(2) Analyzing the difference of CpG methylation sites of the tumor sample and the normal sample in the public database by adopting a statistical test method, and screening CpG methylation sites with obvious difference to obtain a candidate CpG methylation site set;
(3) Randomly extracting methylation data of tumor samples and normal samples in a public database, and constructing an initial methylation model by using the candidate CpG methylation sites screened in the step (2);
(4) Extracting methylation data of tumor samples and normal samples in a public database, adopting the initial methylation model in the step (3), and carrying out fine screening by using a displacement test method to screen out a final methylation site set most relevant to a tumor group;
the public database is a TCGA database and/or a GEO database data set;
the number of tumor samples is more than or equal to 300, and the number of normal samples is more than or equal to 300; the tumor samples comprise phase I-III tumor samples, wherein the proportion of the phase I tumor samples is more than or equal to 30%;
the step of substitution testing includes:
(a) Randomly extracting a certain number of samples from a sample set, wherein the sample quantity is greater than 80% of the total sample quantity, establishing a random forest model for verification, and calculating the sum of square error entropy and residual error square of each candidate methylation site;
(b) Comparing the mean square error entropy and the residual square sum of the candidate methylation sites in the step (a) with the mean square error entropy and the residual square sum of the initial methylation model in the step (3), and calculating the difference absolute value delta IncNodePurity of the mean square error entropy and the difference absolute value delta IncNodePurty of the residual square sum;
(c) Repeating the steps of random extraction, modeling and calculation for n times, wherein the random number n is more than or equal to 1000, and two data sets comprising a data set A and a data set B can be obtained aiming at methylation sites in each original model; data set a: Δimse { Δincmse1, Δincmse2, … Δincmsen }; data set B: Δinp { ΔincnodecoPurity 1, ΔincnodecoPurity 2, … ΔincnodecoPurityl };
(d) Calculating P values of the square sum of the mean square error entropy and the residual error of each position point in the candidate methylation position point set; screening out a final methylation site set most relevant to the tumor group according to the P value of each methylation site;
setting a P value threshold of each methylation site, and screening out a final methylation site set most relevant to a tumor group; the threshold is: p (P) IncMSE <0.01 and/or P NodePurity <0.01。
2. The method of high throughput screening of methylation site sets of claim 1, wherein in step (1), the common database is a TCGA database and/or GEO database dataset;
in the step (2), the CpG methylation sites with significant difference have significant methylation frequency difference and significant P value in the tumor sample group and the normal sample group;
in the step (2), the absolute value of the methylation frequency difference of the CpG methylation sites with obvious difference in the tumor sample group and the normal sample group is more than or equal to 0.5;
in step (2), the test method is a wilcoxon rank sum test corrected by a benjamin-Huo Jiba lattice method and/or a strict wald test based on a β negative binomial distribution;
the CpG methylation sites with significant difference have FDR value <0.01 after Wilkinson rank sum test corrected by the Benjamin-Huo Jiba lattice method; or, P-value <0.001 based on strict wald test of β negative binomial distribution.
3. The method of high throughput screening of methylation site sets of claim 1, wherein in step (3), the public database is a TCGA database and/or a GEO database;
in the step (3), the number of tumor samples randomly extracted is more than or equal to 100, and the number of normal tissue samples is more than or equal to 100;
in the step (3), the model is constructed by adopting a random forest method;
in the step (3), the model is a mapping relation between the methylation rate of each site and the tumor or normal sample corresponding to the sample, and the model can calculate scores of the samples according to the mapping relation;
in step (3), the model includes a sum of squared residual error and a mean square error entropy of each methylation site.
4. A set of targets for methylation abnormal sites associated with colorectal cancer, wherein the set of targets is screened by the method of high throughput screening for a set of methylation sites of any one of claims 1-3, the set of targets comprising: one or more of the 68 CpG sites, the CpG site coordinates based on the hg38 version of the human genome.
5. A probe pool for detecting a collection of methylation site targets associated with colorectal cancer, the probe pool comprising: a fully methylated form of the probes directed to the sites in the collection of targets of claim 4, and a fully unmethylated form of the probes directed to the sites in the collection of targets of claim 4.
6. The library of probes for detecting a set of methylation site targets associated with colorectal cancer of claim 5, wherein the probes cover the methylation sites of the set of targets of claim 4;
the methylation site is positioned in the middle of the probe;
the GC content of the probe is 40% -60%;
the length of the probe is 110-130 bp.
7. A kit for detecting a methylation site collection based on a liquid phase chip targeted capture high throughput sequencing technology, which is characterized by comprising the probe library for detecting the methylation site target collection related to colorectal cancer according to claim 5.
8. The kit for detecting a methylation site collection based on the liquid chip-based targeted capture high throughput sequencing technology of claim 7, further comprising any one or a combination of at least two of a methylation conversion reagent, a genome pre-library construction reagent, a linker, a hybridization capture reagent, a hybridization product amplification reagent, or a streptomycin affinity magnetic bead;
the linker is a methylated linker;
the methylation conversion reagent is selected from a bisulphite conversion reagent or an enzymatic methylation conversion reagent.
9. Use of the probe library for detecting methylation site target sets related to colorectal cancer according to claim 5 or 6 and/or the kit for detecting methylation site sets based on the liquid phase chip-based targeted capture high-throughput sequencing technology according to claim 7 or 8 for preparing tumor or tiny residual focus detection products.
CN202311150628.5A 2023-09-07 2023-09-07 Method and kit for screening methylation site collection in high throughput manner and application of kit Active CN116884491B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311150628.5A CN116884491B (en) 2023-09-07 2023-09-07 Method and kit for screening methylation site collection in high throughput manner and application of kit

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311150628.5A CN116884491B (en) 2023-09-07 2023-09-07 Method and kit for screening methylation site collection in high throughput manner and application of kit

Publications (2)

Publication Number Publication Date
CN116884491A CN116884491A (en) 2023-10-13
CN116884491B true CN116884491B (en) 2023-12-12

Family

ID=88268455

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311150628.5A Active CN116884491B (en) 2023-09-07 2023-09-07 Method and kit for screening methylation site collection in high throughput manner and application of kit

Country Status (1)

Country Link
CN (1) CN116884491B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108410980A (en) * 2018-01-22 2018-08-17 深圳华大基因股份有限公司 Screen method, kit and the application of the target area for the PCR detections that methylate
CN109686414A (en) * 2018-12-28 2019-04-26 陈洪亮 It is only used for the choosing method of the special DNA methylation assay Sites Combination of Hepatocarcinoma screening
CN112941180A (en) * 2021-02-25 2021-06-11 浙江大学医学院附属妇产科医院 Group of lung cancer DNA methylation molecular markers and application thereof in preparation of lung cancer early diagnosis kit
WO2022159035A1 (en) * 2021-01-20 2022-07-28 National University Of Singapore Heatrich-bs: heat enrichment of cpg-rich regions for bisulfite sequencing

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108410980A (en) * 2018-01-22 2018-08-17 深圳华大基因股份有限公司 Screen method, kit and the application of the target area for the PCR detections that methylate
CN109686414A (en) * 2018-12-28 2019-04-26 陈洪亮 It is only used for the choosing method of the special DNA methylation assay Sites Combination of Hepatocarcinoma screening
WO2022159035A1 (en) * 2021-01-20 2022-07-28 National University Of Singapore Heatrich-bs: heat enrichment of cpg-rich regions for bisulfite sequencing
CN112941180A (en) * 2021-02-25 2021-06-11 浙江大学医学院附属妇产科医院 Group of lung cancer DNA methylation molecular markers and application thereof in preparation of lung cancer early diagnosis kit

Also Published As

Publication number Publication date
CN116884491A (en) 2023-10-13

Similar Documents

Publication Publication Date Title
CN107541791A (en) Construction method, kit and the application in plasma DNA DNA methylation assay library
CN105861710A (en) Sequencing joint and preparation method and application thereof in ultra-low frequency mutation detection
CN112176057B (en) Marker for detecting pancreatic duct adenocarcinoma by using CpG site methylation level and application thereof
WO2016049878A1 (en) Snp profiling-based parentage testing method and application
CN111440896A (en) Novel β coronavirus variation detection method, probe and kit
CN114317736B (en) Methylation marker combination for pan-cancer species detection and application thereof
CN111979307B (en) Targeted sequencing method for detecting gene fusion
CN108103164B (en) Method for detecting copy number variation by using multiple fluorescent competitive PCR
CN111961729A (en) Kit for detecting content of 5-hydroxymethylcytosine and application thereof
CN107142320B (en) Gene marker for detecting liver cancer and application thereof
CN115678979A (en) Pineapple liquid phase chip and application thereof
CN113061652A (en) Method for determining 5hmC content in gene marker based on glucose modification
CN109295500B (en) Single cell methylation sequencing technology and application thereof
CN115341031A (en) Screening method of pan-cancer methylation biomarker, biomarker and application
CN103998625B (en) For the method and system of Viral diagnosis
CN114182022A (en) Method for detecting liver cancer specific mutation based on cfDNA base mutation frequency distribution
CN112259165B (en) Method and system for detecting microsatellite instability state
CN116884491B (en) Method and kit for screening methylation site collection in high throughput manner and application of kit
CN117487920A (en) Methylation biomarker combination for detecting metastatic prostate cancer tumor burden and application thereof
CN111748628A (en) Primer and kit for detecting thyroid cancer prognosis related gene variation
WO2022246783A1 (en) Probe composition for identifying or assisting identification of mammalian species, and kit and application thereof
CN114250269A (en) Probe composition, second-generation sequencing library based on probe composition and application of second-generation sequencing library
CN109517819A (en) A kind of detection probe, method and kit modified for detecting multiple target point gene mutation, methylation modification and/or methylolation
CN114657248B (en) Plasma free DNA methylation biomarker for detecting early colorectal cancer or composition, application and kit thereof
CN112011595A (en) Whole genome amplification method for SARS-CoV-2 virus, application and sequencing method and kit

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant