CN113817822A - Tumor diagnosis kit based on methylation detection and application thereof - Google Patents

Tumor diagnosis kit based on methylation detection and application thereof Download PDF

Info

Publication number
CN113817822A
CN113817822A CN202010564746.0A CN202010564746A CN113817822A CN 113817822 A CN113817822 A CN 113817822A CN 202010564746 A CN202010564746 A CN 202010564746A CN 113817822 A CN113817822 A CN 113817822A
Authority
CN
China
Prior art keywords
dna molecules
region
ratio
methylated dna
methylated
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010564746.0A
Other languages
Chinese (zh)
Other versions
CN113817822B (en
Inventor
焦宇辰
曲春枫
宋欠欠
王宇婷
王沛
王京京
陈坤
王思振
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Genetron Health Beijing Co ltd
Cancer Hospital and Institute of CAMS and PUMC
Original Assignee
Genetron Health Beijing Co ltd
Cancer Hospital and Institute of CAMS and PUMC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Genetron Health Beijing Co ltd, Cancer Hospital and Institute of CAMS and PUMC filed Critical Genetron Health Beijing Co ltd
Priority to CN202010564746.0A priority Critical patent/CN113817822B/en
Publication of CN113817822A publication Critical patent/CN113817822A/en
Application granted granted Critical
Publication of CN113817822B publication Critical patent/CN113817822B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • C12Q1/6886Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material for cancer
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/154Methylation markers

Abstract

The invention discloses a tumor diagnosis kit based on methylation detection and application thereof. The invention establishes a tumor prediction index based on low-depth whole genome sequencing, namely low-depth whole genome methylation characteristics, and can establish a prediction model in a tumor/non-tumor sample through a random forest algorithm according to the low-depth whole genome methylation characteristics to realize early screening of the tumor sample. The tumor prediction index of the invention has good practicability and wide application prospect.

Description

Tumor diagnosis kit based on methylation detection and application thereof
Technical Field
The invention relates to a tumor diagnosis kit based on methylation detection and application thereof in the field of biomedicine.
Background
Circulating tumor DNA (ctDNA) contains specific genome variation and apparent modification characteristics of tumor, and can be applied to early screening, diagnosis and staging of cancer, targeted drug guidance, curative effect evaluation, relapse monitoring and other aspects. Currently, the liquid biopsy of tumor usually uses PCR and targeted sequencing technology to detect a specific group of gene level variation or epigenomic variation, such as ctDNA point mutation, gene fusion, methylation of specific gene, etc. 1) The PCR technology has low cost and simple and convenient operation, is usually used for detecting one or more known variations, can not detect complex mutations such as gene fusion and the like, can not detect unknown mutations, and has small coverage; 2) the targeted sequencing technology is suitable for multiple target detection, including complex mutation, but the kit is generally expensive, complex to operate and long in time consumption. In the application process, a suitable detection method needs to be selected according to the quantity and the characteristics of the targets. Because the ctDNA genome variation detection based on the NGS platform is limited by the low ratio of ctDNA in cfDNA (cell free DNA, namely free self DNA in blood), the ctDNA genome variation detection is substantially low-frequency variation detection, has higher requirements on the sensitivity and the detection lower limit of a detection method, needs to ensure higher sequencing depth, has higher detection cost and is difficult to popularize on a large scale; in addition, the detection range is limited in a preset target gene region, the influence of the selection of the target region is large, and the fluctuation of the prediction results of different detection combinations is large.
The above problems can be overcome by using low-depth whole genome sequencing-based indexes. The existing indexes include: ctDNA structure (fragment size, breakpoint distribution), copy number variation, etc., can also be used as a marker feature of tumors. The markers can be obtained under low-depth sequencing, the detection cost is low, and the early screening of large-scale population is facilitated. Among them, 1) structural features such as segment size and breakpoint distribution of ctDNA are often related to gene functional regions such as nucleosome occupancy, transcription factor binding, open chromatin region, and the like. Stephen Cristiano et al calculated the ratio of 100-150bp short cfDNA fragment number/151-220 bp long cfDNA fragment number in the tumor characteristic region at 1-2 Xsequencing depth, the area under ROC curve AUC for tumor/health diagnosis was 0.94, with prediction sensitivity > 70% for 7 different cancer species and specificity 95% (Cristiano et al, 2019). Kun Sun et al calculated the breakpoint abundance of cfDNA in tissue-specific open chromatin regions at a sequencing depth of 3.2 Xand found that the breakpoint abundance index of HCC samples was significantly higher than that of healthy samples, indicating that this index has the ability to differentiate tumor/healthy samples (Sun et al, 2019); 2) copy number Changes (CNV) of ctDNA in blood are one of the major DNA structural variations that cause canceration in addition to cancer gene mutations or gene fusions, and many tumors have specific CNVs. For example, in liver cancer samples, copy number abnormalities tend to occur on both chr1 and chr8 chromosomes (Jiang et al, 2015); median absolute deviation t-MAD was calculated from the whole genome copy number results, and AUC for tumor/healthy sample diagnosis was 0.69(Florent Mouliere et al, 2018).
The existing liquid biopsy markers based on low-depth whole genome sequencing are limited to the indexes depending on genome sequence information, and methylation characteristics under low sequencing depth are not taken into consideration; from the experimental method, a library construction method for simultaneously recording two important tumor specific markers, namely a genome sequence and methylation modification in ctDNA is not established, and the existing gene variation detection and methylation detection need to follow different technical routes, and two samples are taken to construct a library independently.
Disclosure of Invention
The invention aims at providing a tumor diagnosis kit based on methylation detection.
The tumor diagnosis kit based on methylation detection comprises a feature detection reagent, wherein the feature detection reagent comprises a DNA methylation feature detection reagent, and the DNA methylation feature comprises: the ratio of the number of methylated DNA molecules in a region and/or the ratio of the length of methylated DNA molecules in a region;
the ratio of the number of methylated DNA molecules in the region comprises: the ratio of the number of methylated DNA molecules in a region to the total number of molecules of whole genomic DNA, and/or the ratio of the number of methylated DNA molecules in a region to the sum of the number of methylated DNA molecules in a region and the number of unmethylated DNA molecules in a region;
the length ratio of the methylated DNA molecules in the region comprises: the ratio of the number of short segment methylated DNA molecules to the number of long segment methylated DNA molecules in the region, and/or the ratio of the number of small segment methylated DNA molecules to the total number of small segment methylated DNA molecules and long segment methylated DNA molecules in the region.
In the kit, the DNA molecule is a cfDNA molecule or a DNA molecule fragment obtained by breaking a whole genome;
the methylated DNA molecule is a DNA molecule containing a methylation site (i.e. a DNA molecule which is methylated); the unmethylated DNA molecule is a DNA molecule that does not contain a methylation site (i.e., a DNA molecule that is not methylated).
The methylated DNA molecules, the obtaining of unmethylated DNA molecules and the detection of methylation can be obtained by DNA methylation sequencing methods, such as: bisulfite sequencing, restriction enzyme (e.g., HhaI) -based sequencing, targeted enrichment methylation site sequencing.
For example, in bisulfite sequencing methods:
the methylated DNA molecules were: a molecule containing a C base that is not converted to T;
unmethylated DNA molecules are: a molecule containing a C base that is converted to a T.
Methods for targeted enrichment of methylation sites sequencing, such as antibody enrichment:
the methylated DNA molecules were: a molecule captured by a methylated antibody;
unmethylated DNA molecules are: molecules not captured by methylated antibodies.
The DNA molecule is a cfDNA molecule or a whole genome fragmented DNA molecule fragment depending on the sample to be tested, for example, when the sample to be tested is a blood sample, the DNA molecule is a cfDNA molecule, and when the sample to be tested is a tissue sample, the DNA molecule is a whole genome fragmented DNA molecule fragment.
In the above kit, the DNA methylation characteristics further comprise:
the length ratio of the DNA molecules in question in the region;
the length ratio of the DNA molecules in question in the region comprises: the ratio of the number of DNA molecules with doubtful short segments to the number of DNA molecules with doubtful long segments in the region, and/or the ratio of the number of DNA molecules with doubtful small segments to the total number of DNA molecules with doubtful long segments in the region;
the DNA molecule in question is a DNA molecule that cannot be judged to be methylated or unmethylated in the detection process. For example, in a sequencing method based on restriction enzymes using cfDNA as a sample, a DNA molecule in question is generated after cleavage with a methylation-sensitive restriction enzyme. Because different methylation sensitive restriction enzymes can recognize different specific sequences, DNA molecules without specific sequences are not recognized by the methylation sensitive restriction enzymes in the enzyme cutting process, and whether methylation exists or not cannot be known, so that the DNA molecules are in question. For example, digestion with the restriction enzyme hhal, which recognizes the GCGC sequence, results in: methylated DNA molecules, unmethylated DNA molecules, and DNA molecules in question.
For example, methylation detection is performed using restriction enzyme hhal based sequencing, and the molecules are as follows:
the methylated DNA molecule is: a DNA molecule containing the sequence GCGC and not digested by HhaI;
the unmethylated DNA molecule is: a DNA molecule containing a sequence GCGC and capable of being digested by HhaI;
the DNA molecules in question are: DNA molecules which do not contain the sequence GCGC.
In the kit, the short segment is a segment between S1 and S2, the long segment is a segment between (S2+1) and S3, and the small segment is a segment smaller than S1;
the S1 is 1-100bp, for example, the S1 is 10bp, 20bp, 30bp, 40bp, 50bp, 60bp, 70bp, 80bp, 90bp or 100 bp; for example, the S1 is 5bp, 15bp, 25bp, 35bp, 45bp, 55bp, 65bp, 75bp, 85bp or 95 bp.
The S2 is 150-169bp, for example, the S2 is 150bp, 152bp, 155bp, 157bp, 160bp, 162bp, 165bp, 167bp or 169 bp.
The S3 is 250bp of 151-151, for example, the S3 is 220bp of 151-151; for example, the S3 is 200-250 bp; the S3 is 160bp, 170bp, 180bp, 190bp, 200bp, 210bp, 220bp, 230bp, 240bp or 250 bp.
The region comprises a whole genome and/or a specific region, and the specific region comprises one or more of a CpG island region, a promoter region, a tumor specific region and a bin region.
For example, the DNA methylation characteristics of the whole genome can be detected independently, and the whole genome region and one or more of a CpG island region, a promoter region, a tumor specific region and a bin region can be detected simultaneously; or detecting one or more of CpG island region, promoter region, tumor specific region and bin region.
For example, when the region is a whole genome when S1 is 100bp, S2 is 150bp, and S3 is 220bp, the small, short, and long fragments are characterized by the following designations:
the ratio of the number of short segment methylated DNA molecules to the number of long segment methylated DNA molecules is: the number of 100-150bp methylated DNA molecules in the whole genome/the number of 151-220bp methylated molecules in the whole genome; the ratio of the number of small methylated DNA molecules to the total number of long methylated DNA molecules is: the total genome <100bp methylated molecule count/(total genome <100bp methylated molecule count + total genome 151-.
The CpG island region is a region of 500-1000bp in length and > 50% GC content.
The promoter region refers to a promoter region in the whole genome.
For example, the CpG island region and promoter region may be divided according to known databases, such as NCBI, USCS, etc.
The tumor specific region is divided according to different tumor specific gene marker group regions.
The bin region is obtained by dividing a whole genome into a plurality of bins, for example, the whole genome is divided into a plurality of bin regions uniformly or non-uniformly according to the length; for example, in the 5Mb/bin partition, each bin does not cross the centromere, and bins at the centromere edge are allowed to be less than 5 Mb.
In the kit, further, the ratio of the number of methylated DNA molecules in a region to the number of methylated DNA molecules in a particular region includes the ratio of the number of methylated DNA molecules in a region to the total number of methylated DNA molecules in the whole genome.
For example, for a CpG island region, the ratio of the number of methylated DNA molecules also includes the ratio of the number of methylated DNA molecules in the CpG island region to the total number of methylated DNA molecules in the whole genome.
For example, for a promoter region, the ratio of the number of methylated DNA molecules also includes the ratio of the number of methylated DNA molecules in the promoter region to the total number of methylated DNA molecules in the genome.
In the kit, the characteristic detection reagent further comprises a CNV characteristic detection reagent, wherein the CNV comprises a CNV of a chromosome arm and/or a CNV of a hotspot gene.
The hot spot gene refers to CNV of mutant genes related to tumors.
In one embodiment of the present invention, a tumor diagnosis kit based on methylation detection comprises a feature detection reagent, wherein the feature detection reagent comprises a DNA methylation feature detection reagent, and the DNA methylation feature comprises:
the ratio of the number of methylated DNA molecules of the whole genome to the length of methylated DNA molecules of the whole genome;
the ratio of the number of methylated DNA molecules of the whole genome comprises:
the ratio of the number of methylated DNA molecules of the whole genome to the total number of DNA molecules of the whole genome;
the ratio of the number of methylated DNA molecules of the whole genome to the sum of the number of methylated DNA molecules of the whole genome and the number of unmethylated DNA molecules;
the genome-wide methylated DNA molecule length ratios include:
the ratio of the number of short segment methylated DNA molecules to the number of long segment methylated DNA molecules in the whole genome;
the ratio of the number of small methylated DNA molecules to the total number of large methylated DNA molecules in the whole genome.
In another embodiment, in the above kit, the DNA methylation profile further comprises: the length ratio of the DNA molecules in question of the whole genome;
the length ratio of the DNA molecules in question of the whole genome comprises:
the ratio of the number of DNA molecules in question in the whole genome short segment to the number of DNA molecules in question in the long segment;
the ratio of the number of DNA molecules in question in a small fragment of the whole genome to the total number of DNA molecules in question in a small fragment and in a long fragment.
In another embodiment, a tumor diagnostic kit based on methylation detection comprises a feature detection reagent comprising a DNA methylation feature detection reagent, the DNA methylation feature comprising:
the ratio of the number of methylated DNA molecules of the whole genome to the length of methylated DNA molecules of the whole genome;
the ratio of the number of methylated DNA molecules in the CpG island region to the length of methylated DNA molecules in the CpG island region;
the ratio of the number of methylated DNA molecules of the whole genome comprises:
the ratio of the number of methylated DNA molecules of the whole genome to the total number of DNA molecules of the whole genome;
the number of methylated DNA molecules in the whole genome accounts for the ratio of the total number of methylated DNA molecules and unmethylated DNA molecules in the region;
the genome-wide methylated DNA molecule length ratios include:
the ratio of the number of short segment methylated DNA molecules to the number of long segment methylated DNA molecules in the whole genome;
the ratio of the number of small methylated DNA molecules to the total number of large methylated DNA molecules in the whole genome.
The ratio of the number of methylated DNA molecules within the CpG island region comprises:
the ratio of the number of methylated DNA molecules in the CpG island region to the total number of whole genome DNA molecules;
the ratio of the number of methylated DNA molecules in the CpG island region to the sum of the number of methylated DNA molecules and the number of unmethylated DNA molecules in the CpG island region;
the ratio of the number of methylated DNA molecules in the CpG island region to the total number of methylated DNA molecules in the whole genome.
The methylated DNA molecule length ratio in the CpG island region comprises:
the ratio of the number of short segment methylated DNA molecules to the number of long segment methylated DNA molecules in the CpG island region;
the ratio of the number of small methylated DNA molecules to the total number of large methylated DNA molecules in the CpG island region.
In one embodiment, a tumor diagnostic kit based on methylation detection comprises a feature detection reagent comprising a DNA methylation feature detection reagent, wherein the DNA methylation feature comprises:
the ratio of the number of methylated DNA molecules of the whole genome, the length ratio of methylated DNA molecules of the whole genome and the length ratio of doubtful DNA molecules of the whole genome;
the ratio of the number of methylated DNA molecules in the CpG island region to the length of methylated DNA molecules in the CpG island region;
the ratio of the number of methylated DNA molecules of the whole genome comprises:
the ratio of the number of methylated DNA molecules of the whole genome to the total number of DNA molecules of the whole genome;
the number of methylated DNA molecules in the whole genome accounts for the ratio of the total number of methylated DNA molecules and unmethylated DNA molecules in the region;
the genome-wide methylated DNA molecule length ratios include:
the ratio of the number of short segment methylated DNA molecules to the number of long segment methylated DNA molecules in the whole genome;
the ratio of the number of small methylated DNA molecules to the total number of large methylated DNA molecules in the whole genome.
The length ratio of the DNA molecules in question of the whole genome comprises:
the ratio of the number of DNA molecules in question in the whole genome short segment to the number of DNA molecules in question in the long segment;
the ratio of the number of DNA molecules in question in a small fragment of the whole genome to the number of DNA molecules in question in a small fragment to the total number of DNA molecules in question in a long fragment.
The ratio of the number of methylated DNA molecules within the CpG island region comprises:
the ratio of the number of methylated DNA molecules in the CpG island region to the total number of whole genome DNA molecules;
the ratio of the number of methylated DNA molecules in the CpG island region to the sum of the number of methylated DNA molecules and the number of unmethylated DNA molecules in the CpG island region;
the ratio of the number of methylated DNA molecules in the CpG island region to the total number of methylated DNA molecules in the whole genome.
The methylated DNA molecule length ratio in the CpG island region comprises:
the ratio of the number of short segment methylated DNA molecules to the number of long segment methylated DNA molecules in the CpG island region;
the ratio of the number of small methylated DNA molecules to the total number of large methylated DNA molecules in the CpG island region.
In another embodiment, in the above kit, the DNA methylation profile further comprises: the ratio of the number of methylated DNA molecules in the promoter region to the length of methylated DNA molecules in the promoter region;
the ratio of the number of methylated DNA molecules in the promoter region comprises:
the ratio of the number of methylated DNA molecules in the promoter region to the total number of whole genome DNA molecules;
the ratio of the number of methylated DNA molecules in the promoter region to the sum of the number of methylated DNA molecules and the number of unmethylated DNA molecules in the promoter region;
the ratio of the number of methylated DNA molecules in the promoter region to the total number of methylated DNA molecules in the whole genome.
The length ratio of the methylated DNA molecules in the promoter region comprises:
the ratio of the number of short segment methylated DNA molecules to the number of long segment methylated DNA molecules in the promoter region;
the ratio of the number of small methylated DNA molecules to the total number of large methylated DNA molecules in the promoter region.
In one embodiment, a tumor diagnosis kit based on methylation detection comprises a feature detection reagent, wherein the feature detection reagent comprises a DNA methylation feature detection reagent and a CNV feature detection reagent;
the DNA methylation profile includes:
the ratio of the number of methylated DNA molecules of the whole genome, the length ratio of methylated DNA molecules of the whole genome and the length ratio of doubtful DNA molecules of the whole genome;
the ratio of the number of methylated DNA molecules in the CpG island region to the length of methylated DNA molecules in the CpG island region;
the ratio of the number of methylated DNA molecules in the promoter region to the length of methylated DNA molecules in the promoter region.
The ratio of the number of methylated DNA molecules of the whole genome comprises:
the ratio of the number of methylated DNA molecules of the whole genome to the total number of DNA molecules of the whole genome;
the number of methylated DNA molecules in the whole genome accounts for the ratio of the total number of methylated DNA molecules and unmethylated DNA molecules in the region;
the genome-wide methylated DNA molecule length ratios include:
the ratio of the number of short segment methylated DNA molecules to the number of long segment methylated DNA molecules in the whole genome;
the ratio of the number of small methylated DNA molecules to the total number of large methylated DNA molecules in the whole genome.
The length ratio of the DNA molecules in question of the whole genome comprises:
the ratio of the number of DNA molecules in question in the whole genome short segment to the number of DNA molecules in question in the long segment;
the ratio of the number of DNA molecules in question in a small fragment of the whole genome to the number of DNA molecules in question in a small fragment to the total number of DNA molecules in question in a long fragment.
The ratio of the number of methylated DNA molecules within the CpG island region comprises:
the ratio of the number of methylated DNA molecules in the CpG island region to the total number of whole genome DNA molecules;
the ratio of the number of methylated DNA molecules in the CpG island region to the sum of the number of methylated DNA molecules and the number of unmethylated DNA molecules in the CpG island region;
the ratio of the number of methylated DNA molecules in the CpG island region to the total number of methylated DNA molecules in the whole genome.
The methylated DNA molecule length ratio in the CpG island region comprises:
the ratio of the number of short segment methylated DNA molecules to the number of long segment methylated DNA molecules in the CpG island region;
the ratio of the number of small methylated DNA molecules to the total number of large methylated DNA molecules in the CpG island region.
The ratio of the number of methylated DNA molecules in the promoter region comprises:
the ratio of the number of methylated DNA molecules in the promoter region to the total number of whole genome DNA molecules;
the ratio of the number of methylated DNA molecules in the promoter region to the sum of the number of methylated DNA molecules and the number of unmethylated DNA molecules in the promoter region;
the ratio of the number of methylated DNA molecules in the promoter region to the total number of methylated DNA molecules in the whole genome.
The length ratio of the methylated DNA molecules in the promoter region comprises:
the ratio of the number of short segment methylated DNA molecules to the number of long segment methylated DNA molecules in the promoter region;
the ratio of the number of small methylated DNA molecules to the total number of large methylated DNA molecules in the promoter region.
The CNV signature includes CNVs of chromosomal arms and/or CNVs of hotspot genes.
The chromosomal arm copy number variation is the CNV number of each chromosomal arm.
In another embodiment, in the above kit, the DNA methylation profile further comprises: the ratio of the number of methylated DNA molecules in the bin region, the ratio of the length of methylated DNA molecules in the bin region and/or the ratio of the length of DNA molecules in the bin region.
The ratio of the number of methylated DNA molecules of the whole genome comprises:
the ratio of the number of methylated DNA molecules of the whole genome to the total number of DNA molecules of the whole genome;
the number of methylated DNA molecules in the whole genome accounts for the ratio of the number of methylated DNA molecules to the sum of the number of unmethylated DNA molecules in the region.
The genome-wide methylated DNA molecule length ratios include:
the ratio of the number of short segment methylated DNA molecules to the number of long segment methylated DNA molecules in the whole genome;
the ratio of the number of small methylated DNA molecules to the total number of large methylated DNA molecules in the whole genome.
The length ratio of the DNA molecules in question of the whole genome comprises:
the ratio of the number of DNA molecules in question in the whole genome short segment to the number of DNA molecules in question in the long segment;
the ratio of the number of DNA molecules in question in a small fragment of the whole genome to the number of DNA molecules in question in a small fragment to the total number of DNA molecules in question in a long fragment.
The ratio of the number of short segment methylated DNA molecules to the number of long segment methylated DNA molecules in the whole genome;
the ratio of the number of small methylated DNA molecules to the total number of large methylated DNA molecules in the whole genome.
The length ratio of the DNA molecules in question of the whole genome comprises:
the ratio of the number of DNA molecules in question in the whole genome short segment to the number of DNA molecules in question in the long segment;
the ratio of the number of DNA molecules in question in a small fragment of the whole genome to the number of DNA molecules in question in a small fragment to the total number of DNA molecules in question in a long fragment.
The feature detection reagent may be a reagent of the "low depth Whole Genome (WGS) methylation profile", "low depth WGS insert distribution profile", and/or "low depth WGS CNV profile" in the detection examples.
The kit may further include a data processing system for converting the information on each characteristic into information for judging whether or not the patient is cancerous.
The invention also aims to provide the application of the kit in preparing a cancer detection product.
It is a further object of the present invention to provide a tumor diagnosis system based on methylation detection, said system comprising device a and device B;
the device A is used for detecting the characteristics of the sample;
the device B converts the characteristic data information into judgment information on whether the patient is suffering from cancer.
Methods for DNA methylation detection include: bisulfite sequencing, restriction enzyme-based sequencing, or targeted enriched methylation site sequencing. The method specifically comprises the following steps:
extracting sample DNA, and constructing a sequencing library: the method can include constructing a sequencing library by amplification after bisulfite, methylated restriction enzyme, or targeted enrichment treatment;
after sequencing, analysis and statistics are carried out to obtain the DNA methylation characteristics.
And establishing a prediction model according to the DNA methylation characteristics and/or the CNV characteristics, and predicting through the prediction model to realize early screening of the tumor sample.
The features described above for constructing the model are obtained by sequencing, preferably low depth sequencing, i.e. low depth DNA methylation features; further preferred is restriction enzyme based low depth sequencing.
Optionally, the model construction method includes: selecting a specific number of tumor and non-tumor samples to construct a training set, collecting DNA methylation characteristics and/or CNV characteristic data of the training set, and constructing a classifier prediction model, for example, establishing the prediction model by a random forest algorithm. And collecting the characteristics of the sample to be detected, and predicting by using a prediction model to realize early screening of the tumor sample.
In the present invention, the tumor may be a solid tumor and/or a hematological tumor. For example, the solid tumor can be liver cancer, colon cancer, breast cancer, and/or gastric cancer. The liver cancer may be hepatocellular carcinoma.
The invention establishes a tumor prediction index which can be suitable for low-depth whole genome sequencing, the minimum low depth can be 0.5X, namely the low-depth DNA methylation characteristic, and a prediction model can be established in a tumor/non-tumor sample through a random forest algorithm according to the low-depth DNA methylation characteristic, so that the early screening of the tumor sample is realized. The DNA methylation characteristics can be used as tumor markers independently, and better prediction accuracy can be obtained; methylation characteristics can also be combined with other indexes (ctDNA fragment size, breakpoint distribution and copy number variation) to establish a comprehensive tumor prediction index, so that the sensitivity and specificity of tumor prediction can be further improved.
According to the invention, comprehensive DNA information is obtained through whole genome sequencing without being limited in a specific detection target; the genome sequence and the methylation modification information in the DNA can be obtained simultaneously by one-time library building and sequencing, the operation is simple and convenient, and the sample size requirement is low; methylation detection based on the specific endonuclease does not influence the detection of related indexes of a genome sequence, and can ensure that complete genome and epigenome information are obtained simultaneously; the method can achieve good preliminary screening effect only by low-depth sequencing with the average sequencing depth of each sample being 0.8X-1.5X, and can also achieve the effect when the sequencing depth is as low as 0.5X; the cost is low, the method is suitable for primary screening in large-scale population, and the practical clinical application value is achieved; can be performed under low depth sequencing for better cost saving, but is also suitable for high depth sequencing; the overall methylation characteristics under low depth are established as the tumor early screening index, so that the tumor detection accuracy is improved and is superior to the existing low-depth sequencing index (fragment distribution and copy number variation); the methylation characteristics are combined with the characteristics of ctDNA fragment size, breakpoint distribution, copy number variation and the like to establish a comprehensive tumor prediction index, so that the prediction accuracy can be further improved. The tumor prediction index of the invention has good practicability and wide application prospect.
Drawings
FIG. 1 is a schematic diagram of the structure of the ligation product.
FIG. 2 shows the consistency of the overall horizontal fragment distribution profile in both digested and non-digested libraries. The horizontal and vertical coordinates are the distribution characteristics of the non-enzyme digestion library and the enzyme digestion library whole genome short segment respectively.
FIG. 3 shows the consistency of distribution characteristics of fragments of different genomic regions in digested and non-digested libraries of a sample. The horizontal and vertical coordinates are the distribution characteristics of the short segments of different genome regions of the non-enzyme digestion library and the enzyme digestion library respectively.
Fig. 4 is a ROC curve for liver cancer prediction based on low-depth whole genome features.
FIG. 5 is a pan cancer species predictor based on the sum of all low-depth WGS features. Negative indicates non-tumor patients.
Fig. 6 is a colon cancer prediction ROC curve based on the sum of all low-depth WGS features.
Figure 7 is a ROC curve for breast cancer prediction based on the sum of all low-depth WGS features.
Fig. 8 is a gastric cancer prediction ROC curve based on the sum of all low-depth WGS features.
Fig. 9 is a pan-carcinoma prediction ROC curve based on the sum of all low-depth WGS features.
Detailed Description
The present invention is described in further detail below with reference to specific embodiments, which are given for the purpose of illustration only and are not intended to limit the scope of the invention. The experimental procedures in the following examples are conventional unless otherwise specified. Materials, reagents, instruments and the like used in the following examples are commercially available unless otherwise specified. The quantitative tests in the following examples, all set up three replicates and the results averaged. In the following examples, unless otherwise specified, the 1 st position of each nucleotide sequence in the sequence listing is the 5 'terminal nucleotide of the corresponding DNA/RNA, and the last position is the 3' terminal nucleotide of the corresponding DNA/RNA.
In the examples of the present invention, the low depth sequencing method based on restriction enzyme is used to illustrate the present invention by taking cfDNA samples as an example.
Example 1 construction of Whole genome MC library
Methylation sensitive restriction enzyme digestion
Using Apostle MiniMaxTMA free DNA enrichment separation kit (standard version) (Apostle, Cat #: A17622-50) extracts cfDNA of a blood plasma sample to be detected, 5-40ng of cfDNA is taken, a reaction system is configured according to the table 1, and then enzyme digestion and enzyme inactivation treatment are carried out in a PCR instrument (Bio-rad Thermal Cycler, T100) according to the procedures of the table 2, so as to obtain an enzyme digestion product (stored at 4 ℃).
TABLE 1 reaction System
Figure BDA0002547260050000081
Figure BDA0002547260050000091
TABLE 2 reaction procedure
Temperature of Time
37℃ 30min
65℃ 20min
4℃ +∞
Secondly, blunt end repair and A adding treatment of enzyme digestion products
Taking the enzyme digestion product obtained in the step one for use
Figure BDA0002547260050000092
UltraTMII for DNA Library Prep kit (cat No. E7645), the reaction system was prepared as shown in Table 3, and then the end repair and the treatment of adding A to the 3' end were carried out in a PCR apparatus (Bio-rad Thermal Cycler, T100) according to the reaction program in Table 4 to obtain a reaction product (stored at 4 ℃).
TABLE 3 reaction System
Reagent Volume (μ l)
Enzyme digestion product of step one 40
(green)NEBNext Ultra II End Prep Reaction Buffer 5.6
NEBNext Ultra II End Prep Enzyme Mix 2.4
Total volume 48
TABLE 4 reaction sequence
Temperature of Time
20℃ 30min
65℃ 30min
4℃ +∞
Thirdly, connecting the reaction product with an adapter
Use of
Figure BDA0002547260050000093
UltraTMII for DNA Library Prep kit (cat No. E7645), the reaction system was prepared in accordance with Table 5, and the reaction was carried out at 20 ℃ for 30min (Bio-rad Thermal Cycler, T100) to obtain a ligation product (preservation at 4 ℃).
TABLE 5 reaction System
Reagent Volume (μ l)
Reaction product of step two 48
MC Adapter(25μM) 1.5
DNase/RNase-Free Water 0.5
NEBNext Ultra II Ligation Master Mix 24
NEBNext Ligation Enhancer 0.8
Total volume 74.8
Wherein, the preparation steps of the MC Adapter are as follows:
the Adapter sequence information is shown in Table 6.
The single-stranded DNA molecules in Table 6 were dissolved in TE buffer and diluted to a concentration of 100. mu.M, respectively. Two single-stranded DNA molecules in the same group were mixed in equal volumes (50. mu.l each), followed by annealing (annealing program: 95 ℃, 15 min; 25 ℃, 2h) to give 12 sets of DNA solutions each containing a linker, and the 12 sets of DNA solutions were mixed in equal volumes to give an Adapter Mix (i.e., MC Adapter).
Annealing program apparatus for reaction (Bio-rad Thermal Cycler, T100).
TABLE 6 Adapter sequence information
Figure BDA0002547260050000101
Brief summary:
in table 6, the upstream sequence (the upstream sequence containing "F" in the name) consists of: sequencing primer binding sequence + random tag + anchor sequence + T. Downstream sequences (those containing "R" in the name are downstream sequences): anchor + sequencing primer binding sequence.
In Table 6, 8N's represent a random tag of 8bp, and N is A, C, T or G. In practical application, the length of the random tag can be 8-14 bp.
12bp of the anchor sequence is underlined, and the underlined parts of each set of the upstream sequence and the downstream sequence are complementary in the opposite direction, and the upstream sequence and the downstream sequence are joined together by annealing to form a linker. Meanwhile, the anchor sequence can be used as a built-in label for fixing the sequence and used for marking the original template molecule. In practical applications, the anchor sequence may be 12-20bp in length, with no more than 3 consecutive repeat bases, and may not interact with other parts of the primer (e.g., forming hairpin structures, dimers, etc.), with base balance at each position of 12 sets, and with mismatch base numbers > 3.
The bold-ended T in the upstream sequence is complementary to the "A" at the end of the original molecule for TA ligation.
In the upstream sequence, positions 1 to 21 from the 5 'end (Truseq sequencing kit of Illumina) are sequencing primer binding sequences, wherein positions 1 to 19 from the 5' end are library amplification primer portions.
In the downstream sequence, the non-underlined part (nextera sequencing kit from Illumina) is the sequencing primer binding sequence, wherein positions 1 to 22 from the 3' end are the part for designing the library amplification primers.
In table 6, 12 sets of linkers are included in total, and a combination of 12 × 12 — 144 labels can be formed, and the sequence information of the binding molecules themselves is sufficient to distinguish all molecules in the original sample, and the number of sets can be increased (synthesis cost is increased) or decreased (distinguishing effect is slightly weak) as appropriate in practical use.
The structure of the ligation product is shown in FIG. 1. Wherein a is a linker moiety, b and f are library amplification primers, respectively, c is an 8bp random tag (8N in table 6), d is a 12bp anchor sequence (underlined in table 6), and e is an insert (cfDNA).
Fourthly, purification of the ligation product
Adding 112.2 μ l of AMPure XP magnetic beads (Beckman, A63880) into the ligation product obtained in the third step, uniformly mixing by vortex, standing at room temperature for 10min, adsorbing by a magnetic rack until the mixture is clarified (about 10min), discarding the supernatant, washing twice by using 80% (volume percentage content) ethanol, and discarding the supernatant; after the ethanol is dried, 31 mul of DNase/RNase-Free Water is added for elution, the mixture is uniformly mixed by vortex, the mixture is placed at room temperature for 10min, a magnetic frame is used for adsorption for 5min, and then 10 mul of discharging guns are used for sucking all the supernatant (sucking 3 times) to a new 8-row, so that a purified product, namely the MC library, is obtained.
Fifthly, amplification and purification of whole genome library
1. 400ng of the MC library prepared in the fourth step was used to prepare a reaction system (KAPA Hyper Prep Kit, KK8505) according to Table 7, and PCR amplification (Bio-rad Thermal Cycler, T100) was carried out according to Table 8 to obtain a PCR amplification product (stored at 4 ℃).
TABLE 7 reaction System
Amplification system Volume μ l
HIFI(KAPA KK8505) 25
M_D**(10μM) 5
Form panel 20
Total volume 50
In table 7, M _ D is an equimolar mixture of M _ D i5 and M _ D i7, M _ D i5 and M _ D i7 are single stranded DNAs, and the sequences are as follows:
M_D i5:
5’-AATGATACGGCGACCACCGAGATCTACAC********ACACTCTTTCCCTACACGACGCTCT-3’;
M_D i7:5’-CAAGCAGAAGACGGCATACGAGAT********GTCTCGTGGGCTCGGAGATGTGTATAA-3’。
wherein, the index is the position of the sequence of the index, the length of the index is 6-8bp, and the function is to distinguish the sequence among samples, thereby facilitating the mixed sequencing of a plurality of samples.
TABLE 8 reaction procedure
Figure BDA0002547260050000111
2. And (3) vortex mixing the PCR amplification products obtained in the step (1), taking 10 mu l of each reaction, and mixing each reaction into 1 part to 1.5 centrifuge tubes for each 33 reactions. Adding 70-140 μ l (1-2 times volume) of AMPure XP magnetic bead (Beckman, A63880), mixing by vortex, standing at room temperature for 10min, and adsorbing with magnetic frame for 5 min; after the solution is clarified, the supernatant is discarded, 200 mul of 80 percent (volume percentage) ethanol water solution is added for cleaning for 2 times, and the supernatant is discarded; after the ethanol is dried, 100 mul DNase/RNase-Free Water is added, the mixture is evenly mixed by vortex, the mixture is placed at room temperature for 10min, a magnetic frame adsorbs the mixture for 5min, and supernatant solution is absorbed to obtain a product which is stored at the temperature of minus 20 ℃.
3. The product obtained in the step 2 is a sequencing library (namely an enzyme digestion library) which can be used for low-depth whole genome sequencing on an Illumina Hiseq X platform. The sequencing data amount of each sample is 3-5G, and the average sequencing depth is 0.8X-1.5X, so that the subsequent analysis requirements can be met.
The methylation detection based on the specific endonuclease does not influence the detection of the related indexes of the genome sequence. In 40 human cfDNA samples, digested and non-digested samples were performed separatelyLibrary construction, and calculating the distribution characteristics of cfDNA whole genome short fragments in two libraries: the number of short fragment molecules (100-20.996, fig. 2), the overall horizontal fragment distribution remained consistent across both the digested and non-digested libraries.
The non-enzymatic digestion library construction steps are as follows: using Apostle MiniMaxTMAnd (3) extracting cfDNA of a blood plasma sample to be detected by using a free DNA enrichment separation kit (standard version) (Apostle, Cat #: A17622-50), adding water into 5-40ng cfDNA to 40ul, and directly performing the second, third, fourth and fifth steps (the same as enzyme digestion) to obtain a non-enzyme digestion library.
In addition, for one of the cfDNA samples, the distribution of the insert (referring to the amplified fragment with the Adapter sequence removed) of each genomic region in the digested and non-digested libraries also maintained better consistency (R)20.965, fig. 3), the results shown in fig. 3 are that for one cfDNA sample, the whole genome is divided into several hundred regions of 5Mb in length, and the short fragment distribution characteristics of these 5Mb regions in the digested/non-digested library are calculated, respectively: the number of short fragment molecules (100-150 bp)/the number of long fragment molecules (151-220 bp). Each point in fig. 3 represents a short segment distribution characteristic of a 5Mb area.
The results show that the distribution of fragments per 5Mb region remains consistent between digested and non-digested libraries, and methylation detection based on specific endonucleases does not affect the detection of other relevant indicators of genomic sequence.
Example 2 tumor early screening index analysis method of low-depth whole genome sequencing
An enzyme library was prepared according to the method of example 1, and low-depth sequencing was performed to obtain sequencing data, followed by screening according to the following procedure.
First, data quality control and comparison
1. Linker sequences in sequencing reads were removed using trimmatic (v0.36) software, and clean reads were aligned to hg19 reference genome using BWA software (V0.7.10).
2. Removing repeated reads, removing reads (http:// hgdownload. cse. ucsc. edu/goldenpath/hg19/encodeDCC/wgEncodeMapability /) in the area with low comparison quality and blacklist, and obtaining the duplicate-removed bam file.
Two, molecular division
Based on the deduplicated bam file, dividing cfDNA molecules of the obtained sample to be tested based on the sequencing result (the HhaI recognition sequence is GCGC):
1. the methylated DNA molecules were: a DNA molecule containing the sequence GCGC and not digested by HhaI;
2. unmethylated DNA molecules are: a DNA molecule containing a sequence GCGC and capable of being digested by HhaI;
3. the DNA molecules in question were: DNA molecules which do not contain the sequence GCGC.
The sum of the three is recorded as the total DNA molecule number of the whole genome.
The DNA molecules of 100-150bp were designated as short fragments, the DNA molecules of 151-220bp as long fragments, and the DNA molecules of <100bp as small fragments.
Extraction of DNA methylation characteristics
1. Genome-wide DNA methylation characteristics: the ratio of the number of methylated DNA molecules of the whole genome, the ratio of the length of methylated DNA molecules of the whole genome and the ratio of the length of doubtful DNA molecules of the whole genome.
Specifically, the ratio of the number of methylated DNA molecules in the whole genome is as follows:
the number of methylated DNA molecules of the whole genome/total DNA molecules of the whole genome;
the number of genome-wide methylated DNA molecules/(the number of genome-wide methylated DNA molecules + the number of genome-wide unmethylated DNA molecules).
The ratio of the lengths of the DNA molecules methylated in the whole genome is as follows:
the ratio of the number of methylated DNA molecules of the whole genome short segment to the number of methylated DNA molecules of the long segment, i.e., the number of methylated DNA molecules of the whole genome 100-150 bp/the number of methylated DNA molecules of the whole genome 151-220 bp;
the ratio of the number of small methylated DNA molecules of the whole genome to the sum of the number of small methylated DNA molecules of the whole genome to the number of methylated DNA molecules of the long segment, i.e., the number of methylated molecules of the whole genome <100 bp/(the number of methylated DNA molecules of the whole genome <100bp + the number of methylated DNA molecules of the whole genome 151-220 bp).
The ratio of the lengths of the DNA molecules in question in the whole genome is as follows:
the ratio of the number of the DNA molecules with the whole genome in short segment to the number of the DNA molecules with the whole genome in long segment to be suspected, namely the ratio of the number of the DNA molecules with the whole genome in 100-150bp to the number of the DNA molecules with the whole genome in 151-220bp to be suspected;
the ratio of the number of DNA molecules in question in the small segment of the whole genome to the sum of the number of DNA molecules in question in the small segment of the whole genome and the number of DNA molecules in question in the long segment, i.e., the number of DNA molecules in question in the whole genome is less than 100 bp/(the number of DNA molecules in question in the whole genome is less than 100bp + the number of DNA molecules in question in the whole genome is 151-220 bp).
2. DNA methylation characteristics within CpG island regions: the ratio of the number of methylated DNA molecules in the CpG island region to the length of methylated DNA molecules in the CpG island region; wherein, the CpG island region is a region with the length of 500-1000bp and the GC content of more than 50 percent.
Specifically, the ratio of the number of methylated DNA molecules in the CpG island region is as follows:
the number of methylated DNA molecules per total DNA molecules of the whole genome in the CpG island region;
the number of methylated DNA molecules within the CpG island region/(the number of methylated DNA molecules within the CpG island region + the number of unmethylated DNA molecules within the CpG island region);
the number of methylated DNA molecules in the CpG island region/the number of methylated DNA molecules in the whole genome.
The ratio of the lengths of the methylated DNA molecules in the CpG island regions is as follows:
the ratio of the number of short segment methylated DNA molecules to the number of long segment methylated DNA molecules in the CpG island region, i.e., the number of 100-150bp methylated DNA molecules in the CpG island region/the number of 151-220bp methylated DNA molecules in the CpG island region;
the ratio of the number of small segment methylated DNA molecules in the CpG island region to the sum of the number of small segment methylated DNA molecules and the number of large segment methylated DNA molecules in the CpG island region, i.e., <100bp methylated DNA molecules in the CpG island region/(number of <100bp methylated DNA molecules in the CpG island region + number of 151-220bp methylated DNA molecules in the CpG island region).
3. DNA methylation characteristics in the promoter region: the ratio of the number of methylated DNA molecules in the promoter region to the length of methylated DNA molecules in the promoter region.
Specifically, the ratio of the number of methylated DNA molecules in the promoter region is as follows:
the number of methylated DNA molecules/total number of whole genome DNA molecules in the promoter region;
the number of methylated DNA molecules in the promoter region/(the number of methylated DNA molecules in the promoter region + the number of unmethylated DNA molecules in the promoter region);
the number of methylated DNA molecules in the promoter region/the number of methylated DNA molecules of the whole genome.
The ratio of the lengths of the methylated DNA molecules in the promoter region was as follows:
the ratio of the number of short segment methylated DNA molecules to the number of long segment methylated DNA molecules in the promoter region, i.e., the number of 100-150bp methylated DNA molecules in the promoter region/the number of 151-220bp methylated DNA molecules in the promoter region;
the ratio of the number of small methylated DNA molecules in the promoter region to the sum of the number of small methylated DNA molecules in the promoter region and the number of large methylated DNA molecules in the promoter region, i.e., <100bp methylated DNA molecules in the promoter region/(number of <100bp methylated DNA molecules in the promoter region + number of 151-220bp methylated DNA molecules in the promoter region).
4. DNA methylation characteristics in the bin region are as follows:
ratio of length of suspected DNA molecules in bin region: the ratio of the number of DNA molecules suspected in the short segment of the bin region to the number of DNA molecules suspected in the long segment, i.e., the number of DNA molecules suspected in the bin region of 100-150 bp/the number of DNA molecules suspected in the bin region of 151-220 bp;
the ratio of the number of DNA molecules suspected of having small fragments in the bin region to the sum of the number of DNA molecules suspected of having small fragments in the bin region and the number of DNA molecules suspected of having long fragments, i.e., the number of DNA molecules suspected of having <100bp in the bin region/(the number of DNA molecules suspected of having <100bp in the bin region + the number of DNA molecules suspected of having 151 bp and 220bp in the bin region).
Wherein, Bin region is a plurality of segments dividing the genome. If the whole genome is divided into bins with the length of 5Mb, each bin is connected in sequence, each bin does not cross over the centromere, the bin at the edge of the centromere is allowed to be less than 5Mb, the ratio of the length ratio of the DNA molecules suspected in the bin region under each bin, the ratio of the number of the DNA molecules suspected in the small fragment in the bin region to the sum of the number of the DNA molecules suspected in the long fragment in the bin region are centralized and standardized, the average value is 0, the standard deviation is 1, and the average value is respectively marked as the length ratio of the DNA molecules suspected in the bin region of 5Mb-bin-1 and the length ratio of the DNA molecules suspected in the bin region of 5Mb-bin-2, the calculation method is as follows:
the length ratio of the suspected DNA molecules in the bin region is 5Mb-bin-1, which is the ratio of the number of 100-150bp suspected DNA molecules in a certain bin region/the number of 151-220bp suspected DNA molecules in the bin region;
the ratio of the length of the suspected DNA molecules in the bin region of 5Mb-bin-2 is the number of <100bp suspected DNA molecules in a certain bin region/(the number of <100bp suspected DNA molecules in the bin region + the number of 151-220bp suspected DNA molecules in the bin region).
Fourthly, extracting the variation characteristic of the copy number
CNV (i.e., copy number variation) detection at the chromosome arm level and CNV detection of hot genes (e.g., CCND1, FGF19, MYC, TERT genes) were performed using software such as Readdepth, QDNAseq, WiscondorX, and the like.
The CNV of each chromosome arm and the CNV of each hotspot gene are used as copy number variation characteristics.
Fifthly, establishing a model
Selecting a plurality of tumor and non-tumor samples, constructing a training set, selecting DNA methylation characteristics, copy number variation characteristics and/or cfDNA fragment characteristics, establishing a prediction model through a random forest algorithm, and realizing early screening of the tumor samples through the prediction model.
The method for establishing the prediction model by using the random forest algorithm specifically comprises the following steps:
constructing a training set, wherein the training set comprises a plurality of tumor and non-tumor samples;
and extracting relevant characteristics of the training set, and establishing a tumor early screening model by adopting a random forest method and 10-fold cross validation based on the quantitative value of each characteristic. The model is established by using a software package of 'randomForest' in the R language. Firstly, a random forest model is established based on a training sample by using a randomForest function (ntree is 1000, and the rest parameters are default parameters), and then the random forest model is used for testing a verification sample.
The 10-fold means that the data set is divided into ten parts, and 9 parts of the ten parts are used as training data and 1 part is used as test data in turn to carry out the test. Each trial will yield a corresponding accuracy (or error rate). The average of the accuracy (or error rate) of the 10 results is used as an estimate of the accuracy of the algorithm, and generally 10-fold cross validation is performed multiple times (for example, 10 times of 10-fold cross validation), and then the average is obtained as an estimate of the accuracy of the algorithm.
Sixth, tumor detection
And extracting DNA of the sample to be detected, extracting relevant characteristics, substituting the characteristics into a prediction model, and judging whether the tumor or the non-tumor is carried out.
Example 3 prediction of liver cancer based on DNA methylation characteristics
1. Sample(s)
174 hepatocellular carcinoma patients (HCC) plasma samples, and 208 high-risk non-hepatocellular carcinoma patients (hepatitis, cirrhosis patients, nonHCC) plasma samples.
Inclusion criteria for hepatocellular carcinoma patients were: the B ultrasonic detection is positive, and the dynamic CT/MRI imaging and the pathological detection are positive, and the hepatocellular carcinoma is diagnosed.
Inclusion criteria for patients with non-hepatocellular carcinoma were: hepatitis or cirrhosis patients who were negative for B-ultrasonic testing, excluding healthy people.
2. Method of producing a composite material
For each sample, an enzymatic library was prepared according to the method of example 1, low-depth whole genome sequencing was performed to obtain sequencing data, and then features were extracted according to the method of example 2, the selected features including:
(1) genome-wide DNA methylation characteristics: the ratio of the number of methylated DNA molecules of the whole genome, the length ratio of methylated DNA molecules of the whole genome and the length ratio of doubtful DNA molecules of the whole genome;
(2) DNA methylation characteristics within CpG island regions: the ratio of the number of methylated DNA molecules in the CpG island region to the length of methylated DNA molecules in the CpG island region;
(3) DNA methylation characteristics in the promoter region: the ratio of the number of methylated DNA molecules in the promoter region to the length of methylated DNA molecules in the promoter region;
(4) genomic bin region DNA molecular length characterization: calculating the suspected DNA molecular length ratio of 5Mb-bin-1 and 5Mb-bin-2 in the bin region under each bin region according to the small units (bins) divided by 5Mb in the whole genome;
(5) CNV characteristics: CNV number of each chromosome arm.
The characteristics are selected for combination and are divided into four groups, each group is respectively used for constructing a model, and the effects of the four models are evaluated.
The adopted characteristic combination is as follows:
low depth Whole Genome (WGS) methylation profile:
the ratio of the number of methylated DNA molecules of the whole genome to the length of methylated DNA molecules of the whole genome;
the ratio of the number of methylated DNA molecules in the CpG island region to the length of methylated DNA molecules in the CpG island region;
the ratio of the number of methylated DNA molecules in the promoter region to the length of methylated DNA molecules in the promoter region.
Low depth WGS insert distribution feature set:
the length ratio of the DNA molecules in question of the whole genome;
genomic bin region DNA molecular length characterization: 5Mb-bin-1 and 5 Mb-bin-2;
low depth WGS CNV feature set:
CNV number of each chromosome arm.
All low depth WGS feature sum groups:
combinations of the above three groups.
And (3) respectively establishing a tumor early screening model by using a random forest method and adopting 10-fold cross validation on the characteristics. The model is established by using a software package of 'randomForest' in the R language. Firstly, a random forest model is established based on a training sample by using a randomForest function (ntree is 1000, and the other parameters are default parameters), and then the random forest model is used for detecting a sample to be detected.
The detection method comprises the following steps: and taking the quantized value of the index as the input of a model, taking liver cancer and non-liver cancer as the output of the model, and testing the verification sample by using the constructed model.
3. And (3) model prediction results:
the prediction accuracy (AUC 0.951, curve 2 in fig. 4) of the liver cancer sample by using the random forest classifier created by using the low-depth Whole Genome (WGS) methylation feature set alone as the tumor marker is superior to that of the existing low-depth WGS insert distribution feature set (AUC 0.885, curve 3 in fig. 4) and low-depth WGS CNV feature set (AUC 0.751, curve 4 in fig. 4); in addition, the prediction accuracy was further improved by combining the established overall index with the "total sum of features of all low depth WGS" (AUC 0.963, curve 1 in fig. 4).
In the 382 HCC/non HCC samples, when the specificity was 90%, the sensitivity of liver cancer prediction using the low-depth whole genome methylation signature was 87%, which was superior to the WGS insert distribution signature and WGS CNV signature (the sensitivities were 78% and 55%, respectively), and the sensitivity of the combined index established by integrating the three types of indices was further improved to 90%.
In conclusion, the prediction accuracy of the low-depth genome-wide methylation characteristics as tumor markers is superior to that of the existing indexes (fragment distribution and copy number variation), and the detection accuracy can be further improved by combining the methylation characteristics with the indexes related to gene sequences.
Example 4 Low depth Whole genome methylation can be used for pan-cancer species
1. Sample to be tested
Plasma samples from cancer patients (25 colon cancer, 33 breast cancer, 65 gastric cancer). The colon cancer inclusion criteria were: colon cancer positive by enteroscopy; the breast cancer inclusion criteria were: breast positive by breast ultrasonic detection and pathological detection; the criteria for selecting stomach cancer are: space occupying lesion is found by X-ray barium meal examination or endoscopy, and gastric cancer is detected by pathology to be positive.
Plasma samples from non-cancer patients (60 healthy people). The inclusion criteria were: the patient participates in health examination, and has no abnormality in chest fluoroscopy, tumor marker detection and B-ultrasonic detection.
2. Method of producing a composite material
For each sample to be tested, an enzyme digestion library was prepared according to the method of example 1, and sequencing was performed to obtain sequencing data, and then each feature of the sample was extracted according to the method of example 2, the feature being the feature of the "total low depth WGS feature set" in example 3. Methylation profiles include: the ratio of the number of methylated DNA molecules of the whole genome, the ratio of the length of methylated DNA molecules of the whole genome and the ratio of the length of doubtful DNA molecules of the whole genome.
Classifying the tumor patients into a cancer sample group, classifying the healthy people into a non-cancer sample group, using the characteristics as a training set, constructing a random forest model, and performing 10-fold cross validation. Establishing and predicting a liver cancer early screening model by adopting 10-fold cross validation and using a random forest method. The analysis uses the "randomForest" software package in the R language. Firstly, a random forest model is established based on a training sample by using a randomForest function (ntree is 1000, and the rest parameters are default parameters), and then the random forest model is used for testing a verification sample.
3. Results
A predictive model based on the features of the "total low depth WGS feature sum group" was able to distinguish between tumor (colon, breast, stomach) and non-tumor patients, with significant differences in the two groups of samples (fig. 5). For different cancer species, the ability of distinguishing tumor samples from non-tumor samples by using the prediction model is slightly different, and the area AUC under the ROC curve predicted by the model is 0.95 (fig. 6) by taking 25 cases of colon cancer and 60 cases of healthy samples as samples to be tested; 33 breast cancer samples and 60 healthy samples are taken as samples to be tested, and the AUC predicted by the model is 0.931 (figure 7); the AUC predicted by the model was 0.916 (fig. 8) for 65 gastric cancers and 60 healthy samples as test samples.
All of the 3 tumor patients were classified into cancer sample groups (123 cases in total), and the predicted AUC obtained using the same prediction model was 0.927 (fig. 9).
The results show that the genome methylation characteristics can be used as markers of cancer and non-cancer, and can be applied to different types of tumors.

Claims (10)

1. A tumor diagnosis kit based on methylation detection, comprising a feature detection reagent, wherein the feature detection reagent comprises a DNA methylation feature detection reagent, and the DNA methylation feature comprises: the ratio of the number of methylated DNA molecules in a region and/or the ratio of the length of methylated DNA molecules in a region;
the ratio of the number of methylated DNA molecules in the region comprises: the ratio of the number of methylated DNA molecules in a region to the total number of molecules of whole genomic DNA, and/or the ratio of the number of methylated DNA molecules in a region to the sum of the number of methylated DNA molecules in a region and the number of unmethylated DNA molecules in a region;
the length ratio of the methylated DNA molecules in the region comprises: the ratio of the number of short segment methylated DNA molecules to the number of long segment methylated DNA molecules in the region, and/or the ratio of the number of small segment methylated DNA molecules to the total number of small segment methylated DNA molecules and long segment methylated DNA molecules in the region.
2. The kit according to claim 1, wherein the DNA molecule is a cfDNA molecule or a DNA molecule fragment after whole genome disruption; the methylated DNA molecule is a DNA molecule containing a methylation site; the unmethylated DNA molecule is a DNA molecule that does not contain a methylation site.
3. The kit of claim 1 or 2, wherein the DNA methylation signature further comprises:
the length ratio of the DNA molecules in question in the region;
the length ratio of the DNA molecules in question in the region comprises: the ratio of the number of DNA molecules with doubtful short segments to the number of DNA molecules with doubtful long segments in the region, and/or the ratio of the number of DNA molecules with doubtful small segments to the total number of DNA molecules with doubtful long segments in the region; the DNA molecule in question is a DNA molecule that cannot be judged to be methylated or unmethylated in the detection process.
4. The kit of any one of claims 1 to 3, wherein the short segment is a segment between S1 and S2, the long segment is a segment between (S2+1) and S3, and the small segment is a segment smaller than S1;
the S1 is 1-100bp, the S2 is 150-169bp, the S3 is 151-250bp, and preferably the S3 is 151-220 bp.
5. The kit of any one of claims 1 to 4, wherein the region comprises a whole genome and/or a specific region, wherein the specific region is one or more of a CpG island region, a promoter region, a tumor specific region and a bin region; the bin region is obtained by dividing a whole genome into a plurality of bins.
6. The kit of claim 5, wherein the ratio of the number of methylated DNA molecules in a region to the number of methylated DNA molecules in a whole genome for a particular region is further comprised.
7. The kit of any one of claims 1-6, wherein the feature detection reagents further comprise CNV feature detection reagents, wherein the CNV features comprise CNV of chromosome arms and/or CNV of hot spot genes.
8. The kit according to any one of claims 1 to 7, further comprising a data processing system for converting information on each characteristic into judgment information on whether or not cancer is present.
9. Use of a kit according to any one of claims 1 to 8 in the manufacture of a product for the detection of cancer.
10. A tumor diagnosis system based on methylation detection, which is characterized by comprising a device A and a device B;
the features of any one of claims 1 to 7 for use in testing a sample;
the device B converts the characteristic data information into judgment information on whether the patient is suffering from cancer.
CN202010564746.0A 2020-06-19 2020-06-19 Tumor diagnosis kit based on methylation detection and application thereof Active CN113817822B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010564746.0A CN113817822B (en) 2020-06-19 2020-06-19 Tumor diagnosis kit based on methylation detection and application thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010564746.0A CN113817822B (en) 2020-06-19 2020-06-19 Tumor diagnosis kit based on methylation detection and application thereof

Publications (2)

Publication Number Publication Date
CN113817822A true CN113817822A (en) 2021-12-21
CN113817822B CN113817822B (en) 2024-02-13

Family

ID=78924590

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010564746.0A Active CN113817822B (en) 2020-06-19 2020-06-19 Tumor diagnosis kit based on methylation detection and application thereof

Country Status (1)

Country Link
CN (1) CN113817822B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115662519A (en) * 2022-09-29 2023-01-31 昂凯生命科技(苏州)有限公司 cfDNA fragment feature combination and system for predicting cancer based on machine learning

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005198533A (en) * 2004-01-14 2005-07-28 National Cancer Center-Japan Method for judging prognosis of mammalian neuroblastoma
US20070037184A1 (en) * 2005-06-16 2007-02-15 Applera Corporation Methods and kits for evaluating dna methylation
CN101233240A (en) * 2004-03-26 2008-07-30 斯昆诺有限公司 Base specific cleavage of methylation-specific amplification products in combination with mass analysis
CN107385039A (en) * 2017-07-27 2017-11-24 北京泛生子基因科技有限公司 A kind of reagent set and detection method for the horizontal detection of mankind's mgmt gene promoter methylation
CN109385464A (en) * 2018-07-27 2019-02-26 中山大学附属第六医院 A kind of DNA methylation detection kit and method
CN109680060A (en) * 2017-10-17 2019-04-26 华东师范大学 Methylate marker and its application in diagnosing tumor, classification
US20190241979A1 (en) * 2012-09-20 2019-08-08 The Chinese University Of Hong Kong Non-invasive determination of methylome of tumor from plasma
CN110358836A (en) * 2019-07-31 2019-10-22 中山大学附属第六医院 The combination of the reagent of screening or diagnosing tumour, kit and device
CN110904225A (en) * 2019-11-19 2020-03-24 中国医学科学院肿瘤医院 Combined marker for liver cancer detection and application thereof

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005198533A (en) * 2004-01-14 2005-07-28 National Cancer Center-Japan Method for judging prognosis of mammalian neuroblastoma
CN101233240A (en) * 2004-03-26 2008-07-30 斯昆诺有限公司 Base specific cleavage of methylation-specific amplification products in combination with mass analysis
US20070037184A1 (en) * 2005-06-16 2007-02-15 Applera Corporation Methods and kits for evaluating dna methylation
US20190241979A1 (en) * 2012-09-20 2019-08-08 The Chinese University Of Hong Kong Non-invasive determination of methylome of tumor from plasma
CN107385039A (en) * 2017-07-27 2017-11-24 北京泛生子基因科技有限公司 A kind of reagent set and detection method for the horizontal detection of mankind's mgmt gene promoter methylation
CN109680060A (en) * 2017-10-17 2019-04-26 华东师范大学 Methylate marker and its application in diagnosing tumor, classification
CN109385464A (en) * 2018-07-27 2019-02-26 中山大学附属第六医院 A kind of DNA methylation detection kit and method
CN110358836A (en) * 2019-07-31 2019-10-22 中山大学附属第六医院 The combination of the reagent of screening or diagnosing tumour, kit and device
CN110904225A (en) * 2019-11-19 2020-03-24 中国医学科学院肿瘤医院 Combined marker for liver cancer detection and application thereof

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YARUI MA ET AL.: "Methylation silencing of TGF-β receptor type II is involved in malignant transformation of esophageal squamous cell carcinoma", CLINICAL EPIGENETICS, vol. 12, no. 25, pages 1 - 12 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115662519A (en) * 2022-09-29 2023-01-31 昂凯生命科技(苏州)有限公司 cfDNA fragment feature combination and system for predicting cancer based on machine learning
CN115662519B (en) * 2022-09-29 2023-11-03 南京医科大学 cfDNA fragment characteristic combination and system for predicting cancer based on machine learning

Also Published As

Publication number Publication date
CN113817822B (en) 2024-02-13

Similar Documents

Publication Publication Date Title
CN113454219B (en) Methylation marker for liver cancer detection and diagnosis
CN114736968B (en) Application of plasma free DNA methylation marker in lung cancer early screening and lung cancer early screening device
CN112501293B (en) Reagent combination for detecting liver cancer, kit and application thereof
CN108866192A (en) Tumor marker STAMP-EP1 based on methylation modification
US20230272475A1 (en) A method for detecting the mutation and methylation of tumor-specific genes in ctdna
JP7407824B2 (en) Tumor marker STAMP-EP5 based on methylation modification
CN112322736A (en) Reagent combination for detecting liver cancer, kit and application thereof
EP3828273A1 (en) Methylation modification-based tumor marker stamp-ep2
CN112941180A (en) Group of lung cancer DNA methylation molecular markers and application thereof in preparation of lung cancer early diagnosis kit
CN105555965A (en) Method for determining nucleic acid composition of nucleic acid mixture
CN108070658A (en) Detect the non-diagnostic method of MSI
CN112280865A (en) Reagent combination for detecting liver cancer, kit and application thereof
JP7399169B2 (en) Tumor marker STAMP-EP4 based on methylation modification
CN114182022A (en) Method for detecting liver cancer specific mutation based on cfDNA base mutation frequency distribution
JP7383051B2 (en) Tumor marker STAMP-EP8 based on methylation modification and its application
CN113817822B (en) Tumor diagnosis kit based on methylation detection and application thereof
CN109593836A (en) A method of methylation capture sequencing is carried out using mirror image probe
WO2023226939A1 (en) Methylation biomarker for detecting colorectal cancer lymph node metastasis and use thereof
JP2023175696A (en) Tumor marker stamp-ep9 based on methylation modification and application thereof
EP3839070A1 (en) Dna methylation-related marker for diagnosing tumor, and application thereof
CN102776270A (en) Method and device for detecting DNA methylation
CN111020710A (en) ctDNA high-throughput detection of hematopoietic and lymphoid tissue tumors
JP7383727B2 (en) Tumor marker STAMP-EP7 based on methylation modification and its application
US20230064274A1 (en) Marker selection method using methylation difference between nucleic acids, methylated or demethylated marker, and diagnostic method using marker
CN117441027A (en) Headrich-BS: thermal enrichment of CpG-rich regions for bisulfite sequencing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant