WO2022255944A2 - Method for detection and quantification of methylated dna - Google Patents
Method for detection and quantification of methylated dna Download PDFInfo
- Publication number
- WO2022255944A2 WO2022255944A2 PCT/SG2022/050367 SG2022050367W WO2022255944A2 WO 2022255944 A2 WO2022255944 A2 WO 2022255944A2 SG 2022050367 W SG2022050367 W SG 2022050367W WO 2022255944 A2 WO2022255944 A2 WO 2022255944A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- dna
- enzyme
- sequence
- reverse primer
- random nucleotides
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 98
- 238000001514 detection method Methods 0.000 title description 35
- 238000011002 quantification Methods 0.000 title description 6
- 239000012472 biological sample Substances 0.000 claims abstract description 19
- 108020004414 DNA Proteins 0.000 claims description 152
- 206010028980 Neoplasm Diseases 0.000 claims description 121
- 201000011510 cancer Diseases 0.000 claims description 110
- 238000007069 methylation reaction Methods 0.000 claims description 68
- 230000011987 methylation Effects 0.000 claims description 67
- 108091093088 Amplicon Proteins 0.000 claims description 61
- 108091029430 CpG site Proteins 0.000 claims description 61
- 238000012163 sequencing technique Methods 0.000 claims description 59
- 102000004190 Enzymes Human genes 0.000 claims description 40
- 108090000790 Enzymes Proteins 0.000 claims description 40
- 238000006243 chemical reaction Methods 0.000 claims description 40
- OPTASPLRGRRNAP-UHFFFAOYSA-N cytosine Chemical class NC=1C=CNC(=O)N=1 OPTASPLRGRRNAP-UHFFFAOYSA-N 0.000 claims description 38
- 239000000523 sample Substances 0.000 claims description 37
- 239000002773 nucleotide Substances 0.000 claims description 35
- 125000003729 nucleotide group Chemical group 0.000 claims description 35
- ISAKRJDGNUQOIC-UHFFFAOYSA-N Uracil Chemical compound O=C1C=CNC(=O)N1 ISAKRJDGNUQOIC-UHFFFAOYSA-N 0.000 claims description 32
- 239000011324 bead Substances 0.000 claims description 25
- 108091028043 Nucleic acid sequence Proteins 0.000 claims description 22
- RWQNBRDOKXIBIV-UHFFFAOYSA-N thymine Chemical compound CC1=CNC(=O)NC1=O RWQNBRDOKXIBIV-UHFFFAOYSA-N 0.000 claims description 21
- RYVNIFSIEDRLSJ-UHFFFAOYSA-N 5-(hydroxymethyl)cytosine Chemical compound NC=1NC(=O)N=CC=1CO RYVNIFSIEDRLSJ-UHFFFAOYSA-N 0.000 claims description 18
- 238000012408 PCR amplification Methods 0.000 claims description 17
- 230000009615 deamination Effects 0.000 claims description 17
- 238000006481 deamination reaction Methods 0.000 claims description 17
- 238000007477 logistic regression Methods 0.000 claims description 16
- 229940035893 uracil Drugs 0.000 claims description 16
- 102100026846 Cytidine deaminase Human genes 0.000 claims description 15
- 108010031325 Cytidine deaminase Proteins 0.000 claims description 15
- 108010014303 DNA-directed DNA polymerase Proteins 0.000 claims description 15
- 102000016928 DNA-directed DNA polymerase Human genes 0.000 claims description 15
- 238000004458 analytical method Methods 0.000 claims description 15
- 238000007481 next generation sequencing Methods 0.000 claims description 15
- 108010029485 Protein Isoforms Proteins 0.000 claims description 12
- 102000001708 Protein Isoforms Human genes 0.000 claims description 12
- 229940104302 cytosine Drugs 0.000 claims description 11
- LRSASMSXMSNRBT-UHFFFAOYSA-N 5-methylcytosine Chemical compound CC1=CNC(=O)N=C1N LRSASMSXMSNRBT-UHFFFAOYSA-N 0.000 claims description 9
- 102000053602 DNA Human genes 0.000 claims description 9
- 229940113082 thymine Drugs 0.000 claims description 9
- DWAQJAXMDSEUJJ-UHFFFAOYSA-M Sodium bisulfite Chemical compound [Na+].OS([O-])=O DWAQJAXMDSEUJJ-UHFFFAOYSA-M 0.000 claims description 8
- 230000003321 amplification Effects 0.000 claims description 8
- 238000003199 nucleic acid amplification method Methods 0.000 claims description 8
- 235000010267 sodium hydrogen sulphite Nutrition 0.000 claims description 8
- 238000003766 bioinformatics method Methods 0.000 claims description 7
- 230000005298 paramagnetic effect Effects 0.000 claims description 7
- LSNNMFCWUKXFEE-UHFFFAOYSA-M Bisulfite Chemical compound OS([O-])=O LSNNMFCWUKXFEE-UHFFFAOYSA-M 0.000 claims description 6
- 108091034117 Oligonucleotide Proteins 0.000 claims description 6
- 238000002864 sequence alignment Methods 0.000 claims description 6
- 101000653374 Homo sapiens Methylcytosine dioxygenase TET2 Proteins 0.000 claims description 5
- 102100030803 Methylcytosine dioxygenase TET2 Human genes 0.000 claims description 5
- 229910052799 carbon Inorganic materials 0.000 claims description 5
- 239000003153 chemical reaction reagent Substances 0.000 claims description 5
- 239000007788 liquid Substances 0.000 claims description 5
- 238000003908 quality control method Methods 0.000 claims description 5
- 101100310856 Drosophila melanogaster spri gene Proteins 0.000 claims description 4
- 238000001847 surface plasmon resonance imaging Methods 0.000 claims description 4
- 101000653360 Homo sapiens Methylcytosine dioxygenase TET1 Proteins 0.000 claims description 3
- 101000653369 Homo sapiens Methylcytosine dioxygenase TET3 Proteins 0.000 claims description 3
- 102100022433 Single-stranded DNA cytosine deaminase Human genes 0.000 claims description 3
- 101710143275 Single-stranded DNA cytosine deaminase Proteins 0.000 claims description 3
- 230000001590 oxidative effect Effects 0.000 claims description 3
- 230000005945 translocation Effects 0.000 claims description 3
- 241000283726 Bison Species 0.000 claims description 2
- 108060002716 Exonuclease Proteins 0.000 claims description 2
- 101000648539 Homo sapiens Transmembrane protein 59-like Proteins 0.000 claims description 2
- 102100028863 Transmembrane protein 59-like Human genes 0.000 claims description 2
- 238000013528 artificial neural network Methods 0.000 claims description 2
- 102000013165 exonuclease Human genes 0.000 claims description 2
- 210000001519 tissue Anatomy 0.000 description 30
- 230000035945 sensitivity Effects 0.000 description 25
- 210000004027 cell Anatomy 0.000 description 22
- 208000020816 lung neoplasm Diseases 0.000 description 14
- 206010058467 Lung neoplasm malignant Diseases 0.000 description 12
- 210000004369 blood Anatomy 0.000 description 12
- 239000008280 blood Substances 0.000 description 12
- 201000005202 lung cancer Diseases 0.000 description 12
- 238000012216 screening Methods 0.000 description 12
- 230000007067 DNA methylation Effects 0.000 description 11
- 238000013459 approach Methods 0.000 description 10
- 201000010099 disease Diseases 0.000 description 10
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 10
- 208000007660 Residual Neoplasm Diseases 0.000 description 9
- 238000013461 design Methods 0.000 description 9
- 230000035772 mutation Effects 0.000 description 9
- 238000013467 fragmentation Methods 0.000 description 8
- 238000006062 fragmentation reaction Methods 0.000 description 8
- 238000012544 monitoring process Methods 0.000 description 8
- 238000000746 purification Methods 0.000 description 8
- 230000000875 corresponding effect Effects 0.000 description 7
- 230000002255 enzymatic effect Effects 0.000 description 7
- 238000012549 training Methods 0.000 description 7
- 239000000090 biomarker Substances 0.000 description 6
- 238000002790 cross-validation Methods 0.000 description 6
- 239000012634 fragment Substances 0.000 description 6
- 210000000232 gallbladder Anatomy 0.000 description 6
- UYTPUPDQBNUYGX-UHFFFAOYSA-N guanine Chemical compound O=C1NC(N)=NC2=C1N=CN2 UYTPUPDQBNUYGX-UHFFFAOYSA-N 0.000 description 6
- 210000004185 liver Anatomy 0.000 description 6
- 150000007523 nucleic acids Chemical class 0.000 description 6
- 210000000496 pancreas Anatomy 0.000 description 6
- 210000002784 stomach Anatomy 0.000 description 6
- 238000012360 testing method Methods 0.000 description 6
- 238000003556 assay Methods 0.000 description 5
- 210000004072 lung Anatomy 0.000 description 5
- 108020004707 nucleic acids Proteins 0.000 description 5
- 102000039446 nucleic acids Human genes 0.000 description 5
- 238000007637 random forest analysis Methods 0.000 description 5
- 238000002560 therapeutic procedure Methods 0.000 description 5
- 208000001333 Colorectal Neoplasms Diseases 0.000 description 4
- 210000000481 breast Anatomy 0.000 description 4
- 230000002596 correlated effect Effects 0.000 description 4
- 239000012530 fluid Substances 0.000 description 4
- 210000000056 organ Anatomy 0.000 description 4
- BASFCYQUMIYNBI-UHFFFAOYSA-N platinum Chemical compound [Pt] BASFCYQUMIYNBI-UHFFFAOYSA-N 0.000 description 4
- 108090000623 proteins and genes Proteins 0.000 description 4
- 238000001356 surgical procedure Methods 0.000 description 4
- 210000004881 tumor cell Anatomy 0.000 description 4
- GUAHPAJOXVYFON-ZETCQYMHSA-N (8S)-8-amino-7-oxononanoic acid zwitterion Chemical compound C[C@H](N)C(=O)CCCCCC(O)=O GUAHPAJOXVYFON-ZETCQYMHSA-N 0.000 description 3
- 108700028369 Alleles Proteins 0.000 description 3
- 206010006187 Breast cancer Diseases 0.000 description 3
- 208000026310 Breast neoplasm Diseases 0.000 description 3
- 206010009944 Colon cancer Diseases 0.000 description 3
- 108700039691 Genetic Promoter Regions Proteins 0.000 description 3
- 241000282414 Homo sapiens Species 0.000 description 3
- 206010061535 Ovarian neoplasm Diseases 0.000 description 3
- 210000000577 adipose tissue Anatomy 0.000 description 3
- 210000004100 adrenal gland Anatomy 0.000 description 3
- 230000015572 biosynthetic process Effects 0.000 description 3
- 210000000601 blood cell Anatomy 0.000 description 3
- 210000001124 body fluid Anatomy 0.000 description 3
- 210000000988 bone and bone Anatomy 0.000 description 3
- 210000004556 brain Anatomy 0.000 description 3
- 210000000845 cartilage Anatomy 0.000 description 3
- 230000008859 change Effects 0.000 description 3
- 210000001072 colon Anatomy 0.000 description 3
- 230000000295 complement effect Effects 0.000 description 3
- 210000002808 connective tissue Anatomy 0.000 description 3
- 238000007796 conventional method Methods 0.000 description 3
- 210000003238 esophagus Anatomy 0.000 description 3
- 210000001508 eye Anatomy 0.000 description 3
- 230000014509 gene expression Effects 0.000 description 3
- 210000002216 heart Anatomy 0.000 description 3
- 210000003734 kidney Anatomy 0.000 description 3
- 210000002429 large intestine Anatomy 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 210000003205 muscle Anatomy 0.000 description 3
- 238000003058 natural language processing Methods 0.000 description 3
- 210000005036 nerve Anatomy 0.000 description 3
- 238000002360 preparation method Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 210000000813 small intestine Anatomy 0.000 description 3
- 210000000278 spinal cord Anatomy 0.000 description 3
- 210000000952 spleen Anatomy 0.000 description 3
- 230000001629 suppression Effects 0.000 description 3
- 210000001541 thymus gland Anatomy 0.000 description 3
- 210000003932 urinary bladder Anatomy 0.000 description 3
- 229930024421 Adenine Natural products 0.000 description 2
- GFFGJBXGBJISGV-UHFFFAOYSA-N Adenine Chemical compound NC1=NC=NC2=C1N=CN2 GFFGJBXGBJISGV-UHFFFAOYSA-N 0.000 description 2
- 206010003445 Ascites Diseases 0.000 description 2
- HEDRZPFGACZZDS-UHFFFAOYSA-N Chloroform Chemical compound ClC(Cl)Cl HEDRZPFGACZZDS-UHFFFAOYSA-N 0.000 description 2
- 206010033128 Ovarian cancer Diseases 0.000 description 2
- ISWSIDIOOBJBQZ-UHFFFAOYSA-N Phenol Chemical compound OC1=CC=CC=C1 ISWSIDIOOBJBQZ-UHFFFAOYSA-N 0.000 description 2
- 206010036790 Productive cough Diseases 0.000 description 2
- 108020004682 Single-Stranded DNA Proteins 0.000 description 2
- 241000251539 Vertebrata <Metazoa> Species 0.000 description 2
- 229960000643 adenine Drugs 0.000 description 2
- 210000001789 adipocyte Anatomy 0.000 description 2
- 210000003567 ascitic fluid Anatomy 0.000 description 2
- 210000001772 blood platelet Anatomy 0.000 description 2
- 210000001185 bone marrow Anatomy 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 210000001175 cerebrospinal fluid Anatomy 0.000 description 2
- 239000003795 chemical substances by application Substances 0.000 description 2
- 238000011498 curative surgery Methods 0.000 description 2
- 238000004925 denaturation Methods 0.000 description 2
- 230000036425 denaturation Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 210000004051 gastric juice Anatomy 0.000 description 2
- 230000030279 gene silencing Effects 0.000 description 2
- 210000000265 leukocyte Anatomy 0.000 description 2
- 210000004880 lymph fluid Anatomy 0.000 description 2
- 238000007403 mPCR Methods 0.000 description 2
- 210000001819 pancreatic juice Anatomy 0.000 description 2
- 229910052697 platinum Inorganic materials 0.000 description 2
- 210000004910 pleural fluid Anatomy 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 210000003296 saliva Anatomy 0.000 description 2
- 210000004911 serous fluid Anatomy 0.000 description 2
- 210000003802 sputum Anatomy 0.000 description 2
- 208000024794 sputum Diseases 0.000 description 2
- 238000013179 statistical model Methods 0.000 description 2
- 210000000130 stem cell Anatomy 0.000 description 2
- 239000000126 substance Substances 0.000 description 2
- 239000000439 tumor marker Substances 0.000 description 2
- 210000002700 urine Anatomy 0.000 description 2
- 238000012418 validation experiment Methods 0.000 description 2
- 206010069754 Acquired gene mutation Diseases 0.000 description 1
- -1 CpG guanines Chemical class 0.000 description 1
- 108091029523 CpG island Proteins 0.000 description 1
- 238000007400 DNA extraction Methods 0.000 description 1
- 238000007399 DNA isolation Methods 0.000 description 1
- 206010061818 Disease progression Diseases 0.000 description 1
- 208000000461 Esophageal Neoplasms Diseases 0.000 description 1
- 108010007577 Exodeoxyribonuclease I Proteins 0.000 description 1
- 102100029075 Exonuclease 1 Human genes 0.000 description 1
- 238000001134 F-test Methods 0.000 description 1
- 206010017993 Gastrointestinal neoplasms Diseases 0.000 description 1
- 208000031448 Genomic Instability Diseases 0.000 description 1
- 206010025323 Lymphomas Diseases 0.000 description 1
- 206010027476 Metastases Diseases 0.000 description 1
- 102000016397 Methyltransferase Human genes 0.000 description 1
- 108060004795 Methyltransferase Proteins 0.000 description 1
- 208000001894 Nasopharyngeal Neoplasms Diseases 0.000 description 1
- 206010061306 Nasopharyngeal cancer Diseases 0.000 description 1
- 102100030569 Nuclear receptor corepressor 2 Human genes 0.000 description 1
- 101710153660 Nuclear receptor corepressor 2 Proteins 0.000 description 1
- 206010030155 Oesophageal carcinoma Diseases 0.000 description 1
- 206010061902 Pancreatic neoplasm Diseases 0.000 description 1
- 206010060862 Prostate cancer Diseases 0.000 description 1
- 208000000236 Prostatic Neoplasms Diseases 0.000 description 1
- 108010006785 Taq Polymerase Proteins 0.000 description 1
- 102000044209 Tumor Suppressor Genes Human genes 0.000 description 1
- 108700025716 Tumor Suppressor Genes Proteins 0.000 description 1
- 230000001594 aberrant effect Effects 0.000 description 1
- 230000004931 aggregating effect Effects 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 238000000540 analysis of variance Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000001574 biopsy Methods 0.000 description 1
- 238000001369 bisulfite sequencing Methods 0.000 description 1
- 238000005119 centrifugation Methods 0.000 description 1
- 208000006990 cholangiocarcinoma Diseases 0.000 description 1
- 230000002759 chromosomal effect Effects 0.000 description 1
- 238000002052 colonoscopy Methods 0.000 description 1
- 238000002591 computed tomography Methods 0.000 description 1
- 238000009109 curative therapy Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 230000005750 disease progression Effects 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 230000008030 elimination Effects 0.000 description 1
- 238000003379 elimination reaction Methods 0.000 description 1
- 210000002889 endothelial cell Anatomy 0.000 description 1
- 230000004049 epigenetic modification Effects 0.000 description 1
- 201000004101 esophageal cancer Diseases 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000007672 fourth generation sequencing Methods 0.000 description 1
- 208000024200 hematopoietic and lymphoid system neoplasm Diseases 0.000 description 1
- 238000012165 high-throughput sequencing Methods 0.000 description 1
- 238000009396 hybridization Methods 0.000 description 1
- 230000006607 hypermethylation Effects 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 238000010348 incorporation Methods 0.000 description 1
- 230000006698 induction Effects 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 150000002500 ions Chemical class 0.000 description 1
- 230000003902 lesion Effects 0.000 description 1
- 208000032839 leukemia Diseases 0.000 description 1
- 238000011528 liquid biopsy Methods 0.000 description 1
- 201000007270 liver cancer Diseases 0.000 description 1
- 208000014018 liver neoplasm Diseases 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 208000015486 malignant pancreatic neoplasm Diseases 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000009401 metastasis Effects 0.000 description 1
- 125000002496 methyl group Chemical group [H]C([H])([H])* 0.000 description 1
- 238000012164 methylation sequencing Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 231100000350 mutagenesis Toxicity 0.000 description 1
- 230000009871 nonspecific binding Effects 0.000 description 1
- 201000002528 pancreatic cancer Diseases 0.000 description 1
- 208000008443 pancreatic carcinoma Diseases 0.000 description 1
- 238000009595 pap smear Methods 0.000 description 1
- 230000008506 pathogenesis Effects 0.000 description 1
- 210000005259 peripheral blood Anatomy 0.000 description 1
- 239000011886 peripheral blood Substances 0.000 description 1
- 102000054765 polymorphisms of proteins Human genes 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 239000000047 product Substances 0.000 description 1
- 102000004169 proteins and genes Human genes 0.000 description 1
- 239000012264 purified product Substances 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 230000001105 regulatory effect Effects 0.000 description 1
- 238000009877 rendering Methods 0.000 description 1
- 230000001718 repressive effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 230000037439 somatic mutation Effects 0.000 description 1
- 241000894007 species Species 0.000 description 1
- 238000011895 specific detection Methods 0.000 description 1
- 230000002269 spontaneous effect Effects 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 208000024891 symptom Diseases 0.000 description 1
- 230000008685 targeting Effects 0.000 description 1
- 230000001225 therapeutic effect Effects 0.000 description 1
- 230000002103 transcriptional effect Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 206010044412 transitional cell carcinoma Diseases 0.000 description 1
- 238000011144 upstream manufacturing Methods 0.000 description 1
- 238000012070 whole genome sequencing analysis Methods 0.000 description 1
Classifications
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6806—Preparing nucleic acids for analysis, e.g. for polymerase chain reaction [PCR] assay
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6844—Nucleic acid amplification reactions
- C12Q1/6858—Allele-specific amplification
Definitions
- the present invention generally relates to the detection and quantification of nucleic acid.
- the present invention relates to the detection and quantification of methylated DNA.
- DNA methylation is the covalent transfer of a methyl group to the 5 -carbon position of the DNA base cytosine.
- DNA methylation occurs at cytosines within a CpG site, i.e. a cytosine that immediately precedes a guanine base.
- This epigenetic modification is regulated by DNA methyltransferases and is widely known as a repressive mark that plays a key role in transcriptional silencing.
- DNA methylation in promoter regions leads to a decrease in gene expression and is a common mechanism of silencing of tumor suppressor genes. DNA methylation can also result in the induction of mutations and decreased genomic stability. Spontaneous deamination of cytosine forms thymine, thus generating a point mutation.
- Cancer cells have distinct and aberrant patterns of DNA methylation compared to normal cells, and often display large regions of global hypo-methylation across the genome and localized areas of hyper-methylation, which are usually located at islands or clusters of CpG sites in gene promoter regions. Differential patterns of methylation in cancer cells can be used to detect the presence of cancer, such as for cancer screening purposes or for monitoring disease progression and treatment response.
- PCR reactions are also limited in the number of CpG sites that can be assessed in a single reaction, which is typically about one to three per primer pair. Moreover, the conditions for primer design in these PCR reactions are rather stringent, as the primer should contain the target CpG site(s), as well as at least three to five thymines converted from unmethylated non-CpG cytosines, in order to ensure that only properly converted DNA will be amplified. These requirements of methyl- specific PCR reactions exclude the selection of targetable regions that do not fulfil the selection criteria.
- the present disclosure describes a methodology for the identification and quantification of methylated DNA for cancer screening and detection of early-stage (stage MW) cancer which is often undetectable by conventional screening methods, minimal residual disease following cancer surgery or therapy, and cancer relapse.
- the method of the present disclosure seeks to achieve high sensitivity and specificity for the detection of methylated DNA, high efficiency of DNA conversion with minimum fragmentation and loss in DNA yield, suppression of low-level errors due to sequencing, and minimal invasiveness.
- the present disclosure refers to a method of detecting methylated DNA pattern in DNA in a biological sample, comprising:
- step (b) purifying the converted DNA from step (a);
- each forward primer of the plurality of forward and reverse primer pairs comprises a barcode sequence on its 5’ end, wherein the barcode sequence of each forward primer is different from each other, wherein the target regions comprise one or more CpG sites;
- step (d) subjecting the tagged converted DNA from step (c) to a second PCR amplification with universal indexed primers to thereby create a sequencing library with components required for multiplex sequencing;
- the present disclosure refers to a kit for detecting methylated DNA pattern in DNA in a biological sample according to the method of the first aspect, comprising:
- each forward primer of the plurality of forward and reverse primer pairs comprises a barcode sequence on its 5’ end, wherein the barcode sequence of each forward primer is different from each other, wherein the target regions comprise one or more CpG sites;
- Fig. 1 illustrates the overall experimental workflow, from the conversion of DNA to sequencing.
- Fig. 2 illustrates example of primer design for capturing converted DNA.
- the italicised sequences represent the adaptor sequences required for the second amplification with the universal indexed Illumina P5 and P7 primers, respectively.
- the underlined sequence represents the target- specific sequence.
- Y and R represent a degenerate base (C or T and A or G, respectively) following the IUB code.
- NNNNNNNNNN represents a random barcode sequence.
- the underlined bases indicate an 8 bp index barcode.
- each sample will be assigned a unique combination of forward and reverse indexes.
- Fig. 3 illustrates expected sequencing library profile on Tapestation.
- FIG. 4 comprising Figs. 4(a) and 4(b), illustrates examples of sequence alignment to Human hgl9 genome for a single sample visualized using Integrated Genome Viewer (IGV), wherein Fig. 4(a) shows amplicon designed to the plus strand of the genome, and Fig. 4(b) shows amplicon designed to the minus strand of the genome.
- IOV Integrated Genome Viewer
- Fig. 5 illustrates the Conversion efficiency of non-CpG cytosines to thymines. Samples with conversion ⁇ 0.97 will be repeated.
- Fig. 6 illustrates the examples of correlation of CpG methylation within each amplicon.
- Fig. 8 shows examples of average amplicon methylation values across normal, breast, colorectal, lung and ovarian cancer samples.
- Fig. 9 shows sample distribution used for training set and best 3-fold cross validation scores.
- Fig. 10 illustrates the N-gram method of detecting cfDNA methylation patterns.
- Fig. 11 illustrates the Skip-gram method of detecting cfDNA methylation patterns.
- Fig. 12 shows the sensitivity performance of different prediction models, set at 95% specificity threshold of the training set.
- the present disclosure describes a methodology for detecting methylated DNA pattern in DNA with high sensitivity and specificity, for the purpose of cancer screening and detection of early-stage (stage I/II) cancer, minimal residual disease following cancer surgery or therapy, and cancer relapse.
- the present disclosure refers to a method of detecting methylated DNA pattern in DNA in a biological sample, comprising:
- step (b) purifying the converted DNA from step (a);
- each forward primer of the plurality of forward and reverse primer pairs comprises a barcode sequence on its 5’ end, wherein the barcode sequence of each forward primer is different from each other, wherein the target regions comprise one or more CpG sites;
- step (d) subjecting the tagged converted DNA from step (c) to a second PCR amplification with universal indexed primers to thereby create a sequencing library with components required for multiplex sequencing;
- the un-methylated cytosine of the DNA is converted to uracil by deamination to thereby generate converted DNA, as disclosed in step (a) of the method of the first aspect.
- the DNA is extracted from the biological sample before step (a).
- the DNA may be extracted using any method or kit known in the art.
- the DNA is extracted from the biological sample before step (a) using organic extraction methods, such as phenol/chloroform extraction.
- the DNA is extracted from the biological sample before step (a) using kits such as, but not limited to, QIAamp Circulating Nucleic Acid Kit (Qiagen), MagMAX Cell-Free DNA Isolation Kit (Applied Biosystems), Cell/Blood DNA Kit (CatchGene), Tissue DNA Kit (CatchGene) and DNeasy Blood and Tissue Kits (Qiagen).
- the extracted DNA is converted by the method disclosed herein, comprising:
- the first enzyme is a Ten-eleven translocation (TET) enzyme or an isoform thereof.
- TET enzyme is selected from the group consisting of TET1 enzyme or an isoform thereof, TET2 enzyme or an isoform thereof, and TET3 enzyme or an isoform thereof.
- the purification of DNA is performed using an agent such as paramagnetic beads.
- the paramagnetic beads are selected from the group consisting of AMPure XP beads, SPRI beads, and Dynabeads.
- the second enzyme is a cytidine deaminase or an enzyme with cytidine deaminase properties.
- the cytidine deaminase is selected from the group consisting of APOBEC enzyme, CDA, and activation-induced cytidine deaminase.
- the enzyme with cytidine deaminase properties is selected from the group consisting of M. Sssl and M.Hpall.
- the extracted DNA is converted using sodium bisulfite.
- the DNA is not extracted from the biological sample before step (a), and is converted using direct conversion methods in which no DNA extraction is required.
- the un-methylated cytosine of the unextracted DNA is directly converted to uracil by deamination using bisulfite to thereby generate the converted DNA.
- the un-methylated cytosine of the unextracted DNA is directly converted to uracil using direct conversion kits selected from the group consisting of EpiTect Fast FFPE Bisulfite Kit, innuCONVERT Bisulfite All-In-One Kit, and Zymo EZ DNA Methylation-Direct Kit.
- the DNA used in the method of the first aspect is present in a biological sample.
- the biological sample containing the DNA is selected from the group consisting of a liquid sample, a tissue sample, or a cell sample.
- the liquid sample is bodily fluids selected from the group consisting of blood, bone marrow, cerebral spinal fluid, peritoneal fluid, pleural fluid, lymph fluid, ascites, serous fluid, sputum, lacrimal fluid, stool, urine, saliva, ductal fluid from breast, gastric juice, and pancreatic juice.
- the bodily fluid is blood.
- the tissue sample may include, but is not limited to frozen tissue sample, fixed tissue sample (such as formalin-fixed tissue sample).
- the tissue sample or the cell sample may be any type of tissue or cell in the body.
- the tissue sample or cell sample may be a tissue or cell from bone, epithelial, cartilage, adipose tissue, nerves, muscle, connective tissue, esophagus, stomach, liver, gallbladder, pancreas, adrenal glands, bladder, gallbladder, large intestine, small intestine, kidneys, liver, pancreas, colon, stomach, thymus, spleen, brain, spinal cord, heart, lungs, eyes, corneal, skin, or islet tissue or organs.
- the cell sample may also be from blood, such as white blood cells and platelets.
- the cell sample may be cancer cells, stem cells, endothelial cells, or fat cells.
- the biological sample is obtained from a subject having and/or suspected of having a disease.
- the disease is cancer.
- the cancer is selected from the group consisting of leukemia, lymphoma, ovarian cancer, lung cancer, colorectal cancer, breast cancer, pancreatic cancer, prostate cancer, nasopharyngeal cancer, liver cancer, cholangiocarcinoma, esophageal cancer, urothelial cancer, and gastrointestinal cancer.
- the cancer is an early stage cancer.
- the cancer is a Stage I cancer.
- the cancer is a Stage II cancer. In another example, the cancer is a Stage III cancer. In another example, the cancer is a late stage cancer. In another example, the cancer is an original cancer. In another example, the cancer is a relapsed cancer.
- the cancer is relapsed if cancer cells are detected at, in the region of, or distant from the primary site of the tumour, about 1 week, about 2 weeks, about 3 weeks, about 1 month, about 2 months, about 3 months, about 4 months, about 5 months, about 6 months, about 7 months, about 8 months, about 9 months, about 10 months, about 11 months, about 1 year, about 2 years, about 3 years, about 4 years, about 5 years, about 6 years, about 7 years, about 8 years, about 9 years, or about 10 years after complete remission of the primary cancer.
- the disease is minimal residual disease of the primary cancer following curative surgery or therapy.
- MRD minimal residual disease
- the DNA is cell-free DNA (cfDNA).
- cfDNA refers to non-encapsulated DNA which is circulating in a liquid sample disclosed herein and not contained within cells.
- plasma cfDNA is derived from both normal (healthy, non-diseased) cells and tumor cells.
- the DNA is circulating tumor DNA (ctDNA).
- the cfDNA fragments from tumor cells are shorter than cfDNA fragments from normal cells.
- the differences in plasma cfDNA concentrations and cfDNA fragment lengths between individuals with and without cancer can be assayed as cancer- specific signals.
- the liquid sample is bodily fluids selected from the group consisting of blood, bone marrow, cerebral spinal fluid, peritoneal fluid, pleural fluid, lymph fluid, ascites, serous fluid, sputum, lacrimal fluid, stool, urine, saliva, ductal fluid from breast, gastric juice, and pancreatic juice.
- the DNA is encapsulated within tissues and/or cells.
- the tissue or cell may be any type of tissue or cell in the body.
- the tissue sample may include, but is not limited to frozen tissue sample, fixed tissue sample (such as formalin-fixed tissue sample).
- the tissue is from bone, epithelial, cartilage, adipose tissue, nerves, muscle, connective tissue, esophagus, stomach, liver, gallbladder, pancreas, adrenal glands, bladder, gallbladder, large intestine, small intestine, kidneys, liver, pancreas, colon, stomach, thymus, spleen, brain, spinal cord, heart, lungs, eyes, corneal, skin, or islet tissue or organs.
- the cell is from bone, epithelial, cartilage, adipose tissue, nerves, muscle, connective tissue, esophagus, stomach, liver, gallbladder, pancreas, adrenal glands, bladder, gallbladder, large intestine, small intestine, kidneys, liver, pancreas, colon, stomach, thymus, spleen, brain, spinal cord, heart, lungs, eyes, corneal, skin, or islet tissue or organs.
- the cell may be a cancer cell, a stem cell, an endothelial cell, or a fat cell.
- the cell is a blood cell.
- the blood cell may be a white blood cell, or a platelet.
- the amount of DNA used in the method disclosed herein is at least 5 ng. In another example, the amount of DNA used in the method disclosed herein is about 5 ng, or about 10 ng, or about 15 ng, or about 20 ng, or about 30 ng, or about 40 ng, or about 50 ng, or about 60 ng, or about 70 ng, or about 80 ng, or about 90 ng, or about 100 ng, or about 110 ng, or about 120 ng, or about 130 ng, or about 140 ng, or about 150 ng, or about 160 ng, or about 170 ng, or about 180 ng, or about 190 ng, or about 200 ng, or about 300 ng, or about 400 ng, or about 500 ng, or about 600 ng, or about 700 ng, or about 800 ng, or about 900 ng, or about 1000 ng, or at least 1000 ng.
- the converted DNA is then purified as disclosed in step (b) of the method of the first aspect, using an agent such as DNA purification beads.
- the DNA purification beads may be paramagnetic beads, such as AMPure XP beads, and SPRI beads.
- the converted and purified DNA is then tagged with a barcode sequence by performing a first PCR amplification using a plurality of forward and reverse primer pairs specific to the converted sequence of target regions, wherein each forward primer of the plurality of forward and reverse primer pairs comprises a barcode sequence on its 5’ end, wherein the barcode sequence of each forward primer is different from each other, wherein the target regions comprise CpG sites, as disclosed in step (c) of the method of the first aspect.
- the term “barcode sequence” is a commonly used term in the art of nucleic acid sequencing and used within the definition as known in the art.
- the term “barcode sequence” refers to the encoded molecules or barcodes that include variable amount of information within the nucleic acid sequence.
- the barcode sequence is a tag that can be read out using any of a variety of sequence identification techniques, for example, nucleic acid sequencing, probe hybridization based assay, and the like.
- the barcode sequence is used in the method as described herein to tag different converted DNA sequences of target regions of a sample, such that when the barcode sequence tags to the converted DNA sequences of target regions, each different converted DNA sequence of target region would then have a unique barcode sequence that is attached to it and read out with the converted DNA sequence of target region from the sample.
- the barcode sequence is an oligonucleotide comprising 10 to 16 random nucleotides, or 10 to 15 random nucleotides, or 10 to 13 random nucleotides, or 10 random nucleotides, or 11 random nucleotides, or 12 random nucleotides, or 13 random nucleotides, or 14 random nucleotides, or 15 random nucleotides, or 16 random nucleotides.
- the barcode sequence is an oligonucleotide comprising 10 random nucleotides.
- the barcode sequence may be defined as NNNNNNNN, which may have the sequences such as, but is not limited to, TAGCTAACGT, GCAAGGTCAA, ACCTGTGTAT and the like.
- the number of the forward and reverse primer pairs is at least 5. In another example, the number of the forward and reverse primer pairs is at least 10. In another example, the number of the forward and reverse primer pairs is at least 15. In another example, the number of the forward and reverse primer pairs is at least 20. In another example, the number of the forward and reverse primer pairs is at least 30. In another example, the number of the forward and reverse primer pairs is at least 40. In another example, the number of the forward and reverse primer pairs is at least 50. In another example, the number of the forward and reverse primer pairs is at least 60. In another example, the number of the forward and reverse primer pairs is at least 70. In another example, the number of the forward and reverse primer pairs is at least 80.
- the number of the forward and reverse primer pairs is at least 90. In another example, the number of the forward and reverse primer pairs is at least 100. In another example, the number of the forward and reverse primer pairs is at least 110. In another example, the number of the forward and reverse primer pairs is at least 120. In another example, the number of the forward and reverse primer pairs is at least 130. In another example, the number of the forward and reverse primer pairs is at least 140. In another example, the number of the forward and reverse primer pairs is at least 150. In another example, the number of the forward and reverse primer pairs is at least 160. In another example, the number of the forward and reverse primer pairs is at least 170. In another example, the number of the forward and reverse primer pairs is at least 180.
- the number of the forward and reverse primer pairs is at least 190. In another example, the number of the forward and reverse primer pairs is at least 200. In another example, the number of the forward and reverse primer pairs is 5. In another example, the number of the forward and reverse primer pairs is 22. In another example, the number of the forward and reverse primer pairs is 95. In another example, the number of the forward and reverse primer pairs is 159. In another example, there is no upper limit on the number of the forward and reverse primer pairs.
- the forward and reverse primer pairs comprise sequences as disclosed in Table 1. [0044] Table 1. Sample primer sequences (159 pairs).
- the exemplified sequences disclosed in Table 1 show only the target- specific sequences of each primer. These sequences do not show the barcode sequence (for forward primers only) and the adaptor sequence required for the second amplification with universal indexed primers.
- the full sequence of each forward primer used in step (c) of the method of the first aspect contains the adaptor sequence, followed by the barcode sequence and then the target- specific sequence (the sequences disclosed in Table 1).
- Fig. 2 shows the full sequence of CLIP4_methyl_2F, which is one exemplary forward primer among the 159 primer pairs comprising the target- specific sequences in Table 1.
- each reverse primer used in step (c) of the method of the first aspect contains the adaptor sequence followed by the target- specific sequence (the sequences disclosed in Table 1).
- Fig. 2 shows the full sequence of CLIP4_methyl_2R, which is one exemplary reverse primer among the 159 primer pairs comprising the target- specific sequences in Table 1.
- the primer pair comprises degenerate bases.
- the forward primer in the primer pair comprises one or more degenerate bases, while the reverse primer in the primer pair has no degenerate base.
- the reverse primer in the primer pair comprises one or more degenerate bases, while the forward primer in the primer pair has no degenerate base.
- both the forward and reverse primers in the primer pair comprise one or more degenerate bases.
- degenerate primers are used when the primer landing site overlaps with a CpG site.
- a CpG site bound by the forward primer has a sequence of either CG (methylated) or TG (un methylated).
- the degenerate base Y is used in forward primers to specify either a cytosine or thymine, thus allowing the primer to cover both un-methylated and methylated DNA.
- a CpG site bound by reverse primers has a sequence of either CA (un-methylated) or CG (methylated).
- the degenerate base R is used in reverse primers to specify either an adenine or guanine, thus allowing the primer to cover both un-methylated and methylated DNA.
- the degenerate base is selected from the group consisting of C, T, A and G.
- each primer of the primer pair comprises 1, 2, or 3 degenerate bases.
- each primer of the primer pair has one degenerate base.
- the primer pair does not comprise a degenerate base, i.e. has no degenerate base.
- the target regions comprise CpG sites.
- CpG site refers to a cytosine that immediately precedes a guanine base.
- DNA methylation occurs at cytosines within a CpG site.
- each forward and reverse primer pair covers a target region which comprises at least 1 CpG site. In one example, each forward and reverse primer pair covers a target region which comprises at least 2 CpG sites. In one example, each forward and reverse primer pair covers a target region which comprises at least 3 CpG sites. In one example, each forward and reverse primer pair covers a target region which comprises at least 5 CpG sites. In one example, each forward and reverse primer pair covers a target region which comprises at least 8 CpG sites. In one example, each forward and reverse primer pair covers a target region which comprises at least 10 CpG sites. In one example, each forward and reverse primer pair covers a target region which comprises at least 15 CpG sites.
- each forward and reverse primer pair covers a target region which comprises at least 20 CpG sites. In one example, each forward and reverse primer pair covers a target region which comprises at least 25 CpG sites. In one example, each forward and reverse primer pair covers a target region which comprises at least 30 CpG sites. In one example, each forward and reverse primer pair covers a target region which comprises at least 35 CpG sites. In one example, each forward and reverse primer pair covers a target region which comprises at least 40 CpG sites. In one example, each forward and reverse primer pair covers a target region which comprises at least 50 CpG sites. In one example, each forward and reverse primer pair covers a target region which comprises at least 60 CpG sites.
- each forward and reverse primer pair covers a target region which comprises at least 70 CpG sites. In one example, each forward and reverse primer pair covers a target region which comprises at least 80 CpG sites. In one example, each forward and reverse primer pair covers a target region which comprises at least 90 CpG sites. In one example, each forward and reverse primer pair covers a target region which comprises at least 100 CpG sites. In another example, there is no upper limit on the number of CpG sites within the target region covered by each forward and reverse primer pair.
- the first PCR amplification comprises a number of PCR cycles selected from the group consisting of 2, 3, 4 and 5 PCR cycles. In one example, the first PCR amplification comprises 2 PCR cycles. In one example, the first PCR amplification comprises 3 PCR cycles. In one example, the first PCR amplification comprises 4 PCR cycles. In one example, the first PCR amplification comprises 5 PCR cycles. As each forward primer carries on its 5’ end a randomly assigned barcode sequence as disclosed herein, the first PCR amplification allows individual DNA molecules to be tagged uniquely in this first step of sequencing library formation.
- a second PCR amplification is performed with universal indexed primers as disclosed in step (d) of the method of the first aspect, to create a sequencing library with components required for multiplex sequencing on a next- generation sequencing platform selected from the group consisting of Illumina platform, Ion Torrent sequencing technology, MGI sequencing platform, Oxford Nanopore sequencing, PacBio SMRT sequencing and 10X Genomics platform, as disclosed in step (e) of the method of the first aspect.
- Fig. 2 the universal indexed primers used in step (d) of the method of the first aspect are shown in Fig. 2, which comprise: a forward primer comprising the sequence of
- the above exemplary sequences of the universal indexed primers used in step (d) of the method of the first aspect are the Indexed Illumina primers.
- the underlined index barcodes are 8 bp barcode sequences that are specified by Illumina.
- the underlined part can vary for different samples. Each sample within each sequencing run will have a unique combination of forward and reverse indexes.
- the underlined index barcodes has the sequences provided by Illumina for next-generation sequencing on the Illumina platform, and may be any sequence listed in the “Illumina Adapter Sequences” handbook, February 2019 version (https://dnatech.genomecenter.ucdavis.edu/wp- cqntent/uploa/2019/03/illumina-adapter-sequences-2019-100000000 2694-10).
- Exemplary index barcodes for forward primers that may be used are listed in the column “i5 Bases for Sample Sheet iSeq, MiniSeq, NextSeq, HiSeq 3000/4000”, for example CTAGCGCT and TCGATATC.
- Exemplary index barcodes for reverse primers that may be used are listed in the column “i7 Bases in Adapter”, for example AACCGCGG and GGTTATAA.
- step (f) of the method of the first aspect comprising:
- the assignment of DNA sequence to an original parental DNA molecule refers to the cluster reassignment of sequencing reads with the same barcode sequence. This generates barcode clusters wherein each cluster contains reads from the same amplicon and with the same barcode sequence. Consensus calling is performed for each barcode cluster to obtain the consensus reads. These consensus reads are the DNA sequence that is subsequently compared to the reference genome for variations to be detected.
- the initial step of cluster reassignment and generation of barcode clusters is important because it greatly reduces sequencing errors and improves confidence for accurate variant calling.
- step (f) of the method of the first aspect refers to the following process: Barcode sequences are extracted and clustered in two steps: 1. Initial grouping by exact match of the combination of amplicon_name + barcode sequence and 2. Cluster Reassignment, in each group of same amplicon_name, barcodes were further reassigned using global pairwise alignment with maximum 2 base differences between barcodes. Barcode clusters with number of associated reads less than 3 (after cluster reassignment) were considered unreliable clusters and removed from downstream analysis. [0058] Next, the methylated DNA pattern of the DNA is reconstructed as disclosed in step (g) of the method of the first aspect.
- the consensus DNA sequence is compared to a reference genome using a sequence alignment tool and variant analysis of the DNA sequence is conducted by comparing the consensus reads to the reference genome to detect the variations.
- the term “reference genome” refers to DNA sequences known in the art that may be obtainable from public databases. Exemplary Bioinformatics analysis method for reconstructing the methylated DNA pattern include bwa-meth, Bismark, MethylDackel, bisulfite-treated reads analysis tools (BRAT), methyQA, mrsFAST, BSMAP, VerJInxer, RMAP-bs, MethylCoder, BS-seeker2, and Bison.
- Steps (a) to (g) of the method of the first aspect as described above thereby enable assessment of: 1) the conversion efficiency of non-CpG cytosines to thymine as quality control, and 2) the frequency of methylation at each CpG cytosine across all consensus sequencing reads with distinct barcode sequences (i.e. reads that are derived from unique DNA molecules corresponding to a specific amplicon).
- the methods as disclosed herein may be used to detect mutations or polymorphisms at CpG sites.
- methylation of a CpG site is defined as concordance of the CG sequence for a CpG site, regardless of whether the site is on the plus or minus strand.
- non-methylation of a CpG site is defined as:
- Variations at CpG guanines will also be flagged during this process due to the unexpected occurrence of a non-G base that disrupts the CpG site.
- the allele frequency of this variation can be determined by its frequency across all consensus sequencing reads with distinct barcode sequences.
- the method as described in the first aspect further comprises the following steps:
- the method as described in the first aspect further comprises the step of analyzing methylated DNA pattern prior to performing step (h).
- Natural Language Processing, N-gram and Skip-gram are used for analyzing methylated DNA pattern.
- N-gram may be used to capture methylation pattern- specific information and generate new features that can be further analyzed.
- the generated new features can be used as data input for further statistical modelling techniques, such as those in step (h) and/or (i).
- the statistical modelling technique is logistic regression.
- Skip-gram may be used to determine patterns between initially unrelated or non-adjacent CpG sites by skipping N number of sites between 2 sites within an amplicon.
- the determined patterns can be used as data input for further statistical modelling techniques, such as those in step (h) and/or (i).
- the statistical modelling technique is logistic regression.
- the utilities of methylation frequency and methylation patterns derived from N-gram and Skip-gram may be used to detect cancer.
- the cancer is lung cancer.
- the statistical modelling technique is selected from the group consisting of logistic regression, tree based classifiers and deep neural networks.
- the present disclosure refers to a kit for detecting methylated DNA pattern in DNA in a biological sample according to the method of the first aspect, comprising:
- each forward primer of the plurality of forward and reverse primer pairs comprises a barcode sequence on its 5’ end, wherein the barcode sequence of each forward primer is different from each other, wherein the target regions comprise one or more CpG sites;
- the first enzyme, the second enzyme, the plurality of forward and reverse primer pairs, the barcode sequence, the CpG sites, and the plurality of universal indexed primers are disclosed herein.
- the first DNA polymerase is selected from the group consisting of Phusion U Hot Start DNA Polymerase (Thermo Scientific), ZymoTaq DNA Polymerase (Zyymo Research) and Q5U Hot Start High-Fidelity DNA Polymerase (NEB).
- the reagent capable of removing excess primers is selected from the group consisting of paramagnetic beads and single- strand exonucleases. Exemplary paramagnetic beads include AMPure XP beads, SPRI beads, and Dynabeads.
- the second DNA polymerase is selected from the group consisting of KAPA HiFi DNA Polymerase (Roche), Platinum Taq DNA Polymerase or Platinum SuperFi DNA Polymerase (Invitrogen) and Q5 High-Fidelity DNA Polymerase (NEB).
- a primer includes a plurality of primers, including mixtures and combinations thereof.
- the terms “increase” and “decrease” refer to the relative alteration of a chosen trait or characteristic in a subset of a population in comparison to the same trait or characteristic as present in the whole population. An increase thus indicates a change on a positive scale, whereas a decrease indicates a change on a negative scale.
- the term “change”, as used herein, also refers to the difference between a chosen trait or characteristic of an isolated population subset in comparison to the same trait or characteristic in the population as a whole. However, this term is without valuation of the difference seen.
- the term “about” in the context of concentration of a substance, size of a substance, length of time, or other stated values means +/- 5% of the stated value, or +/- 4% of the stated value, or +/- 3% of the stated value, or +/- 2% of the stated value, or +/- 1% of the stated value, or +/- 0.5% of the stated value.
- range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of the disclosed ranges. Accordingly, the description of a range should be considered to have specifically disclosed all the possible sub-ranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed sub-ranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numbers within that range, for example, 1, 2, 3, 4, 5, and 6. This applies regardless of the breadth of the range.
- Plasma cell-free DNA was extracted using the QIAamp Circulating Nucleic Acid Kit (Qiagen). To convert all un-methylated cytosines in the genome to uracils while preserving methylated cytosines, the plasma cfDNA was subjected to enzymatic conversion using the NEBNext Enzymatic Methyl-Seq Conversion Module (New England BioLabs). Briefly, DNA was treated with the TET2 enzyme that oxidizes 5-methylcytosine and 5-hydroxymethylcytosine, protecting these bases from deamination by APOBEC in the next step.
- AMPure XP beads (Beckman Coulter) prior to the addition of APOBEC enzyme which deaminates un-methylated cytosines to uracils.
- purification using AMPure XP beads generated single- stranded DNA that is similar to that of sodium-bisulfite- converted DNA.
- a multiplex amplicon-based next generation sequencing (NGS) platform was developed to capture and sequence targeted regions of the converted genome. These regions were selected based on literature review of known methylated regions in specific cancers and from analyses of methylation data from normal and tumor tissues in the Cancer Genome Atlas (TCGA) database. Each amplicon covers at least 1 CpG site. In initial validation experiments, primers for 22 amplicons were designed and the panel has since been increased to >100 amplicons (159 amplicons). The design of the assay is intended to be scalable to include multiple targets for the specific identification of multiple cancers.
- Each forward primer additionally includes on the 5’ end, a random 10 nucleotide sequence to serve as barcode sequence for the identification of unique DNA molecules.
- degeneracy was incorporated for the primer designs to enable the capture of both un methylated and methylated CpGs.
- a combinatorial amplicon-based NGS based assay targeting hotspot mutations in 32 genes that are commonly mutated in lung cancer was developed to complement the multiplex amplicon-based NGS platform described above, to improve the sensitivity of cancer detection.
- Said combinatorial amplicon panel incorporates molecular barcode sequences for error suppression and improved coverage, enabling 100% specificity and 100% detection sensitivity at 1% and 5% VAF for single nucleotide variants (SNVs) and insertions/deletions (indels) and 89% detection sensitivity at 0.1% VAF using HD780 (Horizon Discovery) reference standards.
- the design of the panel incorporates tiled amplicons that can generate longer or shorter amplicons, thus enabling the profiling of the size distribution of cfDNA fragments.
- the combinatorial amplicon panel can detect cfDNA methylation.
- the combinatorial amplicon panel can detect cfDNA concentration.
- the combinatorial amplicon panel can detect ctDNA fragmentation profile.
- the combinatorial amplicon panel can detect cfDNA methylation, cfDNA concentration and ctDNA fragmentation profile or any combinations thereof.
- the amount of converted cfDNA used for library preparation varied slightly depending on the amount used for enzymatic conversion, but typically represented 5-10 ng starting amount of cfDNA prior to conversion.
- both forward and reverse primers were combined in a single reaction using Phusion U Hot Start DNA Polymerase (Thermo Fisher Scientific) under the following thermocycling conditions: Denaturation at 98°C for 30s, followed by 3 or 4 cycles of 98°C for 10s, 55-57°C for 6 min, and 72°C for 5 min (3 cycles with 55°C for the 22-amplicon panel, 4 cycles with 56°C or 57°C for larger panels).
- a final amplification was performed to amplify the targets and to complete the library with indexed sequencing adaptors for sequencing on the Illumina platform.
- purified product was amplified with indexed P5 adapter sequence and indexed P7 adapter sequence using KAPA HiFi HotStart ReadyMix (Roche) under the following thermocycling conditions: Denaturation at 98°C for 45 s, followed by 19 to 21 cycles of 98°C for 15 s, 60°C for 30 s, and 72°C for 30 s, with a final extension at 72°C for 1 min.
- the amplified library was purified twice with 0.8x then 0.7x AMPure XP beads to remove non-specific products.
- the quality and quantity of the sequencing library was assessed using the 4200 Tapestation system (Agilent Technologies, USA) and KAPA Library Quantification Kit for Illumina® Platforms (Roche) respectively. Paired-end sequencing (2x15 lbp) of the final dual-indexed libraries was performed on the Illumina platform as per manufacturer’s instructions.
- FASTQ files were processed using a custom pipeline. First, expected amplicons were identified and labelled in the FASTQ files based on the expected primer sequences in Read 1 and paired Read 2. For amplicons with degenerate primers, data formed from each pair of degenerate primers are aggregated and assigned to the same amplicon based on the expected primer sequences. Primer sequences and upstream barcode sequences were trimmed using cutadapt, primer trimmed sequences were mapped to the Homo sapiens GRCh37 (hgl9) reference genome using bwa-meth, which is specifically designed for the alignment of bisulfite-converted sequences.
- the name of the primer which has the best match to a read is concatenated to the name of the mapped output reads (for both read 1 and read 2).
- the primer name assigned to read 1 may not always match that of read 2, which can be due to non-specific binding.
- An “amplicon name” is assigned to each paired read by combining the matching primer name of read 1 and read 2 (concatenated by semicolon).
- Molecular tag (or barcode) sequences were included in the trimmed “primer” sequences of read 1, and can be extracted given the unique structure of primer sequences in read 1. The extracted molecular tag sequences were clustered in two steps: 1. Initial grouping by exact match of the combination of amplicon_name + barcode sequence and 2.
- Cluster Reassignment in each group of same amplicon_name, barcodes are further reassigned using global pairwise alignment with maximum 2 base differences between barcodes. Barcode clusters with number of associated reads less than 3 (after cluster reassignment) were considered unreliable clusters and removed from downstream analysis.
- Consensus calling was done for each molecular tag (or barcode) cluster, by first performing global alignment among all associated reads using MAFFT.
- the consensus base in each aligned position is called by determining the majority representative base type, the percentage of which was no less than an automatically determined threshold.
- the threshold is a function of the total number of reads for that barcode sequence. If no representative base can be called, the position is assigned N (as opposed to one of A, C, T, G).
- a new quality score was assigned to each position, which is either 90th percentile of all the quality values from the representative base type in that position (if a consensus base is found), or 10th percentile of all quality values in that position (if no consensus bases is found).
- the consensus reads were then written to a new FASTQ file. With molecular barcoding, the sequencing is error-free and increases confidence of methylated/non-methylated calls due to the high quality of sequencing data.
- Conversion efficiency is defined as the average conversion fraction of non-CpG cytosines to thymines. Samples with amplicons with conversion efficiency ⁇ 0.97 were repeated.
- a methylation fraction was calculated at each CpG position and mean methylation fraction of an amplicon is defined as the average methylation fraction of all the considered CpG cytosines.
- the methylation pattern in DNA sequences can also contain information of their source.
- cancer-specific methylation patterns were evaluated via alternative approaches namely N-gram and Skip-gram.
- a N-gram model is a type of probabilistic language model for predicting the next item in such a sequence in the form of a (n - 1) order Markov model.
- N-gram features such as bigram, trigram, quad-gram and pentagram combinations were constructed to capture methylation patterns in adjacent 2, 3, 4 or 5-CpG sites, respectively.
- the N-gram for each amplicon was normalized by taking the average of all the reads and then divided by the maximum number of N-grams that can be formed for the particular amplicon.
- grid search was performed to reduce the number of features that were then used for the training of a logistic regression model for cancer/non-cancer prediction.
- Skip-gram another approach used in Natural Language Processing, was adopted to find patterns between initially unrelated or non-adjacent CpG sites by skipping N number of sites between 2 sites within an amplicon (Fig. 10). Similar to the N-gram approach, the Skip-gram for each amplicon was normalized to account for different numbers of CpG sites in different amplicons.
- Methylation frequency for each feature was log transformed using the formula np.log(x-i-O.OOOOl) where x is the methylation frequency also known as the methylation beta- value. Highly correlated features were removed from the model by calculating the variance inflation factor (VIF) score for each feature to detect multicollinearity. Features with the highest VIF scores were then dropped iteratively until a maximum VIF score of ⁇ 10.
- VIF variance inflation factor
- Recursive feature elimination was performed on the remaining features to determine the set of features for the best performance of the model. If a sample was healthy, its corresponding target array value was set to 0, while if a sample was cancerous, its corresponding target array value was set to 1.
- the scikit-learn http://scikit-learn.org
- LogisticRegression module was used to machine-learn parameters for an LR classifier using the log-transformed methylation signatures as features and the target array as the target values.
- the liblinear solver implemented in scikit-learn http://www.csie.ntu.edu.tw/ ⁇ cjlin/liblinear was utilized for this process.
- Plasma cfDNA methylation, cfDNA levels, cfDNA fragmentation profiles and ctDNA detection can each provide complementary information for enhanced accuracy in cancer detection.
- the combinatorial amplicon-based panel approach described herein combining the detection of cfDNA methylation, cfDNA concentration and ctDNA fragmentation profile can mitigate the limitations of individual approaches and improve the overall accuracy of cancer detection.
- a machine learning classifier prediction model that integrates these multiple classes of data generated from plasma cfDNA was used.
- cfDNA Biomarker features were first trained using the dataset of 60 healthy individuals and 39 early (stage I-III) and 56 late-stage lung cancer patients. Aggregate cfDNA Biomarker features encompass plasma cfDNA concentration, fragment size ratio and the ctDNA detection score, each of which are log normalized. The ctDNA detection score was determined by first classifying each variant to one of six classes based on evidence in public databases for the prevalence and pathogenesis of the variant in cancer.
- Each of the classes were assigned a score, with the highest score assigned to the highest class, and the ctDNA detection score of each sample was calculated by aggregating the score multiplied by allele frequency of each variant detected.
- the ctDNA detection score, plasma cfDNA concentration and cfDNA fragment size were incorporated into a single ‘Biomarkers’ logistic regression model.
- a Stacking Ensemble technique was adopted to merge the 3 (mAF + Biomarkers + N-gram or Skip-gram) models and generate a final prediction probability value for cancer.
- Plasma CEA levels are also commonly used in cancer screening and detection. Its utility is demonstrated in the combinatorial multi-omic approach by the assessment of plasma CEA levels. Plasma CEA levels, detected by the Beckman Access II immunoanalyzer, were higher in lung cancer samples compared to normal controls, giving a sensitivity of 46.15% and 73.21% for early and late-stage lung cancer detection, respectively, at a specificity of 95% (Fig. 12). When combined with a mAF, N- gram and Biomarkers Ensemble prediction model, the addition of CEA provided an additional diagnostic sensitivity of 5.2% and 3.6% in the detection of early and late-stage lung cancer, respectively.
- Feature selection was done using ANOVA F-Test via the f_classif() function from scikit- learn (https ://scikit- learn.org/stable/modules/generated/skleam.feature_selection.f_classif.html). F_Scores for all methylation sites across 4 different cancer categories were computed and ranked.
- Random Forest as implemented in the scikit-learn (http://scikit-leam.org) package’s RandomForestClassifier module was used, using the methylation signatures as features and cancer type as the target label. The default setting of the RandomForestClassifier were used. For robustness, five rounds of 3-fold CV were performed for each iteration of the model. The performance of the Random Forest Classifier seemed to plateau at around 23 features and these were selected the final features for the model. For prediction, probability scores were calculated for each cancer type and the cancer type with the highest probability score was predicted as likely cancer type for that particular sample.
- the present disclosure describes the methodology for the identification of methylated DNA for the detection of early stage cancer, minimal residual disease following cancer surgery or therapy, and cancer relapse, with high sensitivity and specificity, especially in situations that these disease are undetectable by conventional screening methods.
- blood-based test is used for the identification of methylated signatures in plasma cell-free DNA (cfDNA) that can indicate the presence of cancer and specify its tissue of origin (i.e. cancer type) before the development of overt symptoms.
- cfDNA plasma cell-free DNA
- the present disclosure uses enzymatic conversion as an alternative to conventional sodium bisulfite treatment to convert un-methylated cytosines to uracils.
- cfDNA was treated with TET2 enzyme that oxidizes 5-methylcytosine and 5- hydroxymethylcytosine, protecting these bases from deamination by APOBEC in the next step.
- the cfDNA was purified using AMPure XP beads prior to the addition of APOBEC enzyme which deaminates un-methylated cytosines to uracils.
- purification using AMPure XP beads generated single-stranded DNA that is similar to that of sodium- bisulfite-converted cfDNA, but typically obtained in higher recovery yields and with little fragmentation compared to bisulfite-converted DNA. As little as 5 ng starting amount of cfDNA has been successfully put through conversion, library preparation and sequencing in the present workflow.
- the converted cfDNA molecules were selectively enriched using a multiplicity of primers specific to the converted sequence of target regions in a single PCR reaction.
- the converted cfDNA was added to a PCR reaction containing more than 5 ‘forward’ and ‘reverse’ primer pairs and subject to 2, 3, 4 or 5 cycles of PCR in a first limited amplification reaction.
- this PCR allows individual cfDNA molecules to be tagged uniquely in this first step of sequencing library formation.
- the reactions were purified to remove excess primers.
- a final PCR amplification with universal indexed primers was done to create libraries with components required for multiplex sequencing on a next-generation sequencing platform such as Illumina.
- Each ‘forward’ and ‘reverse’ primer pair forms an amplicon that covers at least 1 CpG site.
- the primer designs incorporate degeneracy for the capture of both un-methylated and methylated CpGs and thus overcome methylation-related drop-off of coverage and capture.
- the presence of a barcode sequence was detected using specialized Bioinformatics methods to count and assign each DNA sequence from high- throughput sequencing to an original parental DNA molecule, carrying the same tag.
- the parental DNA molecule is the original cfDNA molecule right after enzymatic conversion.
- the cfDNA methylation pattern of the biological sample was then reconstructed.
- the number of unique cfDNA molecules corresponding to targeted regions of the genome were enumerated.
- the specific DNA methylation pattern of each molecule was reconstructed by comparing to a reference genome using a sequence alignment tool (for example, bwa- meth) designed for the alignment of bisulfite-converted sequences. Variations of the samples’ genome sequence compared to this reference genome were detected by variant analysis. This allows for the assessment of 1) the conversion efficiency of non-CpG cytosines to thymine as quality control and 2) the frequency of methylation at each CpG cytosine across all consensus sequencing reads with distinct barcode sequences (i.e.
- the target regions to be analysed were selected based on externally validated regions and from genome-wide analyses of methylation data in the TCGA database. Even when using a relatively small panel of 22 targets, the method has a high sensitivity (>90%) and specificity (90%) for the detection of cancer. This panel has been expanded to include more targets (159) and can be further expanded. Increased target number greatly improved the sensitivity and specificity of the test. The combination of target regions, and their associated CpG sites that are covered by each primer pair, renders the present method novel.
- the method of the present disclosure may be used on a blood-based test (for example, to detect methylated DNA pattern in cfDNA in the blood) that is fast and non-invasive (only one draw of blood is needed).
- the method is scalable for the detection of multiple cancers in a single test and is suitable for cancer screening in an asymptomatic population.
- DNA methylation in cancer occurs predominantly in CpG islands within gene promoter regions and are thus more accessible to comprehensive profiling.
- DNA methylation typically occurs in a tissue-specific manner which increases the specificity of identifying the tissue of origin of the cancer. The frequency of methylation can be calculated which gives an indication of tumor load and can be used for disease monitoring.
- Degenerate primers are used to capture both methylated and un-methylated strands in regions that are CpG rich and would otherwise be inaccessible with regular primers for bisulfite sequencing.
- the initial multiplex PCR reaction is scalable and allows the capture of multiple genomic regions for the identification of several cancer types in a single assay.
- a statistical model trained using methylation sequencing data from hundreds of known normal and cancer cfDNA enables accurate detection of cancer in independent samples.
- the technological significance lies in the generalizable use of primers for target capture, which allows working with smaller, limiting amounts of DNA, especially when enzymatic conversion is used instead of conventional sodium bisulfite treatment.
- the unique combination of targets is selected for the sensitivity and specific detection of multiple cancers.
Abstract
Disclosed is a method of detecting methylated DNA pattern in DNA in a biological sample. Also disclosed is a kit for detecting methylated DNA pattern in DNA in a biological sample according to the method as disclosed herein.
Description
METHOD FOR DETECTION AND QUANTIFICATION OF METHYUATED DNA
FIEUD OF THE INVENTION
[0001] The present invention generally relates to the detection and quantification of nucleic acid. In particular, the present invention relates to the detection and quantification of methylated DNA.
BACKGROUND
[0002] DNA methylation is the covalent transfer of a methyl group to the 5 -carbon position of the DNA base cytosine. In vertebrates, DNA methylation occurs at cytosines within a CpG site, i.e. a cytosine that immediately precedes a guanine base. This epigenetic modification is regulated by DNA methyltransferases and is widely known as a repressive mark that plays a key role in transcriptional silencing.
[0003] In the context of cancer, DNA methylation in promoter regions leads to a decrease in gene expression and is a common mechanism of silencing of tumor suppressor genes. DNA methylation can also result in the induction of mutations and decreased genomic stability. Spontaneous deamination of cytosine forms thymine, thus generating a point mutation. Cancer cells have distinct and aberrant patterns of DNA methylation compared to normal cells, and often display large regions of global hypo-methylation across the genome and localized areas of hyper-methylation, which are usually located at islands or clusters of CpG sites in gene promoter regions. Differential patterns of methylation in cancer cells can be used to detect the presence of cancer, such as for cancer screening purposes or for monitoring disease progression and treatment response.
[0004] Conventional methods for cancer screening, early cancer detection and disease monitoring have various drawbacks. For example, existing cancer screening methods, such as blood tumor marker tests or CT scans, are often limited by their sensitivity or specificity. These methods subject the patient to unnecessary follow-up that can be invasive, expensive and stressful. Further, conventional cancer screening methods such as colonoscopy and pap smear are often time-consuming, invasive and only detect one type of cancer per test. In addition, late cancer diagnosis when the cancer has already metastasized leaves the patient ineligible for curative surgery and limits the patient’s effective therapeutic window and treatment options. Moreover, for cancer patients, disease monitoring by repeat tissue biopsy
is infeasible and repeat imaging scans are usually only recommended every 3 months to minimize radiation exposure. Also, disease monitoring through the detection of mutations in tissue or liquid biopsies is limited in sensitivity because mutations can occur anywhere along the length of a gene, rendering the comprehensive identification of mutations technically challenging. Finally, cancer mutations are often not specific to a particular cancer type, making it difficult to identify the tissue of origin of the tumor.
[0005] In addition, conventional methods for multiplex detection of methylated DNA, for example, in plasma cell-free DNA (cfDNA), also face various challenges. Conventional treatment with sodium bisulfite to convert un-methylated cytosines to uracils is harsh and often leads to DNA fragmentation and poor yield. This sodium bisulfite conversion method requires high starting amount of DNA, which can be challenging especially in the case of plasma cfDNA from individuals with no or low tumor load. Further, sequencing errors limit the sensitivity of detection as signal is indistinguishable from technical noise. In addition, target capture of CpG-rich hyper-methylated regions in cancer often requires two sets of primers for the separate identification of un-methylated and methylated DNA in methyl- specific PCR reactions. These PCR reactions are also limited in the number of CpG sites that can be assessed in a single reaction, which is typically about one to three per primer pair. Moreover, the conditions for primer design in these PCR reactions are rather stringent, as the primer should contain the target CpG site(s), as well as at least three to five thymines converted from unmethylated non-CpG cytosines, in order to ensure that only properly converted DNA will be amplified. These requirements of methyl- specific PCR reactions exclude the selection of targetable regions that do not fulfil the selection criteria.
[0006] Thus, there is a need for a method to address the disadvantages of the conventional methods as described above. The present disclosure describes a methodology for the identification and quantification of methylated DNA for cancer screening and detection of early-stage (stage MW) cancer which is often undetectable by conventional screening methods, minimal residual disease following cancer surgery or therapy, and cancer relapse. The method of the present disclosure seeks to achieve high sensitivity and specificity for the detection of methylated DNA, high efficiency of DNA conversion with minimum fragmentation and loss in DNA yield, suppression of low-level errors due to sequencing, and minimal invasiveness.
SUMMARY
[0007] In a first aspect, the present disclosure refers to a method of detecting methylated DNA pattern in DNA in a biological sample, comprising:
(a) converting un-methylated cytosine of the DNA to uracil by deamination to thereby generate converted DNA;
(b) purifying the converted DNA from step (a);
(c) tagging a barcode sequence on the converted DNA, by performing a first PCR amplification using a plurality of forward and reverse primer pairs specific to the converted sequence of target regions, wherein each forward primer of the plurality of forward and reverse primer pairs comprises a barcode sequence on its 5’ end, wherein the barcode sequence of each forward primer is different from each other, wherein the target regions comprise one or more CpG sites;
(d) subjecting the tagged converted DNA from step (c) to a second PCR amplification with universal indexed primers to thereby create a sequencing library with components required for multiplex sequencing;
(e) subjecting the sequencing library to multiplex sequencing on a next-generation sequencing platform;
(f) detecting the presence of a barcode sequence using Bioinformatics methods to count and assign each DNA sequence from the next-generation sequencing to an original parental DNA molecule carrying the same barcode sequence, comprising:
(i) performing cluster reassignment of sequencing reads with the same barcode sequence to thereby generate barcode clusters wherein each barcode cluster contains reads from the same amplicon and with the same barcode sequence; and
(ii) performing consensus calling for each barcode cluster to thereby obtain consensus reads;
(g) reconstructing the methylated DNA pattern of the DNA by
(I) comparing the DNA sequence to a reference genome using a sequence alignment tool; and
(II) conducting variant analysis of the DNA sequence by comparing the consensus reads to the reference genome to detect the variations; to thereby assess 1) the conversion efficiency of non-CpG cytosines to thymine as quality control, and 2) the frequency of methylation at each CpG cytosine across all
consensus sequencing reads with distinct barcode sequences (i.e. reads that are derived from unique DNA molecules corresponding to a specific amplicon).
[0008] In a second aspect, the present disclosure refers to a kit for detecting methylated DNA pattern in DNA in a biological sample according to the method of the first aspect, comprising:
(a) a first enzyme capable of oxidizing 5-methylcytosine and 5-hydroxymethylcytosine of the DNA;
(b) a second enzyme capable of converting un-methylated cytosine of the DNA to uracil by deamination;
(c) a plurality of forward and reverse primer pairs specific to the converted sequence of target regions, wherein each forward primer of the plurality of forward and reverse primer pairs comprises a barcode sequence on its 5’ end, wherein the barcode sequence of each forward primer is different from each other, wherein the target regions comprise one or more CpG sites;
(d) a plurality of universal indexed primers for creating the sequencing library;
(e) a first DNA polymerase capable of amplifying DNA with uracil bases, for amplification of converted DNA;
(f) a reagent capable of removing excess primers;
(g) a second DNA polymerase capable of amplifying DNA, for creating the sequencing library; and
(h) sodium bisulfite.
BRIEF DESCRIPTION OF THE DRAWINGS
[0009] The invention will be better understood with reference to the detailed description when considered in conjunction with the non-limiting examples and the accompanying drawings, in which:
[0010] Fig. 1 illustrates the overall experimental workflow, from the conversion of DNA to sequencing.
[0011] Fig. 2 illustrates example of primer design for capturing converted DNA.
Top: For CLIP4_methyl_2F and CLIP4_methyl_2R, the italicised sequences represent the adaptor sequences required for the second amplification with the universal indexed Illumina P5 and P7 primers, respectively. The underlined sequence represents the target- specific
sequence. Y and R represent a degenerate base (C or T and A or G, respectively) following the IUB code. For CLIP4_methyl_2F, NNNNNNNNNN represents a random barcode sequence.
Bottom: For the indexed Illumina P5 and P7 primers, the underlined bases indicate an 8 bp index barcode. For multiplex sequencing, each sample will be assigned a unique combination of forward and reverse indexes.
[0012] Fig. 3 illustrates expected sequencing library profile on Tapestation.
[0013] Fig. 4, comprising Figs. 4(a) and 4(b), illustrates examples of sequence alignment to Human hgl9 genome for a single sample visualized using Integrated Genome Viewer (IGV), wherein Fig. 4(a) shows amplicon designed to the plus strand of the genome, and Fig. 4(b) shows amplicon designed to the minus strand of the genome.
[0014] Fig. 5 illustrates the Conversion efficiency of non-CpG cytosines to thymines. Samples with conversion <0.97 will be repeated.
[0015] Fig. 6 illustrates the examples of correlation of CpG methylation within each amplicon.
Top: Amplicon that contains highly correlated CpG methylation (Pearson Correlation Coefficient>0.9 at each site).
Bottom: Amplicon with low correlation of CpG methylation. The axes indicate chromosomal position.
[0016] Fig. 7 illustrates examples of median methylation beta-values across normal (n=57) and cancer (n=152) samples. For amplicons with low CpG correlations (<0.8 correlation value), individual CpG position data is considered.
[0017] Fig. 8 shows examples of average amplicon methylation values across normal, breast, colorectal, lung and ovarian cancer samples.
[0018] Fig. 9 shows sample distribution used for training set and best 3-fold cross validation scores.
[0019] Fig. 10 illustrates the N-gram method of detecting cfDNA methylation patterns.
[0020] Fig. 11 illustrates the Skip-gram method of detecting cfDNA methylation patterns.
Examples of 1-Skip and 2-Skip analyses are shown.
[0021] Fig. 12 shows the sensitivity performance of different prediction models, set at 95% specificity threshold of the training set.
DETAILED DESCRIPTION
[0022] The present disclosure describes a methodology for detecting methylated DNA pattern in DNA with high sensitivity and specificity, for the purpose of cancer screening and detection of early-stage (stage I/II) cancer, minimal residual disease following cancer surgery or therapy, and cancer relapse.
[0023] In a first aspect, the present disclosure refers to a method of detecting methylated DNA pattern in DNA in a biological sample, comprising:
(a) converting un-methylated cytosine of the DNA to uracil by deamination to thereby generate converted DNA;
(b) purifying the converted DNA from step (a);
(c) tagging a barcode sequence on the converted DNA, by performing a first PCR amplification using a plurality of forward and reverse primer pairs specific to the converted sequence of target regions, wherein each forward primer of the plurality of forward and reverse primer pairs comprises a barcode sequence on its 5’ end, wherein the barcode sequence of each forward primer is different from each other, wherein the target regions comprise one or more CpG sites;
(d) subjecting the tagged converted DNA from step (c) to a second PCR amplification with universal indexed primers to thereby create a sequencing library with components required for multiplex sequencing;
(e) subjecting the sequencing library to multiplex sequencing on a next-generation sequencing platform;
(f) detecting the presence of a barcode sequence using Bioinformatics methods to count and assign each DNA sequence from the next-generation sequencing to an original parental DNA molecule carrying the same barcode sequence, comprising:
(i) performing cluster reassignment of sequencing reads with the same barcode sequence to thereby generate barcode clusters wherein each barcode cluster contains reads from the same amplicon and with the same barcode sequence; and
(ii) performing consensus calling for each barcode cluster to thereby obtain consensus reads;
(g) reconstructing the methylated DNA pattern of the DNA by
(I) comparing the DNA sequence to a reference genome using a sequence alignment tool; and
(II) conducting variant analysis of the DNA sequence by comparing the consensus reads to the reference genome to detect the variations; to thereby assess 1) the conversion efficiency of non-CpG cytosines to thymine as quality control, and 2) the frequency of methylation at each CpG cytosine across all consensus sequencing reads with distinct barcode sequences (i.e. reads that are derived from unique DNA molecules corresponding to a specific amplicon).
[0024] Firstly, the un-methylated cytosine of the DNA is converted to uracil by deamination to thereby generate converted DNA, as disclosed in step (a) of the method of the first aspect.
[0025] In one example, the DNA is extracted from the biological sample before step (a). The DNA may be extracted using any method or kit known in the art. In one example, the DNA is extracted from the biological sample before step (a) using organic extraction methods, such as phenol/chloroform extraction. In another example, the DNA is extracted from the biological sample before step (a) using kits such as, but not limited to, QIAamp Circulating Nucleic Acid Kit (Qiagen), MagMAX Cell-Free DNA Isolation Kit (Applied Biosystems), Cell/Blood DNA Kit (CatchGene), Tissue DNA Kit (CatchGene) and DNeasy Blood and Tissue Kits (Qiagen).
[0026] In one example, the extracted DNA is converted by the method disclosed herein, comprising:
• treating the DNA using a first enzyme that oxidizes 5-methylcytosine and 5- hydroxymethylcytosine of the DNA to thereby protect the 5-methylcytosine and 5- hydroxymethylcytosine from deamination;
• purifying the DNA;
• converting the un-methylated cytosine of the DNA to uracil by deamination using a second enzyme to thereby generate converted DNA.
[0027] In one example, the first enzyme is a Ten-eleven translocation (TET) enzyme or an isoform thereof. In another example, the TET enzyme is selected from the group consisting of TET1 enzyme or an isoform thereof, TET2 enzyme or an isoform thereof, and TET3 enzyme or an isoform thereof.
[0028] In one example, the purification of DNA is performed using an agent such as paramagnetic beads. In one example, the paramagnetic beads are selected from the group consisting of AMPure XP beads, SPRI beads, and Dynabeads.
[0029] In one example, the second enzyme is a cytidine deaminase or an enzyme with cytidine deaminase properties. In another example, the cytidine deaminase is selected from the group consisting of APOBEC enzyme, CDA, and activation-induced cytidine deaminase. In another example, the enzyme with cytidine deaminase properties is selected from the group consisting of M. Sssl and M.Hpall.
[0030] In another example, the extracted DNA is converted using sodium bisulfite.
[0031] In another example, the DNA is not extracted from the biological sample before step (a), and is converted using direct conversion methods in which no DNA extraction is required.
[0032] In another example, the un-methylated cytosine of the unextracted DNA is directly converted to uracil by deamination using bisulfite to thereby generate the converted DNA. In another example, the un-methylated cytosine of the unextracted DNA is directly converted to uracil using direct conversion kits selected from the group consisting of EpiTect Fast FFPE Bisulfite Kit, innuCONVERT Bisulfite All-In-One Kit, and Zymo EZ DNA Methylation-Direct Kit.
[0033] The DNA used in the method of the first aspect is present in a biological sample. In one example, the biological sample containing the DNA is selected from the group consisting of a liquid sample, a tissue sample, or a cell sample. In another example, the liquid sample is bodily fluids selected from the group consisting of blood, bone marrow, cerebral spinal fluid, peritoneal fluid, pleural fluid, lymph fluid, ascites, serous fluid, sputum, lacrimal fluid, stool, urine, saliva, ductal fluid from breast, gastric juice, and pancreatic juice. In one example, the bodily fluid is blood. In some examples, the tissue sample may include, but is not limited to frozen tissue sample, fixed tissue sample (such as formalin-fixed tissue sample). In another example, the tissue sample or the cell sample may be any type of tissue or cell in the body. For example, the tissue sample or cell sample may be a tissue or cell from bone, epithelial, cartilage, adipose tissue, nerves, muscle, connective tissue, esophagus, stomach, liver, gallbladder, pancreas, adrenal glands, bladder, gallbladder, large intestine, small intestine, kidneys, liver, pancreas, colon, stomach, thymus, spleen, brain, spinal cord, heart, lungs, eyes, corneal, skin, or islet tissue or organs. The cell sample may also be from blood, such as white blood cells and platelets. In another example, the cell sample may be cancer cells, stem cells, endothelial cells, or fat cells.
[0034] In another example, the biological sample is obtained from a subject having and/or suspected of having a disease. In another example, the disease is cancer. In yet another example, the cancer is selected from the group consisting of leukemia, lymphoma, ovarian cancer, lung cancer, colorectal cancer, breast cancer, pancreatic cancer, prostate cancer, nasopharyngeal cancer, liver cancer, cholangiocarcinoma, esophageal cancer, urothelial cancer, and gastrointestinal cancer. In another example, the cancer is an early stage cancer. In another example, the cancer is a Stage I cancer. In another example, the cancer is a Stage II cancer. In another example, the cancer is a Stage III cancer. In another example, the cancer is a late stage cancer. In another example, the cancer is an original cancer. In another example, the cancer is a relapsed cancer. In another example, the cancer is relapsed if cancer cells are detected at, in the region of, or distant from the primary site of the tumour, about 1 week, about 2 weeks, about 3 weeks, about 1 month, about 2 months, about 3 months, about 4 months, about 5 months, about 6 months, about 7 months, about 8 months, about 9 months, about 10 months, about 11 months, about 1 year, about 2 years, about 3 years, about 4 years, about 5 years, about 6 years, about 7 years, about 8 years, about 9 years, or about 10 years after complete remission of the primary cancer. In another example, the disease is minimal residual disease of the primary cancer following curative surgery or therapy. As used herein, minimal residual disease (MRD) is a term used to describe the presence of tumour cells disseminated from the primary lesion to distant organs in patients who lack any clinical or radiological signs of metastasis, or residual tumour cells left behind after therapy, that eventually lead to cancer relapse.
[0035] In one example, the DNA is cell-free DNA (cfDNA). As used herein, cfDNA refers to non-encapsulated DNA which is circulating in a liquid sample disclosed herein and not contained within cells. In one example, plasma cfDNA is derived from both normal (healthy, non-diseased) cells and tumor cells. In one example, the DNA is circulating tumor DNA (ctDNA). In one example, the cfDNA fragments from tumor cells are shorter than cfDNA fragments from normal cells. In one example, the differences in plasma cfDNA concentrations and cfDNA fragment lengths between individuals with and without cancer can be assayed as cancer- specific signals. In one example, the liquid sample is bodily fluids selected from the group consisting of blood, bone marrow, cerebral spinal fluid, peritoneal fluid, pleural fluid, lymph fluid, ascites, serous fluid, sputum, lacrimal fluid, stool, urine, saliva, ductal fluid from breast, gastric juice, and pancreatic juice.
[0036] In another example, the DNA is encapsulated within tissues and/or cells. In another example, the tissue or cell may be any type of tissue or cell in the body. In some examples, the tissue sample may include, but is not limited to frozen tissue sample, fixed tissue sample (such as formalin-fixed tissue sample). In another example, the tissue is from bone, epithelial, cartilage, adipose tissue, nerves, muscle, connective tissue, esophagus, stomach, liver, gallbladder, pancreas, adrenal glands, bladder, gallbladder, large intestine, small intestine, kidneys, liver, pancreas, colon, stomach, thymus, spleen, brain, spinal cord, heart, lungs, eyes, corneal, skin, or islet tissue or organs. In one example, the cell is from bone, epithelial, cartilage, adipose tissue, nerves, muscle, connective tissue, esophagus, stomach, liver, gallbladder, pancreas, adrenal glands, bladder, gallbladder, large intestine, small intestine, kidneys, liver, pancreas, colon, stomach, thymus, spleen, brain, spinal cord, heart, lungs, eyes, corneal, skin, or islet tissue or organs. In another example, the cell may be a cancer cell, a stem cell, an endothelial cell, or a fat cell. In yet another example, the cell is a blood cell. The blood cell may be a white blood cell, or a platelet.
[0037] As used herein, when cfDNA or DNA encapsulated in blood cells in the peripheral blood is used, the method as disclosed herein is carried out on a non-invasive basis.
[0038] In one example, the amount of DNA used in the method disclosed herein is at least 5 ng. In another example, the amount of DNA used in the method disclosed herein is about 5 ng, or about 10 ng, or about 15 ng, or about 20 ng, or about 30 ng, or about 40 ng, or about 50 ng, or about 60 ng, or about 70 ng, or about 80 ng, or about 90 ng, or about 100 ng, or about 110 ng, or about 120 ng, or about 130 ng, or about 140 ng, or about 150 ng, or about 160 ng, or about 170 ng, or about 180 ng, or about 190 ng, or about 200 ng, or about 300 ng, or about 400 ng, or about 500 ng, or about 600 ng, or about 700 ng, or about 800 ng, or about 900 ng, or about 1000 ng, or at least 1000 ng.
[0039] After conversion of un-methylated cytosine of the DNA to uracil by deamination, the converted DNA is then purified as disclosed in step (b) of the method of the first aspect, using an agent such as DNA purification beads. The DNA purification beads may be paramagnetic beads, such as AMPure XP beads, and SPRI beads.
[0040] The converted and purified DNA is then tagged with a barcode sequence by performing a first PCR amplification using a plurality of forward and reverse primer pairs specific to the converted sequence of target regions, wherein each forward primer of the
plurality of forward and reverse primer pairs comprises a barcode sequence on its 5’ end, wherein the barcode sequence of each forward primer is different from each other, wherein the target regions comprise CpG sites, as disclosed in step (c) of the method of the first aspect. As used herein, the term “barcode sequence” is a commonly used term in the art of nucleic acid sequencing and used within the definition as known in the art. Thus, the term “barcode sequence” refers to the encoded molecules or barcodes that include variable amount of information within the nucleic acid sequence. For example, the barcode sequence is a tag that can be read out using any of a variety of sequence identification techniques, for example, nucleic acid sequencing, probe hybridization based assay, and the like. In some examples, the barcode sequence is used in the method as described herein to tag different converted DNA sequences of target regions of a sample, such that when the barcode sequence tags to the converted DNA sequences of target regions, each different converted DNA sequence of target region would then have a unique barcode sequence that is attached to it and read out with the converted DNA sequence of target region from the sample.
[0041] In one example, the barcode sequence is an oligonucleotide comprising 10 to 16 random nucleotides, or 10 to 15 random nucleotides, or 10 to 13 random nucleotides, or 10 random nucleotides, or 11 random nucleotides, or 12 random nucleotides, or 13 random nucleotides, or 14 random nucleotides, or 15 random nucleotides, or 16 random nucleotides. In another example, the barcode sequence is an oligonucleotide comprising 10 random nucleotides. As exemplified in the Experimental Section (Fig. 2), the barcode sequence may be defined as NNNNNNNNNN, which may have the sequences such as, but is not limited to, TAGCTAACGT, GCAAGGTCAA, ACCTGTGTAT and the like.
[0042] In one example, the number of the forward and reverse primer pairs is at least 5. In another example, the number of the forward and reverse primer pairs is at least 10. In another example, the number of the forward and reverse primer pairs is at least 15. In another example, the number of the forward and reverse primer pairs is at least 20. In another example, the number of the forward and reverse primer pairs is at least 30. In another example, the number of the forward and reverse primer pairs is at least 40. In another example, the number of the forward and reverse primer pairs is at least 50. In another example, the number of the forward and reverse primer pairs is at least 60. In another example, the number of the forward and reverse primer pairs is at least 70. In another example, the number of the forward and reverse primer pairs is at least 80. In
another example, the number of the forward and reverse primer pairs is at least 90. In another example, the number of the forward and reverse primer pairs is at least 100. In another example, the number of the forward and reverse primer pairs is at least 110. In another example, the number of the forward and reverse primer pairs is at least 120. In another example, the number of the forward and reverse primer pairs is at least 130. In another example, the number of the forward and reverse primer pairs is at least 140. In another example, the number of the forward and reverse primer pairs is at least 150. In another example, the number of the forward and reverse primer pairs is at least 160. In another example, the number of the forward and reverse primer pairs is at least 170. In another example, the number of the forward and reverse primer pairs is at least 180. In another example, the number of the forward and reverse primer pairs is at least 190. In another example, the number of the forward and reverse primer pairs is at least 200. In another example, the number of the forward and reverse primer pairs is 5. In another example, the number of the forward and reverse primer pairs is 22. In another example, the number of the forward and reverse primer pairs is 95. In another example, the number of the forward and reverse primer pairs is 159. In another example, there is no upper limit on the number of the forward and reverse primer pairs.
[0043] In another example, the forward and reverse primer pairs comprise sequences as disclosed in Table 1. [0044] Table 1. Sample primer sequences (159 pairs).
[0045] The exemplified sequences disclosed in Table 1 show only the target- specific sequences of each primer. These sequences do not show the barcode sequence (for forward primers only) and the adaptor sequence required for the second amplification with universal indexed primers. [0046] The full sequence of each forward primer used in step (c) of the method of the first aspect contains the adaptor sequence, followed by the barcode sequence and then the target- specific sequence (the sequences disclosed in Table 1). Fig. 2 shows the full sequence of CLIP4_methyl_2F, which is one exemplary forward primer among the 159 primer pairs comprising the target- specific sequences in Table 1. [0047] The full sequence of each reverse primer used in step (c) of the method of the first aspect contains the adaptor sequence followed by the target- specific sequence (the sequences disclosed in Table 1). Fig. 2 shows the full sequence of CLIP4_methyl_2R, which is one exemplary reverse primer among the 159 primer pairs comprising the target- specific sequences in Table 1. [0048] In another example, the primer pair comprises degenerate bases. In one example, the forward primer in the primer pair comprises one or more degenerate bases, while the
reverse primer in the primer pair has no degenerate base. In another example, the reverse primer in the primer pair comprises one or more degenerate bases, while the forward primer in the primer pair has no degenerate base. In yet another example, both the forward and reverse primers in the primer pair comprise one or more degenerate bases. As used herein, degenerate primers are used when the primer landing site overlaps with a CpG site. A CpG site bound by the forward primer has a sequence of either CG (methylated) or TG (un methylated). The degenerate base Y is used in forward primers to specify either a cytosine or thymine, thus allowing the primer to cover both un-methylated and methylated DNA. In addition, a CpG site bound by reverse primers has a sequence of either CA (un-methylated) or CG (methylated). The degenerate base R is used in reverse primers to specify either an adenine or guanine, thus allowing the primer to cover both un-methylated and methylated DNA. In another example, the degenerate base is selected from the group consisting of C, T, A and G. In another example, each primer of the primer pair comprises 1, 2, or 3 degenerate bases. In another example, each primer of the primer pair has one degenerate base. In another example, the primer pair does not comprise a degenerate base, i.e. has no degenerate base.
[0049] In one example, the target regions comprise CpG sites. As used herein, CpG site refers to a cytosine that immediately precedes a guanine base. In vertebrates, DNA methylation occurs at cytosines within a CpG site.
[0050] In one example, each forward and reverse primer pair covers a target region which comprises at least 1 CpG site. In one example, each forward and reverse primer pair covers a target region which comprises at least 2 CpG sites. In one example, each forward and reverse primer pair covers a target region which comprises at least 3 CpG sites. In one example, each forward and reverse primer pair covers a target region which comprises at least 5 CpG sites. In one example, each forward and reverse primer pair covers a target region which comprises at least 8 CpG sites. In one example, each forward and reverse primer pair covers a target region which comprises at least 10 CpG sites. In one example, each forward and reverse primer pair covers a target region which comprises at least 15 CpG sites. In one example, each forward and reverse primer pair covers a target region which comprises at least 20 CpG sites. In one example, each forward and reverse primer pair covers a target region which comprises at least 25 CpG sites. In one example, each forward and reverse primer pair covers a target region which comprises at least 30 CpG sites. In one example,
each forward and reverse primer pair covers a target region which comprises at least 35 CpG sites. In one example, each forward and reverse primer pair covers a target region which comprises at least 40 CpG sites. In one example, each forward and reverse primer pair covers a target region which comprises at least 50 CpG sites. In one example, each forward and reverse primer pair covers a target region which comprises at least 60 CpG sites. In one example, each forward and reverse primer pair covers a target region which comprises at least 70 CpG sites. In one example, each forward and reverse primer pair covers a target region which comprises at least 80 CpG sites. In one example, each forward and reverse primer pair covers a target region which comprises at least 90 CpG sites. In one example, each forward and reverse primer pair covers a target region which comprises at least 100 CpG sites. In another example, there is no upper limit on the number of CpG sites within the target region covered by each forward and reverse primer pair.
[0051] The first PCR amplification comprises a number of PCR cycles selected from the group consisting of 2, 3, 4 and 5 PCR cycles. In one example, the first PCR amplification comprises 2 PCR cycles. In one example, the first PCR amplification comprises 3 PCR cycles. In one example, the first PCR amplification comprises 4 PCR cycles. In one example, the first PCR amplification comprises 5 PCR cycles. As each forward primer carries on its 5’ end a randomly assigned barcode sequence as disclosed herein, the first PCR amplification allows individual DNA molecules to be tagged uniquely in this first step of sequencing library formation.
[0052] After the first PCR amplification, a second PCR amplification is performed with universal indexed primers as disclosed in step (d) of the method of the first aspect, to create a sequencing library with components required for multiplex sequencing on a next- generation sequencing platform selected from the group consisting of Illumina platform, Ion Torrent sequencing technology, MGI sequencing platform, Oxford Nanopore sequencing, PacBio SMRT sequencing and 10X Genomics platform, as disclosed in step (e) of the method of the first aspect.
[0053] In one example, the universal indexed primers used in step (d) of the method of the first aspect are shown in Fig. 2, which comprise: a forward primer comprising the sequence of
AATGATACGGCGACCACCGAGATCTACACCTAGCGCTACACTCTTTCCCTACAC GACGCTCTTCCGATC*T; and
a reverse primer comprising the sequence of
CAAGCAGAAGACGGCATACGAGATAACCGCGGGTGACTGGAGTTCAGACGTGT
GCTCTTCCGATC*T.
[0054] The above exemplary sequences of the universal indexed primers used in step (d) of the method of the first aspect are the Indexed Illumina primers. The underlined index barcodes are 8 bp barcode sequences that are specified by Illumina. The underlined part can vary for different samples. Each sample within each sequencing run will have a unique combination of forward and reverse indexes. In another example, the underlined index barcodes has the sequences provided by Illumina for next-generation sequencing on the Illumina platform, and may be any sequence listed in the “Illumina Adapter Sequences” handbook, February 2019 version (https://dnatech.genomecenter.ucdavis.edu/wp- cqntent/uploa/2019/03/illumina-adapter-sequences-2019-100000000 2694-10). Exemplary index barcodes for forward primers that may be used are listed in the column “i5 Bases for Sample Sheet iSeq, MiniSeq, NextSeq, HiSeq 3000/4000”, for example CTAGCGCT and TCGATATC. Exemplary index barcodes for reverse primers that may be used are listed in the column “i7 Bases in Adapter”, for example AACCGCGG and GGTTATAA.
[0055] After next-generation sequencing, the presence of a barcode sequence is then detected using Bio informatics methods to count and assign each DNA sequence from the next-generation sequencing to an original parental DNA molecule carrying the same barcode sequence, as disclosed in step (f) of the method of the first aspect, comprising:
(i) performing cluster reassignment of sequencing reads with the same barcode sequence to thereby generate barcode clusters wherein each barcode cluster contains reads from the same amplicon and with the same barcode sequence; and
(ii) performing consensus calling for each barcode cluster to thereby obtain consensus reads.
[0056] In one example, the assignment of DNA sequence to an original parental DNA molecule refers to the cluster reassignment of sequencing reads with the same barcode sequence. This generates barcode clusters wherein each cluster contains reads from the same amplicon and with the same barcode sequence. Consensus calling is performed for each barcode cluster to obtain the consensus reads. These consensus reads are the DNA sequence that is subsequently compared to the reference genome for variations to be detected. The
initial step of cluster reassignment and generation of barcode clusters is important because it greatly reduces sequencing errors and improves confidence for accurate variant calling. [0057] As used herein, the term “count” recited in step (f) of the method of the first aspect refers to the following process: Barcode sequences are extracted and clustered in two steps: 1. Initial grouping by exact match of the combination of amplicon_name + barcode sequence and 2. Cluster Reassignment, in each group of same amplicon_name, barcodes were further reassigned using global pairwise alignment with maximum 2 base differences between barcodes. Barcode clusters with number of associated reads less than 3 (after cluster reassignment) were considered unreliable clusters and removed from downstream analysis. [0058] Next, the methylated DNA pattern of the DNA is reconstructed as disclosed in step (g) of the method of the first aspect. The consensus DNA sequence is compared to a reference genome using a sequence alignment tool and variant analysis of the DNA sequence is conducted by comparing the consensus reads to the reference genome to detect the variations. As used herein, the term “reference genome” refers to DNA sequences known in the art that may be obtainable from public databases. Exemplary Bioinformatics analysis method for reconstructing the methylated DNA pattern include bwa-meth, Bismark, MethylDackel, bisulfite-treated reads analysis tools (BRAT), methyQA, mrsFAST, BSMAP, VerJInxer, RMAP-bs, MethylCoder, BS-seeker2, and Bison.
[0059] Steps (a) to (g) of the method of the first aspect as described above thereby enable assessment of: 1) the conversion efficiency of non-CpG cytosines to thymine as quality control, and 2) the frequency of methylation at each CpG cytosine across all consensus sequencing reads with distinct barcode sequences (i.e. reads that are derived from unique DNA molecules corresponding to a specific amplicon).
[0060] In one example, the methods as disclosed herein may be used to detect mutations or polymorphisms at CpG sites. As used herein, relative to the reference genome, methylation of a CpG site is defined as concordance of the CG sequence for a CpG site, regardless of whether the site is on the plus or minus strand. As used herein, relative to the reference genome, non-methylation of a CpG site is defined as:
• A sequence of TG for a CpG on the plus strand. In this case, the unmethylated cytosine has been converted to thymine; or
• A sequence of CA for a CpG on the minus strand. In this case, the unmethylated cytosine on the minus strand was converted to thymine, which has a complementary adenine base on the plus strand (see Fig. 4).
[0061] During the reconstruction of the methylated DNA pattern of cfDNA, variations at CpG cytosines to a non-C/T base (i.e. mutation to A or G), will be flagged due to the unexpected occurrence of a non-C/T base that disrupts the CpG site. The allele frequency of this variation can be determined by its frequency across all consensus sequencing reads with distinct barcode sequences.
[0062] Variations at CpG guanines will also be flagged during this process due to the unexpected occurrence of a non-G base that disrupts the CpG site. The allele frequency of this variation can be determined by its frequency across all consensus sequencing reads with distinct barcode sequences.
[0063] In one example, the method as described in the first aspect further comprises the following steps:
(h) using a statistical modelling technique to thereby predict presence or absence of cancer; and
(i) using a statistical modelling technique to thereby identify specific cancer types when the presence of cancer is predicted in step (h). In one example, the method as described in the first aspect further comprises the step of analyzing methylated DNA pattern prior to performing step (h). In one example, Natural Language Processing, N-gram and Skip-gram are used for analyzing methylated DNA pattern. In one example, N-gram may be used to capture methylation pattern- specific information and generate new features that can be further analyzed. In one example, the generated new features can be used as data input for further statistical modelling techniques, such as those in step (h) and/or (i). In one example, the statistical modelling technique is logistic regression. In one example, Skip-gram may be used to determine patterns between initially unrelated or non-adjacent CpG sites by skipping N number of sites between 2 sites within an amplicon. In one example, the determined patterns can be used as data input for further statistical modelling techniques, such as those in step (h) and/or (i). In one example, the statistical modelling technique is logistic regression. In one example, the utilities of methylation frequency and methylation patterns derived from N-gram and Skip-gram may be used to detect cancer. In one example, the cancer is lung cancer.
[0064] In another example, the statistical modelling technique is selected from the group consisting of logistic regression, tree based classifiers and deep neural networks.
[0065] In a second aspect, the present disclosure refers to a kit for detecting methylated DNA pattern in DNA in a biological sample according to the method of the first aspect, comprising:
(a) a first enzyme capable of oxidizing 5-methylcytosine and 5-hydroxymethylcytosine of the DNA;
(b) a second enzyme capable of converting un-methylated cytosine of the DNA to uracil by deamination;
(c) a plurality of forward and reverse primer pairs specific to the converted sequence of target regions, wherein each forward primer of the plurality of forward and reverse primer pairs comprises a barcode sequence on its 5’ end, wherein the barcode sequence of each forward primer is different from each other, wherein the target regions comprise one or more CpG sites;
(d) a plurality of universal indexed primers for creating the sequencing library;
(e) a first DNA polymerase capable of amplifying DNA with uracil bases, for amplification of converted DNA;
(f) a reagent capable of removing excess primers;
(g) a second DNA polymerase capable of amplifying DNA, for creating the sequencing library; and
(h) sodium bisulfite.
[0066] The first enzyme, the second enzyme, the plurality of forward and reverse primer pairs, the barcode sequence, the CpG sites, and the plurality of universal indexed primers are disclosed herein.
[0067] In one example, the first DNA polymerase is selected from the group consisting of Phusion U Hot Start DNA Polymerase (Thermo Scientific), ZymoTaq DNA Polymerase (Zyymo Research) and Q5U Hot Start High-Fidelity DNA Polymerase (NEB). In another example, the reagent capable of removing excess primers is selected from the group consisting of paramagnetic beads and single- strand exonucleases. Exemplary paramagnetic beads include AMPure XP beads, SPRI beads, and Dynabeads. In another example, the second DNA polymerase is selected from the group consisting of KAPA HiFi DNA
Polymerase (Roche), Platinum Taq DNA Polymerase or Platinum SuperFi DNA Polymerase (Invitrogen) and Q5 High-Fidelity DNA Polymerase (NEB).
[0068] As used in this application, the singular form “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise. For example, the term “a primer” includes a plurality of primers, including mixtures and combinations thereof.
[0069] As used herein, the terms “increase” and “decrease” refer to the relative alteration of a chosen trait or characteristic in a subset of a population in comparison to the same trait or characteristic as present in the whole population. An increase thus indicates a change on a positive scale, whereas a decrease indicates a change on a negative scale. The term “change”, as used herein, also refers to the difference between a chosen trait or characteristic of an isolated population subset in comparison to the same trait or characteristic in the population as a whole. However, this term is without valuation of the difference seen.
[0070] As used herein, the term “about” in the context of concentration of a substance, size of a substance, length of time, or other stated values means +/- 5% of the stated value, or +/- 4% of the stated value, or +/- 3% of the stated value, or +/- 2% of the stated value, or +/- 1% of the stated value, or +/- 0.5% of the stated value.
[0071] Throughout this disclosure, certain embodiments may be disclosed in a range format. It should be understood that the description in range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of the disclosed ranges. Accordingly, the description of a range should be considered to have specifically disclosed all the possible sub-ranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed sub-ranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numbers within that range, for example, 1, 2, 3, 4, 5, and 6. This applies regardless of the breadth of the range. [0072] The invention illustratively described herein may suitably be practiced in the absence of any element or elements, limitation or limitations, not specifically disclosed herein. Thus, for example, the terms "comprising", "including", "containing", etc. shall be read expansively and without limitation. Additionally, the terms and expressions employed herein have been used as terms of description and not of limitation, and there is no intention in the use of such terms and expressions of excluding any equivalents of the features shown and described or portions thereof, but it is recognized that various modifications are possible
within the scope of the invention claimed. Thus, it should be understood that although the present invention has been specifically disclosed by preferred embodiments and optional features, modification and variation of the inventions embodied herein disclosed may be resorted to by those skilled in the art, and that such modifications and variations are considered to be within the scope of this invention.
[0073] The invention has been described broadly and generically herein. Each of the narrower species and subgeneric groupings falling within the generic disclosure also form part of the invention. This includes the generic description of the invention with a proviso or negative limitation removing any subject matter from the genus, regardless of whether or not the excised material is specifically recited herein.
[0074] Other embodiments are within the following claims and non-limiting examples.
EXAMPLES
[0075] Methods
[0076] Sample collection and Processing
[0077] Blood from healthy individuals or patients with cancer was collected into Streck Cell-Free DNA tubes and plasma was isolated by centrifugation. Plasma cell-free DNA (cfDNA) was extracted using the QIAamp Circulating Nucleic Acid Kit (Qiagen). To convert all un-methylated cytosines in the genome to uracils while preserving methylated cytosines, the plasma cfDNA was subjected to enzymatic conversion using the NEBNext Enzymatic Methyl-Seq Conversion Module (New England BioLabs). Briefly, DNA was treated with the TET2 enzyme that oxidizes 5-methylcytosine and 5-hydroxymethylcytosine, protecting these bases from deamination by APOBEC in the next step. Next, the DNA was purified using AMPure XP beads (Beckman Coulter) prior to the addition of APOBEC enzyme which deaminates un-methylated cytosines to uracils. Lastly, purification using AMPure XP beads generated single- stranded DNA that is similar to that of sodium-bisulfite- converted DNA.
[0078] Design of Multiplex PCR panel for identification of DNA methylation in targeted regions
[0079] A multiplex amplicon-based next generation sequencing (NGS) platform was developed to capture and sequence targeted regions of the converted genome. These regions were selected based on literature review of known methylated regions in specific cancers and
from analyses of methylation data from normal and tumor tissues in the Cancer Genome Atlas (TCGA) database. Each amplicon covers at least 1 CpG site. In initial validation experiments, primers for 22 amplicons were designed and the panel has since been increased to >100 amplicons (159 amplicons). The design of the assay is intended to be scalable to include multiple targets for the specific identification of multiple cancers.
[0080] Each forward primer additionally includes on the 5’ end, a random 10 nucleotide sequence to serve as barcode sequence for the identification of unique DNA molecules. In CpG-rich regions in which it was not possible to design primers in between CpG sites, degeneracy was incorporated for the primer designs to enable the capture of both un methylated and methylated CpGs.
[0081] A combinatorial amplicon-based NGS based assay targeting hotspot mutations in 32 genes that are commonly mutated in lung cancer was developed to complement the multiplex amplicon-based NGS platform described above, to improve the sensitivity of cancer detection. Said combinatorial amplicon panel incorporates molecular barcode sequences for error suppression and improved coverage, enabling 100% specificity and 100% detection sensitivity at 1% and 5% VAF for single nucleotide variants (SNVs) and insertions/deletions (indels) and 89% detection sensitivity at 0.1% VAF using HD780 (Horizon Discovery) reference standards. The design of the panel incorporates tiled amplicons that can generate longer or shorter amplicons, thus enabling the profiling of the size distribution of cfDNA fragments. In one example, the combinatorial amplicon panel can detect cfDNA methylation. In one example, the combinatorial amplicon panel can detect cfDNA concentration. In one example, the combinatorial amplicon panel can detect ctDNA fragmentation profile. In one example, the combinatorial amplicon panel can detect cfDNA methylation, cfDNA concentration and ctDNA fragmentation profile or any combinations thereof.
[0082] Preparation of whole-genome sequencing library
[0083] The amount of converted cfDNA used for library preparation varied slightly depending on the amount used for enzymatic conversion, but typically represented 5-10 ng starting amount of cfDNA prior to conversion. For the target capture PCR, both forward and reverse primers were combined in a single reaction using Phusion U Hot Start DNA Polymerase (Thermo Fisher Scientific) under the following thermocycling conditions: Denaturation at 98°C for 30s, followed by 3 or 4 cycles of 98°C for 10s, 55-57°C for 6 min,
and 72°C for 5 min (3 cycles with 55°C for the 22-amplicon panel, 4 cycles with 56°C or 57°C for larger panels). At the end of the reaction, for the 22-amplicon panel, excess primers were removed by purification with 1.5x AMPure XP beads twice. For larger panels, excess primers were removed by purification with 1.2x AMPure XP beads, treatment with Thermolabile Exonuclease I (New England BioLabs) for 10 min, and a second round of purification with 1.3x AMPure XP beads.
[0084] A final amplification was performed to amplify the targets and to complete the library with indexed sequencing adaptors for sequencing on the Illumina platform. Briefly, purified product was amplified with indexed P5 adapter sequence and indexed P7 adapter sequence using KAPA HiFi HotStart ReadyMix (Roche) under the following thermocycling conditions: Denaturation at 98°C for 45 s, followed by 19 to 21 cycles of 98°C for 15 s, 60°C for 30 s, and 72°C for 30 s, with a final extension at 72°C for 1 min. The amplified library was purified twice with 0.8x then 0.7x AMPure XP beads to remove non-specific products. The quality and quantity of the sequencing library was assessed using the 4200 Tapestation system (Agilent Technologies, USA) and KAPA Library Quantification Kit for Illumina® Platforms (Roche) respectively. Paired-end sequencing (2x15 lbp) of the final dual-indexed libraries was performed on the Illumina platform as per manufacturer’s instructions.
[0085] Data Analysis
[0086] FASTQ files were processed using a custom pipeline. First, expected amplicons were identified and labelled in the FASTQ files based on the expected primer sequences in Read 1 and paired Read 2. For amplicons with degenerate primers, data formed from each pair of degenerate primers are aggregated and assigned to the same amplicon based on the expected primer sequences. Primer sequences and upstream barcode sequences were trimmed using cutadapt, primer trimmed sequences were mapped to the Homo sapiens GRCh37 (hgl9) reference genome using bwa-meth, which is specifically designed for the alignment of bisulfite-converted sequences. For “primer” trimmed fastq files, the name of the primer which has the best match to a read is concatenated to the name of the mapped output reads (for both read 1 and read 2). The primer name assigned to read 1 may not always match that of read 2, which can be due to non-specific binding. An “amplicon name” is assigned to each paired read by combining the matching primer name of read 1 and read 2 (concatenated by semicolon).
[0087] Molecular tag (or barcode) sequences were included in the trimmed “primer” sequences of read 1, and can be extracted given the unique structure of primer sequences in read 1. The extracted molecular tag sequences were clustered in two steps: 1. Initial grouping by exact match of the combination of amplicon_name + barcode sequence and 2. Cluster Reassignment, in each group of same amplicon_name, barcodes are further reassigned using global pairwise alignment with maximum 2 base differences between barcodes. Barcode clusters with number of associated reads less than 3 (after cluster reassignment) were considered unreliable clusters and removed from downstream analysis.
[0088] Consensus calling was done for each molecular tag (or barcode) cluster, by first performing global alignment among all associated reads using MAFFT. The consensus base in each aligned position is called by determining the majority representative base type, the percentage of which was no less than an automatically determined threshold. The threshold is a function of the total number of reads for that barcode sequence. If no representative base can be called, the position is assigned N (as opposed to one of A, C, T, G). A new quality score was assigned to each position, which is either 90th percentile of all the quality values from the representative base type in that position (if a consensus base is found), or 10th percentile of all quality values in that position (if no consensus bases is found). The consensus reads were then written to a new FASTQ file. With molecular barcoding, the sequencing is error-free and increases confidence of methylated/non-methylated calls due to the high quality of sequencing data.
[0089] Analysis of conversion efficiency and methylation frequency [0090] Adaptor-trimmed, barcode-clustered consensus FASTQ reads were mapped to the Homo sapiens GRCh37 (hgl9) reference genome using bwa-meth. The reads were subjected to several filtering steps prior to the evaluation of conversion efficiency (non-CpG Cs) and methylation frequency (CpG Cs). First, each read was only considered if at least two-thirds (66%) of its CpG cytosines are properly covered and assigned to a base (A, C, T or G) instead of N. Reads with more than one-third of its CpG cytosines assigned as N were excluded. Subsequently, data from all the reads were aggregated at the amplicon level and cytosines that meet any of the following criteria are excluded:
• >40% N fraction at an expected cytosine position. This filters out positions with low quality sequencing.
• <80% C or T base fraction at position. This filters out potential SNPs or positions with low quality.
• <60% G base fraction of the adjacent G base of an expected CpG site. This allows for the identification of putative SNPs at the G coordinate of a CpG site that would disrupt the site.
• >40% G base fraction of the 3’ adjacent base of a non-CpG cytosine. This allows for the identification of putative SNPs that result in the formation of an unexpected CpG.
[0091] Conversion efficiency is defined as the average conversion fraction of non-CpG cytosines to thymines. Samples with amplicons with conversion efficiency <0.97 were repeated.
[0092] A methylation fraction was calculated at each CpG position and mean methylation fraction of an amplicon is defined as the average methylation fraction of all the considered CpG cytosines.
[0093] In addition to evaluating the mean methylation frequency of an amplicon, the methylation pattern in DNA sequences can also contain information of their source. To supplement the methylation frequency data, cancer- specific methylation patterns were evaluated via alternative approaches namely N-gram and Skip-gram.
[0094] The N-gram approach, which is similar to Natural Language Processing technique, was adopted to capture pattern-specific information and create new features that can be further analyzed (Fig. 9). A N-gram model is a type of probabilistic language model for predicting the next item in such a sequence in the form of a (n - 1) order Markov model. N-gram features such as bigram, trigram, quad-gram and pentagram combinations were constructed to capture methylation patterns in adjacent 2, 3, 4 or 5-CpG sites, respectively. As amplicons that cover more CpG sites would have higher numbers of N-gram combinations, the N-gram for each amplicon was normalized by taking the average of all the reads and then divided by the maximum number of N-grams that can be formed for the particular amplicon. Of all possible N-gram combinations derived herein, grid search was performed to reduce the number of features that were then used for the training of a logistic regression model for cancer/non-cancer prediction.
[0095] Skip-gram, another approach used in Natural Language Processing, was adopted to find patterns between initially unrelated or non-adjacent CpG sites by skipping N number of sites between 2 sites within an amplicon (Fig. 10). Similar to the N-gram approach, the
Skip-gram for each amplicon was normalized to account for different numbers of CpG sites in different amplicons.
[0096] Logistic regression model for cancer prediction
[0097] To build a training set for a logistic regression model for cancer prediction, >200 samples from healthy individuals and cancer patients were processed and analyzed. Using the data from these samples, methylation across individual CpG sites within an amplicon was examined for concordance.
[0098] Concordance of CpG methylation across CpG sites for an amplicon was computed using the Pearson Correlation to identify highly correlated features. Absolute values of Pearson Correlation Coefficient (PCC) was calculated for methylation/non-methylation at each pair of CpG positions and a PCC threshold of >0.8 was used for filtering out highly concordant features. This was to ensure that there was no multicollinearity among the amplicons (independent variables) when building the Logistic Regression Model.
[0099] For amplicons with concordance of >0.8 for CpG methylation, the average percent methylation across all the sites was considered as one feature. For amplicons with poor CpG methylation concordance (<0.8), the methylation frequency of each CpG within the amplicon was considered as a separate feature.
[00100] Methylation frequency for each feature was log transformed using the formula np.log(x-i-O.OOOOl) where x is the methylation frequency also known as the methylation beta- value. Highly correlated features were removed from the model by calculating the variance inflation factor (VIF) score for each feature to detect multicollinearity. Features with the highest VIF scores were then dropped iteratively until a maximum VIF score of <10.
[00101] Recursive feature elimination (RFE) was performed on the remaining features to determine the set of features for the best performance of the model. If a sample was healthy, its corresponding target array value was set to 0, while if a sample was cancerous, its corresponding target array value was set to 1. For model-building set, the scikit-learn (http://scikit-learn.org) package’s LogisticRegression module was used to machine-learn parameters for an LR classifier using the log-transformed methylation signatures as features and the target array as the target values. The liblinear solver implemented in scikit-learn (http://www.csie.ntu.edu.tw/~cjlin/liblinear) was utilized for this process. In order to avoid overfitting and build a robust model, a cross-validation approach with RIDGE penalty (L2) was utilized. 3-fold cross validation was performed for each iteration of the model and a
probability threshold of 0.5 was used to assign samples as normal (<0.5) or cancer (>0.5). Sensitivity and specificity values were calculated for each fold and finally overall sensitivity and specificity was reported by taking average of the fold scores.
[00102] The utilities of methylation frequency and methylation patterns by N-gram and Skip-gram for the detection of cancer were evaluated by training individual logistic regression models using a dataset derived from 60 healthy individuals and 39 early (stage I- III) and 56 late stage lung cancer patients, respectively. 3-fold cross validation, each with 50 repeats, of the logistic regression models with a threshold set at 95% specificity demonstrated 33.96-54.72% and 81.61-86.21% sensitivity of detection of early and late stage lung cancers, respectively (Fig. 11).
[00103] Plasma cfDNA methylation, cfDNA levels, cfDNA fragmentation profiles and ctDNA detection can each provide complementary information for enhanced accuracy in cancer detection. The combinatorial amplicon-based panel approach described herein combining the detection of cfDNA methylation, cfDNA concentration and ctDNA fragmentation profile can mitigate the limitations of individual approaches and improve the overall accuracy of cancer detection. Thus, a machine learning classifier prediction model that integrates these multiple classes of data generated from plasma cfDNA was used.
[00104] To establish a prediction classifier model of normal vs lung cancer status, individual logistic regression models of mAF, N-gram, Skip-gram or aggregate cfDNA ‘Biomarker’ features were first trained using the dataset of 60 healthy individuals and 39 early (stage I-III) and 56 late-stage lung cancer patients. Aggregate cfDNA Biomarker features encompass plasma cfDNA concentration, fragment size ratio and the ctDNA detection score, each of which are log normalized. The ctDNA detection score was determined by first classifying each variant to one of six classes based on evidence in public databases for the prevalence and pathogenesis of the variant in cancer. Each of the classes were assigned a score, with the highest score assigned to the highest class, and the ctDNA detection score of each sample was calculated by aggregating the score multiplied by allele frequency of each variant detected. The ctDNA detection score, plasma cfDNA concentration and cfDNA fragment size were incorporated into a single ‘Biomarkers’ logistic regression model. A Stacking Ensemble technique was adopted to merge the 3 (mAF + Biomarkers + N-gram or Skip-gram) models and generate a final prediction probability value for cancer.
[00105] At a specificity of 95%, 3-fold cross-validation analysis using an Ensemble mAF + Biomarkers + N-gram model yielded an average sensitivity of 79.49% and 91.07% for early- and late-stage lung cancer, respectively, with an overall sensitivity of 86.32% (Fig. 12). Considering both early and late-stage detection sensitivities, the combinatorial approach provided an additional diagnostic value of 24.8-45.5% for early-stage and 4.9-9.5% for late- stage lung cancer compared with individual models alone, supporting the clinical utility of the combinatorial approach.
[00106] Measurement of protein tumor marker levels are also commonly used in cancer screening and detection. Its utility is demonstrated in the combinatorial multi-omic approach by the assessment of plasma CEA levels. Plasma CEA levels, detected by the Beckman Access II immunoanalyzer, were higher in lung cancer samples compared to normal controls, giving a sensitivity of 46.15% and 73.21% for early and late-stage lung cancer detection, respectively, at a specificity of 95% (Fig. 12). When combined with a mAF, N- gram and Biomarkers Ensemble prediction model, the addition of CEA provided an additional diagnostic sensitivity of 5.2% and 3.6% in the detection of early and late-stage lung cancer, respectively.
[00107] Random forest model for determination of cancer type
[00108] For samples predicted to be cancer by the logistic regression model described above, a random forest classification algorithm was trained for identification of the specific type of cancer using data from several types of cancer samples, including breast, colorectal, lung and ovarian cancers.
[00109] Feature selection was done using ANOVA F-Test via the f_classif() function from scikit- learn (https ://scikit- learn.org/stable/modules/generated/skleam.feature_selection.f_classif.html). F_Scores for all methylation sites across 4 different cancer categories were computed and ranked.
[00110] Random Forest, as implemented in the scikit-learn (http://scikit-leam.org) package’s RandomForestClassifier module was used, using the methylation signatures as features and cancer type as the target label. The default setting of the RandomForestClassifier were used. For robustness, five rounds of 3-fold CV were performed for each iteration of the model. The performance of the Random Forest Classifier seemed to plateau at around 23 features and these were selected the final features for the model. For prediction, probability scores were calculated for each cancer type and the cancer
type with the highest probability score was predicted as likely cancer type for that particular sample.
[00111] Finally, individual sensitivities and specificities for each cancer type across all 5 iterations of the models were combined and reported.
[00112] All analysis and modelling for both the modelling parts was conducted in Python Programming Language, version 3.7.3.
[00113] Results
[00114] The present disclosure describes the methodology for the identification of methylated DNA for the detection of early stage cancer, minimal residual disease following cancer surgery or therapy, and cancer relapse, with high sensitivity and specificity, especially in situations that these disease are undetectable by conventional screening methods. In one example, blood-based test is used for the identification of methylated signatures in plasma cell-free DNA (cfDNA) that can indicate the presence of cancer and specify its tissue of origin (i.e. cancer type) before the development of overt symptoms. To identify sites of DNA methylation, the present disclosure uses enzymatic conversion as an alternative to conventional sodium bisulfite treatment to convert un-methylated cytosines to uracils. First, cfDNA was treated with TET2 enzyme that oxidizes 5-methylcytosine and 5- hydroxymethylcytosine, protecting these bases from deamination by APOBEC in the next step. Next, the cfDNA was purified using AMPure XP beads prior to the addition of APOBEC enzyme which deaminates un-methylated cytosines to uracils. Lastly, purification using AMPure XP beads generated single-stranded DNA that is similar to that of sodium- bisulfite-converted cfDNA, but typically obtained in higher recovery yields and with little fragmentation compared to bisulfite-converted DNA. As little as 5 ng starting amount of cfDNA has been successfully put through conversion, library preparation and sequencing in the present workflow.
[00115] In the target capture and library amplification step, the converted cfDNA molecules were selectively enriched using a multiplicity of primers specific to the converted sequence of target regions in a single PCR reaction. The converted cfDNA was added to a PCR reaction containing more than 5 ‘forward’ and ‘reverse’ primer pairs and subject to 2, 3, 4 or 5 cycles of PCR in a first limited amplification reaction. As each forward primer carries on its 5’ end a randomly assigned barcode sequence, this PCR allows individual cfDNA molecules to be tagged uniquely in this first step of sequencing library formation.
Subsequently, the reactions were purified to remove excess primers. A final PCR amplification with universal indexed primers was done to create libraries with components required for multiplex sequencing on a next-generation sequencing platform such as Illumina.
[00116] Each ‘forward’ and ‘reverse’ primer pair forms an amplicon that covers at least 1 CpG site. In CpG-rich regions in which it was not possible to design primers not overlapping CpG sites, the primer designs incorporate degeneracy for the capture of both un-methylated and methylated CpGs and thus overcome methylation-related drop-off of coverage and capture.
[00117] Following sequencing, the presence of a barcode sequence was detected using specialized Bioinformatics methods to count and assign each DNA sequence from high- throughput sequencing to an original parental DNA molecule, carrying the same tag. In the method as disclosed herein, the parental DNA molecule is the original cfDNA molecule right after enzymatic conversion.
[00118] The cfDNA methylation pattern of the biological sample was then reconstructed. The number of unique cfDNA molecules corresponding to targeted regions of the genome were enumerated. The specific DNA methylation pattern of each molecule was reconstructed by comparing to a reference genome using a sequence alignment tool (for example, bwa- meth) designed for the alignment of bisulfite-converted sequences. Variations of the samples’ genome sequence compared to this reference genome were detected by variant analysis. This allows for the assessment of 1) the conversion efficiency of non-CpG cytosines to thymine as quality control and 2) the frequency of methylation at each CpG cytosine across all consensus sequencing reads with distinct barcode sequences (i.e. reads that are derived from unique cfDNA molecules corresponding to a specific amplicon). The methylation information was therefore obtained and because of the incorporation of barcode sequences at the PCR step, low-level errors of sequencing were suppressed (<1%) which allowed for accurate determination of the methylation status at each CpG site.
[00119] For training of a statistical model for cancer vs non-cancer prediction, methylation across individual CpG sites within an amplicon was examined for pairwise concordance using a large number of samples. Concordance of CpG methylation across CpG sites for an amplicon was computed using the Pearson Correlation to identify highly correlated features. Absolute values of Pearson Correlation Coefficient (PCC) was calculated for
methylation/non-methylation at each pair of CpG positions and a PCC threshold of >0.8 was used to filter out highly concordant features. This was to ensure that there was no multicollinearity among the amplicons (independent variables) when building the Logistic Regression Model.
[00120] For amplicons with concordance of >0.8 for CpG methylation, the average percent methylation across all the sites was considered as one feature. For amplicons with poor CpG methylation concordance (<0.8), the methylation frequency of each CpG within the amplicon was considered as an individual feature. Methylation data obtained from 209 plasma cfDNA samples (57 normal, 152 cancer) were then used as a “training set” for a logistic regression model to calculate probabilities and AUC/ROC curves for cancer prediction. 3-fold cross validation of this model reported 94.1% sensitivity and 87.7% specificity for cancer detection (Fig. 9). Data from specific cancer types was also used for a second model in which a random forest classification algorithm is used to predict the tissue of origin (or cancer type) in samples that cancer was detected in the first step (Fig. 9).
[00121] Discussion
[00122] The method of the present disclosure has the following advantages:
1. The target regions to be analysed were selected based on externally validated regions and from genome-wide analyses of methylation data in the TCGA database. Even when using a relatively small panel of 22 targets, the method has a high sensitivity (>90%) and specificity (90%) for the detection of cancer. This panel has been expanded to include more targets (159) and can be further expanded. Increased target number greatly improved the sensitivity and specificity of the test. The combination of target regions, and their associated CpG sites that are covered by each primer pair, renders the present method novel.
2. The method of the present disclosure may be used on a blood-based test (for example, to detect methylated DNA pattern in cfDNA in the blood) that is fast and non-invasive (only one draw of blood is needed). In addition, the method is scalable for the detection of multiple cancers in a single test and is suitable for cancer screening in an asymptomatic population.
3. Unlike somatic mutations that can occur anywhere along the length of a gene and are thus difficult to profile comprehensively, DNA methylation in cancer occurs predominantly in CpG islands within gene promoter regions and are thus more
accessible to comprehensive profiling. In addition, DNA methylation typically occurs in a tissue-specific manner which increases the specificity of identifying the tissue of origin of the cancer. The frequency of methylation can be calculated which gives an indication of tumor load and can be used for disease monitoring.
4. Enzymatic conversion of un-methylated cytosines to uracils enables high efficiency conversion with little fragmentation and loss in DNA yield.
5. Primers with barcode sequences in a multiplex amplicon-capture assay allow the suppression of low-level errors due to sequencing and improve sensitivity of identification of methylated sites with high confidence.
6. Degenerate primers are used to capture both methylated and un-methylated strands in regions that are CpG rich and would otherwise be inaccessible with regular primers for bisulfite sequencing.
7. The initial multiplex PCR reaction is scalable and allows the capture of multiple genomic regions for the identification of several cancer types in a single assay.
8. Use of dual index combinations reduces the possibility of index swapping during sequencing.
9. A statistical model trained using methylation sequencing data from hundreds of known normal and cancer cfDNA enables accurate detection of cancer in independent samples.
10. The technological significance lies in the generalizable use of primers for target capture, which allows working with smaller, limiting amounts of DNA, especially when enzymatic conversion is used instead of conventional sodium bisulfite treatment. In addition, the unique combination of targets is selected for the sensitivity and specific detection of multiple cancers.
11. The method of the present disclosure may be used in the following applications:
• Identification of methylation signatures specific to cancers.
• Identification of methylation signatures that are specific to particular cancers.
• Cancer screening in healthy individuals and individuals at high risk for the tested cancers. One of the intended uses of the method of the present disclosure is for cancer screening and early cancer detection. In a validation experiment, the method of the present disclosure showed 90.9% sensitivity for the detection of stage I colorectal cancer.
• Disease monitoring in cancer patients, including monitoring response to treatment and cancer relapse, and detecting minimal residual disease (MRD) following cancer surgery or therapy. The method of the present disclosure is suitable for regular disease monitoring as only a blood draw is required.
Claims
1. A method of detecting methylated DNA pattern in DNA in a biological sample, comprising:
(a) converting un-methylated cytosine of the DNA to uracil by deamination to thereby generate converted DNA;
(b) purifying the converted DNA from step (a);
(c) tagging a barcode sequence on the converted DNA, by performing a first PCR amplification using a plurality of forward and reverse primer pairs specific to the converted sequence of target regions, wherein each forward primer of the plurality of forward and reverse primer pairs comprises a barcode sequence on its 5’ end, wherein the barcode sequence of each forward primer is different from each other, wherein the target regions comprise one or more CpG sites;
(d) subjecting the tagged converted DNA from step (c) to a second PCR amplification with universal indexed primers to thereby create a sequencing library with components required for multiplex sequencing;
(e) subjecting the sequencing library to multiplex sequencing on a next-generation sequencing platform;
(f) detecting the presence of a barcode sequence using Bioinformatics methods to count and assign each DNA sequence from the next-generation sequencing to an original parental DNA molecule carrying the same barcode sequence, comprising:
(i) performing cluster reassignment of sequencing reads with the same barcode sequence to thereby generate barcode clusters wherein each barcode cluster contains reads from the same amplicon and with the same barcode sequence; and
(ii) performing consensus calling for each barcode cluster to thereby obtain consensus reads;
(g) reconstructing the methylated DNA pattern of the DNA by
(I) comparing the DNA sequence to a reference genome using a sequence alignment tool; and
(II) conducting variant analysis of the DNA sequence by comparing the consensus reads to the reference genome to detect the variations;
to thereby assess 1) the conversion efficiency of non-CpG cytosines to thymine as quality control, and 2) the frequency of methylation at each CpG cytosine across all consensus sequencing reads with distinct barcode sequences (i.e. reads that are derived from unique DNA molecules corresponding to a specific amplicon).
2. The method of claim 1, further comprising the following steps before step (a):
1) extracting the DNA from the biological sample;
2) treating the DNA using a first enzyme that oxidizes 5-methylcytosine and 5- hydroxymethylcytosine of the DNA to thereby protect the 5-methylcytosine and 5- hydroxymethylcytosine from deamination in step (a); and
3) purifying the DNA from step (2).
3. The method of claim 2, wherein in step (a), the un-methylated cytosine of the DNA is converted to uracil by deamination using a second enzyme to thereby generate converted DNA.
4. The method of claim 1, wherein the DNA is unextracted from the biological sample, and wherein in step (a), the un-methylated cytosine of the DNA is converted to uracil by deamination using bisulfite to thereby generate converted DNA.
5. The method of any one of the preceding claims, further comprising:
(h) performing a statistical modelling technique to thereby predict presence or absence of cancer; and
(i) performing a statistical modelling technique to thereby identify specific cancer types when the presence of cancer is predicted in step (h).
6. The method of claim 5, wherein the statistical modelling technique is selected from the group consisting of logistic regression, tree based classifiers and deep neural networks.
7. The method of any one of the preceding claims, further comprising the following step prior to step (h):
analyzing methylated DNA pattern by capturing methylation pattern- specific information and generating new features and/or determining patterns within an amplicon as data input for statistical modelling techniques of step (h) and/or step (i).
8. The method of any one of the preceding claims, wherein the biological sample is selected from the group consisting of a liquid sample, a tissue sample, or a cell sample.
9. The method of claim 1, wherein the DNA is selected from the group consisting of cell-free DNA (cfDNA) and DNA encapsulated within tissues and/or cells.
10. The method of claim 2, wherein the first enzyme is a Ten-eleven translocation (TET) enzyme or an isoform thereof; and wherein optionally, the TET enzyme is selected from the group consisting of TET1 enzyme or an isoform thereof, TET2 enzyme or an isoform thereof, and TET3 enzyme or an isoform thereof.
11. The method of claim 3, wherein the second enzyme is a cytidine deaminase or an enzyme with cytidine deaminase properties; wherein optionally, the cytidine deaminase is selected from the group consisting of APOBEC enzyme, CDA, and activation-induced cytidine deaminase; and wherein optionally, the enzyme with cytidine deaminase properties is selected from the group consisting of M. Sssl and M.Hpall.
12. The method of any one of the preceding claims, wherein the amount of DNA is at least 5 ng.
13. The method of any one of the preceding claims, wherein the barcode sequence is an oligonucleotide comprising 10 to 16 random nucleotides, or 10 to 15 random nucleotides, or 10 to 13 random nucleotides, or 10 random nucleotides, or 11 random nucleotides, or 12 random nucleotides, or 13 random nucleotides, or 14 random nucleotides, or 15 random nucleotides, or 16 random nucleotides; and wherein optionally, the barcode sequence is an oligonucleotide having 10 random nucleotides.
14. The method of any one of the preceding claims, wherein the number of the forward and reverse primer pairs is at least 5; and wherein optionally, the number of the forward and reverse primer pairs is at least 159.
15. The method of any one of the preceding claims, wherein each forward and reverse primer pair covers a target region which comprises at least 1 CpG site.
16. The method of any one of the preceding claims, wherein the forward and reverse primer pairs comprise sequences as disclosed in Table 1.
17. The method of any one of the preceding claims, wherein the forward primer in the primer pair comprises one or more degenerate bases, and/or the reverse primer in the primer pair comprises one or more degenerate bases; wherein optionally, the degenerate base is selected from the group consisting of C, T, A and G.
18. The method of claim 17, wherein each primer of the primer pair comprises 1, 2, or 3 degenerate bases; and wherein optionally, each primer of the primer pair has one degenerate base.
19. The method of any one of claims 1-18, wherein the primer pair does not comprise a degenerate base.
20. The method of any one of the preceding claims, wherein the first PCR amplification comprises a number of PCR cycles selected from the group consisting of 2, 3, 4 and 5 PCR cycles.
21. The method of claim 1, wherein the universal indexed primers comprise: a forward primer comprising the sequence of
AATGATACGGCGACCACCGAGATCTACACCTAGCGCTACACTCTTTCCCTACAC GACGCTCTTCCGATC*T; and
a reverse primer comprising the sequence of
CAAGCAGAAGACGGCATACGAGATAACCGCGGGTGACTGGAGTTCAGACGTGT
GCTCTTCCGATC*T.
22. The method of claim 1, wherein the methylated DNA pattern is reconstructed using a Bioinformatics analysis method selected from the group consisting of bwa-meth, Bismark, MethylDackel, bisulfite-treated reads analysis tools (BRAT), methyQA, mrsFAST, BSMAP, VerJInxer, RMAP-bs, MethylCoder, BS-seeker2, and Bison.
23. A kit for detecting methylated DNA pattern in DNA in a biological sample according to the method of claim 1, comprising:
(a) a first enzyme capable of oxidizing 5-methylcytosine and 5-hydroxymethylcytosine of the DNA;
(b) a second enzyme capable of converting un-methylated cytosine of the DNA to uracil by deamination;
(c) a plurality of forward and reverse primer pairs specific to the converted sequence of target regions, wherein each forward primer of the plurality of forward and reverse primer pairs comprises a barcode sequence on its 5’ end, wherein the barcode sequence of each forward primer is different from each other, wherein the target regions comprise one or more CpG sites;
(d) a plurality of universal indexed primers for creating the sequencing library;
(e) a first DNA polymerase capable of amplifying DNA with uracil bases, for amplification of converted DNA;
(f) a reagent capable of removing excess primers;
(g) a second DNA polymerase capable of amplifying DNA, for creating the sequencing library; and
(h) sodium bisulfite.
24. The kit of claim 23, wherein the first enzyme is selected from the group consisting of a Ten-eleven translocation (TET) enzyme or an isoform thereof; and wherein optionally, the TET enzyme is selected from the group consisting of TET1 enzyme or an isoform thereof, TET2 enzyme or an isoform thereof, and TET3 enzyme or an isoform thereof.
25. The kit of claim 23, wherein the second enzyme is a cytidine deaminase or an enzyme with cytidine deaminase properties; wherein optionally, the cytidine deaminase is selected from the group consisting of APOBEC enzyme, CDA, and activation-induced cytidine deaminase; and wherein optionally, the enzyme with cytidine deaminase properties is selected from the group consisting of M. Sssl and M.Hpall.
26. The kit of claim 23, wherein the reagent is selected from the group consisting of paramagnetic beads and single- strand exonucleases; wherein optionally the paramagnetic beads are selected from the group consisting of AMPure XP beads, SPRI beads, and Dynabeads.
27. The kit of claim 23, wherein the barcode sequence is an oligonucleotide comprising 10 to 16 random nucleotides, or 10 to 15 random nucleotides, or 10 to 13 random nucleotides, or 10 random nucleotides, or 11 random nucleotides, or 12 random nucleotides, or 13 random nucleotides, or 14 random nucleotides, or 15 random nucleotides, or 16 random nucleotides; and wherein optionally, the barcode sequence is an oligonucleotide having 10 random nucleotides.
28. The kit of claim 23, wherein the number of the forward and reverse primer pairs is at least 5; and wherein optionally, the number of the forward and reverse primer pairs is at least 159.
29. The kit of claim 23, wherein each forward and reverse primer pair covers at least 1 CpG site.
30. The kit of claim 23, wherein the forward and reverse primer pairs comprises sequences as disclosed in Table 1.
31. The kit of claim 23, wherein the forward primer in the primer pair comprises one or more degenerate bases, and/or the reverse primer in the primer pair comprises one or more
degenerate bases; and wherein optionally, the degenerate base is selected from the group consisting of C, T, A and G.
32. The kit of claim 31, wherein each primer of the primer pair comprises 1, 2, or 3 degenerate bases; and wherein optionally, each primer of the primer pair has one degenerate base.
33. The kit of claim 23, wherein the primer pair does not comprise a degenerate base.
34. The kit of claim 23, wherein the universal indexed primers comprise: a forward primer comprising the sequence of
AATGATACGGCGACCACCGAGATCTACACCTAGCGCTACACTCTTTCCCTACAC GACGCTCTTCCGATC*T; and a reverse primer comprising the sequence of C AAGC AGAAGACGGC AT ACGAGAT AACCGCGGGTG ACTGG AGTTC AGACGTGT GCTCTTCCGATC*T.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
SG10202105843Q | 2021-06-02 | ||
SG10202105843Q | 2021-06-02 |
Publications (2)
Publication Number | Publication Date |
---|---|
WO2022255944A2 true WO2022255944A2 (en) | 2022-12-08 |
WO2022255944A3 WO2022255944A3 (en) | 2023-01-12 |
Family
ID=84324624
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/SG2022/050367 WO2022255944A2 (en) | 2021-06-02 | 2022-05-30 | Method for detection and quantification of methylated dna |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2022255944A2 (en) |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018031760A1 (en) * | 2016-08-10 | 2018-02-15 | Grail, Inc. | Methods of preparing dual-indexed dna libraries for bisulfite conversion sequencing |
WO2019126313A1 (en) * | 2017-12-22 | 2019-06-27 | The University Of Chicago | Multiplex 5mc marker barcode counting for methylation detection in cell-free dna |
AU2020359506A1 (en) * | 2019-09-30 | 2022-03-10 | Integrated Dna Technologies, Inc. | Methods of preparing dual indexed methyl-seq libraries |
-
2022
- 2022-05-30 WO PCT/SG2022/050367 patent/WO2022255944A2/en unknown
Also Published As
Publication number | Publication date |
---|---|
WO2022255944A3 (en) | 2023-01-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20210180139A1 (en) | Cancer detection methods | |
Milani et al. | DNA methylation for subtype classification and prediction of treatment outcome in patients with childhood acute lymphoblastic leukemia | |
CN111742062B (en) | Methylation markers for diagnosing cancer | |
DK2898100T3 (en) | NON-INVASIVE DETERMINATION OF A FOSTER METHYLOM OR PLASMA TUMOR | |
CA3126428A1 (en) | Compositions and methods for isolating cell-free dna | |
CN112236520A (en) | Methylation signatures and target methylation probe plates | |
WO2017201606A1 (en) | Cell-free detection of methylated tumour dna | |
EP3658684B1 (en) | Enhancement of cancer screening using cell-free viral nucleic acids | |
TW202124728A (en) | Determination of base modifications of nucleic acids | |
EP4235676A2 (en) | Methods for non-invasive assessment of genetic alterations | |
WO2016097120A1 (en) | Method for the prognosis of hepatocellular carcinoma | |
JP2014519319A (en) | Methods and compositions for detecting cancer through general loss of epigenetic domain stability | |
CN112941180A (en) | Group of lung cancer DNA methylation molecular markers and application thereof in preparation of lung cancer early diagnosis kit | |
US20230203590A1 (en) | Methods and means for diagnosing lung cancer | |
WO2023226939A1 (en) | Methylation biomarker for detecting colorectal cancer lymph node metastasis and use thereof | |
WO2022255944A2 (en) | Method for detection and quantification of methylated dna | |
Gallardo-Gómez et al. | Serum methylation of GALNT9, UPF3A, WARS, and LDB2 as non-invasive biomarkers for the early detection of colorectal cancer and premalignant adenomas | |
US20220290245A1 (en) | Cancer detection and classification | |
WO2022262831A1 (en) | Substance and method for tumor assessment | |
CN115772566B (en) | Methylation biomarker for auxiliary detection of lung cancer somatic ERBB2 gene mutation and application thereof | |
Gallardo-Gómez et al. | Serum methylation of GALNT9, UPF3A, WARS, and LDB2 as noninvasive biomarkers for the early detection of colorectal cancer and advanced adenomas | |
US20230323473A1 (en) | Methods for multimodal epigenetic sequencing assays | |
Michel et al. | Non-invasive multi-cancer diagnosis using DNA hypomethylation of LINE-1 retrotransposons | |
WO2024047250A1 (en) | Sensitive and specific determination of dna methylation profiles | |
TW202330938A (en) | Substance and method for evaluating tumor |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
NENP | Non-entry into the national phase |
Ref country code: DE |