US20230242981A1 - Method for sequencing a direct repeat - Google Patents
Method for sequencing a direct repeat Download PDFInfo
- Publication number
- US20230242981A1 US20230242981A1 US18/057,201 US202218057201A US2023242981A1 US 20230242981 A1 US20230242981 A1 US 20230242981A1 US 202218057201 A US202218057201 A US 202218057201A US 2023242981 A1 US2023242981 A1 US 2023242981A1
- Authority
- US
- United States
- Prior art keywords
- sequence
- repeat
- sequences
- template
- sequencing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 111
- 238000012163 sequencing technique Methods 0.000 title claims abstract description 41
- 108091081062 Repeated sequence (DNA) Proteins 0.000 claims abstract description 27
- 238000011144 upstream manufacturing Methods 0.000 claims abstract description 25
- 238000006243 chemical reaction Methods 0.000 claims abstract description 24
- 238000009396 hybridization Methods 0.000 claims abstract description 24
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 10
- 108020004414 DNA Proteins 0.000 claims description 93
- 239000002773 nucleotide Substances 0.000 claims description 80
- 125000003729 nucleotide group Chemical group 0.000 claims description 80
- 239000012634 fragment Substances 0.000 claims description 65
- 230000003321 amplification Effects 0.000 claims description 22
- 238000003199 nucleic acid amplification method Methods 0.000 claims description 22
- 230000002441 reversible effect Effects 0.000 claims description 11
- 239000000758 substrate Substances 0.000 claims description 5
- 238000001574 biopsy Methods 0.000 claims description 3
- 230000000813 microbial effect Effects 0.000 claims description 3
- 230000003612 virological effect Effects 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 2
- 239000002585 base Substances 0.000 description 83
- 239000000523 sample Substances 0.000 description 51
- 150000007523 nucleic acids Chemical class 0.000 description 43
- 102000039446 nucleic acids Human genes 0.000 description 42
- 108020004707 nucleic acids Proteins 0.000 description 42
- 230000000295 complement effect Effects 0.000 description 26
- 102000040430 polynucleotide Human genes 0.000 description 26
- 108091033319 polynucleotide Proteins 0.000 description 26
- 239000002157 polynucleotide Substances 0.000 description 26
- 108091034117 Oligonucleotide Proteins 0.000 description 24
- 230000035772 mutation Effects 0.000 description 14
- 210000001519 tissue Anatomy 0.000 description 13
- 210000004027 cell Anatomy 0.000 description 11
- 239000000203 mixture Substances 0.000 description 11
- JLCPHMBAVCMARE-UHFFFAOYSA-N [3-[[3-[[3-[[3-[[3-[[3-[[3-[[3-[[3-[[3-[[3-[[5-(2-amino-6-oxo-1H-purin-9-yl)-3-[[3-[[3-[[3-[[3-[[3-[[5-(2-amino-6-oxo-1H-purin-9-yl)-3-[[5-(2-amino-6-oxo-1H-purin-9-yl)-3-hydroxyoxolan-2-yl]methoxy-hydroxyphosphoryl]oxyoxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(5-methyl-2,4-dioxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxyoxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(5-methyl-2,4-dioxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(5-methyl-2,4-dioxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(5-methyl-2,4-dioxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methyl [5-(6-aminopurin-9-yl)-2-(hydroxymethyl)oxolan-3-yl] hydrogen phosphate Polymers Cc1cn(C2CC(OP(O)(=O)OCC3OC(CC3OP(O)(=O)OCC3OC(CC3O)n3cnc4c3nc(N)[nH]c4=O)n3cnc4c3nc(N)[nH]c4=O)C(COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3CO)n3cnc4c(N)ncnc34)n3ccc(N)nc3=O)n3cnc4c(N)ncnc34)n3ccc(N)nc3=O)n3ccc(N)nc3=O)n3ccc(N)nc3=O)n3cnc4c(N)ncnc34)n3cnc4c(N)ncnc34)n3cc(C)c(=O)[nH]c3=O)n3cc(C)c(=O)[nH]c3=O)n3ccc(N)nc3=O)n3cc(C)c(=O)[nH]c3=O)n3cnc4c3nc(N)[nH]c4=O)n3cnc4c(N)ncnc34)n3cnc4c(N)ncnc34)n3cnc4c(N)ncnc34)n3cnc4c(N)ncnc34)O2)c(=O)[nH]c1=O JLCPHMBAVCMARE-UHFFFAOYSA-N 0.000 description 10
- 206010069754 Acquired gene mutation Diseases 0.000 description 9
- 230000037439 somatic mutation Effects 0.000 description 9
- OPTASPLRGRRNAP-UHFFFAOYSA-N cytosine Chemical compound NC=1C=CNC(=O)N=1 OPTASPLRGRRNAP-UHFFFAOYSA-N 0.000 description 8
- 239000000463 material Substances 0.000 description 8
- 230000008569 process Effects 0.000 description 8
- 108091032973 (ribonucleotides)n+m Proteins 0.000 description 6
- 206010028980 Neoplasm Diseases 0.000 description 6
- -1 deoxyribose sugars Chemical class 0.000 description 6
- 238000001308 synthesis method Methods 0.000 description 6
- RWQNBRDOKXIBIV-UHFFFAOYSA-N thymine Chemical compound CC1=CNC(=O)NC1=O RWQNBRDOKXIBIV-UHFFFAOYSA-N 0.000 description 6
- GFFGJBXGBJISGV-UHFFFAOYSA-N Adenine Chemical compound NC1=NC=NC2=C1N=CN2 GFFGJBXGBJISGV-UHFFFAOYSA-N 0.000 description 5
- 102000053602 DNA Human genes 0.000 description 5
- 201000011510 cancer Diseases 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 5
- UYTPUPDQBNUYGX-UHFFFAOYSA-N guanine Chemical compound O=C1NC(N)=NC2=C1N=CN2 UYTPUPDQBNUYGX-UHFFFAOYSA-N 0.000 description 5
- KDCGOANMDULRCW-UHFFFAOYSA-N 7H-purine Chemical compound N1=CNC2=NC=NC2=C1 KDCGOANMDULRCW-UHFFFAOYSA-N 0.000 description 4
- 241000124008 Mammalia Species 0.000 description 4
- ISAKRJDGNUQOIC-UHFFFAOYSA-N Uracil Chemical compound O=C1C=CNC(=O)N1 ISAKRJDGNUQOIC-UHFFFAOYSA-N 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 4
- 238000000137 annealing Methods 0.000 description 4
- 230000015572 biosynthetic process Effects 0.000 description 4
- 229940104302 cytosine Drugs 0.000 description 4
- 238000003780 insertion Methods 0.000 description 4
- 230000037431 insertion Effects 0.000 description 4
- 150000002500 ions Chemical class 0.000 description 4
- 239000000178 monomer Substances 0.000 description 4
- 238000007481 next generation sequencing Methods 0.000 description 4
- 108090000623 proteins and genes Proteins 0.000 description 4
- 238000012175 pyrosequencing Methods 0.000 description 4
- 235000000346 sugar Nutrition 0.000 description 4
- 229930024421 Adenine Natural products 0.000 description 3
- 108700028369 Alleles Proteins 0.000 description 3
- HMFHBZSHGGEWLO-SOOFDHNKSA-N D-ribofuranose Chemical compound OC[C@H]1OC(O)[C@H](O)[C@@H]1O HMFHBZSHGGEWLO-SOOFDHNKSA-N 0.000 description 3
- 102000016928 DNA-directed DNA polymerase Human genes 0.000 description 3
- 108010014303 DNA-directed DNA polymerase Proteins 0.000 description 3
- 108091092878 Microsatellite Proteins 0.000 description 3
- 108091028664 Ribonucleotide Proteins 0.000 description 3
- PYMYPHUHKUWMLA-LMVFSUKVSA-N Ribose Natural products OC[C@@H](O)[C@@H](O)[C@@H](O)C=O PYMYPHUHKUWMLA-LMVFSUKVSA-N 0.000 description 3
- 208000003028 Stuttering Diseases 0.000 description 3
- 229960000643 adenine Drugs 0.000 description 3
- HMFHBZSHGGEWLO-UHFFFAOYSA-N alpha-D-Furanose-Ribose Natural products OCC1OC(O)C(O)C1O HMFHBZSHGGEWLO-UHFFFAOYSA-N 0.000 description 3
- 238000013459 approach Methods 0.000 description 3
- 210000004369 blood Anatomy 0.000 description 3
- 239000008280 blood Substances 0.000 description 3
- 210000001124 body fluid Anatomy 0.000 description 3
- 238000007796 conventional method Methods 0.000 description 3
- 238000012217 deletion Methods 0.000 description 3
- 230000037430 deletion Effects 0.000 description 3
- 239000005547 deoxyribonucleotide Substances 0.000 description 3
- 125000002637 deoxyribonucleotide group Chemical group 0.000 description 3
- 239000000975 dye Substances 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000010369 molecular cloning Methods 0.000 description 3
- 239000012188 paraffin wax Substances 0.000 description 3
- 239000013610 patient sample Substances 0.000 description 3
- 229920000642 polymer Polymers 0.000 description 3
- 239000002336 ribonucleotide Substances 0.000 description 3
- 125000002652 ribonucleotide group Chemical group 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 229940113082 thymine Drugs 0.000 description 3
- YBJHBAHKTGYVGT-ZKWXMUAHSA-N (+)-Biotin Chemical group N1C(=O)N[C@@H]2[C@H](CCCCC(=O)O)SC[C@@H]21 YBJHBAHKTGYVGT-ZKWXMUAHSA-N 0.000 description 2
- 108091035707 Consensus sequence Proteins 0.000 description 2
- 108010025464 Cyclin-Dependent Kinase 4 Proteins 0.000 description 2
- 102100036252 Cyclin-dependent kinase 4 Human genes 0.000 description 2
- 230000005778 DNA damage Effects 0.000 description 2
- 231100000277 DNA damage Toxicity 0.000 description 2
- 102000004190 Enzymes Human genes 0.000 description 2
- 108090000790 Enzymes Proteins 0.000 description 2
- 102100028138 F-box/WD repeat-containing protein 7 Human genes 0.000 description 2
- ZHNUHDYFZUAESO-UHFFFAOYSA-N Formamide Chemical compound NC=O ZHNUHDYFZUAESO-UHFFFAOYSA-N 0.000 description 2
- 102100030708 GTPase KRas Human genes 0.000 description 2
- 206010018338 Glioma Diseases 0.000 description 2
- 101000584612 Homo sapiens GTPase KRas Proteins 0.000 description 2
- 101000611023 Homo sapiens Tumor necrosis factor receptor superfamily member 6 Proteins 0.000 description 2
- 206010025323 Lymphomas Diseases 0.000 description 2
- 208000000172 Medulloblastoma Diseases 0.000 description 2
- 241001465754 Metazoa Species 0.000 description 2
- 102100025725 Mothers against decapentaplegic homolog 4 Human genes 0.000 description 2
- 101710143112 Mothers against decapentaplegic homolog 4 Proteins 0.000 description 2
- 102100022219 NF-kappa-B essential modulator Human genes 0.000 description 2
- 208000012902 Nervous system disease Diseases 0.000 description 2
- 208000029726 Neurodevelopmental disease Diseases 0.000 description 2
- 208000025966 Neurological disease Diseases 0.000 description 2
- 238000012408 PCR amplification Methods 0.000 description 2
- CZPWVGJYEJSRLH-UHFFFAOYSA-N Pyrimidine Chemical compound C1=CN=CN=C1 CZPWVGJYEJSRLH-UHFFFAOYSA-N 0.000 description 2
- 102100026715 Serine/threonine-protein kinase STK11 Human genes 0.000 description 2
- 102000008579 Transposases Human genes 0.000 description 2
- 108010020764 Transposases Proteins 0.000 description 2
- 102100040403 Tumor necrosis factor receptor superfamily member 6 Human genes 0.000 description 2
- 201000003588 autosomal dominant cerebellar ataxia, deafness and narcolepsy Diseases 0.000 description 2
- 239000012472 biological sample Substances 0.000 description 2
- 229910052799 carbon Inorganic materials 0.000 description 2
- 210000001175 cerebrospinal fluid Anatomy 0.000 description 2
- 150000001875 compounds Chemical class 0.000 description 2
- 239000000470 constituent Substances 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 239000012530 fluid Substances 0.000 description 2
- 239000007850 fluorescent dye Substances 0.000 description 2
- 125000000623 heterocyclic group Chemical group 0.000 description 2
- 229910052739 hydrogen Inorganic materials 0.000 description 2
- 125000002887 hydroxy group Chemical group [H]O* 0.000 description 2
- 230000000977 initiatory effect Effects 0.000 description 2
- 210000002751 lymph Anatomy 0.000 description 2
- 201000001441 melanoma Diseases 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 231100000590 oncogenic Toxicity 0.000 description 2
- 230000002246 oncogenic effect Effects 0.000 description 2
- 210000005259 peripheral blood Anatomy 0.000 description 2
- 239000011886 peripheral blood Substances 0.000 description 2
- 210000002381 plasma Anatomy 0.000 description 2
- 150000003212 purines Chemical class 0.000 description 2
- 150000003230 pyrimidines Chemical class 0.000 description 2
- 230000010076 replication Effects 0.000 description 2
- 210000003296 saliva Anatomy 0.000 description 2
- 210000002966 serum Anatomy 0.000 description 2
- 238000003860 storage Methods 0.000 description 2
- 210000001138 tear Anatomy 0.000 description 2
- 229940035893 uracil Drugs 0.000 description 2
- 238000005406 washing Methods 0.000 description 2
- HNXRLRRQDUXQEE-ALURDMBKSA-N (2s,3r,4s,5r,6r)-2-[[(2r,3s,4r)-4-hydroxy-2-(hydroxymethyl)-3,4-dihydro-2h-pyran-3-yl]oxy]-6-(hydroxymethyl)oxane-3,4,5-triol Chemical compound O[C@@H]1[C@@H](O)[C@@H](O)[C@@H](CO)O[C@H]1O[C@@H]1[C@@H](CO)OC=C[C@H]1O HNXRLRRQDUXQEE-ALURDMBKSA-N 0.000 description 1
- 102100039583 116 kDa U5 small nuclear ribonucleoprotein component Human genes 0.000 description 1
- PIINGYXNCHTJTF-UHFFFAOYSA-N 2-(2-azaniumylethylamino)acetate Chemical group NCCNCC(O)=O PIINGYXNCHTJTF-UHFFFAOYSA-N 0.000 description 1
- ASJSAQIRZKANQN-CRCLSJGQSA-N 2-deoxy-D-ribose Chemical compound OC[C@@H](O)[C@@H](O)CC=O ASJSAQIRZKANQN-CRCLSJGQSA-N 0.000 description 1
- FWMNVWWHGCHHJJ-SKKKGAJSSA-N 4-amino-1-[(2r)-6-amino-2-[[(2r)-2-[[(2r)-2-[[(2r)-2-amino-3-phenylpropanoyl]amino]-3-phenylpropanoyl]amino]-4-methylpentanoyl]amino]hexanoyl]piperidine-4-carboxylic acid Chemical compound C([C@H](C(=O)N[C@H](CC(C)C)C(=O)N[C@H](CCCCN)C(=O)N1CCC(N)(CC1)C(O)=O)NC(=O)[C@H](N)CC=1C=CC=CC=1)C1=CC=CC=C1 FWMNVWWHGCHHJJ-SKKKGAJSSA-N 0.000 description 1
- CLGFIVUFZRGQRP-UHFFFAOYSA-N 7,8-dihydro-8-oxoguanine Chemical compound O=C1NC(N)=NC2=C1NC(=O)N2 CLGFIVUFZRGQRP-UHFFFAOYSA-N 0.000 description 1
- HCAJQHYUCKICQH-VPENINKCSA-N 8-Oxo-7,8-dihydro-2'-deoxyguanosine Chemical compound C1=2NC(N)=NC(=O)C=2NC(=O)N1[C@H]1C[C@H](O)[C@@H](CO)O1 HCAJQHYUCKICQH-VPENINKCSA-N 0.000 description 1
- 102100033793 ALK tyrosine kinase receptor Human genes 0.000 description 1
- 102100034580 AT-rich interactive domain-containing protein 1A Human genes 0.000 description 1
- 102100034571 AT-rich interactive domain-containing protein 1B Human genes 0.000 description 1
- 102100027452 ATP-dependent DNA helicase Q4 Human genes 0.000 description 1
- 102100030374 Actin, cytoplasmic 2 Human genes 0.000 description 1
- 241000251468 Actinopterygii Species 0.000 description 1
- 102100035886 Adenine DNA glycosylase Human genes 0.000 description 1
- 102100034540 Adenomatous polyposis coli protein Human genes 0.000 description 1
- 102100034614 Ankyrin repeat domain-containing protein 11 Human genes 0.000 description 1
- 241001156002 Anthonomus pomorum Species 0.000 description 1
- 102100027308 Apoptosis regulator BAX Human genes 0.000 description 1
- 108050006685 Apoptosis regulator BAX Proteins 0.000 description 1
- 102100021569 Apoptosis regulator Bcl-2 Human genes 0.000 description 1
- 102100035683 Axin-2 Human genes 0.000 description 1
- 102100021631 B-cell lymphoma 6 protein Human genes 0.000 description 1
- 108091012583 BCL2 Proteins 0.000 description 1
- 108700020463 BRCA1 Proteins 0.000 description 1
- 102000036365 BRCA1 Human genes 0.000 description 1
- 101150072950 BRCA1 gene Proteins 0.000 description 1
- 108700020462 BRCA2 Proteins 0.000 description 1
- 102000052609 BRCA2 Human genes 0.000 description 1
- 241000894006 Bacteria Species 0.000 description 1
- 208000014803 Baraitser-Winter cerebrofrontofacial syndrome Diseases 0.000 description 1
- 201000002876 Baraitser-Winter syndrome Diseases 0.000 description 1
- 208000024400 Blepharophimosis-intellectual disability syndrome, Ohdo type Diseases 0.000 description 1
- 208000019495 Bohring-Opitz syndrome Diseases 0.000 description 1
- 102100025423 Bone morphogenetic protein receptor type-1A Human genes 0.000 description 1
- 101000964894 Bos taurus 14-3-3 protein zeta/delta Proteins 0.000 description 1
- 101150008921 Brca2 gene Proteins 0.000 description 1
- 206010006187 Breast cancer Diseases 0.000 description 1
- 208000026310 Breast neoplasm Diseases 0.000 description 1
- 206010064063 CHARGE syndrome Diseases 0.000 description 1
- OKTJSMMVPCPJKN-UHFFFAOYSA-N Carbon Chemical compound [C] OKTJSMMVPCPJKN-UHFFFAOYSA-N 0.000 description 1
- 102100028914 Catenin beta-1 Human genes 0.000 description 1
- ZEOWTGPWHLSLOG-UHFFFAOYSA-N Cc1ccc(cc1-c1ccc2c(n[nH]c2c1)-c1cnn(c1)C1CC1)C(=O)Nc1cccc(c1)C(F)(F)F Chemical compound Cc1ccc(cc1-c1ccc2c(n[nH]c2c1)-c1cnn(c1)C1CC1)C(=O)Nc1cccc(c1)C(F)(F)F ZEOWTGPWHLSLOG-UHFFFAOYSA-N 0.000 description 1
- 108091007854 Cdh1/Fizzy-related Proteins 0.000 description 1
- 102000038594 Cdh1/Fizzy-related Human genes 0.000 description 1
- 102100025064 Cellular tumor antigen p53 Human genes 0.000 description 1
- 241000282693 Cercopithecidae Species 0.000 description 1
- 102100038215 Chromodomain-helicase-DNA-binding protein 7 Human genes 0.000 description 1
- 201000001432 Coffin-Siris syndrome Diseases 0.000 description 1
- 206010009944 Colon cancer Diseases 0.000 description 1
- 208000001333 Colorectal Neoplasms Diseases 0.000 description 1
- 108010043471 Core Binding Factor Alpha 2 Subunit Proteins 0.000 description 1
- 241000938605 Crocodylia Species 0.000 description 1
- 108010058546 Cyclin D1 Proteins 0.000 description 1
- 102000006311 Cyclin D1 Human genes 0.000 description 1
- 108010009392 Cyclin-Dependent Kinase Inhibitor p16 Proteins 0.000 description 1
- 108010009540 DNA (Cytosine-5-)-Methyltransferase 1 Proteins 0.000 description 1
- 102100036279 DNA (cytosine-5)-methyltransferase 1 Human genes 0.000 description 1
- 102100021122 DNA damage-binding protein 2 Human genes 0.000 description 1
- 102100034157 DNA mismatch repair protein Msh2 Human genes 0.000 description 1
- 102100021147 DNA mismatch repair protein Msh6 Human genes 0.000 description 1
- 230000006820 DNA synthesis Effects 0.000 description 1
- 101100174544 Danio rerio foxo1a gene Proteins 0.000 description 1
- 108010036364 Deoxyribonuclease IV (Phage T4-Induced) Proteins 0.000 description 1
- AHCYMLUZIRLXAA-SHYZEUOFSA-N Deoxyuridine 5'-triphosphate Chemical compound O1[C@H](COP(O)(=O)OP(O)(=O)OP(O)(O)=O)[C@@H](O)C[C@@H]1N1C(=O)NC(=O)C=C1 AHCYMLUZIRLXAA-SHYZEUOFSA-N 0.000 description 1
- 108010086291 Deubiquitinating Enzyme CYLD Proteins 0.000 description 1
- 102100023274 Dual specificity mitogen-activated protein kinase kinase 4 Human genes 0.000 description 1
- 102000012199 E3 ubiquitin-protein ligase Mdm2 Human genes 0.000 description 1
- 108050002772 E3 ubiquitin-protein ligase Mdm2 Proteins 0.000 description 1
- 101150097734 EPHB2 gene Proteins 0.000 description 1
- 206010014733 Endometrial cancer Diseases 0.000 description 1
- 206010014759 Endometrial neoplasm Diseases 0.000 description 1
- 102100031968 Ephrin type-B receptor 2 Human genes 0.000 description 1
- 102100029055 Exostosin-1 Human genes 0.000 description 1
- 102100029074 Exostosin-2 Human genes 0.000 description 1
- 101710105178 F-box/WD repeat-containing protein 7 Proteins 0.000 description 1
- 101150106966 FOXO1 gene Proteins 0.000 description 1
- 102100023593 Fibroblast growth factor receptor 1 Human genes 0.000 description 1
- 101710182386 Fibroblast growth factor receptor 1 Proteins 0.000 description 1
- 102100023600 Fibroblast growth factor receptor 2 Human genes 0.000 description 1
- 101710182389 Fibroblast growth factor receptor 2 Proteins 0.000 description 1
- 102100027842 Fibroblast growth factor receptor 3 Human genes 0.000 description 1
- 101710182396 Fibroblast growth factor receptor 3 Proteins 0.000 description 1
- 208000002893 Floating-Harbor syndrome Diseases 0.000 description 1
- 102100035427 Forkhead box protein O1 Human genes 0.000 description 1
- 102100035421 Forkhead box protein O3 Human genes 0.000 description 1
- 241000233866 Fungi Species 0.000 description 1
- 102100029974 GTPase HRas Human genes 0.000 description 1
- 102100039788 GTPase NRas Human genes 0.000 description 1
- 208000032612 Glial tumor Diseases 0.000 description 1
- 102100032530 Glypican-3 Human genes 0.000 description 1
- 108700039143 HMGA2 Proteins 0.000 description 1
- 208000002927 Hamartoma Diseases 0.000 description 1
- 102100031880 Helicase SRCAP Human genes 0.000 description 1
- 208000021236 Hereditary diffuse leukoencephalopathy with axonal spheroids and pigmented glia Diseases 0.000 description 1
- 108091027305 Heteroduplex Proteins 0.000 description 1
- 241000238631 Hexapoda Species 0.000 description 1
- 102100035108 High affinity nerve growth factor receptor Human genes 0.000 description 1
- 102100028999 High mobility group protein HMGI-C Human genes 0.000 description 1
- 102100033070 Histone acetyltransferase KAT6B Human genes 0.000 description 1
- 102100022103 Histone-lysine N-methyltransferase 2A Human genes 0.000 description 1
- 102100027768 Histone-lysine N-methyltransferase 2D Human genes 0.000 description 1
- 102100038970 Histone-lysine N-methyltransferase EZH2 Human genes 0.000 description 1
- 102100039121 Histone-lysine N-methyltransferase MECOM Human genes 0.000 description 1
- 101150073387 Hmga2 gene Proteins 0.000 description 1
- 102100030308 Homeobox protein Hox-A11 Human genes 0.000 description 1
- 102100030307 Homeobox protein Hox-A13 Human genes 0.000 description 1
- 102100021090 Homeobox protein Hox-A9 Human genes 0.000 description 1
- 102100020761 Homeobox protein Hox-C13 Human genes 0.000 description 1
- 102100039545 Homeobox protein Hox-D11 Human genes 0.000 description 1
- 102100040227 Homeobox protein Hox-D13 Human genes 0.000 description 1
- 101000608799 Homo sapiens 116 kDa U5 small nuclear ribonucleoprotein component Proteins 0.000 description 1
- 101000779641 Homo sapiens ALK tyrosine kinase receptor Proteins 0.000 description 1
- 101000924266 Homo sapiens AT-rich interactive domain-containing protein 1A Proteins 0.000 description 1
- 101000924255 Homo sapiens AT-rich interactive domain-containing protein 1B Proteins 0.000 description 1
- 101000580577 Homo sapiens ATP-dependent DNA helicase Q4 Proteins 0.000 description 1
- 101000756632 Homo sapiens Actin, cytoplasmic 1 Proteins 0.000 description 1
- 101000773237 Homo sapiens Actin, cytoplasmic 2 Proteins 0.000 description 1
- 101000824278 Homo sapiens Acyl-[acyl-carrier-protein] hydrolase Proteins 0.000 description 1
- 101001000351 Homo sapiens Adenine DNA glycosylase Proteins 0.000 description 1
- 101000924577 Homo sapiens Adenomatous polyposis coli protein Proteins 0.000 description 1
- 101000924727 Homo sapiens Alternative prion protein Proteins 0.000 description 1
- 101000924476 Homo sapiens Ankyrin repeat domain-containing protein 11 Proteins 0.000 description 1
- 101000874569 Homo sapiens Axin-2 Proteins 0.000 description 1
- 101000971234 Homo sapiens B-cell lymphoma 6 protein Proteins 0.000 description 1
- 101000934638 Homo sapiens Bone morphogenetic protein receptor type-1A Proteins 0.000 description 1
- 101000916173 Homo sapiens Catenin beta-1 Proteins 0.000 description 1
- 101000883739 Homo sapiens Chromodomain-helicase-DNA-binding protein 7 Proteins 0.000 description 1
- 101001041466 Homo sapiens DNA damage-binding protein 2 Proteins 0.000 description 1
- 101001134036 Homo sapiens DNA mismatch repair protein Msh2 Proteins 0.000 description 1
- 101000968658 Homo sapiens DNA mismatch repair protein Msh6 Proteins 0.000 description 1
- 101001115395 Homo sapiens Dual specificity mitogen-activated protein kinase kinase 4 Proteins 0.000 description 1
- 101000918311 Homo sapiens Exostosin-1 Proteins 0.000 description 1
- 101000918275 Homo sapiens Exostosin-2 Proteins 0.000 description 1
- 101001060231 Homo sapiens F-box/WD repeat-containing protein 7 Proteins 0.000 description 1
- 101000877681 Homo sapiens Forkhead box protein O3 Proteins 0.000 description 1
- 101000584633 Homo sapiens GTPase HRas Proteins 0.000 description 1
- 101000744505 Homo sapiens GTPase NRas Proteins 0.000 description 1
- 101001014668 Homo sapiens Glypican-3 Proteins 0.000 description 1
- 101000704158 Homo sapiens Helicase SRCAP Proteins 0.000 description 1
- 101000596894 Homo sapiens High affinity nerve growth factor receptor Proteins 0.000 description 1
- 101000944174 Homo sapiens Histone acetyltransferase KAT6B Proteins 0.000 description 1
- 101001045846 Homo sapiens Histone-lysine N-methyltransferase 2A Proteins 0.000 description 1
- 101001045848 Homo sapiens Histone-lysine N-methyltransferase 2B Proteins 0.000 description 1
- 101001008894 Homo sapiens Histone-lysine N-methyltransferase 2D Proteins 0.000 description 1
- 101000882127 Homo sapiens Histone-lysine N-methyltransferase EZH2 Proteins 0.000 description 1
- 101001033728 Homo sapiens Histone-lysine N-methyltransferase MECOM Proteins 0.000 description 1
- 101001083158 Homo sapiens Homeobox protein Hox-A11 Proteins 0.000 description 1
- 101001002988 Homo sapiens Homeobox protein Hox-C13 Proteins 0.000 description 1
- 101000962591 Homo sapiens Homeobox protein Hox-D11 Proteins 0.000 description 1
- 101001037168 Homo sapiens Homeobox protein Hox-D13 Proteins 0.000 description 1
- 101000916644 Homo sapiens Macrophage colony-stimulating factor 1 receptor Proteins 0.000 description 1
- 101000573901 Homo sapiens Major prion protein Proteins 0.000 description 1
- 101001030211 Homo sapiens Myc proto-oncogene protein Proteins 0.000 description 1
- 101001128138 Homo sapiens NACHT, LRR and PYD domains-containing protein 2 Proteins 0.000 description 1
- 101000973618 Homo sapiens NF-kappa-B essential modulator Proteins 0.000 description 1
- 101000720704 Homo sapiens Neuronal migration protein doublecortin Proteins 0.000 description 1
- 101000981336 Homo sapiens Nibrin Proteins 0.000 description 1
- 101000945735 Homo sapiens Parafibromin Proteins 0.000 description 1
- 101000605639 Homo sapiens Phosphatidylinositol 4,5-bisphosphate 3-kinase catalytic subunit alpha isoform Proteins 0.000 description 1
- 101001064282 Homo sapiens Platelet-activating factor acetylhydrolase IB subunit beta Proteins 0.000 description 1
- 101000728236 Homo sapiens Polycomb group protein ASXL1 Proteins 0.000 description 1
- 101000585703 Homo sapiens Protein L-Myc Proteins 0.000 description 1
- 101000642815 Homo sapiens Protein SSXT Proteins 0.000 description 1
- 101000695187 Homo sapiens Protein patched homolog 1 Proteins 0.000 description 1
- 101000579425 Homo sapiens Proto-oncogene tyrosine-protein kinase receptor Ret Proteins 0.000 description 1
- 101000779418 Homo sapiens RAC-alpha serine/threonine-protein kinase Proteins 0.000 description 1
- 101000798015 Homo sapiens RAC-beta serine/threonine-protein kinase Proteins 0.000 description 1
- 101001012157 Homo sapiens Receptor tyrosine-protein kinase erbB-2 Proteins 0.000 description 1
- 101000932478 Homo sapiens Receptor-type tyrosine-protein kinase FLT3 Proteins 0.000 description 1
- 101000654718 Homo sapiens SET-binding protein Proteins 0.000 description 1
- 101000702542 Homo sapiens SWI/SNF-related matrix-associated actin-dependent regulator of chromatin subfamily E member 1 Proteins 0.000 description 1
- 101000984753 Homo sapiens Serine/threonine-protein kinase B-raf Proteins 0.000 description 1
- 101000628562 Homo sapiens Serine/threonine-protein kinase STK11 Proteins 0.000 description 1
- 101000631760 Homo sapiens Sodium channel protein type 1 subunit alpha Proteins 0.000 description 1
- 101000951145 Homo sapiens Succinate dehydrogenase [ubiquinone] cytochrome b small subunit, mitochondrial Proteins 0.000 description 1
- 101000874160 Homo sapiens Succinate dehydrogenase [ubiquinone] iron-sulfur subunit, mitochondrial Proteins 0.000 description 1
- 101000934888 Homo sapiens Succinate dehydrogenase cytochrome b560 subunit, mitochondrial Proteins 0.000 description 1
- 101000628885 Homo sapiens Suppressor of fused homolog Proteins 0.000 description 1
- 101000891113 Homo sapiens T-cell acute lymphocytic leukemia protein 1 Proteins 0.000 description 1
- 101000800488 Homo sapiens T-cell leukemia homeobox protein 1 Proteins 0.000 description 1
- 101000655119 Homo sapiens T-cell leukemia homeobox protein 3 Proteins 0.000 description 1
- 101000837626 Homo sapiens Thyroid hormone receptor alpha Proteins 0.000 description 1
- 101000702545 Homo sapiens Transcription activator BRG1 Proteins 0.000 description 1
- 101000837845 Homo sapiens Transcription factor E3 Proteins 0.000 description 1
- 101000823316 Homo sapiens Tyrosine-protein kinase ABL1 Proteins 0.000 description 1
- 101001026790 Homo sapiens Tyrosine-protein kinase Fes/Fps Proteins 0.000 description 1
- 101000997832 Homo sapiens Tyrosine-protein kinase JAK2 Proteins 0.000 description 1
- 208000026350 Inborn Genetic disease Diseases 0.000 description 1
- 206010062717 Increased upper airway secretion Diseases 0.000 description 1
- 201000003488 KBG syndrome Diseases 0.000 description 1
- 208000007367 Kabuki syndrome Diseases 0.000 description 1
- 208000008839 Kidney Neoplasms Diseases 0.000 description 1
- 101150083522 MECP2 gene Proteins 0.000 description 1
- 229910015837 MSH2 Inorganic materials 0.000 description 1
- 108700012912 MYCN Proteins 0.000 description 1
- 101150022024 MYCN gene Proteins 0.000 description 1
- 102100028198 Macrophage colony-stimulating factor 1 receptor Human genes 0.000 description 1
- 102100025818 Major prion protein Human genes 0.000 description 1
- 208000016703 Mandibulofacial dysostosis-microcephaly syndrome Diseases 0.000 description 1
- 206010027406 Mesothelioma Diseases 0.000 description 1
- 102100039124 Methyl-CpG-binding protein 2 Human genes 0.000 description 1
- 108010074346 Mismatch Repair Endonuclease PMS2 Proteins 0.000 description 1
- 102000008071 Mismatch Repair Endonuclease PMS2 Human genes 0.000 description 1
- 102100025751 Mothers against decapentaplegic homolog 2 Human genes 0.000 description 1
- 101710143123 Mothers against decapentaplegic homolog 2 Proteins 0.000 description 1
- 208000034578 Multiple myelomas Diseases 0.000 description 1
- 102100038895 Myc proto-oncogene protein Human genes 0.000 description 1
- 208000028738 Myhre syndrome Diseases 0.000 description 1
- 108700026495 N-Myc Proto-Oncogene Proteins 0.000 description 1
- 102100030124 N-myc proto-oncogene protein Human genes 0.000 description 1
- 101710090077 NF-kappa-B essential modulator Proteins 0.000 description 1
- 102100029166 NT-3 growth factor receptor Human genes 0.000 description 1
- 108700019961 Neoplasm Genes Proteins 0.000 description 1
- 102000048850 Neoplasm Genes Human genes 0.000 description 1
- 208000005890 Neuroma Diseases 0.000 description 1
- 102100025929 Neuronal migration protein doublecortin Human genes 0.000 description 1
- 102100024403 Nibrin Human genes 0.000 description 1
- 102000001759 Notch1 Receptor Human genes 0.000 description 1
- 108010029755 Notch1 Receptor Proteins 0.000 description 1
- 108091028043 Nucleic acid sequence Proteins 0.000 description 1
- 201000003048 Ohdo syndrome Diseases 0.000 description 1
- 206010068842 Olmsted syndrome Diseases 0.000 description 1
- 108700020796 Oncogene Proteins 0.000 description 1
- 102000043276 Oncogene Human genes 0.000 description 1
- 206010033128 Ovarian cancer Diseases 0.000 description 1
- 206010061535 Ovarian neoplasm Diseases 0.000 description 1
- 108010011536 PTEN Phosphohydrolase Proteins 0.000 description 1
- 102000014160 PTEN Phosphohydrolase Human genes 0.000 description 1
- 206010061902 Pancreatic neoplasm Diseases 0.000 description 1
- 102100034743 Parafibromin Human genes 0.000 description 1
- 102100038332 Phosphatidylinositol 4,5-bisphosphate 3-kinase catalytic subunit alpha isoform Human genes 0.000 description 1
- 206010035226 Plasma cell myeloma Diseases 0.000 description 1
- 108010051742 Platelet-Derived Growth Factor beta Receptor Proteins 0.000 description 1
- 102100030655 Platelet-activating factor acetylhydrolase IB subunit beta Human genes 0.000 description 1
- 102100026547 Platelet-derived growth factor receptor beta Human genes 0.000 description 1
- 102100040990 Platelet-derived growth factor subunit B Human genes 0.000 description 1
- 102100029799 Polycomb group protein ASXL1 Human genes 0.000 description 1
- 208000008601 Polycythemia Diseases 0.000 description 1
- 206010036790 Productive cough Diseases 0.000 description 1
- 206010060862 Prostate cancer Diseases 0.000 description 1
- 208000000236 Prostatic Neoplasms Diseases 0.000 description 1
- 102100030128 Protein L-Myc Human genes 0.000 description 1
- 102100035586 Protein SSXT Human genes 0.000 description 1
- 102100028680 Protein patched homolog 1 Human genes 0.000 description 1
- 208000007531 Proteus syndrome Diseases 0.000 description 1
- 108700020978 Proto-Oncogene Proteins 0.000 description 1
- 102000052575 Proto-Oncogene Human genes 0.000 description 1
- 108010019674 Proto-Oncogene Proteins c-sis Proteins 0.000 description 1
- 102100028286 Proto-oncogene tyrosine-protein kinase receptor Ret Human genes 0.000 description 1
- 102100033810 RAC-alpha serine/threonine-protein kinase Human genes 0.000 description 1
- 102100032315 RAC-beta serine/threonine-protein kinase Human genes 0.000 description 1
- 102000004229 RNA-binding protein EWS Human genes 0.000 description 1
- 108090000740 RNA-binding protein EWS Proteins 0.000 description 1
- 238000003559 RNA-seq method Methods 0.000 description 1
- 102100030086 Receptor tyrosine-protein kinase erbB-2 Human genes 0.000 description 1
- 102100020718 Receptor-type tyrosine-protein kinase FLT3 Human genes 0.000 description 1
- 206010038389 Renal cancer Diseases 0.000 description 1
- 102100025373 Runt-related transcription factor 1 Human genes 0.000 description 1
- 102100032741 SET-binding protein Human genes 0.000 description 1
- 108700028341 SMARCB1 Proteins 0.000 description 1
- 101150008214 SMARCB1 gene Proteins 0.000 description 1
- 102100025746 SWI/SNF-related matrix-associated actin-dependent regulator of chromatin subfamily B member 1 Human genes 0.000 description 1
- 102100031029 SWI/SNF-related matrix-associated actin-dependent regulator of chromatin subfamily E member 1 Human genes 0.000 description 1
- 240000004808 Saccharomyces cerevisiae Species 0.000 description 1
- 206010039491 Sarcoma Diseases 0.000 description 1
- 208000005867 Schinzel-Giedion syndrome Diseases 0.000 description 1
- 101000702553 Schistosoma mansoni Antigen Sm21.7 Proteins 0.000 description 1
- 101000714192 Schistosoma mansoni Tegument antigen Proteins 0.000 description 1
- 102100027103 Serine/threonine-protein kinase B-raf Human genes 0.000 description 1
- 101710181599 Serine/threonine-protein kinase STK11 Proteins 0.000 description 1
- 102100028910 Sodium channel protein type 1 subunit alpha Human genes 0.000 description 1
- 102100038014 Succinate dehydrogenase [ubiquinone] cytochrome b small subunit, mitochondrial Human genes 0.000 description 1
- 102100035726 Succinate dehydrogenase [ubiquinone] iron-sulfur subunit, mitochondrial Human genes 0.000 description 1
- 102100025393 Succinate dehydrogenase cytochrome b560 subunit, mitochondrial Human genes 0.000 description 1
- 102100026939 Suppressor of fused homolog Human genes 0.000 description 1
- 102100040365 T-cell acute lymphocytic leukemia protein 1 Human genes 0.000 description 1
- 102100033111 T-cell leukemia homeobox protein 1 Human genes 0.000 description 1
- 102100032568 T-cell leukemia homeobox protein 3 Human genes 0.000 description 1
- 102100033456 TGF-beta receptor type-1 Human genes 0.000 description 1
- 102100033455 TGF-beta receptor type-2 Human genes 0.000 description 1
- 102000003568 TRPV3 Human genes 0.000 description 1
- 102100028702 Thyroid hormone receptor alpha Human genes 0.000 description 1
- 102100031027 Transcription activator BRG1 Human genes 0.000 description 1
- 102100028507 Transcription factor E3 Human genes 0.000 description 1
- 108010011702 Transforming Growth Factor-beta Type I Receptor Proteins 0.000 description 1
- 108010082684 Transforming Growth Factor-beta Type II Receptor Proteins 0.000 description 1
- 206010052779 Transplant rejections Diseases 0.000 description 1
- 101150043371 Trpv3 gene Proteins 0.000 description 1
- 108010078814 Tumor Suppressor Protein p53 Proteins 0.000 description 1
- 102100033254 Tumor suppressor ARF Human genes 0.000 description 1
- 102100022596 Tyrosine-protein kinase ABL1 Human genes 0.000 description 1
- 102100037333 Tyrosine-protein kinase Fes/Fps Human genes 0.000 description 1
- 102100033444 Tyrosine-protein kinase JAK2 Human genes 0.000 description 1
- 102100024250 Ubiquitin carboxyl-terminal hydrolase CYLD Human genes 0.000 description 1
- 208000002495 Uterine Neoplasms Diseases 0.000 description 1
- 108010053100 Vascular Endothelial Growth Factor Receptor-3 Proteins 0.000 description 1
- 102100033179 Vascular endothelial growth factor receptor 3 Human genes 0.000 description 1
- 241000700605 Viruses Species 0.000 description 1
- 102000040856 WT1 Human genes 0.000 description 1
- 108700020467 WT1 Proteins 0.000 description 1
- 101150084041 WT1 gene Proteins 0.000 description 1
- 201000003790 Weaver syndrome Diseases 0.000 description 1
- 239000002253 acid Substances 0.000 description 1
- 150000007513 acids Chemical class 0.000 description 1
- 230000003213 activating effect Effects 0.000 description 1
- 201000008445 adult-onset leukoencephalopathy with axonal spheroids and pigmented glia Diseases 0.000 description 1
- 125000001931 aliphatic group Chemical group 0.000 description 1
- 239000003513 alkali Substances 0.000 description 1
- 125000003275 alpha amino acid group Chemical group 0.000 description 1
- 150000001412 amines Chemical class 0.000 description 1
- 210000004381 amniotic fluid Anatomy 0.000 description 1
- 230000000692 anti-sense effect Effects 0.000 description 1
- 239000007864 aqueous solution Substances 0.000 description 1
- QVGXLLKOCUKJST-UHFFFAOYSA-N atomic oxygen Chemical compound [O] QVGXLLKOCUKJST-UHFFFAOYSA-N 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 229910002056 binary alloy Inorganic materials 0.000 description 1
- 229960002685 biotin Drugs 0.000 description 1
- 235000020958 biotin Nutrition 0.000 description 1
- 239000011616 biotin Substances 0.000 description 1
- 210000000988 bone and bone Anatomy 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 210000000481 breast Anatomy 0.000 description 1
- 125000003178 carboxy group Chemical group [H]OC(*)=O 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000002759 chromosomal effect Effects 0.000 description 1
- 108091092240 circulating cell-free DNA Proteins 0.000 description 1
- 210000001072 colon Anatomy 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 210000004748 cultured cell Anatomy 0.000 description 1
- 230000006378 damage Effects 0.000 description 1
- 230000009615 deamination Effects 0.000 description 1
- 238000006481 deamination reaction Methods 0.000 description 1
- 238000004925 denaturation Methods 0.000 description 1
- 230000036425 denaturation Effects 0.000 description 1
- 239000005549 deoxyribonucleoside Substances 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 229910052805 deuterium Inorganic materials 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 229940079593 drug Drugs 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 239000000839 emulsion Substances 0.000 description 1
- 102000052116 epidermal growth factor receptor activity proteins Human genes 0.000 description 1
- 108700015053 epidermal growth factor receptor activity proteins Proteins 0.000 description 1
- 150000002170 ethers Chemical class 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 210000001508 eye Anatomy 0.000 description 1
- 210000003754 fetus Anatomy 0.000 description 1
- 206010016629 fibroma Diseases 0.000 description 1
- 238000012632 fluorescent imaging Methods 0.000 description 1
- 238000007672 fourth generation sequencing Methods 0.000 description 1
- 230000005021 gait Effects 0.000 description 1
- 208000016361 genetic disease Diseases 0.000 description 1
- 230000007614 genetic variation Effects 0.000 description 1
- 208000001580 genitopatellar syndrome Diseases 0.000 description 1
- 239000011521 glass Substances 0.000 description 1
- 210000003780 hair follicle Anatomy 0.000 description 1
- 125000005843 halogen group Chemical group 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 201000005787 hematologic cancer Diseases 0.000 description 1
- 208000024200 hematopoietic and lymphoid system neoplasm Diseases 0.000 description 1
- 238000012165 high-throughput sequencing Methods 0.000 description 1
- 108010021685 homeobox protein HOXA13 Proteins 0.000 description 1
- 108010027263 homeobox protein HOXA9 Proteins 0.000 description 1
- 239000001257 hydrogen Substances 0.000 description 1
- 208000003532 hypothyroidism Diseases 0.000 description 1
- 230000002989 hypothyroidism Effects 0.000 description 1
- 230000003100 immobilizing effect Effects 0.000 description 1
- 238000012606 in vitro cell culture Methods 0.000 description 1
- 238000011065 in-situ storage Methods 0.000 description 1
- 238000010348 incorporation Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000000968 intestinal effect Effects 0.000 description 1
- 210000004153 islets of langerhan Anatomy 0.000 description 1
- 238000005304 joining Methods 0.000 description 1
- 201000010982 kidney cancer Diseases 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 201000010260 leiomyoma Diseases 0.000 description 1
- 208000032839 leukemia Diseases 0.000 description 1
- 208000015486 malignant pancreatic neoplasm Diseases 0.000 description 1
- 210000004962 mammalian cell Anatomy 0.000 description 1
- 201000010994 mandibulofacial dysostosis, Guion-Almeida type Diseases 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 230000001404 mediated effect Effects 0.000 description 1
- 238000002844 melting Methods 0.000 description 1
- 230000008018 melting Effects 0.000 description 1
- 238000002156 mixing Methods 0.000 description 1
- 208000024703 mutilating palmoplantar keratoderma with periorificial keratotic plaques Diseases 0.000 description 1
- YOHYSYJDKVYCJI-UHFFFAOYSA-N n-[3-[[6-[3-(trifluoromethyl)anilino]pyrimidin-4-yl]amino]phenyl]cyclopropanecarboxamide Chemical compound FC(F)(F)C1=CC=CC(NC=2N=CN=C(NC=3C=C(NC(=O)C4CC4)C=CC=3)C=2)=C1 YOHYSYJDKVYCJI-UHFFFAOYSA-N 0.000 description 1
- JTSLALYXYSRPGW-UHFFFAOYSA-N n-[5-(4-cyanophenyl)-1h-pyrrolo[2,3-b]pyridin-3-yl]pyridine-3-carboxamide Chemical compound C=1C=CN=CC=1C(=O)NC(C1=C2)=CNC1=NC=C2C1=CC=C(C#N)C=C1 JTSLALYXYSRPGW-UHFFFAOYSA-N 0.000 description 1
- 238000002663 nebulization Methods 0.000 description 1
- 208000015122 neurodegenerative disease Diseases 0.000 description 1
- 230000000926 neurological effect Effects 0.000 description 1
- 238000006386 neutralization reaction Methods 0.000 description 1
- 229910052757 nitrogen Inorganic materials 0.000 description 1
- 238000001668 nucleic acid synthesis Methods 0.000 description 1
- 125000003835 nucleoside group Chemical group 0.000 description 1
- 238000002515 oligonucleotide synthesis Methods 0.000 description 1
- 230000002611 ovarian Effects 0.000 description 1
- 229910052760 oxygen Inorganic materials 0.000 description 1
- 239000001301 oxygen Substances 0.000 description 1
- 238000004806 packaging method and process Methods 0.000 description 1
- 201000002528 pancreatic cancer Diseases 0.000 description 1
- 208000008443 pancreatic carcinoma Diseases 0.000 description 1
- 208000007312 paraganglioma Diseases 0.000 description 1
- 230000000849 parathyroid Effects 0.000 description 1
- 230000037361 pathway Effects 0.000 description 1
- 208000028591 pheochromocytoma Diseases 0.000 description 1
- 208000026435 phlegm Diseases 0.000 description 1
- 230000001817 pituitary effect Effects 0.000 description 1
- 239000013612 plasmid Substances 0.000 description 1
- 210000004910 pleural fluid Anatomy 0.000 description 1
- 238000005498 polishing Methods 0.000 description 1
- 238000009598 prenatal testing Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 125000002924 primary amino group Chemical group [H]N([H])* 0.000 description 1
- 229910002059 quaternary alloy Inorganic materials 0.000 description 1
- 230000008707 rearrangement Effects 0.000 description 1
- 238000010188 recombinant method Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 150000003291 riboses Chemical class 0.000 description 1
- 125000000548 ribosyl group Chemical group C1([C@H](O)[C@H](O)[C@H](O1)CO)* 0.000 description 1
- 238000005096 rolling process Methods 0.000 description 1
- 210000000582 semen Anatomy 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000007841 sequencing by ligation Methods 0.000 description 1
- 238000010008 shearing Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 239000000243 solution Substances 0.000 description 1
- 238000000527 sonication Methods 0.000 description 1
- 125000006850 spacer group Chemical group 0.000 description 1
- 241000894007 species Species 0.000 description 1
- 210000003802 sputum Anatomy 0.000 description 1
- 208000024794 sputum Diseases 0.000 description 1
- 210000002784 stomach Anatomy 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 150000008163 sugars Chemical class 0.000 description 1
- 229910052717 sulfur Inorganic materials 0.000 description 1
- 208000011580 syndromic disease Diseases 0.000 description 1
- 210000001179 synovial fluid Anatomy 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
- 210000001685 thyroid gland Anatomy 0.000 description 1
- 239000001226 triphosphate Substances 0.000 description 1
- 235000011178 triphosphate Nutrition 0.000 description 1
- 125000002264 triphosphate group Chemical class [H]OP(=O)(O[H])OP(=O)(O[H])OP(=O)(O[H])O* 0.000 description 1
- 108010064892 trkC Receptor Proteins 0.000 description 1
- 210000002700 urine Anatomy 0.000 description 1
- 229910052720 vanadium Inorganic materials 0.000 description 1
Images
Classifications
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6869—Methods for sequencing
- C12Q1/6874—Methods for sequencing involving nucleic acid arrays, e.g. sequencing by hybridisation
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6869—Methods for sequencing
Definitions
- Some sequencing methods require comparing two sequences with a single sequence read to determine if there is a difference between the sequences.
- such methods can be challenging to perform because the software that performs this task needs to accurately identify the beginnings and ends of the sequences in a sequence read that should be compared, extract sequences that should be compared, and then perform an alignment of those sequence.
- These steps can be challenging to automatically perform consistently for all different sequences, sequence compositions and lengths. For example, the existence of repeated sequences within a sequence read can cause slippage of an alignment, which may produce erroneous results.
- the present disclosure provides an alternative, better way for comparing sequences with the same sequence read.
- a method of sequencing a template that comprises a direct repeat i.e., template comprising a first repeat sequence and a second repeat sequence that is in direct orientation with the first repeat.
- the method may comprise, in the same reaction, hybridizing a primer to a first site that is upstream of the first repeat sequence and hybridizing a primer to a second site that is upstream of the second repeat sequence.
- the first and second sites i.e., the sites to which the first and second primers bind
- the hybridization product produced by this step contains the template with two primers annealed to it, both upstream of a repeat sequence by the same distance (e.g., the same number of bases).
- the method involves sequencing the template using a sequencing-by-synthesis method (e.g., using fluorescent dye terminators) to produce a sequence read that comprises a combination of the first and second repeat sequences, i.e., a sequence read that is essentially two reads (one from the first primer and the other from the second primer) that are merged with one another. Differences between the sequence of the first and second repeats can be identified as low-quality base calls.
- the first repeat sequence and the second repeat sequence are amplified from opposite strands of a double-stranded fragment of DNA.
- the sequences of the first and second repeats should be identical except for positions that correspond to damaged nucleotides in the double-stranded fragment of DNA or errors that occur during amplification.
- any differences between the top and bottom strands of the double-stranded fragment can be identified in the sequence read as a “low quality” base call, i.e., a base that is associated with poor underlying data due to there being, in effect, two different bases at a particular position in the sequence.
- the first repeat may be amplified from the one strand of a double-stranded fragment of genomic DNA and the second repeat may be amplified from the other strand of the same fragment of double-stranded fragment of genomic DNA.
- the sequences of the first and second repeats are often the same. However, in cases where there is damage in the original molecule, the sequences of the first and second repeats (within a single molecule) may differ.
- the first and second repeats are typically identical except for positions that correspond to (a) damaged nucleotides in the double-stranded fragment of genomic DNA from which those strands were copied or (b) errors that occur during amplification of the direct repeat molecule (e.g., nucleotides that are mis-incorporated or deletions caused by a stutter or slippage event during amplification).
- the first and second repeats are typically at least 95% identical in sequence.
- the different repeats in a template molecule can be sequenced using two primers (one for each repeat) at the same time to determine if the repeats (which correspond to the top and complement of the bottom strands of an initial fragment of genomic DNA) differ.
- the sequences of the first and second repeats are merged in the same sequence read. Any differences between those sequences can be observed as a low-quality base call because the underlying data for that base call are essentially derived from two bases (one base read by the first primer and the other base read by the second primer, where those bases are the same distance downstream from the primers). If there is a low-quality base call at a particular position, then the method may comprise excluding that base call from future analysis. The method may be used to identify damaged nucleotides and amplification errors, as well as sequencing errors (i.e., errors that stem from the sequence reaction itself, not in the sequencing template).
- the method finds particular use in analyzing samples of DNA that contain damaged DNA, samples in which the amount of DNA is limited and/or samples that contain fragments having a low copy number mutation (e.g., a sequence caused by a mutation that is present at low copy number relative to sequences that do not contain the mutation).
- a low copy number mutation e.g., a sequence caused by a mutation that is present at low copy number relative to sequences that do not contain the mutation.
- ctDNA circulating tumor
- tissue sections e.g., tissue sections.
- the sample may be DNA obtained from tissue embedded in paraffin (i.e., an FFPE sample).
- the mutant sequences may only be present at a very limited copy number (e.g., less than 10, less than 5 copies or even 1 copy in a background of hundreds or thousands of copies of the wild type sequence). In these situations, without an effective way to eliminate errors generated by DNA damage, it can be almost impossible to identify a true sequence variation with significant confidence.
- FIG. 1 schematically illustrates a direct repeat template that has been made from a fragment of double-stranded genomic DNA.
- FIG. 2 schematically illustrates where the first and second primers used in the method hybridize a direct repeat template.
- FIG. 3 schematically illustrates an example of the method.
- FIG. 4 schematically illustrates an exemplary method by which a direct repeat molecule can be produced.
- FIG. 5 schematically illustrates another exemplary method by which a direct repeat molecule can be produced.
- nucleic acids are written left to right in 5′ to 3′ orientation; amino acid sequences are written left to right in amino to carboxy orientation, respectively.
- sample as used herein relates to a material or mixture of materials, typically containing one or more analytes of interest.
- the term as used in its broadest sense refers to any plant, animal, microbial or viral material containing genomic DNA, such as, for example, tissue or fluid isolated from an individual (including without limitation plasma, serum, cerebrospinal fluid, lymph, tears, saliva and tissue sections) or from in vitro cell culture constituents, as well as samples from the environment.
- nucleic acid sample denotes a sample containing nucleic acids.
- Nucleic acid samples used herein may be complex in that they contain multiple different molecules that contain sequences. Genomic DNA samples from a mammal (e.g., mouse or human) are types of complex samples. Complex samples may have more than about 10 4 , 10 5 , 10 6 or 10 7 , 10 8 , 10 9 or 10 10 different nucleic acid molecules.
- a DNA target may originate from any source such as genomic DNA, or an artificial DNA construct. Any sample containing nucleic acids, e.g., genomic DNA from tissue culture cells or a sample of tissue, may be employed herein.
- mixture refers to a combination of elements, that are interspersed and not in any particular order.
- a mixture is heterogeneous and not spatially separable into its different constituents.
- examples of mixtures of elements include a number of different elements that are dissolved in the same aqueous solution and a number of different elements attached to a solid support at random positions (i.e., in no particular order).
- a mixture is not addressable.
- an array of spatially separated surface-bound polynucleotides as is commonly known in the art, is not a mixture of surface-bound polynucleotides because the species of surface-bound polynucleotides are spatially distinct, and the array is addressable.
- nucleotide is intended to include those moieties that can be copied using a polymerase. Nucleotides contain not only the known purine and pyrimidine bases, but also other heterocyclic bases that have been modified e.g., “damaged” bases that have oxidized or deadenylated for example. Such modifications include methylated purines or pyrimidines, acylated purines or pyrimidines, alkylated riboses or other heterocycles. In addition, the term “nucleotide” includes those moieties that contain hapten or fluorescent labels and may contain not only conventional ribose and deoxyribose sugars, but other sugars as well.
- Modified nucleosides or nucleotides also include modifications on the sugar moiety, e.g., wherein one or more of the hydroxyl groups are replaced with halogen atoms or aliphatic groups, or are functionalized as ethers, amines, or the like.
- nucleic acid and “polynucleotide” are used interchangeably herein to describe a polymer of any length, e.g., greater than about 2 bases, greater than about 10 bases, greater than about 100 bases, greater than about 500 bases, greater than 1000 bases, greater than 10,000 bases, greater than 100,000 bases, greater than about 1,000,000, up to about 10 10 or more bases composed of nucleotides, e.g., deoxyribonucleotides or ribonucleotides, and may be produced enzymatically or synthetically (e.g., PNA as described in U.S. Pat. No.
- Naturally-occurring nucleotides include guanine, cytosine, adenine, thymine, uracil (G, C, A, T and U respectively).
- DNA and RNA have a deoxyribose and ribose sugar backbone, respectively, whereas PNA’s backbone is composed of repeating N-(2-aminoethyl)-glycine units linked by peptide bonds.
- LNA locked nucleic acid
- inaccessible RNA is a modified RNA nucleotide.
- the ribose moiety of an LNA nucleotide is modified with an extra bridge connecting the 2′ oxygen and 4′ carbon. The bridge “locks” the ribose in the 3′-endo (North) conformation, which is often found in the A-form duplexes.
- LNA nucleotides can be mixed with DNA or RNA residues in the oligonucleotide whenever desired.
- unstructured nucleic acid is a nucleic acid containing non-natural nucleotides that bind to each other with reduced stability.
- an unstructured nucleic acid may contain a G′ residue and a C′ residue, where these residues correspond to non-naturally occurring forms, i.e., analogs, of G and C that base pair with each other with reduced stability, but retain an ability to base pair with naturally occurring C and G residues, respectively.
- Unstructured nucleic acid is described in US20050233340, which is incorporated by reference herein for disclosure of UNA.
- oligonucleotide denotes a single-stranded multimer of nucleo of from about 2 to 200 nucleotides, up to 500 nucleotides in length. Oligonucleotides may be synthetic or may be made enzymatically, and, in some embodiments, are 30 to 150 nucleotides in length. Oligonucleotides may contain ribonucleotide monomers (i.e., may be oligoribonucleotides) or deoxyribonucleotide monomers, or both ribonucleotide monomers and deoxyribonucleotide monomers. An oligonucleotide may be 10 to 20, 21 to 30, 31 to 40, 41 to 50, 51 to 60, 61 to 70, 71 to 80, 80 to 100, 100 to 150 or 150 to 200 nucleotides in length, for example.
- Primer means an oligonucleotide, either natural or synthetic, that is capable, upon forming a duplex with a polynucleotide template, of acting as a point of initiation of nucleic acid synthesis and being extended from its 3′ end along the template so that an extended duplex is formed.
- the sequence of nucleotides added during the extension process is determined by the sequence of the template polynucleotide. Usually primers are extended by a DNA polymerase.
- Primers are generally of a length compatible with their use in synthesis of primer extension products and are usually in the range of between 8 to 100 nucleotides in length, such as 10 to 75, 15 to 60, 15 to 40, 18 to 30, 20 to 40, 21 to 50, 22 to 45, 25 to 40, and so on.
- Typical primers can be in the range of between 10-50 nucleotides long, such as 15-45, 18-40, 20-30, 21-25 and so on, and any length between the stated ranges.
- the primers are usually not more than about 10, 12, 15, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 35, 40, 45, 50, 55, 60, 65, or 70 nucleotides in length.
- a primer can be activated prior to primer extension.
- some primers have a 3′ block and internal RNA base. The RNA base can be removed by RNaseH or another treatment, thereby producing a 3′ hydroxyl group which can be extended. Other methods for activating primers exist.
- Primers are usually single-stranded for maximum efficiency in amplification but may alternatively be double-stranded or partially double-stranded. If double-stranded, the primer is usually first treated to separate its strands before being used to prepare extension products. This denaturation step is typically affected by heat, but may alternatively be carried out using alkali, followed by neutralization. Also included in this definition are toehold exchange primers, as described in Zhang et al (Nature Chemistry 2012 4: 208-214), which is incorporated by reference herein.
- a “primer” is complementary to a template, and complexes by hydrogen bonding or hybridization with the template to give a primer/template complex for initiation of synthesis by a polymerase, which is extended by the addition of covalently bonded bases linked at its 3′ end complementary to the template in the process of DNA synthesis.
- hybridization refers to a process in which a region of a nucleic acid strand anneals to and forms a stable duplex, either a homoduplex or a heteroduplex, under normal hybridization conditions with a second complementary nucleic acid strand and does not form a stable duplex with unrelated nucleic acid molecules under the same normal hybridization conditions.
- the formation of a duplex is accomplished by annealing two complementary nucleic acid strand regions in a hybridization reaction.
- the hybridization reaction can be made to be highly specific by adjustment of the hybridization conditions (often referred to as hybridization stringency) under which the hybridization reaction takes place, such that two nucleic acid strands will not form a stable duplex, e.g., a duplex that retains a region of double-strandedness under normal stringency conditions, unless the two nucleic acid strands contain a certain number of nucleotides in specific sequences which are substantially or completely complementary. “Normal hybridization or normal stringency conditions” are readily determined for any given hybridization reaction.
- hybridizing refers to any process by which a strand of nucleic acid binds with a complementary strand through base pairing.
- a nucleic acid is considered to be “selectively hybridizable” to a reference nucleic acid sequence if the two sequences specifically hybridize to one another under moderate to high stringency hybridization and wash conditions.
- Moderate and high stringency hybridization conditions are known (see, e.g., Ausubel, et al., Short Protocols in Molecular Biology, 3rd ed., Wiley & Sons 1995 and Sambrook et al., Molecular Cloning: A Laboratory Manual, Third Edition, 2001 Cold Spring Harbor, N.Y.).
- One example of high stringency conditions include hybridization at about 42° C.
- amplifying refers to the process of synthesizing nucleic acid molecules that are complementary to one or both strands of a template nucleic acid.
- Amplifying a nucleic acid molecule may include denaturing the template nucleic acid, annealing primers to the template nucleic acid at a temperature that is below the melting temperatures of the primers, and enzymatically elongating from the primers to generate an amplification product.
- the denaturing, annealing and elongating steps each can be performed one or more times.
- the denaturing, annealing and elongating steps are performed multiple times such that the amount of amplification product is increasing, often times exponentially, although exponential amplification is not required by the present methods.
- Amplification typically requires the presence of deoxyribonucleoside triphosphates, a DNA polymerase enzyme and an appropriate buffer and/or co-factors for optimal activity of the polymerase enzyme.
- the term “amplification product” refers to the nucleic acids, which are produced from the amplifying process as defined herein.
- determining means determining if an element is present or not. These terms include both quantitative and/or qualitative determinations. Assessing may be relative or absolute. “Assessing the presence of” includes determining the amount of something present, as well as determining whether it is present or absent.
- ligating refers to the enzymatically catalyzed joining of the terminal nucleotide at the 5′ end of a first DNA molecule to the terminal nucleotide at the 3′ end of a second DNA molecule.
- a “plurality” contains at least 2 members. In certain cases, a plurality may have at least 2, at least 5, at least 10, at least 100, at least 100, at least 10,000, at least 100,000, at least 10 6 , at least 10 7 , at least 10 8 or at least 10 9 or more members.
- oligonucleotide binding site refers to a site to which an oligonucleotide hybridizes in a target polynucleotide. If an oligonucleotide “provides” a binding site for a primer, then the primer may hybridize to that oligonucleotide or its complement.
- strand refers to a nucleic acid made up of nucleotides covalently linked together by covalent bonds, e.g., phosphodiester bonds.
- DNA usually exists in a double-stranded form, and as such, has two complementary strands of nucleic acid referred to herein as the “Watson” (or “top”) and “Crick” (or “bottom”) strands.
- complementary strands of a chromosomal region may be referred to as “plus” and “minus” strands, the “first” and “second” strands, the “coding” and “noncoding” strands, the “top” and “bottom” strands or the “sense” and “antisense” strands.
- the assignment of a strand as being a Watson or Crick strand is arbitrary and does not imply any particular orientation, function or structure.
- extending refers to the extension of a primer by the addition of nucleotides using a polymerase. If a primer that is annealed to a nucleic acid is extended, the nucleic acid acts as a template for extension reaction.
- sequencing refers to a method by which the identity of at least 10 consecutive nucleotides (e.g., the identity of at least 20, at least 50, at least 100 or at least 200 or more consecutive nucleotides) of a polynucleotide is obtained.
- next-generation sequencing or “high-throughput sequencing”, as used herein, refer to the so-called parallelized sequencing-by-synthesis or sequencing-by-ligation platforms currently employed by Illumina, Life Technologies, and Roche, etc.
- Next-generation sequencing methods may also include nanopore sequencing methods such as that commercialized by Oxford Nanopore Technologies, electronic-detection based methods such as Ion Torrent technology commercialized by Life Technologies, or single-molecule fluorescence-based methods such as that commercialized by Pacific Biosciences.
- barcode sequence refers to a unique sequence of nucleotides that can be used to a) identify and/or track the source of a polynucleotide in a reaction, b) count how many times an initial molecule is sequenced and c) pair sequence reads from different strands of the same molecule. Barcode sequences may vary widely in size and composition; the following references provide guidance for selecting sets of barcode sequences appropriate for particular embodiments: Casbon (Nuc. Acids Res. 2011, 22 e81), Brenner, U.S. Pat. No. 5,635,400; Brenner et al., Proc. Natl. Acad.
- a barcode sequence may have a length in range of from 2 to 36 nucleotides, or from 6 to 30 nucleotides, or from 8 to 20 nucleotides.
- a barcode may contain a “degenerate base region” or “DBR”, where the terms “degenerate base region” and “DBR” refers to a type of molecular barcode that has complexity that is sufficient to help one distinguish between fragments to which the DBR has been added.
- substantially every tagged fragment may have a different DBR sequence.
- a high complexity DBR may be used (e.g., one that is composed of at least 10,000 or 100,000, or more sequences).
- some fragments may be tagged with the same DBR sequence, but those fragments can still be distinguished by the combination of i. the DBR sequence, ii. the sequence of the fragment, iii.
- a DBR may comprise one or more (e.g., at least 2, at least 3, at least 4, at least 5, or 5 to 30 or more) nucleotides selected from R, Y, S, W, K, M, B, D, H, V, N (as defined by the IUPAC code).
- a double-stranded barcode can be made by making an oligonucleotide containing degenerate sequence (e.g., an oligonucleotide that has a run of 2-10 or more “Ns”) and then copying the complement of the barcode onto the other strand, as described below.
- an oligonucleotide containing degenerate sequence e.g., an oligonucleotide that has a run of 2-10 or more “Ns”
- Oligonucleotides that contain a variable sequence can be made by making a number of oligonucleotides separately, mixing the oligonucleotides together, and by amplifying them en masse.
- the population of oligonucleotides that contain a variable sequence can be made as a single oligonucleotide that contains degenerate positions (i.e., positions that contain more than one type of nucleotide).
- such a population of oligonucleotides can be made by fabricating them individually or using an array of the oligonucleotides using in situ synthesis methods, cleaving the oligonucleotides from the substrate and optionally amplifying them. Examples of such methods are described in, e.g., Cleary et al. (Nature Methods 2004 1: 241-248) and LeProust et al. (Nucleic Acids Research 2010 38: 2522-2540).
- a barcode may be error correcting.
- Descriptions of exemplary error identifying (or error correcting) sequences can be found throughout the literature (e.g., in are described in U.S. Patent Application Publications US2010/0323348 and US2009/0105959 both incorporated herein by reference).
- Error-correctable codes may be necessary for quantitating absolute numbers of molecules.
- Many reports in the literature use codes that were originally developed for error-correction of binary systems (Hamming codes, Reed Solomon codes etc.) or apply these to quaternary systems (e.g. quaternary Hamming codes; see Generalized DNA barcode design based on Hamming codes, Bystrykh 2012 PLoS One. 2012 7: e36852).
- a barcode may additionally be used to determine the number of initial target polynucleotide molecules that have been analyzed, i.e., to “count” the number of initial target polynucleotide molecules that have been analyzed.
- PCR amplification of molecules that have been tagged with a barcode can result in multiple sub-populations of products that are clonally-related in that each of the different sub-populations is amplified from a single tagged molecule.
- the number of molecules tagged in the first step of the method can be estimated by counting the number of DBR sequences associated with a target sequence that is represented in the population of PCR products. This number is useful because, in certain embodiments, the population of PCR products made using this method may be sequenced to produce a plurality of sequences.
- the number of different barcode sequences that are associated with the sequences of a target polynucleotide can be counted, and this number can be used (along with, e.g., the sequence of the fragment, the sequence of the ends of the fragment, and/or the site of insertion of the DBR into the fragment) to estimate the number of initial template nucleic acid molecules that have been sequenced.
- Such tags can also be useful in correcting sequencing errors.
- sample identifier sequence or “sample index” refer to a type of barcode that can be appended to a target polynucleotide, where the sequence identifies the source of the target polynucleotide (i.e., the sample from which the target polynucleotide is derived).
- each sample is tagged with a different sample identifier sequence (e.g., one sequence is appended to each sample, where the different samples are appended to different sequences), and the tagged samples are pooled. After the pooled sample is sequenced, the sample identifier sequence can be used to identify the source of the sequences.
- adapter refers to a nucleic acid that can be joined to at least one strand of a double-stranded DNA molecule.
- adapter refers to molecules that are at least partially double-stranded.
- An adaptor may be 20 to 150 bases in length, e.g., 40 to 120 bases, although adaptors outside of this range are envisioned.
- adaptor-tagged refers to a nucleic acid that has been tagged by, i.e., covalently linked with, an adaptor.
- An adaptor can be joined to a 5′ end and/or a 3′ end of a nucleic acid molecule.
- tagged DNA refers to DNA molecules that have an added adaptor sequence, i.e., a “tag” of synthetic origin.
- An adaptor sequence can be added (i.e., “appended”) by ligation.
- complexity refers to the total number of different sequences in a population. For example, if a population has 4 different sequences then that population has a complexity of 4. A population may have a complexity of at least 4, at least 8, at least 16, at least 100, at least 1,000, at least 10,000 or at least 100,000 or more, depending on the desired result.
- polynucleotides described herein may be referred to by a formula. Unless otherwise indicated the polynucleotides defined by a formula are oriented in the 5′ to 3′ direction.
- the components of the formula refer to separately definable sequences of nucleotides within a polynucleotide, where, unless implicit from the context, the sequences are linked together covalently such that a polynucleotide described by a formula is a single molecule. In some cases, the components of the formula are immediately adjacent to one another in the single molecule.
- a region defined by a formula may have additional sequences, a primer binding site, a molecular barcode, a promoter, or a spacer, etc., at its 3′ end, its 5′ end or both the 3′ and 5′ ends.
- the various component sequences of a polynucleotide may independently be of any desired length as long as they are capable of performing the desired function (e.g., hybridization to another sequence).
- the various component sequences of a polynucleotide may independently have a length in the range of 8-80 nucleotides, e.g., 10-50 nucleotides or 12-30 nucleotides.
- opposite strands refers to the top and bottom strands, where the strands are complementary to one another, except for damaged nucleotides.
- sequence variation refers to a sequence variation, e.g., a substitution, deletion, insertion or rearrangement of one or more nucleotides in one sequence relative to another.
- amplification error refers to a mis-incorporated base, or a deletion/insertion caused by polymerase stutter. Stutter usually occurs in repeat sequences, e.g., short tandem repeats (STRs) or microsatellite repeats and is presumed to be due to miscopying or slippage by the polymerase
- target enrichment refers to a method in which selected sequences are separated from other sequences in a sample. This may be done by hybridization to a probe, e.g., hybridizing a biotinylated oligonucleotide to the sample to produce duplexes between the oligonucleotide and the target sequence, immobilizing the duplexes via the biotin group, washing the immobilized duplexes, and then releasing the target sequences from the oligonucleotides.
- a selected sequence may be enriched by amplifying that sequence, e.g., by PCR using one or more primers that hybridize to a site that is proximal to the target sequence.
- a minority variant is a variant that is present at a frequency of less than 50%, relative to other molecules in the sample.
- a minority variant may be a first allele of a polymorphic target sequence, where, in a sample, the ratio of molecules that contain the first allele of the polymorphic target sequence compared to molecules that contain other alleles of the polymorphic target sequence is 1:5 or less, 1:10 or less, 1:100 or less, 1:1,000 or less, 1: 10,000 or less, 1: 100,000 or less or 1:1,000,000 or less.
- duplex sequencing refers to a method in which sequences for both strands of a double-stranded molecule of genomic DNA are obtained. In duplex sequencing, the sequences derived from the top strand of double-stranded molecule of genomic DNA are distinguishable from sequences derived from the bottom strand of that molecule in such a way that the sequences for the top and bottom strands from the same double-stranded molecule of genomic DNA can be compared.
- direct repeat refers a molecule that contains two copies of near identical sequences, i.e., sequences that are of the same length and that are at least 95% identical in nucleotide sequence.
- distance depends on the sequencing-by-synthesis method being used for sequencing. For example, in methods that rely on reversible chain terminators the distance between the 3′ end of a primer and a downstream nucleotide can be defined by the number of bases. In semiconductor or pyrosequencing methods the distance between the 3′ end of a primer and a downstream nucleotide can be defined by the number of flows because, in those methods, several nucleotides can be added in a single flow. Thus, “equidistant” can mean the same number of nucleotides if a reversible chain terminator-based sequencing method is used or the same number of flows if a semiconductor- or pyrosequencing-based sequencing methods is used.
- the reverse complement of a sequence may be indicated by the prime (“ ′ ”) symbol.
- the reverse complement of a sequence referred to as “W” is may be referred to as “W”′.
- nucleic acid includes a plurality of such nucleic acids
- compound includes reference to one or more compounds and equivalents thereof known to those skilled in the art, and so forth.
- the practice of the present invention may employ, unless otherwise indicated, conventional techniques and descriptions of organic chemistry, polymer technology, molecular biology (including recombinant techniques), cell biology, biochemistry, and immunology, which are within the skill of the art.
- Such conventional techniques include polymer array synthesis, hybridization, ligation, and detection of hybridization using a label. Specific illustrations of suitable techniques can be had by reference to the example herein below. However, other equivalent conventional procedures can, of course, also be used.
- Such conventional techniques and descriptions can be found in standard laboratory manuals such as Genome Analysis: A Laboratory Manual Series (Vols.
- a way to sequence a template that has a direct repeat i.e., a template that comprises a first repeat sequence and a second repeat sequence, wherein the first and second repeat sequences are in a direct repeat and either identical or nearly identical.
- the first repeat sequence and the second repeat sequence may be amplified from opposite strands of a double-stranded fragment of DNA.
- the sequences of the repeats may be identical except for positions that correspond to damaged nucleotides in the double-stranded fragment of DNA or errors that occur during amplification.
- An example of such a direct repeat is illustrated in FIG. 1 . As shown, within each repeat molecule the first repeat and the second repeat are amplified from opposite strands of a fragment of double-stranded genomic DNA, e.g., genomic DNA.
- the first repeat has the same or a very similar sequence as one strand (the top strand of the fragment, for example) of a fragment of double-stranded genomic DNA whereas the second repeat has the same or a very similar sequence as the reverse complement of the other strand of the fragment (e.g., the bottom strand of the fragment).
- the first and second repeat sequences should be identical except for nucleotides that correspond to (i.e., are at a position that corresponds to the position of) damaged nucleotides in the fragment of double-stranded genomic DNA or errors that have occurred during amplification.
- the double-stranded fragment may be made synthetically or derived from a double-stranded plasmid, for example.
- a “damaged nucleotide” refers to any derivative of adenine, cytosine, guanine, and thymine that has been altered in a way that allows it to pair with a different base.
- some bases can be oxidized, alkylated or deaminated in a way that effects base pairing.
- 7,8-dihydro-8-oxoguanine (8-oxo-dG) is a derivative of guanine that base pairs with adenine instead of cytosine. This derivative causes a G to T transversion after replication. Deamination of cytosine produces uracil, which can base pair with adenine, leading to a C to T change after replication.
- Other examples or damaged nucleotide that are capable of mismatched pairing include are known.
- the sequences of first and second repeats have identical lengths and are at least 95% identical (e.g., at least 95% identical, at least 96% identical, at least 97% identical, at least 98% identical, at least 99% identical or 100% identical, depending on, e.g., the extent of DNA damage in the fragment of double-stranded genomic DNA and/or amplification errors) and, with the exception of nucleotides that correspond to damaged nucleotides and amplification errors, should be identical.
- the molecules may have a unit length of 1, meaning that there is only one copy of first repeat and one copy of the second repeat in each molecule.
- the template molecules may be single stranded or double stranded.
- the template is in its single stranded form when it is being sequenced.
- the sequence of the first and second repeats may have a length of at least 50 nucleotides and in some embodiments may be in the range of 50 nucleotides to 2 kb in length, e.g., 50-500 nt or 50-300 nt.
- the direct repeat template may be in a sample that contains other direct repeat templates.
- the complexity and median length of the sequence of the first repeat may vary and may be approximately the same as the complexity and median length of the sequence of the second repeat, since those sequences are almost identical.
- the first repeat and the second repeat may each have a complexity of at least 10 3 , e.g., at least 10 4 , at least 10 5 , at least 10 6 , at least 10 7 , at least 10 8 , at least 10 9 or at least 10 10 , for example, meaning that in the population, the first repeat and the second repeat are each represented by at least 10 3 different sequences.
- the lengths of the first and second repeats may depend on the lengths of the fragments of DNA in the sample from which the molecules are made.
- the fragments may have a median size that no more than 2 kb in length (e.g., in the range of 50 bp to 2 kb, e.g., 75 bp to 1.5 kb, 100 bp to 1 kb, 100 bp to 500 bp).
- the lengths of the fragment may be tailored to the sequencing platform being used. Examples of how these molecules can be made will be described in greater detail below.
- the direct repeat molecule may be made by copying a double-stranded fragment of DNA to produce the direct repeat molecule, where the first and second repeats of the direct repeat molecule are to be amplified from opposite strands of the double-stranded fragment of DNA.
- the method may comprise, in the same reaction, hybridizing a primer to a first site that is upstream of the first repeat sequence and hybridizing a primer to a second site that is upstream of the second repeat sequence.
- the first and second sites i.e., the sites to which the first and second primers bind, respectively
- the first and second sites are upstream of the first and second repeat sequences, respectively, and equidistant from the first and second repeat sequences. This is illustrated in FIG. 2 .
- the first primer binds to a site that is upstream of (i.e., 3′ to) the first repeat whereas the second primer binds to a site that is upstream of (i.e., 3′ to) the second repeat, where the distances between the primers and their respective repeats are the same.
- the 3′ end of the first primer hybridizes to a nucleotide that is upstream of (i.e., 3′ to) the first repeat by n bases (where n is in the range of, e.g., 5 to 30) then the 3′ end of the second primer hybridizes to a nucleotide that is upstream of (i.e., 3′ to) the second repeat by n bases.
- the distance between the primer binding sites and the repeats can be defined by the number of bases for some sequencing methods (e.g., Illumina’s dye terminator sequencing method), the distance can be defined by “flows” in other methods (e.g. Ion Torrent or pyrosequencing methods).
- the method may comprise subjecting the hybridization product to a sequencing-by-synthesis sequencing reaction to produce a sequence read that comprises a combination of the first and second repeat sequences, meaning that the sequences are merged into one.
- sequencing-by-synthesis methods are those that involve extending a primer using a template and detecting which nucleotide is added at each position. Sequencing-by-synthesis methods included, but are not limited to, Illumina’s reversible dye terminator method, Thermo’s Ion Torrent method (which detects ions as they are released by DNA polymerase) and pyrosequencing, although others are known.
- the sequence of a template is determined using reversible terminators chemistry (Turcatti et al., Nucleic Acids Res. 2008 36:e25).
- reversible terminators chemistry Teurcatti et al., Nucleic Acids Res. 2008 36:e25.
- a single fluorescently labeled, 3′-blocked nucleotide is added in a templated primer extension reaction.
- the identity of the fluorescent label added is detected by fluorescent imaging.
- the labels and terminators are chemically removed in order to prepare the primer extension product the next cycle.
- sequence read produced using this method will be a combination of the first and the second repeat sequences, where the term “combination” is intended to mean that the sequences of the first and second repeats are merged, superimposed or melded into one.
- sequence of the first repeat is GATCGGATCGA (SEQ ID NO: 1)
- sequence of the second repeat is GATCGGATCGA (SEQ ID NO: 1)
- the sequence read will contain only one copy of the sequence GATCGGATCGA (SEQ ID NO: 1), where some of the signal used to generate the sequence read is generated by extension of the first primer and some of the signal used to generate the sequence read is generated by the extension of the second primer in the same reaction.
- Differences in the sequences of the first and second repeats can be identified because the underlying signal corresponding to the difference will be mixed (i.e., will be a composite of signals produced by two different bases at that position). Positions that have a mixed signal can be identified because they are associated with a low-quality base call. As such, differences in the sequences of the first and second repeats can be identified as positions that have a low-quality base call.
- the sequence read comprises, for each position of the sequence read, a quality score indicating the reliability of the base(s) called at that position.
- Base calling is the process by which an order of nucleotides in a template is inferred during a sequencing reaction.
- next generation sequencing platforms that use fluorescently labeled reversible terminators have a unique color for each base. These are incorporated into the complementary strand of the DNA template and captured with a sensitive CCD camera. These images are processed into signals which are used to infer the order of nucleotides, also known as base calling.
- Base calling accuracy can be measured a variety of different ways.
- base calling accuracy can be measured using a Q score (Phred quality score), which is a common metric to assess the accuracy of a sequencing run.
- the method may be used to identify positions that differ in the first and second repeats.
- a position in the sequence read that is uncalled or associated with a low-quality score indicates that first and second repeat sequences differ at a nucleotide that corresponds to that position.
- the sequence read may contain only one copy of the sequence GATCGG[G/T]ATCGA (SEQ ID NO: 3), where “G/T” is a base that has a mixed signal and is therefore associated with a poor quality base call.
- the quality of the base calls for the non-G/T bases will be high and the quality of the base call for the G/T base will be poor because some of the signal for that position, as analyzed by the base celling algorithm, will be generated by extension of the first primer and some of the signal will be generated by the extension of the second primer, in the same reaction.
- the method may further comprise analyzing the underlying signals for that position to determine the identities of the nucleotides at that position in the first and second repeats.
- the underlying signals i.e., prior to base calling and referred to as primary sequence data
- the underlying signals could be analyzed to determine that the position contains a mixture of G and T, thereby indicating that the first repeat contains a G or T at that position, and the second repeat contains the other nucleotide.
- the method may comprise reading a combination of signals obtained by simultaneous extension of the first and second primers to produce primary sequencing data, processing the primary sequencing data using a base-calling algorithm to produce a sequence read composed of a sequence of base calls, each base call associated with a quality score indicating the reliability of the base call; and outputting the sequence read based on the quality scores.
- the quality scores allow differences between the first and second repeats to be identified.
- the first and second sites in the template are the same sequence.
- a single primer may be used in the method, where the primer binds to two sites in the template.
- the first and second sites in the template i.e., the sequences to which the first and second primers bind
- two or more primers may be used in the method, where the primer binds different sequences in the template, one upstream of the first repeat and the other upstream from the second repeat.
- the method may involve determining how many strands of the first repeat are sequenced relative to the number of strands of the second repeat, or if a sufficient number of molecules have been sequenced. These embodiments may be implemented by adding a calibration sequence to the template, as shown in FIG. 3 .
- the template may comprise: a first calibrator sequence that is present between the first site and the first repeat; and a second calibrator sequence that is present between the second site and the second repeat, wherein the first and second calibrator sequences are the same length (e.g., may be two, three or four bases in length or the same number of flows in length, depending on the sequencing method used) and have a different sequence; and the sequence read of step (b) includes positions that correspond to the first and second calibrator sequences.
- the underlying signals corresponding to the first and second calibrator sequences can be examined to determine how many strands of the first and second repeats are sequenced in the reaction.
- the underlying signals corresponding to the first and second calibrator sequences can be examined to determine if a sufficient number of molecules have been sequenced.
- templates are clonally amplified, and the amplification products are sequenced in a highly parallel fashion.
- Such methods are reviewed in, e.g., Metzker et al. (Genome Res. 2005 15:1767-1776) and Bentley (Curr. Opin. Genet. Dev. 2006 16: 545-55).
- Illumina sequencing the templates are spread in a flow cell and immobilized on a support (typically glass; see Fedurco et al., Nucleic Acids Res. 2006 34:e22), where they are amplified in place by bridge PCR, which generates clusters of identical templates (or “colonies”) on the support.
- the present method may be implemented by amplifying the template on a substrate by bridge PCR to produce a colony that comprises copies of the template, hybridizing one or more primers to the colony, wherein a primer hybridizes to a first site that is upstream of the first repeat sequence and a primer hybridizes to a second site that is upstream of the second repeat sequence, wherein the first and second sites are: upstream of the first and second repeat sequences, respectively, equidistant from the first and second repeat sequences; and obtaining the sequence of the template by a sequencing-by-synthesis sequencing reaction to produce a sequence read that comprises a combination of the first and second repeat sequences.
- the top and bottom strands of the bridge PCR amplification products may be sequenced by Illumina’s sequencing method (which is referred to as “paired end” sequencing).
- Illumina’s sequencing method which is referred to as “paired end” sequencing.
- the sequence of a top strand of a bridge PCR product can be compared to the sequence of a bottom strand of a bridge PCR product. Positions that are associated with a low-quality base call as a result of a difference in sequence between the first and second repeats should have a low-quality base call in both strands.
- after sequencing both strands of the product by paired end sequencing one can produce a consensus sequence for the top strand of the initial double-stranded fragment and a consensus sequence for the bottom strand of the initial double-stranded fragment.
- Low quality bases can be masked or integrated into a model in which the quality scores are taken into account. Sequences that are not present in both the top and bottom strands of the initial double-stranded fragment can thereby
- FIG. 3 illustrates an example of the method.
- the template is a double stranded molecule and one or both strands need to be sequenced (sequencing of the bottom strand is shown).
- the direct repeat template has flow cell sequences (e.g., Illumina’s P5 and P7 sequences) at the ends and a primer binding site between the first and second repeats.
- this molecule is amplified from a double-stranded fragment, where the first and second repeat sequences (W* and W or C* and C) are amplified from opposite strands of a double-stranded fragment of DNA and are identical except for positions that correspond to damaged nucleotides in the double-stranded fragment of DNA or errors that occur during amplification.
- the method may involve hybridizing two primers (designated P 1 and P 2 , which can be the same or different) to the template (after it has been amplified).
- the repeats each have a calibrator sequence (referred to as “key 1” and “key 2” that can be used to determine the relative number of copies of the first and second repeats that are sequenced in a reaction.
- the part of the sequence read obtained from primer P 1 should contain key 1 (TT) and the part of the sequence read obtained from primer P 2 should contain key 1 (AA).
- TT key 1
- AA key 1
- the primers may be extended but not read for the first few cycles, thereby allowing one to obtain the sequence of the keys and/or repeats faster.
- the direct repeat template may have different, non-complementary sequences (Sequences 1 and 2 in FIG. 3 ) in at least 10 nucleotides (e.g., at least 10, 12 or 14 nucleotides in length) that allow the fragments to be amplified by a single pair of primers: a first primer that hybridizes to one sequence and another that hybridizes to the complement of the other sequence.
- sequences may be compatible with the sequencing platform being used. These sequences do not need to be at the very end of a molecule although, in many embodiments, the sequences are within 50 nt, e.g., within 30 nt of the end of molecule.
- the template molecule should have a junction sequence between the first and second repeats.
- the junction sequence should be of 10 nucleotides (e.g., 10 to 100 nt).
- the template may contain a molecular barcode (e.g., a sample identifier or molecule identifier) at any position (outside of the repeats).
- the method described above can be employed to analyze genomic DNA from virtually any organism, including, but not limited to, plants, animals (e.g., reptiles, mammals, insects, worms, fish, etc.), tissue samples, bacteria, fungi (e.g., yeast), phage, viruses, cadaveric tissue, archaeological/ ancient samples, etc.
- the genomic DNA used in the method may be derived from a mammal, wherein certain embodiments the mammal is a human.
- the sample may contain genomic DNA from a mammalian cell, such as, a human, mouse, rat, or monkey cell.
- the sample may be made from cultured cells or cells of a clinical sample, e.g., a tissue biopsy, scrape or lavage or cells of a forensic sample (i.e., cells of a sample collected at a crime scene).
- the nucleic acid sample may be obtained from a biological sample such as cells, tissues, bodily fluids, and stool. Bodily fluids of interest include but are not limited to, blood, serum, plasma, saliva, mucous, phlegm, cerebral spinal fluid, pleural fluid, tears, lactal duct fluid, lymph, sputum, synovial fluid, urine, amniotic fluid, and semen.
- a sample may be obtained from a subject, e.g., a human.
- the sample comprises fragments of human genomic DNA.
- the sample may be obtained from a cancer patient.
- the sample may be made by extracting fragmented DNA from a patient sample, e.g., a formalin-fixed paraffin embedded tissue sample.
- the patient sample may be a sample of cell-free “circulating” DNA from a bodily fluid, e.g., peripheral blood, e.g., from the blood of a patient or of a pregnant female.
- the DNA fragments used in the initial step of the method should be non-amplified DNA that has not been denatured beforehand.
- the DNA in the initial sample may be made by extracting genomic DNA from a biological sample, and then fragmenting it.
- the fragmenting may be done mechanically (e.g., by sonication, nebulization, or shearing, etc.) or using a double stranded DNA “dsDNA” fragmentase enzyme (New England Biolabs, Ipswich MA).
- dsDNA double stranded DNA “dsDNA” fragmentase enzyme
- the ends after the DNA is fragmented, the ends may be polished and A-tailed prior to ligation to one or more adaptors. Alternatively, the ends may be polished and ligated to adaptors in a blunt-end ligation reaction.
- the DNA in the initial sample may already be fragmented (e.g., as is the case for FFPE (formalin-fixed paraffin embedded) samples and circulating cell-free DNA (cfDNA), e.g., ctDNA).
- the fragments in the initial sample may have a median size that is below 1 kb (e.g., in the range of 50 bp to 500 bp, or 80 bp to 400 bp), although fragments having a median size outside of this range may be used.
- the amount of DNA in a sample may be limiting.
- the initial sample of fragmented DNA may contain less than 200 ng of fragmented human DNA, e.g., 1 pg to 20 pg, 10 pg to 200 ng, 100 pg to 200 ng, 1 ng to 200 ng or 5 ng to 50 ng, or less than 10,000 (e.g., less than 5,000, less than 1,000, less than 500, less than 100, less than 10 or less than 1) haploid genome equivalents, depending on the genome.
- sample identifiers i.e., a sequence that identifies the sample to which the sequence is added, which can identify the patient, or a tissue, etc.
- sample identifiers can be added to the polynucleotides prior to sequencing, so that multiple (e.g., at least 2, at least 4, at least 8, at least 16, at least 48, at least 96 or more) samples can be multiplexed.
- the sample identifier may be ligated to the initial polynucleotides as part of the asymmetric adaptor, or the sample identifier may be ligated to the polynucleotides in the sub-samples, before or after amplification of those polynucleotides.
- the tag may be added by primer extension, i.e., using a primer that has a 3′ end that hybridizes to an adaptor sequence, and a 5′ tail that contains the sample identifier.
- the population of direct repeat molecules may be made in a variety of different ways. These methods rely on creating circular molecules, retaining physical proximity between the two strands of one double-stranded DNA molecule, or physically isolating two strands of one double-stranded molecule, during manipulation steps. The methods also divide into strategies requiring one, or more, adaptor types. These methods can be done by fragmenting, polishing and then tailing the ends of the fragments before adaptor ligation. Alternatively, transposases can be used to add adaptor sequences.
- standard transposons can be used but then modified to create a Y-shaped adaptor using oligonucleotide replacement (Grunenwald H, Baas B, Goryshin I, Zhang B, Adey A, Hu S, Shendure J, Caruccio N, Maffitt M 2011. Nextera PCR-free DNA library preparation for next-generation sequencing. [Poster presentation, AGBT 2011]; Gertz J, Varley KE, Davis NS, Baas BJ, Goryshin IY, Vaidyanathan R, Kuersten S, Myers RM 2012. Transposase mediated construction of RNA-seq libraries. Genome Res 22: 134-141).
- the direct repeat template may be made by (a) ligating adaptor sequences onto both ends of top and bottom strands of a population of fragments of double-stranded genomic DNA to produce double-stranded molecules comprising (i) a top strand comprising a 5′ sequence (e.g., X) at the 5′ end and a junction sequence (e.g., J) at the 3′ end; and (ii) a bottom strand comprising a 5′ sequence (e.g., Y′) at the 5′ end, and the complement of the junction sequence (J′) at the 3′ end; and (b) extending the 3′ end of the top strands (i.e., the strand that contains sequence X) using the bottom strand as a template, thereby copying the complement of the bottom strand, as well as sequences J and Y, into the same molecule as the top strand to produce a direct repeat molecule of formula: X-TOP-J-BOT′-Y, where
- TOP and BOT′ vary in the population and have a median length of at least 50 nucleotides and X and Y are different, non-complementary sequences of at least 10 nucleotides in length that do not vary in the population; and J is a junction sequence. Examples of this method are shown in the figures and described in greater detail below.
- a direct repeat molecule may be made by ligating a single adaptor onto both ends of top and bottom strands of a population of fragments of double-stranded genomic DNA, such that, the individual molecules are in a covalently open circle and, in in the individual molecules in the population, sequence X is added onto the 5′ end of the top strands of the fragment and sequence Y′ is ligated onto the 5′ of bottom strands of the fragments.
- This method involves extending the 3′ end of the top strands (i.e., the strand that contains sequence X) using the bottom strand as a template, thereby copying the complement of the bottom strand, as well as sequence Y, into the same molecule as the top strand.
- Such a molecule can be amplified using primers that have a 3′ end that is the same as or that hybridize to sequence X and Y.
- FIGS. 4 and 5 An example of such a method is illustrated in FIGS. 4 and 5 , where the top strand of the fragments of genomic DNA are indicated as “forward” and “reverse” respectively and sequences X and Y′ are indicated as sequences R1 and R2.
- the direct repeat molecules may be made by ligating a single adaptor onto both ends of top and bottom strands of a population of fragments of double-stranded genomic DNA, such that, the individual molecules are in a covalently closed circle and, in the individual molecules in the population, sequence X is added onto the 5′ end of the top strands of the fragment and sequence Y′ is ligated onto the 5′ end of the bottom strands of the fragments.
- This method involves creating one or more nicks by reacting, e.g., an adaptor containing dUTP and a mixture of UDG/endonuclease IV, extending the 3′ end of the top strands (i.e., the strand that contains sequence X) using the bottom strand as a template, thereby copying the complement of the bottom strand, as well as sequence Y, into the same molecule as the top strand.
- a molecule can be amplified using primers that have a 3′ end that is the same as or that hybridizes to sequence X and Y.
- the direct repeat template may be of the formula X-TOP-J-BOT′-Y, wherein (i) within each repeat molecule TOP and BOT′ are amplified from opposite strands of a double-stranded fragment of genomic DNA and are identical except for positions that correspond to damaged nucleotides in the double-stranded fragment of genomic DNA or errors that occur during amplification; (ii) TOP and BOT′ have a median length of at least 50 nucleotides; (iii) X and Y are different, non-complementary sequences of at least 10 nucleotides; and (iv) J is a junction sequence of, e.g., at least 10 nucleotides in length.
- the direct repeat template may have a strand of the formula X-(T)TOP(A)-J-(T)BOT′(A)-Y, wherein (T) and (A) are thymine and adenine nucleotides that are immediately adjacent to TOP and BOT′.
- Such molecules may be made by, for example (a) ligating adaptor sequences onto both ends of top and bottom strands of a population of fragments of double-stranded genomic DNA to produce double-stranded molecules comprising: (i) a top strand comprising sequence X at the 5′ end and sequence J at the 3′ end; and (ii) a bottom strand comprising sequence Y′ at the 5′ end, and sequence J′ at the 3′ end; and (b) extending the 3′ end of the top strands using the bottom strands as a template, thereby adding the complement of the bottom strands and sequence Y onto the end 3′ end of the top strands.
- This method is illustrated in FIGS. 4 and 5 .
- kits for practicing the subject method as described above.
- the various components of the kit may be present in separate containers or certain compatible components may be pre-combined into a single container, as desired.
- the subject kits may further include instructions for using the components of the kit to practice the subject methods, i.e., to provide instructions for sample analysis.
- the instructions for practicing the subject methods are generally recorded on a suitable recording medium.
- the instructions may be printed on a substrate, such as paper or plastic, etc.
- the instructions may be present in the kits as a package insert, in the labeling of the container of the kit or components thereof (i.e., associated with the packaging or subpackaging) etc.
- the instructions are present as an electronic storage data file present on a suitable computer readable storage medium, e.g., CD-ROM, diskette, etc.
- the actual instructions are not present in the kit, but means for obtaining the instructions from a remote source, e.g., via the internet, are provided.
- An example of this embodiment is a kit that includes a web address where the instructions can be viewed and/or from which the instructions can be downloaded. As with the instructions, this means for obtaining the instructions is recorded on a suitable substrate.
- the method described above may be employed to analyze any type of sample, including, but not limited to samples that contain heritable mutations, samples that contain somatic mutations, samples from mosaic individuals, pregnant females (in which some of the sample contains DNA from a developing fetus), and samples that contain a mixture of DNA from different sources.
- the method may be used identify a minority variant that, in some cases, may be due to a somatic mutation in a person.
- the method may be employed to detect an oncogenic mutation (which may be a somatic mutation) in, e.g., PIK3CA, NRAS, KRAS, JAK2, HRAS, FGFR3, FGFR1, EGFR, CDK4, BRAF, RET, PGDFRA, KIT or ERBB2, which may be associated with breast cancer, melanoma, renal cancer, endometrial cancer, ovarian cancer, pancreatic cancer, leukemia, colorectal cancer, prostate cancer, mesothelioma, glioma, medulloblastoma, polycythemia, lymphoma, sarcoma or multiple myeloma (see, e.g., Chial 2008 Proto-oncogenes to oncogenes to cancer.
- an oncogenic mutation which may be a somatic mutation
- PIK3CA e.g., PIK3CA, NRAS, KRAS, JAK2, HRAS,
- oncogenic mutations include mutations in, e.g., APC, AXIN2, CDH1, GPC3, CYLD, EXT1, EXT2, PTCH, SUFU, FH, SDHB, SDHC, SDHD, VHL, TP53, WT1, STK11/LKB1, PTEN, TSC1, TSC2, CDKN2A, CDK4, RB1, NF1, BMPR1A, MEN1, SMAD4, BHD, HRPT2, NF2, MUTYH, ATM, BLM, BRCA1, BRCA2, FANCA, FANCC, FANCD2, FANCE, FANCF, FANCG, NBS1, RECQL4, WRN, MSH2, MLH1, MSH6, PMS2, XPA, XPC, ERCC2-5, DDB2 or MET, which may be associated with colon, thyroid, parathyroid, pituitary, islet cell, stomach, intestinal, embryonal, bone
- the method may be employed to detect a somatic mutation in genes that are implicated in cancer, e.g., CTNNB1, BCL2, TNFRSF6/FAS, BAX, FBXW7/CDC4, GLI, HPVE6, MDM2, NOTCH1, AKT2, FOXO1A, FOXO3A, CCND1, HPVE7, TAL1, TFE3, ABL1, ALK, EPHB2, FES, FGFR2, FLT3, FLT4, KRAS2, NTRK1, NTRK3, PDGFB, PDGFRB, EWSR1, RUNX1, SMAD2, TGFBR1, TGFBR2, BCL6, EVI1, HMGA2, HOXA9, HOXA11, HOXA13, HOXC13, HOXD11, HOXD13, HOX11, HOX11L2, MAP2K4, MLL, MYC, MYCN, MYCL1, PTNP1, PTNP11, RARA, SS18
- mutations of interest include mutations in, e.g., ARID1A, ARID1B, SMARCA4, SMARCB1, SMARCE1, AKT1, ACTB/ACTG1, CHD7, ANKRD11, SETBP1, MLL2, ASXL1, which may be at least associated with rare syndromes such as Coffin-Siris syndrome, Proteus syndrome, Baraitser-Winter syndrome, CHARGE syndrome, KBG syndrome, Schinzel-Giedion syndrome, Kabuki syndrome or Bohring-Opitz syndrome (see, e.g., Veltman and Brunner 2012 De novo mutations in human genetic disease. Nature Reviews Genetics 13:565-575). Hence, the method may be employed to detect a mutation in those genes.
- the method may be employed to detect a mutation in genes that are implicated in a variety of neurodevelopmental disorders, e.g., KAT6B, THRA, EZH2, SRCAP, CSF1R, TRPV3, DNMT1, EFTUD2, SMAD4, LIS1, DCX, which may be associated with Ohdo syndrome, hypothyroidism, Genitopatellar syndrome, Weaver syndrome, Floating-Harbor syndrome, hereditary diffuse leukoencephalopathy with spheroids, Olmsted syndrome, ADCA-DN (autosomal-dominant cerebellar ataxia, deafness and narcolepsy), mandibulofacial dysostosis with microcephaly or Myhre syndrome (see, e.g., Ku et al.
- the method may also be employed to detect a somatic mutation in genes that are implicated in a variety of neurological and neurodegenerative disorders, e.g., SCN1A, MECP2, IKBKG/NEMO or PRNP (see, e.g., Poduri et al. (2014) Somatic mutation, genetic variation, and neurological disease. Science 341(6141):1237758).
- a sample may be collected from a patient at a first location, e.g., in a clinical setting such as in a hospital or at a doctor’s office, and the sample may be forwarded to a second location, e.g., a laboratory where it is processed, and the above-described method is performed to generate a report.
- a “report” as described herein, is an electronic or tangible document which includes report elements that provide test results that may indicate the presence and/or quantity of minority variant(s) in the sample.
- the report may be forwarded to another location (which may be the same location as the first location), where it may be interpreted by a health professional (e.g., a clinician, a laboratory technician, or a physician such as an oncologist, surgeon, pathologist or virologist), as part of a clinical decision.
- a health professional e.g., a clinician, a laboratory technician, or a physician such as an oncologist, surgeon, pathologist or virologist
- the method may be used to analyze diseases that are associated with mutations, transplant rejection and has applications in non-invasive prenatal testing.
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Chemical & Material Sciences (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Organic Chemistry (AREA)
- Zoology (AREA)
- Wood Science & Technology (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Microbiology (AREA)
- Immunology (AREA)
- Biotechnology (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Analytical Chemistry (AREA)
- Physics & Mathematics (AREA)
- Biochemistry (AREA)
- Bioinformatics & Cheminformatics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Genetics & Genomics (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
Described herein is a method of sequencing a template that comprises a direct repeat, comprising: (a) in the same reaction, hybridizing a primer to a first site that is upstream of the first repeat sequence and hybridizing a primer to a second site that is upstream of the second repeat sequence, wherein the first and second sites are: (i) upstream of the first and second repeat sequences, respectively, and (ii) equidistant from the first and second repeat sequences; and (b) subjecting the hybridization product of (a) to a sequencing-by-synthesis sequencing reaction to produce a sequence read that comprises a combination of the first and second repeat sequences.
Description
- This application claims the benefit of U.S. Provisional Application Serial No. 62/818,527, filed on Mar. 14, 2019, which application is incorporated by reference herein.
- Some sequencing methods require comparing two sequences with a single sequence read to determine if there is a difference between the sequences. However, such methods can be challenging to perform because the software that performs this task needs to accurately identify the beginnings and ends of the sequences in a sequence read that should be compared, extract sequences that should be compared, and then perform an alignment of those sequence. These steps can be challenging to automatically perform consistently for all different sequences, sequence compositions and lengths. For example, the existence of repeated sequences within a sequence read can cause slippage of an alignment, which may produce erroneous results.
- The present disclosure provides an alternative, better way for comparing sequences with the same sequence read.
- A method of sequencing a template that comprises a direct repeat, i.e., template comprising a first repeat sequence and a second repeat sequence that is in direct orientation with the first repeat is provided. In some embodiments, the method may comprise, in the same reaction, hybridizing a primer to a first site that is upstream of the first repeat sequence and hybridizing a primer to a second site that is upstream of the second repeat sequence. In these embodiments the first and second sites (i.e., the sites to which the first and second primers bind) should be upstream of the first and second repeat sequences, respectively (i.e., downstream from the 3′ ends of the primers) and equidistant from the first and second repeat sequences. The hybridization product produced by this step contains the template with two primers annealed to it, both upstream of a repeat sequence by the same distance (e.g., the same number of bases). Next, the method involves sequencing the template using a sequencing-by-synthesis method (e.g., using fluorescent dye terminators) to produce a sequence read that comprises a combination of the first and second repeat sequences, i.e., a sequence read that is essentially two reads (one from the first primer and the other from the second primer) that are merged with one another. Differences between the sequence of the first and second repeats can be identified as low-quality base calls.
- In some embodiments, within each template molecule the first repeat sequence and the second repeat sequence are amplified from opposite strands of a double-stranded fragment of DNA. In these embodiments, the sequences of the first and second repeats should be identical except for positions that correspond to damaged nucleotides in the double-stranded fragment of DNA or errors that occur during amplification. Thus, any differences between the top and bottom strands of the double-stranded fragment can be identified in the sequence read as a “low quality” base call, i.e., a base that is associated with poor underlying data due to there being, in effect, two different bases at a particular position in the sequence. In more detail, within each template molecule the first repeat may be amplified from the one strand of a double-stranded fragment of genomic DNA and the second repeat may be amplified from the other strand of the same fragment of double-stranded fragment of genomic DNA. Within a molecule, the sequences of the first and second repeats are often the same. However, in cases where there is damage in the original molecule, the sequences of the first and second repeats (within a single molecule) may differ. As such, within each repeat molecule, the first and second repeats are typically identical except for positions that correspond to (a) damaged nucleotides in the double-stranded fragment of genomic DNA from which those strands were copied or (b) errors that occur during amplification of the direct repeat molecule (e.g., nucleotides that are mis-incorporated or deletions caused by a stutter or slippage event during amplification). As such, the first and second repeats are typically at least 95% identical in sequence. Thus, the different repeats in a template molecule can be sequenced using two primers (one for each repeat) at the same time to determine if the repeats (which correspond to the top and complement of the bottom strands of an initial fragment of genomic DNA) differ. Because two primers are used, the sequences of the first and second repeats are merged in the same sequence read. Any differences between those sequences can be observed as a low-quality base call because the underlying data for that base call are essentially derived from two bases (one base read by the first primer and the other base read by the second primer, where those bases are the same distance downstream from the primers). If there is a low-quality base call at a particular position, then the method may comprise excluding that base call from future analysis. The method may be used to identify damaged nucleotides and amplification errors, as well as sequencing errors (i.e., errors that stem from the sequence reaction itself, not in the sequencing template).
- The method finds particular use in analyzing samples of DNA that contain damaged DNA, samples in which the amount of DNA is limited and/or samples that contain fragments having a low copy number mutation (e.g., a sequence caused by a mutation that is present at low copy number relative to sequences that do not contain the mutation). These features are often present in patient samples that can be obtained non-invasively, e.g., circulating tumor (ctDNA) samples, which can be obtained from peripheral blood, or invasively, e.g., tissue sections. In some embodiments, the sample may be DNA obtained from tissue embedded in paraffin (i.e., an FFPE sample). In such samples, the mutant sequences may only be present at a very limited copy number (e.g., less than 10, less than 5 copies or even 1 copy in a background of hundreds or thousands of copies of the wild type sequence). In these situations, without an effective way to eliminate errors generated by DNA damage, it can be almost impossible to identify a true sequence variation with significant confidence.
- The invention is best understood from the following detailed description when read in conjunction with the accompanying drawings. It is emphasized that, according to common practice, the various features of the drawings are not to scale. Indeed, the dimensions of the various features are arbitrarily expanded or reduced for clarity. Included in the drawings are the following figures.
-
FIG. 1 schematically illustrates a direct repeat template that has been made from a fragment of double-stranded genomic DNA. -
FIG. 2 schematically illustrates where the first and second primers used in the method hybridize a direct repeat template. -
FIG. 3 schematically illustrates an example of the method. -
FIG. 4 schematically illustrates an exemplary method by which a direct repeat molecule can be produced. -
FIG. 5 schematically illustrates another exemplary method by which a direct repeat molecule can be produced. - Unless defined otherwise herein, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Although any methods and materials similar or equivalent to those described herein can be used in the practice or testing of the present invention, the preferred methods and materials are described.
- All patents and publications, including all sequences disclosed within such patents and publications, referred to herein are expressly incorporated by reference.
- Numeric ranges are inclusive of the numbers defining the range. Unless otherwise indicated, nucleic acids are written left to right in 5′ to 3′ orientation; amino acid sequences are written left to right in amino to carboxy orientation, respectively.
- The headings provided herein are not limitations of the various aspects or embodiments of the invention. Accordingly, the terms defined immediately below are more fully defined by reference to the specification as a whole.
- Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Singleton, et al., DICTIONARY OF MICROBIOLOGY AND MOLECULAR BIOLOGY, 2D ED., John Wiley and Sons, New York (1994), and Hale & Markham, THE HARPER COLLINS DICTIONARY OF BIOLOGY, Harper Perennial, N.Y. (1991) provide one of skill with the general meaning of many of the terms used herein. Still, certain terms are defined below for the sake of clarity and ease of reference.
- It is further noted that the claims may be drafted to exclude any optional element. As such, this statement is intended to serve as antecedent basis for use of such exclusive terminology as “solely”, “only” and the like in connection with the recitation of claim elements, or the use of a “negative” limitation.
- The term “sample” as used herein relates to a material or mixture of materials, typically containing one or more analytes of interest. In one embodiment, the term as used in its broadest sense, refers to any plant, animal, microbial or viral material containing genomic DNA, such as, for example, tissue or fluid isolated from an individual (including without limitation plasma, serum, cerebrospinal fluid, lymph, tears, saliva and tissue sections) or from in vitro cell culture constituents, as well as samples from the environment.
- The term “nucleic acid sample,” as used herein, denotes a sample containing nucleic acids. Nucleic acid samples used herein may be complex in that they contain multiple different molecules that contain sequences. Genomic DNA samples from a mammal (e.g., mouse or human) are types of complex samples. Complex samples may have more than about 104, 105, 106 or 107, 108, 109 or 1010 different nucleic acid molecules. A DNA target may originate from any source such as genomic DNA, or an artificial DNA construct. Any sample containing nucleic acids, e.g., genomic DNA from tissue culture cells or a sample of tissue, may be employed herein.
- The term “mixture” as used herein, refers to a combination of elements, that are interspersed and not in any particular order. A mixture is heterogeneous and not spatially separable into its different constituents. Examples of mixtures of elements include a number of different elements that are dissolved in the same aqueous solution and a number of different elements attached to a solid support at random positions (i.e., in no particular order). A mixture is not addressable. To illustrate by example, an array of spatially separated surface-bound polynucleotides, as is commonly known in the art, is not a mixture of surface-bound polynucleotides because the species of surface-bound polynucleotides are spatially distinct, and the array is addressable.
- The term “nucleotide” is intended to include those moieties that can be copied using a polymerase. Nucleotides contain not only the known purine and pyrimidine bases, but also other heterocyclic bases that have been modified e.g., “damaged” bases that have oxidized or deadenylated for example. Such modifications include methylated purines or pyrimidines, acylated purines or pyrimidines, alkylated riboses or other heterocycles. In addition, the term “nucleotide” includes those moieties that contain hapten or fluorescent labels and may contain not only conventional ribose and deoxyribose sugars, but other sugars as well. Modified nucleosides or nucleotides also include modifications on the sugar moiety, e.g., wherein one or more of the hydroxyl groups are replaced with halogen atoms or aliphatic groups, or are functionalized as ethers, amines, or the like.
- The term “nucleic acid” and “polynucleotide” are used interchangeably herein to describe a polymer of any length, e.g., greater than about 2 bases, greater than about 10 bases, greater than about 100 bases, greater than about 500 bases, greater than 1000 bases, greater than 10,000 bases, greater than 100,000 bases, greater than about 1,000,000, up to about 1010 or more bases composed of nucleotides, e.g., deoxyribonucleotides or ribonucleotides, and may be produced enzymatically or synthetically (e.g., PNA as described in U.S. Pat. No. 5,948,902 and the references cited therein) which can hybridize with naturally occurring nucleic acids in a sequence specific manner analogous to that of two naturally occurring nucleic acids, e.g., can participate in Watson-Crick base pairing interactions. Naturally-occurring nucleotides include guanine, cytosine, adenine, thymine, uracil (G, C, A, T and U respectively). DNA and RNA have a deoxyribose and ribose sugar backbone, respectively, whereas PNA’s backbone is composed of repeating N-(2-aminoethyl)-glycine units linked by peptide bonds. In PNA various purine and pyrimidine bases are linked to the backbone by methylenecarbonyl bonds. A locked nucleic acid (LNA), often referred to as inaccessible RNA, is a modified RNA nucleotide. The ribose moiety of an LNA nucleotide is modified with an extra bridge connecting the 2′ oxygen and 4′ carbon. The bridge “locks” the ribose in the 3′-endo (North) conformation, which is often found in the A-form duplexes. LNA nucleotides can be mixed with DNA or RNA residues in the oligonucleotide whenever desired. The term “unstructured nucleic acid,” or “UNA,” is a nucleic acid containing non-natural nucleotides that bind to each other with reduced stability. For example, an unstructured nucleic acid may contain a G′ residue and a C′ residue, where these residues correspond to non-naturally occurring forms, i.e., analogs, of G and C that base pair with each other with reduced stability, but retain an ability to base pair with naturally occurring C and G residues, respectively. Unstructured nucleic acid is described in US20050233340, which is incorporated by reference herein for disclosure of UNA.
- The term “oligonucleotide” as used herein denotes a single-stranded multimer of nucleo of from about 2 to 200 nucleotides, up to 500 nucleotides in length. Oligonucleotides may be synthetic or may be made enzymatically, and, in some embodiments, are 30 to 150 nucleotides in length. Oligonucleotides may contain ribonucleotide monomers (i.e., may be oligoribonucleotides) or deoxyribonucleotide monomers, or both ribonucleotide monomers and deoxyribonucleotide monomers. An oligonucleotide may be 10 to 20, 21 to 30, 31 to 40, 41 to 50, 51 to 60, 61 to 70, 71 to 80, 80 to 100, 100 to 150 or 150 to 200 nucleotides in length, for example.
- “Primer” means an oligonucleotide, either natural or synthetic, that is capable, upon forming a duplex with a polynucleotide template, of acting as a point of initiation of nucleic acid synthesis and being extended from its 3′ end along the template so that an extended duplex is formed. The sequence of nucleotides added during the extension process is determined by the sequence of the template polynucleotide. Usually primers are extended by a DNA polymerase. Primers are generally of a length compatible with their use in synthesis of primer extension products and are usually in the range of between 8 to 100 nucleotides in length, such as 10 to 75, 15 to 60, 15 to 40, 18 to 30, 20 to 40, 21 to 50, 22 to 45, 25 to 40, and so on. Typical primers can be in the range of between 10-50 nucleotides long, such as 15-45, 18-40, 20-30, 21-25 and so on, and any length between the stated ranges. In some embodiments, the primers are usually not more than about 10, 12, 15, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 35, 40, 45, 50, 55, 60, 65, or 70 nucleotides in length. In some embodiments a primer can be activated prior to primer extension. For example, some primers have a 3′ block and internal RNA base. The RNA base can be removed by RNaseH or another treatment, thereby producing a 3′ hydroxyl group which can be extended. Other methods for activating primers exist.
- Primers are usually single-stranded for maximum efficiency in amplification but may alternatively be double-stranded or partially double-stranded. If double-stranded, the primer is usually first treated to separate its strands before being used to prepare extension products. This denaturation step is typically affected by heat, but may alternatively be carried out using alkali, followed by neutralization. Also included in this definition are toehold exchange primers, as described in Zhang et al (Nature Chemistry 2012 4: 208-214), which is incorporated by reference herein.
- Thus, a “primer” is complementary to a template, and complexes by hydrogen bonding or hybridization with the template to give a primer/template complex for initiation of synthesis by a polymerase, which is extended by the addition of covalently bonded bases linked at its 3′ end complementary to the template in the process of DNA synthesis.
- The term “hybridization” or “hybridizes” refers to a process in which a region of a nucleic acid strand anneals to and forms a stable duplex, either a homoduplex or a heteroduplex, under normal hybridization conditions with a second complementary nucleic acid strand and does not form a stable duplex with unrelated nucleic acid molecules under the same normal hybridization conditions. The formation of a duplex is accomplished by annealing two complementary nucleic acid strand regions in a hybridization reaction. The hybridization reaction can be made to be highly specific by adjustment of the hybridization conditions (often referred to as hybridization stringency) under which the hybridization reaction takes place, such that two nucleic acid strands will not form a stable duplex, e.g., a duplex that retains a region of double-strandedness under normal stringency conditions, unless the two nucleic acid strands contain a certain number of nucleotides in specific sequences which are substantially or completely complementary. “Normal hybridization or normal stringency conditions” are readily determined for any given hybridization reaction. See, for example, Ausubel et al., Current Protocols in Molecular Biology, John Wiley & Sons, Inc., New York, or Sambrook et al., Molecular Cloning: A Laboratory Manual, Cold Spring Harbor Laboratory Press. As used herein, the term “hybridizing” or “hybridization” refers to any process by which a strand of nucleic acid binds with a complementary strand through base pairing.
- A nucleic acid is considered to be “selectively hybridizable” to a reference nucleic acid sequence if the two sequences specifically hybridize to one another under moderate to high stringency hybridization and wash conditions. Moderate and high stringency hybridization conditions are known (see, e.g., Ausubel, et al., Short Protocols in Molecular Biology, 3rd ed., Wiley & Sons 1995 and Sambrook et al., Molecular Cloning: A Laboratory Manual, Third Edition, 2001 Cold Spring Harbor, N.Y.). One example of high stringency conditions include hybridization at about 42° C. in 50% formamide, 5X SSC, 5X Denhardt’s solution, 0.5% SDS and 100 µg/ml denatured carrier DNA followed by washing two times in 2X SSC and 0.5% SDS at room temperature and two additional times in 0.1X SSC and 0.5% SDS at 42° C.
- The term “amplifying” as used herein refers to the process of synthesizing nucleic acid molecules that are complementary to one or both strands of a template nucleic acid. Amplifying a nucleic acid molecule may include denaturing the template nucleic acid, annealing primers to the template nucleic acid at a temperature that is below the melting temperatures of the primers, and enzymatically elongating from the primers to generate an amplification product. The denaturing, annealing and elongating steps each can be performed one or more times. In certain cases, the denaturing, annealing and elongating steps are performed multiple times such that the amount of amplification product is increasing, often times exponentially, although exponential amplification is not required by the present methods. Amplification typically requires the presence of deoxyribonucleoside triphosphates, a DNA polymerase enzyme and an appropriate buffer and/or co-factors for optimal activity of the polymerase enzyme. The term “amplification product” refers to the nucleic acids, which are produced from the amplifying process as defined herein.
- The terms “determining,” “measuring,” “evaluating,” “assessing,” “assaying,” and “analyzing” are used interchangeably herein to refer to any form of measurement and include determining if an element is present or not. These terms include both quantitative and/or qualitative determinations. Assessing may be relative or absolute. “Assessing the presence of” includes determining the amount of something present, as well as determining whether it is present or absent.
- The term “ligating,” as used herein, refers to the enzymatically catalyzed joining of the terminal nucleotide at the 5′ end of a first DNA molecule to the terminal nucleotide at the 3′ end of a second DNA molecule.
- A “plurality” contains at least 2 members. In certain cases, a plurality may have at least 2, at least 5, at least 10, at least 100, at least 100, at least 10,000, at least 100,000, at least 106, at least 107, at least 108 or at least 109 or more members.
- An “oligonucleotide binding site” refers to a site to which an oligonucleotide hybridizes in a target polynucleotide. If an oligonucleotide “provides” a binding site for a primer, then the primer may hybridize to that oligonucleotide or its complement.
- The term “strand” as used herein refers to a nucleic acid made up of nucleotides covalently linked together by covalent bonds, e.g., phosphodiester bonds. In a cell, DNA usually exists in a double-stranded form, and as such, has two complementary strands of nucleic acid referred to herein as the “Watson” (or “top”) and “Crick” (or “bottom”) strands. In certain cases, complementary strands of a chromosomal region may be referred to as “plus” and “minus” strands, the “first” and “second” strands, the “coding” and “noncoding” strands, the “top” and “bottom” strands or the “sense” and “antisense” strands. The assignment of a strand as being a Watson or Crick strand is arbitrary and does not imply any particular orientation, function or structure.
- The term “extending”, as used herein, refers to the extension of a primer by the addition of nucleotides using a polymerase. If a primer that is annealed to a nucleic acid is extended, the nucleic acid acts as a template for extension reaction.
- The term “sequencing,” as used herein, refers to a method by which the identity of at least 10 consecutive nucleotides (e.g., the identity of at least 20, at least 50, at least 100 or at least 200 or more consecutive nucleotides) of a polynucleotide is obtained.
- The terms “next-generation sequencing” or “high-throughput sequencing”, as used herein, refer to the so-called parallelized sequencing-by-synthesis or sequencing-by-ligation platforms currently employed by Illumina, Life Technologies, and Roche, etc. Next-generation sequencing methods may also include nanopore sequencing methods such as that commercialized by Oxford Nanopore Technologies, electronic-detection based methods such as Ion Torrent technology commercialized by Life Technologies, or single-molecule fluorescence-based methods such as that commercialized by Pacific Biosciences.
- The term “barcode sequence” or “molecular barcode”, as used herein, refers to a unique sequence of nucleotides that can be used to a) identify and/or track the source of a polynucleotide in a reaction, b) count how many times an initial molecule is sequenced and c) pair sequence reads from different strands of the same molecule. Barcode sequences may vary widely in size and composition; the following references provide guidance for selecting sets of barcode sequences appropriate for particular embodiments: Casbon (Nuc. Acids Res. 2011, 22 e81), Brenner, U.S. Pat. No. 5,635,400; Brenner et al., Proc. Natl. Acad. Sci., 97: 1665-1670 (2000); Shoemaker et al., Nature Genetics, 14: 450-456 (1996); Morris et al., European patent publication 0799897A1; Wallace, U.S. Pat. No. 5,981,179; and the like. In particular embodiments, a barcode sequence may have a length in range of from 2 to 36 nucleotides, or from 6 to 30 nucleotides, or from 8 to 20 nucleotides.
- In some cases, a barcode may contain a “degenerate base region” or “DBR”, where the terms “degenerate base region” and “DBR” refers to a type of molecular barcode that has complexity that is sufficient to help one distinguish between fragments to which the DBR has been added. In some cases, substantially every tagged fragment may have a different DBR sequence. In these embodiments, a high complexity DBR may be used (e.g., one that is composed of at least 10,000 or 100,000, or more sequences). In other embodiments, some fragments may be tagged with the same DBR sequence, but those fragments can still be distinguished by the combination of i. the DBR sequence, ii. the sequence of the fragment, iii. the sequence of the ends of the fragment, and/or iv. the site of insertion of the DBR into the fragment. In some embodiments, at least 95%, e.g., at least 96%, at least 97%, at least 98%, at least 99% or at least 99.5% of the target polynucleotides become associated with a different DBR sequence. In some embodiments, a DBR may comprise one or more (e.g., at least 2, at least 3, at least 4, at least 5, or 5 to 30 or more) nucleotides selected from R, Y, S, W, K, M, B, D, H, V, N (as defined by the IUPAC code). In some cases, a double-stranded barcode can be made by making an oligonucleotide containing degenerate sequence (e.g., an oligonucleotide that has a run of 2-10 or more “Ns”) and then copying the complement of the barcode onto the other strand, as described below.
- Oligonucleotides that contain a variable sequence, e.g., a DBR, can be made by making a number of oligonucleotides separately, mixing the oligonucleotides together, and by amplifying them en masse. In other words, the population of oligonucleotides that contain a variable sequence can be made as a single oligonucleotide that contains degenerate positions (i.e., positions that contain more than one type of nucleotide). Alternatively, such a population of oligonucleotides can be made by fabricating them individually or using an array of the oligonucleotides using in situ synthesis methods, cleaving the oligonucleotides from the substrate and optionally amplifying them. Examples of such methods are described in, e.g., Cleary et al. (Nature Methods 2004 1: 241-248) and LeProust et al. (Nucleic Acids Research 2010 38: 2522-2540).
- In some cases, a barcode may be error correcting. Descriptions of exemplary error identifying (or error correcting) sequences can be found throughout the literature (e.g., in are described in U.S. Patent Application Publications US2010/0323348 and US2009/0105959 both incorporated herein by reference). Error-correctable codes may be necessary for quantitating absolute numbers of molecules. Many reports in the literature use codes that were originally developed for error-correction of binary systems (Hamming codes, Reed Solomon codes etc.) or apply these to quaternary systems (e.g. quaternary Hamming codes; see Generalized DNA barcode design based on Hamming codes, Bystrykh 2012 PLoS One. 2012 7: e36852).
- In some embodiments, a barcode may additionally be used to determine the number of initial target polynucleotide molecules that have been analyzed, i.e., to “count” the number of initial target polynucleotide molecules that have been analyzed. PCR amplification of molecules that have been tagged with a barcode can result in multiple sub-populations of products that are clonally-related in that each of the different sub-populations is amplified from a single tagged molecule. As would be apparent, even though there may be several thousand or millions or more of molecules in any of the clonally-related sub-populations of PCR products and the number of target molecules in those clonally-related sub-populations may vary greatly, the number of molecules tagged in the first step of the method can be estimated by counting the number of DBR sequences associated with a target sequence that is represented in the population of PCR products. This number is useful because, in certain embodiments, the population of PCR products made using this method may be sequenced to produce a plurality of sequences. The number of different barcode sequences that are associated with the sequences of a target polynucleotide can be counted, and this number can be used (along with, e.g., the sequence of the fragment, the sequence of the ends of the fragment, and/or the site of insertion of the DBR into the fragment) to estimate the number of initial template nucleic acid molecules that have been sequenced. Such tags can also be useful in correcting sequencing errors.
- The terms “sample identifier sequence” or “sample index” refer to a type of barcode that can be appended to a target polynucleotide, where the sequence identifies the source of the target polynucleotide (i.e., the sample from which the target polynucleotide is derived). In use, each sample is tagged with a different sample identifier sequence (e.g., one sequence is appended to each sample, where the different samples are appended to different sequences), and the tagged samples are pooled. After the pooled sample is sequenced, the sample identifier sequence can be used to identify the source of the sequences.
- The term “adapter” refers to a nucleic acid that can be joined to at least one strand of a double-stranded DNA molecule. The term “adapter” refers to molecules that are at least partially double-stranded. An adaptor may be 20 to 150 bases in length, e.g., 40 to 120 bases, although adaptors outside of this range are envisioned.
- The term “adaptor-tagged,” as used herein, refers to a nucleic acid that has been tagged by, i.e., covalently linked with, an adaptor. An adaptor can be joined to a 5′ end and/or a 3′ end of a nucleic acid molecule.
- The term “tagged DNA” as used herein refers to DNA molecules that have an added adaptor sequence, i.e., a “tag” of synthetic origin. An adaptor sequence can be added (i.e., “appended”) by ligation.
- The term “complexity” refers to the total number of different sequences in a population. For example, if a population has 4 different sequences then that population has a complexity of 4. A population may have a complexity of at least 4, at least 8, at least 16, at least 100, at least 1,000, at least 10,000 or at least 100,000 or more, depending on the desired result.
- The term “of the formula” means that the individual molecules in a population are described by, i.e., encompassed by, the formula.
- Certain polynucleotides described herein may be referred to by a formula. Unless otherwise indicated the polynucleotides defined by a formula are oriented in the 5′ to 3′ direction. The components of the formula refer to separately definable sequences of nucleotides within a polynucleotide, where, unless implicit from the context, the sequences are linked together covalently such that a polynucleotide described by a formula is a single molecule. In some cases, the components of the formula are immediately adjacent to one another in the single molecule. Unless otherwise indicated or implicit from the context, a region defined by a formula may have additional sequences, a primer binding site, a molecular barcode, a promoter, or a spacer, etc., at its 3′ end, its 5′ end or both the 3′ and 5′ ends. As would be apparent, the various component sequences of a polynucleotide may independently be of any desired length as long as they are capable of performing the desired function (e.g., hybridization to another sequence). For example, the various component sequences of a polynucleotide may independently have a length in the range of 8-80 nucleotides, e.g., 10-50 nucleotides or 12-30 nucleotides.
- The term “opposite strands”, as used herein, refers to the top and bottom strands, where the strands are complementary to one another, except for damaged nucleotides.
- The term “potential sequence variation”, as used herein, refers to a sequence variation, e.g., a substitution, deletion, insertion or rearrangement of one or more nucleotides in one sequence relative to another.
- The term “amplification error” refers to a mis-incorporated base, or a deletion/insertion caused by polymerase stutter. Stutter usually occurs in repeat sequences, e.g., short tandem repeats (STRs) or microsatellite repeats and is presumed to be due to miscopying or slippage by the polymerase
- The term “target enrichment”, as used herein, refers to a method in which selected sequences are separated from other sequences in a sample. This may be done by hybridization to a probe, e.g., hybridizing a biotinylated oligonucleotide to the sample to produce duplexes between the oligonucleotide and the target sequence, immobilizing the duplexes via the biotin group, washing the immobilized duplexes, and then releasing the target sequences from the oligonucleotides. Alternatively, a selected sequence may be enriched by amplifying that sequence, e.g., by PCR using one or more primers that hybridize to a site that is proximal to the target sequence.
- The terms “minority variant” and “sequence variation”, as used herein, is a variant that is present at a frequency of less than 50%, relative to other molecules in the sample. In some cases, a minority variant may be a first allele of a polymorphic target sequence, where, in a sample, the ratio of molecules that contain the first allele of the polymorphic target sequence compared to molecules that contain other alleles of the polymorphic target sequence is 1:5 or less, 1:10 or less, 1:100 or less, 1:1,000 or less, 1: 10,000 or less, 1: 100,000 or less or 1:1,000,000 or less.
- The term “duplex sequencing” refers to a method in which sequences for both strands of a double-stranded molecule of genomic DNA are obtained. In duplex sequencing, the sequences derived from the top strand of double-stranded molecule of genomic DNA are distinguishable from sequences derived from the bottom strand of that molecule in such a way that the sequences for the top and bottom strands from the same double-stranded molecule of genomic DNA can be compared.
- The term “direct repeat” refers a molecule that contains two copies of near identical sequences, i.e., sequences that are of the same length and that are at least 95% identical in nucleotide sequence.
- The term “distance” as used herein depends on the sequencing-by-synthesis method being used for sequencing. For example, in methods that rely on reversible chain terminators the distance between the 3′ end of a primer and a downstream nucleotide can be defined by the number of bases. In semiconductor or pyrosequencing methods the distance between the 3′ end of a primer and a downstream nucleotide can be defined by the number of flows because, in those methods, several nucleotides can be added in a single flow. Thus, “equidistant” can mean the same number of nucleotides if a reversible chain terminator-based sequencing method is used or the same number of flows if a semiconductor- or pyrosequencing-based sequencing methods is used.
- For ease of reference, the reverse complement of a sequence may be indicated by the prime (“ ′ ”) symbol. For example, the reverse complement of a sequence referred to as “W” is may be referred to as “W”′.
- Other definitions of terms may appear throughout the specification.
- Before the present invention is described, it is to be understood that this invention is not limited to particular embodiments described, as such may, of course, vary. It is also to be understood that the terminology used herein is for the purpose of describing particular embodiments only, and is not intended to be limiting, since the scope of the present invention will be limited only by the appended claims.
- Where a range of values is provided, it is understood that each intervening value, to the tenth of the unit of the lower limit unless the context clearly dictates otherwise, between the upper and lower limits of that range is also specifically disclosed. Each smaller range between any stated value or intervening value in a stated range and any other stated or intervening value in that stated range is encompassed within the invention. The upper and lower limits of these smaller ranges may independently be included or excluded in the range, and each range where either, neither or both limits are included in the smaller ranges is also encompassed within the invention, subject to any specifically excluded limit in the stated range. Where the stated range includes one or both of the limits, ranges excluding either or both of those included limits are also included in the invention.
- Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Although any methods and materials similar or equivalent to those described herein can be used in the practice or testing of the present invention, some potential and preferred methods and materials are now described. All publications mentioned herein are incorporated herein by reference to disclose and describe the methods and/or materials in connection with which the publications are cited. It is understood that the present disclosure supersedes any disclosure of an incorporated publication to the extent there is a contradiction.
- It must be noted that as used herein and in the appended claims, the singular forms “a”, “an”, and “the” include plural referents unless the context clearly dictates otherwise. Thus, for example, reference to “a nucleic acid” includes a plurality of such nucleic acids and reference to “the compound” includes reference to one or more compounds and equivalents thereof known to those skilled in the art, and so forth.
- The practice of the present invention may employ, unless otherwise indicated, conventional techniques and descriptions of organic chemistry, polymer technology, molecular biology (including recombinant techniques), cell biology, biochemistry, and immunology, which are within the skill of the art. Such conventional techniques include polymer array synthesis, hybridization, ligation, and detection of hybridization using a label. Specific illustrations of suitable techniques can be had by reference to the example herein below. However, other equivalent conventional procedures can, of course, also be used. Such conventional techniques and descriptions can be found in standard laboratory manuals such as Genome Analysis: A Laboratory Manual Series (Vols. I-IV), Using Antibodies: A Laboratory Manual, Cells: A Laboratory Manual, PCR Primer: A Laboratory Manual, and Molecular Cloning: A Laboratory Manual (all from Cold Spring Harbor Laboratory Press), Stryer, L. (1995) Biochemistry (4th Ed.) Freeman, New York, Gait, “Oligonucleotide Synthesis: A Practical Approach” 1984, IRL Press, London, Nelson and Cox (2000), Lehninger, A., Principles of
Biochemistry 3rd Ed., W. H. Freeman Pub., New York, N.Y. and Berg et al. (2002) Biochemistry, 5th Ed., W. H. Freeman Pub., New York, N.Y., all of which are herein incorporated in their entirety by reference for all purposes. - The publications discussed herein are provided solely for their disclosure prior to the filing date of the present application. Nothing herein is to be construed as an admission that the present invention is not entitled to antedate such publication by virtue of prior invention. Further, the dates of publication provided may be different from the actual publication dates which may need to be independently confirmed.
- Provided herein, among other things, is a way to sequence a template that has a direct repeat, i.e., a template that comprises a first repeat sequence and a second repeat sequence, wherein the first and second repeat sequences are in a direct repeat and either identical or nearly identical. In some embodiments within each template molecule, the first repeat sequence and the second repeat sequence may be amplified from opposite strands of a double-stranded fragment of DNA. In embodiments in which the fragment of DNA is double-stranded genomic DNA (e.g., eukaryotic genomic DNA, which may be isolated from a tissue biopsy or may be cell-free DNA (cfDNA), microbial genomic DNA or viral genomic DNA), the sequences of the repeats may be identical except for positions that correspond to damaged nucleotides in the double-stranded fragment of DNA or errors that occur during amplification. An example of such a direct repeat is illustrated in
FIG. 1 . As shown, within each repeat molecule the first repeat and the second repeat are amplified from opposite strands of a fragment of double-stranded genomic DNA, e.g., genomic DNA. The first repeat has the same or a very similar sequence as one strand (the top strand of the fragment, for example) of a fragment of double-stranded genomic DNA whereas the second repeat has the same or a very similar sequence as the reverse complement of the other strand of the fragment (e.g., the bottom strand of the fragment). In embodiments, in which the fragment is genomic DNA, the first and second repeat sequences should be identical except for nucleotides that correspond to (i.e., are at a position that corresponds to the position of) damaged nucleotides in the fragment of double-stranded genomic DNA or errors that have occurred during amplification. In other embodiments, the double-stranded fragment may be made synthetically or derived from a double-stranded plasmid, for example. - A “damaged nucleotide” refers to any derivative of adenine, cytosine, guanine, and thymine that has been altered in a way that allows it to pair with a different base. In non-damaged DNA, A base pairs with T and C base pairs with G. However, some bases can be oxidized, alkylated or deaminated in a way that effects base pairing. For example, 7,8-dihydro-8-oxoguanine (8-oxo-dG) is a derivative of guanine that base pairs with adenine instead of cytosine. This derivative causes a G to T transversion after replication. Deamination of cytosine produces uracil, which can base pair with adenine, leading to a C to T change after replication. Other examples or damaged nucleotide that are capable of mismatched pairing include are known.
- Within a direct repeat template molecule, the sequences of first and second repeats have identical lengths and are at least 95% identical (e.g., at least 95% identical, at least 96% identical, at least 97% identical, at least 98% identical, at least 99% identical or 100% identical, depending on, e.g., the extent of DNA damage in the fragment of double-stranded genomic DNA and/or amplification errors) and, with the exception of nucleotides that correspond to damaged nucleotides and amplification errors, should be identical. As shown, the molecules may have a unit length of 1, meaning that there is only one copy of first repeat and one copy of the second repeat in each molecule. The template molecules may be single stranded or double stranded. However, as would be appreciated, the template is in its single stranded form when it is being sequenced. The sequence of the first and second repeats may have a length of at least 50 nucleotides and in some embodiments may be in the range of 50 nucleotides to 2 kb in length, e.g., 50-500 nt or 50-300 nt. In some embodiments, the direct repeat template may be in a sample that contains other direct repeat templates. Within the population, the complexity and median length of the sequence of the first repeat may vary and may be approximately the same as the complexity and median length of the sequence of the second repeat, since those sequences are almost identical. In the population, the first repeat and the second repeat may each have a complexity of at least 103, e.g., at least 104, at least 105, at least 106, at least 107, at least 108, at least 109 or at least 1010, for example, meaning that in the population, the first repeat and the second repeat are each represented by at least 103 different sequences. The lengths of the first and second repeats may depend on the lengths of the fragments of DNA in the sample from which the molecules are made. In some embodiments, the fragments may have a median size that no more than 2 kb in length (e.g., in the range of 50 bp to 2 kb, e.g., 75 bp to 1.5 kb, 100 bp to 1 kb, 100 bp to 500 bp). The lengths of the fragment may be tailored to the sequencing platform being used. Examples of how these molecules can be made will be described in greater detail below.
- Therefore, in any embodiment, the direct repeat molecule may be made by copying a double-stranded fragment of DNA to produce the direct repeat molecule, where the first and second repeats of the direct repeat molecule are to be amplified from opposite strands of the double-stranded fragment of DNA.
- As noted above, in some embodiments the method may comprise, in the same reaction, hybridizing a primer to a first site that is upstream of the first repeat sequence and hybridizing a primer to a second site that is upstream of the second repeat sequence. In these embodiments, the first and second sites (i.e., the sites to which the first and second primers bind, respectively) are upstream of the first and second repeat sequences, respectively, and equidistant from the first and second repeat sequences. This is illustrated in
FIG. 2 . As illustrated, the first primer binds to a site that is upstream of (i.e., 3′ to) the first repeat whereas the second primer binds to a site that is upstream of (i.e., 3′ to) the second repeat, where the distances between the primers and their respective repeats are the same. Illustrated by example, if the 3′ end of the first primer hybridizes to a nucleotide that is upstream of (i.e., 3′ to) the first repeat by n bases (where n is in the range of, e.g., 5 to 30) then the 3′ end of the second primer hybridizes to a nucleotide that is upstream of (i.e., 3′ to) the second repeat by n bases. While the distance between the primer binding sites and the repeats can be defined by the number of bases for some sequencing methods (e.g., Illumina’s dye terminator sequencing method), the distance can be defined by “flows” in other methods (e.g. Ion Torrent or pyrosequencing methods). - After hybridization of the primers, the method may comprise subjecting the hybridization product to a sequencing-by-synthesis sequencing reaction to produce a sequence read that comprises a combination of the first and second repeat sequences, meaning that the sequences are merged into one. In some embodiments, sequencing-by-synthesis methods are those that involve extending a primer using a template and detecting which nucleotide is added at each position. Sequencing-by-synthesis methods included, but are not limited to, Illumina’s reversible dye terminator method, Thermo’s Ion Torrent method (which detects ions as they are released by DNA polymerase) and pyrosequencing, although others are known. In the reversible dye terminator approach, the sequence of a template is determined using reversible terminators chemistry (Turcatti et al., Nucleic Acids Res. 2008 36:e25). In every sequencing cycle a single fluorescently labeled, 3′-blocked nucleotide is added in a templated primer extension reaction. After incorporation, the identity of the fluorescent label added is detected by fluorescent imaging. In each round, the labels and terminators are chemically removed in order to prepare the primer extension product the next cycle. A more detailed description of the process can be found in Bentley, supra.
- As noted above, the sequence read produced using this method will be a combination of the first and the second repeat sequences, where the term “combination” is intended to mean that the sequences of the first and second repeats are merged, superimposed or melded into one. By way of example, if the sequence of the first repeat is GATCGGATCGA (SEQ ID NO: 1) and sequence of the second repeat is GATCGGATCGA (SEQ ID NO: 1), then the sequence read will contain only one copy of the sequence GATCGGATCGA (SEQ ID NO: 1), where some of the signal used to generate the sequence read is generated by extension of the first primer and some of the signal used to generate the sequence read is generated by the extension of the second primer in the same reaction.
- Differences in the sequences of the first and second repeats can be identified because the underlying signal corresponding to the difference will be mixed (i.e., will be a composite of signals produced by two different bases at that position). Positions that have a mixed signal can be identified because they are associated with a low-quality base call. As such, differences in the sequences of the first and second repeats can be identified as positions that have a low-quality base call. In these embodiments, the sequence read comprises, for each position of the sequence read, a quality score indicating the reliability of the base(s) called at that position. Base calling is the process by which an order of nucleotides in a template is inferred during a sequencing reaction. For example, next generation sequencing platforms that use fluorescently labeled reversible terminators have a unique color for each base. These are incorporated into the complementary strand of the DNA template and captured with a sensitive CCD camera. These images are processed into signals which are used to infer the order of nucleotides, also known as base calling.
- Base calling accuracy can be measured a variety of different ways. In some embodiments base calling accuracy can be measured using a Q score (Phred quality score), which is a common metric to assess the accuracy of a sequencing run. Q scores are defined as logarithmically related to base calling error probability, where Q = - 10 log P / log 10. In this system, if a base is assigned a Q score of 40, this is equal to the probability of an incorrect base call of 1 in 10,000 times, or 99.99% base calling accuracy; a lower Q score of 10 means, there is the probability of an incorrect call in 1 of 10 bases. Lower Q scores can lead to increases in false positive variant calls and reduces the overall confidence an investigator has in their sequencing data. Details of base calling and methods for calculating the quality of a base call are described in a variety of publications, including, e.g., Ledergerber et al. (Brief Bioinform. 2011 12: 489-497), Whiteford et al. (Bioinformatics 2009 25: 2194-2199), Erlich (Nat. Methods. 2008 5: 679-682) and Kao et al. (Genome Res. 2009 19: 1884-95), which are incorporated by reference for disclosure of those methods.
- In some embodiments, the method may be used to identify positions that differ in the first and second repeats. In these embodiments, a position in the sequence read that is uncalled or associated with a low-quality score indicates that first and second repeat sequences differ at a nucleotide that corresponds to that position. By way of example, if the sequence of the first repeat is GATCGGATCGA (SEQ ID NO: 1) and the sequence of the second repeat is GATCGTATCGA (SEQ ID NO: 2), then the sequence read may contain only one copy of the sequence GATCGG[G/T]ATCGA (SEQ ID NO: 3), where “G/T” is a base that has a mixed signal and is therefore associated with a poor quality base call. In this example, the quality of the base calls for the non-G/T bases will be high and the quality of the base call for the G/T base will be poor because some of the signal for that position, as analyzed by the base celling algorithm, will be generated by extension of the first primer and some of the signal will be generated by the extension of the second primer, in the same reaction.
- After a position that has a low-quality base call has been identified (or, in some cases a position that is uncalled), the method may further comprise analyzing the underlying signals for that position to determine the identities of the nucleotides at that position in the first and second repeats. For example, in the example described in the prior paragraph, the underlying signals (i.e., prior to base calling and referred to as primary sequence data) could be analyzed to determine that the position contains a mixture of G and T, thereby indicating that the first repeat contains a G or T at that position, and the second repeat contains the other nucleotide. As such, in any embodiment, the method may comprise reading a combination of signals obtained by simultaneous extension of the first and second primers to produce primary sequencing data, processing the primary sequencing data using a base-calling algorithm to produce a sequence read composed of a sequence of base calls, each base call associated with a quality score indicating the reliability of the base call; and outputting the sequence read based on the quality scores. The quality scores allow differences between the first and second repeats to be identified.
- In some embodiments, the first and second sites in the template (i.e., the sequences to which the first and second primers bind) are the same sequence. In these embodiments, a single primer may be used in the method, where the primer binds to two sites in the template. In alternative embodiments, the first and second sites in the template (i.e., the sequences to which the first and second primers bind) may be different sequences. In these embodiments, two or more primers may be used in the method, where the primer binds different sequences in the template, one upstream of the first repeat and the other upstream from the second repeat.
- In some embodiments, the method may involve determining how many strands of the first repeat are sequenced relative to the number of strands of the second repeat, or if a sufficient number of molecules have been sequenced. These embodiments may be implemented by adding a calibration sequence to the template, as shown in
FIG. 3 . In these embodiments, the template may comprise: a first calibrator sequence that is present between the first site and the first repeat; and a second calibrator sequence that is present between the second site and the second repeat, wherein the first and second calibrator sequences are the same length (e.g., may be two, three or four bases in length or the same number of flows in length, depending on the sequencing method used) and have a different sequence; and the sequence read of step (b) includes positions that correspond to the first and second calibrator sequences. In these embodiments, the underlying signals corresponding to the first and second calibrator sequences (prior to base calling) can be examined to determine how many strands of the first and second repeats are sequenced in the reaction. Likewise, the underlying signals corresponding to the first and second calibrator sequences (prior to base calling) can be examined to determine if a sufficient number of molecules have been sequenced. - In many sequencing-by-synthesis methods, template molecules are clonally amplified, and the amplification products are sequenced in a highly parallel fashion. Such methods are reviewed in, e.g., Metzker et al. (Genome Res. 2005 15:1767-1776) and Bentley (Curr. Opin. Genet. Dev. 2006 16: 545-55). In Illumina sequencing the templates are spread in a flow cell and immobilized on a support (typically glass; see Fedurco et al., Nucleic Acids Res. 2006 34:e22), where they are amplified in place by bridge PCR, which generates clusters of identical templates (or “colonies”) on the support. As such, the present method may be implemented by amplifying the template on a substrate by bridge PCR to produce a colony that comprises copies of the template, hybridizing one or more primers to the colony, wherein a primer hybridizes to a first site that is upstream of the first repeat sequence and a primer hybridizes to a second site that is upstream of the second repeat sequence, wherein the first and second sites are: upstream of the first and second repeat sequences, respectively, equidistant from the first and second repeat sequences; and obtaining the sequence of the template by a sequencing-by-synthesis sequencing reaction to produce a sequence read that comprises a combination of the first and second repeat sequences. In some embodiments (and as illustrate in
FIG. 3 ) the top and bottom strands of the bridge PCR amplification products may be sequenced by Illumina’s sequencing method (which is referred to as “paired end” sequencing). As such, in some embodiments, the sequence of a top strand of a bridge PCR product can be compared to the sequence of a bottom strand of a bridge PCR product. Positions that are associated with a low-quality base call as a result of a difference in sequence between the first and second repeats should have a low-quality base call in both strands. In some embodiments, after sequencing both strands of the product by paired end sequencing one can produce a consensus sequence for the top strand of the initial double-stranded fragment and a consensus sequence for the bottom strand of the initial double-stranded fragment. Low quality bases can be masked or integrated into a model in which the quality scores are taken into account. Sequences that are not present in both the top and bottom strands of the initial double-stranded fragment can thereby be eliminated from future analysis. -
FIG. 3 illustrates an example of the method. In this example, the template is a double stranded molecule and one or both strands need to be sequenced (sequencing of the bottom strand is shown). In this example, the direct repeat template has flow cell sequences (e.g., Illumina’s P5 and P7 sequences) at the ends and a primer binding site between the first and second repeats. As shown, this molecule is amplified from a double-stranded fragment, where the first and second repeat sequences (W* and W or C* and C) are amplified from opposite strands of a double-stranded fragment of DNA and are identical except for positions that correspond to damaged nucleotides in the double-stranded fragment of DNA or errors that occur during amplification. As shown, the method may involve hybridizing two primers (designated P1 and P2, which can be the same or different) to the template (after it has been amplified). In this embodiment, the repeats each have a calibrator sequence (referred to as “key 1” and “key 2” that can be used to determine the relative number of copies of the first and second repeats that are sequenced in a reaction. As shown, the part of the sequence read obtained from primer P1 should contain key 1 (TT) and the part of the sequence read obtained from primer P2 should contain key 1 (AA). In this example, there is a difference in sequence in the first and second repeats, which can be identified as a base call with a low quality (as a result of the template have a mixed nucleotide at that position). - In embodiments in which there is non-informational sequence immediately downstream of a primer binding site, the primers may be extended but not read for the first few cycles, thereby allowing one to obtain the sequence of the keys and/or repeats faster.
- In some embodiments, the direct repeat template may have different, non-complementary sequences (
Sequences FIG. 3 ) in at least 10 nucleotides (e.g., at least 10, 12 or 14 nucleotides in length) that allow the fragments to be amplified by a single pair of primers: a first primer that hybridizes to one sequence and another that hybridizes to the complement of the other sequence. These sequences may be compatible with the sequencing platform being used. These sequences do not need to be at the very end of a molecule although, in many embodiments, the sequences are within 50 nt, e.g., within 30 nt of the end of molecule. As would be apparent, the template molecule should have a junction sequence between the first and second repeats. The junction sequence should be of 10 nucleotides (e.g., 10 to 100 nt). The template may contain a molecular barcode (e.g., a sample identifier or molecule identifier) at any position (outside of the repeats). - The method described above can be employed to analyze genomic DNA from virtually any organism, including, but not limited to, plants, animals (e.g., reptiles, mammals, insects, worms, fish, etc.), tissue samples, bacteria, fungi (e.g., yeast), phage, viruses, cadaveric tissue, archaeological/ancient samples, etc. In certain embodiments, the genomic DNA used in the method may be derived from a mammal, wherein certain embodiments the mammal is a human. In exemplary embodiments, the sample may contain genomic DNA from a mammalian cell, such as, a human, mouse, rat, or monkey cell. The sample may be made from cultured cells or cells of a clinical sample, e.g., a tissue biopsy, scrape or lavage or cells of a forensic sample (i.e., cells of a sample collected at a crime scene). In particular embodiments, the nucleic acid sample may be obtained from a biological sample such as cells, tissues, bodily fluids, and stool. Bodily fluids of interest include but are not limited to, blood, serum, plasma, saliva, mucous, phlegm, cerebral spinal fluid, pleural fluid, tears, lactal duct fluid, lymph, sputum, synovial fluid, urine, amniotic fluid, and semen. In particular embodiments, a sample may be obtained from a subject, e.g., a human. In some embodiments, the sample comprises fragments of human genomic DNA. In some embodiments, the sample may be obtained from a cancer patient. In some embodiments, the sample may be made by extracting fragmented DNA from a patient sample, e.g., a formalin-fixed paraffin embedded tissue sample. In some embodiments, the patient sample may be a sample of cell-free “circulating” DNA from a bodily fluid, e.g., peripheral blood, e.g., from the blood of a patient or of a pregnant female. The DNA fragments used in the initial step of the method should be non-amplified DNA that has not been denatured beforehand.
- The DNA in the initial sample may be made by extracting genomic DNA from a biological sample, and then fragmenting it. In some embodiments, the fragmenting may be done mechanically (e.g., by sonication, nebulization, or shearing, etc.) or using a double stranded DNA “dsDNA” fragmentase enzyme (New England Biolabs, Ipswich MA). In some of these methods (e.g., the mechanical and fragmentase methods), after the DNA is fragmented, the ends may be polished and A-tailed prior to ligation to one or more adaptors. Alternatively, the ends may be polished and ligated to adaptors in a blunt-end ligation reaction. In other embodiments, the DNA in the initial sample may already be fragmented (e.g., as is the case for FFPE (formalin-fixed paraffin embedded) samples and circulating cell-free DNA (cfDNA), e.g., ctDNA). The fragments in the initial sample may have a median size that is below 1 kb (e.g., in the range of 50 bp to 500 bp, or 80 bp to 400 bp), although fragments having a median size outside of this range may be used.
- In some embodiments, the amount of DNA in a sample may be limiting. For example, the initial sample of fragmented DNA may contain less than 200 ng of fragmented human DNA, e.g., 1 pg to 20 pg, 10 pg to 200 ng, 100 pg to 200 ng, 1 ng to 200 ng or 5 ng to 50 ng, or less than 10,000 (e.g., less than 5,000, less than 1,000, less than 500, less than 100, less than 10 or less than 1) haploid genome equivalents, depending on the genome.
- In some embodiments, sample identifiers (i.e., a sequence that identifies the sample to which the sequence is added, which can identify the patient, or a tissue, etc.) can be added to the polynucleotides prior to sequencing, so that multiple (e.g., at least 2, at least 4, at least 8, at least 16, at least 48, at least 96 or more) samples can be multiplexed. In these embodiments, the sample identifier may be ligated to the initial polynucleotides as part of the asymmetric adaptor, or the sample identifier may be ligated to the polynucleotides in the sub-samples, before or after amplification of those polynucleotides. Alternatively, the tag may be added by primer extension, i.e., using a primer that has a 3′ end that hybridizes to an adaptor sequence, and a 5′ tail that contains the sample identifier.
- The population of direct repeat molecules may be made in a variety of different ways. These methods rely on creating circular molecules, retaining physical proximity between the two strands of one double-stranded DNA molecule, or physically isolating two strands of one double-stranded molecule, during manipulation steps. The methods also divide into strategies requiring one, or more, adaptor types. These methods can be done by fragmenting, polishing and then tailing the ends of the fragments before adaptor ligation. Alternatively, transposases can be used to add adaptor sequences. In some embodiments, standard transposons can be used but then modified to create a Y-shaped adaptor using oligonucleotide replacement (Grunenwald H, Baas B, Goryshin I, Zhang B, Adey A, Hu S, Shendure J, Caruccio N, Maffitt M 2011. Nextera PCR-free DNA library preparation for next-generation sequencing. [Poster presentation, AGBT 2011]; Gertz J, Varley KE, Davis NS, Baas BJ, Goryshin IY, Vaidyanathan R, Kuersten S, Myers RM 2012. Transposase mediated construction of RNA-seq libraries. Genome Res 22: 134-141).
- In some embodiments, the direct repeat template may be made by (a) ligating adaptor sequences onto both ends of top and bottom strands of a population of fragments of double-stranded genomic DNA to produce double-stranded molecules comprising (i) a top strand comprising a 5′ sequence (e.g., X) at the 5′ end and a junction sequence (e.g., J) at the 3′ end; and (ii) a bottom strand comprising a 5′ sequence (e.g., Y′) at the 5′ end, and the complement of the junction sequence (J′) at the 3′ end; and (b) extending the 3′ end of the top strands (i.e., the strand that contains sequence X) using the bottom strand as a template, thereby copying the complement of the bottom strand, as well as sequences J and Y, into the same molecule as the top strand to produce a direct repeat molecule of formula: X-TOP-J-BOT′-Y, wherein: (i) within each repeat molecule TOP and BOT′ are amplified from opposite strands of a fragment of the double-stranded of genomic DNA and identical except for positions that correspond to damaged nucleotides in the double-stranded fragment of genomic DNA or amplification errors. In these embodiments, TOP and BOT′ vary in the population and have a median length of at least 50 nucleotides and X and Y are different, non-complementary sequences of at least 10 nucleotides in length that do not vary in the population; and J is a junction sequence. Examples of this method are shown in the figures and described in greater detail below.
- In some embodiments and as shown in
FIGS. 4 and 5 , a direct repeat molecule may be made by ligating a single adaptor onto both ends of top and bottom strands of a population of fragments of double-stranded genomic DNA, such that, the individual molecules are in a covalently open circle and, in in the individual molecules in the population, sequence X is added onto the 5′ end of the top strands of the fragment and sequence Y′ is ligated onto the 5′ of bottom strands of the fragments. This method involves extending the 3′ end of the top strands (i.e., the strand that contains sequence X) using the bottom strand as a template, thereby copying the complement of the bottom strand, as well as sequence Y, into the same molecule as the top strand. Such a molecule can be amplified using primers that have a 3′ end that is the same as or that hybridize to sequence X and Y. An example of such a method is illustrated inFIGS. 4 and 5 , where the top strand of the fragments of genomic DNA are indicated as “forward” and “reverse” respectively and sequences X and Y′ are indicated as sequences R1 and R2. - In some embodiments, the direct repeat molecules may be made by ligating a single adaptor onto both ends of top and bottom strands of a population of fragments of double-stranded genomic DNA, such that, the individual molecules are in a covalently closed circle and, in the individual molecules in the population, sequence X is added onto the 5′ end of the top strands of the fragment and sequence Y′ is ligated onto the 5′ end of the bottom strands of the fragments. This method involves creating one or more nicks by reacting, e.g., an adaptor containing dUTP and a mixture of UDG/endonuclease IV, extending the 3′ end of the top strands (i.e., the strand that contains sequence X) using the bottom strand as a template, thereby copying the complement of the bottom strand, as well as sequence Y, into the same molecule as the top strand. Such a molecule can be amplified using primers that have a 3′ end that is the same as or that hybridizes to sequence X and Y.
- A similar product may be made by emulsion PCR, using an immobilization approach, or rolling circle amplification, single adapter methods and greater than 1 adapter methods, as described in WO2018229547, which is incorporated by reference in its entirety. In some embodiments, the direct repeat template may be of the formula X-TOP-J-BOT′-Y, wherein (i) within each repeat molecule TOP and BOT′ are amplified from opposite strands of a double-stranded fragment of genomic DNA and are identical except for positions that correspond to damaged nucleotides in the double-stranded fragment of genomic DNA or errors that occur during amplification; (ii) TOP and BOT′ have a median length of at least 50 nucleotides; (iii) X and Y are different, non-complementary sequences of at least 10 nucleotides; and (iv) J is a junction sequence of, e.g., at least 10 nucleotides in length. In some embodiments, the direct repeat template may have a strand of the formula X-(T)TOP(A)-J-(T)BOT′(A)-Y, wherein (T) and (A) are thymine and adenine nucleotides that are immediately adjacent to TOP and BOT′. Such molecules may be made by, for example (a) ligating adaptor sequences onto both ends of top and bottom strands of a population of fragments of double-stranded genomic DNA to produce double-stranded molecules comprising: (i) a top strand comprising sequence X at the 5′ end and sequence J at the 3′ end; and (ii) a bottom strand comprising sequence Y′ at the 5′ end, and sequence J′ at the 3′ end; and (b) extending the 3′ end of the top strands using the bottom strands as a template, thereby adding the complement of the bottom strands and sequence Y onto the
end 3′ end of the top strands. This method is illustrated inFIGS. 4 and 5 . - Also provided by this disclosure is a kit for practicing the subject method, as described above. The various components of the kit may be present in separate containers or certain compatible components may be pre-combined into a single container, as desired.
- In addition to above-mentioned components, the subject kits may further include instructions for using the components of the kit to practice the subject methods, i.e., to provide instructions for sample analysis. The instructions for practicing the subject methods are generally recorded on a suitable recording medium. For example, the instructions may be printed on a substrate, such as paper or plastic, etc. As such, the instructions may be present in the kits as a package insert, in the labeling of the container of the kit or components thereof (i.e., associated with the packaging or subpackaging) etc. In other embodiments, the instructions are present as an electronic storage data file present on a suitable computer readable storage medium, e.g., CD-ROM, diskette, etc. In yet other embodiments, the actual instructions are not present in the kit, but means for obtaining the instructions from a remote source, e.g., via the internet, are provided. An example of this embodiment is a kit that includes a web address where the instructions can be viewed and/or from which the instructions can be downloaded. As with the instructions, this means for obtaining the instructions is recorded on a suitable substrate.
- As would be readily apparent, the method described above may be employed to analyze any type of sample, including, but not limited to samples that contain heritable mutations, samples that contain somatic mutations, samples from mosaic individuals, pregnant females (in which some of the sample contains DNA from a developing fetus), and samples that contain a mixture of DNA from different sources. In certain embodiments, the method may be used identify a minority variant that, in some cases, may be due to a somatic mutation in a person.
- In some embodiments, the method may be employed to detect an oncogenic mutation (which may be a somatic mutation) in, e.g., PIK3CA, NRAS, KRAS, JAK2, HRAS, FGFR3, FGFR1, EGFR, CDK4, BRAF, RET, PGDFRA, KIT or ERBB2, which may be associated with breast cancer, melanoma, renal cancer, endometrial cancer, ovarian cancer, pancreatic cancer, leukemia, colorectal cancer, prostate cancer, mesothelioma, glioma, medulloblastoma, polycythemia, lymphoma, sarcoma or multiple myeloma (see, e.g., Chial 2008 Proto-oncogenes to oncogenes to cancer. Nature Education 1:1). Other oncogenic mutations (which may be somatic mutations) of interest include mutations in, e.g., APC, AXIN2, CDH1, GPC3, CYLD, EXT1, EXT2, PTCH, SUFU, FH, SDHB, SDHC, SDHD, VHL, TP53, WT1, STK11/LKB1, PTEN, TSC1, TSC2, CDKN2A, CDK4, RB1, NF1, BMPR1A, MEN1, SMAD4, BHD, HRPT2, NF2, MUTYH, ATM, BLM, BRCA1, BRCA2, FANCA, FANCC, FANCD2, FANCE, FANCF, FANCG, NBS1, RECQL4, WRN, MSH2, MLH1, MSH6, PMS2, XPA, XPC, ERCC2-5, DDB2 or MET, which may be associated with colon, thyroid, parathyroid, pituitary, islet cell, stomach, intestinal, embryonal, bone, renal, breast, brain, ovarian, pancreatic, uterine, eye, hair follicle, blood or uterus cancers, pilotrichomas, medulloblastomas, leiomyomas, paragangliomas, pheochromocytomas, hamartomas, gliomas, fibromas, neuromas, lymphomas or melanomas. In some embodiments, the method may be employed to detect a somatic mutation in genes that are implicated in cancer, e.g., CTNNB1, BCL2, TNFRSF6/FAS, BAX, FBXW7/CDC4, GLI, HPVE6, MDM2, NOTCH1, AKT2, FOXO1A, FOXO3A, CCND1, HPVE7, TAL1, TFE3, ABL1, ALK, EPHB2, FES, FGFR2, FLT3, FLT4, KRAS2, NTRK1, NTRK3, PDGFB, PDGFRB, EWSR1, RUNX1, SMAD2, TGFBR1, TGFBR2, BCL6, EVI1, HMGA2, HOXA9, HOXA11, HOXA13, HOXC13, HOXD11, HOXD13, HOX11, HOX11L2, MAP2K4, MLL, MYC, MYCN, MYCL1, PTNP1, PTNP11, RARA, SS18 (see, e.g., Vogelstein and Kinzler 2004 Cancer genes and the pathways they control. Nature Medicine 10:789-799). The method of embodiment may be employed to detect any somatic mutation that is implicated in cancer which is catalogued by COSMIC (Catalogue of Somatic Mutations in Cancer), data of which can be accessed on the internet.
- Other mutations of interest include mutations in, e.g., ARID1A, ARID1B, SMARCA4, SMARCB1, SMARCE1, AKT1, ACTB/ACTG1, CHD7, ANKRD11, SETBP1, MLL2, ASXL1, which may be at least associated with rare syndromes such as Coffin-Siris syndrome, Proteus syndrome, Baraitser-Winter syndrome, CHARGE syndrome, KBG syndrome, Schinzel-Giedion syndrome, Kabuki syndrome or Bohring-Opitz syndrome (see, e.g., Veltman and Brunner 2012 De novo mutations in human genetic disease. Nature Reviews Genetics 13:565-575). Hence, the method may be employed to detect a mutation in those genes.
- In other embodiments, the method may be employed to detect a mutation in genes that are implicated in a variety of neurodevelopmental disorders, e.g., KAT6B, THRA, EZH2, SRCAP, CSF1R, TRPV3, DNMT1, EFTUD2, SMAD4, LIS1, DCX, which may be associated with Ohdo syndrome, hypothyroidism, Genitopatellar syndrome, Weaver syndrome, Floating-Harbor syndrome, hereditary diffuse leukoencephalopathy with spheroids, Olmsted syndrome, ADCA-DN (autosomal-dominant cerebellar ataxia, deafness and narcolepsy), mandibulofacial dysostosis with microcephaly or Myhre syndrome (see, e.g., Ku et al. (2012) A new paradigm emerges from study of de novo mutations in the context of neurodevelopmental disease. Molecular Psychiatry 18:141-153). The method may also be employed to detect a somatic mutation in genes that are implicated in a variety of neurological and neurodegenerative disorders, e.g., SCN1A, MECP2, IKBKG/NEMO or PRNP (see, e.g., Poduri et al. (2014) Somatic mutation, genetic variation, and neurological disease. Science 341(6141):1237758).
- In some embodiments, a sample may be collected from a patient at a first location, e.g., in a clinical setting such as in a hospital or at a doctor’s office, and the sample may be forwarded to a second location, e.g., a laboratory where it is processed, and the above-described method is performed to generate a report. A “report” as described herein, is an electronic or tangible document which includes report elements that provide test results that may indicate the presence and/or quantity of minority variant(s) in the sample. Once generated, the report may be forwarded to another location (which may be the same location as the first location), where it may be interpreted by a health professional (e.g., a clinician, a laboratory technician, or a physician such as an oncologist, surgeon, pathologist or virologist), as part of a clinical decision.
- The method may be used to analyze diseases that are associated with mutations, transplant rejection and has applications in non-invasive prenatal testing.
- Accordingly, the preceding merely illustrates the principles of the invention. It will be appreciated that those skilled in the art will be able to devise various arrangements which, although not explicitly described or shown herein, embody the principles of the invention and are included within its spirit and scope. Furthermore, all examples and conditional language recited herein are principally intended to aid the reader in understanding the principles of the invention and the concepts contributed by the inventors to furthering the art and are to be construed as being without limitation to such specifically recited examples and conditions. Moreover, all statements herein reciting principles, aspects, and embodiments of the invention as well as specific examples thereof, are intended to encompass both structural and functional equivalents thereof. Additionally, it is intended that such equivalents include both currently known equivalents and equivalents developed in the future, i.e., any elements developed that perform the same function, regardless of structure. The scope of the present invention, therefore, is not intended to be limited to the exemplary embodiments shown and described herein. Rather, the scope and spirit of the present invention is embodied by the appended claims.
Claims (20)
1. A method of sequencing a template that comprises a first repeat sequence and a second repeat sequence, wherein the first and second repeat sequences are in a direct repeat and either identical or nearly identical, comprising:
(a) in the same reaction, hybridizing a primer to a first site that is upstream of the first repeat sequence and hybridizing a primer to a second site that is upstream of the second repeat sequence, wherein the first and second sites are:
(i) upstream of the first and second repeat sequences, respectively, and
(ii) equidistant from the first and second repeat sequences; and
(b) subjecting the hybridization product of (a) to a sequencing-by-synthesis sequencing reaction to produce a sequence read that comprises a combination of the first and second repeat sequences.
2. The method of claim 1 , wherein within each template the first repeat sequence and the second repeat sequence are amplified from opposite strands of a double-stranded fragment of DNA and are identical except for positions that correspond to damaged nucleotides in the double-stranded fragment of DNA or errors that occur during amplification.
3. The method of claim 2 , wherein the double-stranded fragment of DNA is genomic DNA.
4. The method of claim 3 , wherein the genomic DNA is eukaryotic genomic DNA.
5. The method of claim 3 , wherein the genomic DNA is isolated from a tissue biopsy.
6. The method of claim 3 , wherein the genomic DNA is cell-free DNA (cfDNA).
7. The method of claim 3 , wherein the genomic DNA is microbial genomic DNA.
8. The method of claim 3 , wherein the genomic DNA is viral genomic DNA.
9. The method of claim 1 , wherein the sequence read of (b) comprises, for each position of the sequence read, a quality score indicating the reliability of the base(s) called at that position.
10. The method of claim 9 , wherein a position in the sequence read that is uncalled or associated with a low-quality score indicates that first and second repeat sequences differ at a nucleotide that corresponds to that position.
11. The method of claim 10 , further comprising analyzing primary sequencing data for a position that has a low-quality score to determine the identities of the nucleotides at that position in the first and second repeats.
12. The method of claim 1 , wherein step (b) comprises:
(i) reading a combination of signals obtained by simultaneous extension of the first and second primers to produce primary sequencing data;
(ii) processing the primary sequencing data using a base-calling algorithm to produce a sequence read composed of a sequence of base calls, each base call associated with a quality score indicating the reliability of the base call; and
(iii) outputting the sequence read based on (ii).
13. The method of claim 1 , wherein the sequencing-by-synthesis of step (b) comprises simultaneously extending the first and second primers in the presence of reversible chain terminators.
14. The method of claim 1 , wherein the first and second sites in the template are the same sequence.
15. The method of claim 1 , wherein the first and second sites in the template are different sequences.
16. The method of claim 1 , wherein the template comprises:
(i) a first calibrator sequence that is present between the first site and the first repeat; and
(ii) a second calibrator sequence that is present between the second site and the second repeat, wherein the first and second calibrator sequences are the same length and have a different sequence; and
the sequence read of step (b) includes positions that correspond to the first and second calibrator sequences.
17. The method of claim 16 , further comprising analyzing the signals corresponding to the first and second calibrator sequences to determine how many strands of the first and second repeats are sequenced in the reaction.
18. The method of claim 17 , further comprising analyzing the signals corresponding to the first and second calibrator sequences to determine if a sufficient number of molecules have been sequenced.
19. The method of claim 1 , wherein first and second repeats are less than 2,000 nucleotides in length.
20. The method of claim 1 , wherein the method is done by:
amplifying the template on a substrate by bridge PCR to produce a colony that comprises copies of the template;
hybridizing one or more primers to the colony, wherein a primer hybridizes to a first site that is upstream of the first repeat sequence and a primer hybridizes to a second site that is upstream of the second repeat sequence, wherein the first and second sites are: upstream of the first and second repeat sequences, respectively, and equidistant from the first and second repeat sequences; and
obtaining the sequence of the template by a sequencing-by-synthesis sequencing reaction to produce a sequence read that comprises a combination of the first and second repeat sequences.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US18/057,201 US20230242981A1 (en) | 2019-03-14 | 2022-11-18 | Method for sequencing a direct repeat |
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201962818527P | 2019-03-14 | 2019-03-14 | |
PCT/IB2020/051702 WO2020183280A1 (en) | 2019-03-14 | 2020-02-27 | Method for sequencing a direct repeat |
US202117435687A | 2021-09-01 | 2021-09-01 | |
US18/057,201 US20230242981A1 (en) | 2019-03-14 | 2022-11-18 | Method for sequencing a direct repeat |
Related Parent Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/435,687 Continuation US11512346B2 (en) | 2019-03-14 | 2020-02-27 | Method for sequencing a direct repeat |
PCT/IB2020/051702 Continuation WO2020183280A1 (en) | 2019-03-14 | 2020-02-27 | Method for sequencing a direct repeat |
Publications (1)
Publication Number | Publication Date |
---|---|
US20230242981A1 true US20230242981A1 (en) | 2023-08-03 |
Family
ID=69784500
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/435,687 Active US11512346B2 (en) | 2019-03-14 | 2020-02-27 | Method for sequencing a direct repeat |
US18/057,201 Pending US20230242981A1 (en) | 2019-03-14 | 2022-11-18 | Method for sequencing a direct repeat |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/435,687 Active US11512346B2 (en) | 2019-03-14 | 2020-02-27 | Method for sequencing a direct repeat |
Country Status (4)
Country | Link |
---|---|
US (2) | US11512346B2 (en) |
EP (1) | EP3938541B9 (en) |
ES (1) | ES2953889T3 (en) |
WO (1) | WO2020183280A1 (en) |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5981179A (en) | 1991-11-14 | 1999-11-09 | Digene Diagnostics, Inc. | Continuous amplification reaction |
US5604097A (en) | 1994-10-13 | 1997-02-18 | Spectragen, Inc. | Methods for sorting polynucleotides using oligonucleotide tags |
US6458530B1 (en) | 1996-04-04 | 2002-10-01 | Affymetrix Inc. | Selecting tag nucleic acids |
US5948902A (en) | 1997-11-20 | 1999-09-07 | South Alabama Medical Science Foundation | Antisense oligonucleotides to human serine/threonine protein phosphatase genes |
US20050233340A1 (en) | 2004-04-20 | 2005-10-20 | Barrett Michael T | Methods and compositions for assessing CpG methylation |
WO2007107710A1 (en) * | 2006-03-17 | 2007-09-27 | Solexa Limited | Isothermal methods for creating clonal single molecule arrays |
EP2164985A4 (en) | 2007-06-01 | 2014-05-14 | 454 Life Sciences Corp | System and meth0d for identification of individual samples from a multiplex mixture |
US20100323348A1 (en) | 2009-01-31 | 2010-12-23 | The Regents Of The University Of Colorado, A Body Corporate | Methods and Compositions for Using Error-Detecting and/or Error-Correcting Barcodes in Nucleic Acid Amplification Process |
US10767222B2 (en) * | 2013-12-11 | 2020-09-08 | Accuragen Holdings Limited | Compositions and methods for detecting rare sequence variants |
EP3638786A1 (en) | 2017-06-15 | 2020-04-22 | Genome Research Limited | Duplex sequencing using direct repeat molecules |
-
2020
- 2020-02-27 ES ES20710617T patent/ES2953889T3/en active Active
- 2020-02-27 US US17/435,687 patent/US11512346B2/en active Active
- 2020-02-27 EP EP20710617.0A patent/EP3938541B9/en active Active
- 2020-02-27 WO PCT/IB2020/051702 patent/WO2020183280A1/en active Application Filing
-
2022
- 2022-11-18 US US18/057,201 patent/US20230242981A1/en active Pending
Also Published As
Publication number | Publication date |
---|---|
EP3938541C0 (en) | 2023-06-07 |
EP3938541A1 (en) | 2022-01-19 |
EP3938541B1 (en) | 2023-06-07 |
WO2020183280A1 (en) | 2020-09-17 |
EP3938541B9 (en) | 2023-10-04 |
US11512346B2 (en) | 2022-11-29 |
ES2953889T3 (en) | 2023-11-16 |
US20220042092A1 (en) | 2022-02-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR102210852B1 (en) | Systems and methods to detect rare mutations and copy number variation | |
US20230065345A1 (en) | Method for bidirectional sequencing | |
US11111524B2 (en) | Method of identifying sequence variants using concatenation | |
JP2022519159A (en) | Analytical method of circulating cells | |
AU2017363180B2 (en) | Methods for preparing DNA reference material and controls | |
US10533214B2 (en) | Method for measuring mutational load | |
US11788116B2 (en) | Method for the analysis of minimal residual disease | |
JP2023139307A (en) | Methods and systems for detecting insertions and deletions | |
US11512346B2 (en) | Method for sequencing a direct repeat | |
US11078482B2 (en) | Duplex sequencing using direct repeat molecules | |
JP2022512848A (en) | Methods, compositions and systems for calibrating epigenetic compartment assays | |
US20220356467A1 (en) | Methods for duplex sequencing of cell-free dna and applications thereof | |
US20220056508A1 (en) | Method for amplifying a genomic sample | |
WO2023287876A1 (en) | Efficient duplex sequencing using high fidelity next generation sequencing reads |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: GENOME RESEARCH LIMITED, UNITED KINGDOM Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:OSBORNE, ROBERT;REEL/FRAME:061846/0182 Effective date: 20200622 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |