US11008616B2 - Correcting for deamination-induced sequence errors - Google Patents
Correcting for deamination-induced sequence errors Download PDFInfo
- Publication number
- US11008616B2 US11008616B2 US16/866,252 US202016866252A US11008616B2 US 11008616 B2 US11008616 B2 US 11008616B2 US 202016866252 A US202016866252 A US 202016866252A US 11008616 B2 US11008616 B2 US 11008616B2
- Authority
- US
- United States
- Prior art keywords
- nucleic acids
- designated position
- subset
- variation
- variant
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000009615 deamination Effects 0.000 title claims abstract description 90
- 238000006481 deamination reaction Methods 0.000 title claims abstract description 90
- 150000007523 nucleic acids Chemical class 0.000 claims abstract description 355
- 102000039446 nucleic acids Human genes 0.000 claims abstract description 348
- 108020004707 nucleic acids Proteins 0.000 claims abstract description 348
- 238000000034 method Methods 0.000 claims abstract description 104
- 239000002773 nucleotide Substances 0.000 claims description 238
- 125000003729 nucleotide group Chemical group 0.000 claims description 238
- 206010028980 Neoplasm Diseases 0.000 claims description 71
- 102000053602 DNA Human genes 0.000 claims description 43
- 108020004414 DNA Proteins 0.000 claims description 43
- 201000011510 cancer Diseases 0.000 claims description 41
- 238000006243 chemical reaction Methods 0.000 claims description 37
- 108091035707 Consensus sequence Proteins 0.000 claims description 33
- 230000000295 complement effect Effects 0.000 claims description 33
- 108090000623 proteins and genes Proteins 0.000 claims description 26
- 230000000694 effects Effects 0.000 claims description 21
- 102000004169 proteins and genes Human genes 0.000 claims description 21
- 210000001124 body fluid Anatomy 0.000 claims description 18
- 108060002716 Exonuclease Proteins 0.000 claims description 10
- 239000010839 body fluid Substances 0.000 claims description 10
- 102000013165 exonuclease Human genes 0.000 claims description 10
- 208000024891 symptom Diseases 0.000 claims description 3
- 238000012163 sequencing technique Methods 0.000 abstract description 133
- 201000010099 disease Diseases 0.000 abstract description 21
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 abstract description 21
- 230000008569 process Effects 0.000 abstract description 6
- 238000004393 prognosis Methods 0.000 abstract description 5
- 230000001010 compromised effect Effects 0.000 abstract description 2
- 239000000523 sample Substances 0.000 description 79
- 238000004891 communication Methods 0.000 description 38
- 210000004027 cell Anatomy 0.000 description 32
- 230000035772 mutation Effects 0.000 description 31
- 230000015654 memory Effects 0.000 description 27
- 238000003199 nucleic acid amplification method Methods 0.000 description 27
- 230000003321 amplification Effects 0.000 description 25
- 238000012545 processing Methods 0.000 description 19
- 238000003860 storage Methods 0.000 description 19
- OPTASPLRGRRNAP-UHFFFAOYSA-N cytosine Chemical compound NC=1C=CNC(=O)N=1 OPTASPLRGRRNAP-UHFFFAOYSA-N 0.000 description 18
- UYTPUPDQBNUYGX-UHFFFAOYSA-N guanine Chemical compound O=C1NC(N)=NC2=C1N=CN2 UYTPUPDQBNUYGX-UHFFFAOYSA-N 0.000 description 18
- 102000040430 polynucleotide Human genes 0.000 description 17
- 108091033319 polynucleotide Proteins 0.000 description 17
- 239000002157 polynucleotide Substances 0.000 description 17
- 108700028369 Alleles Proteins 0.000 description 16
- 229920002477 rna polymer Polymers 0.000 description 14
- RWQNBRDOKXIBIV-UHFFFAOYSA-N thymine Chemical compound CC1=CNC(=O)NC1=O RWQNBRDOKXIBIV-UHFFFAOYSA-N 0.000 description 14
- 238000011282 treatment Methods 0.000 description 14
- 238000006467 substitution reaction Methods 0.000 description 11
- 238000004458 analytical method Methods 0.000 description 9
- 210000004369 blood Anatomy 0.000 description 9
- 239000008280 blood Substances 0.000 description 9
- 229940104302 cytosine Drugs 0.000 description 9
- 230000002068 genetic effect Effects 0.000 description 9
- 238000013507 mapping Methods 0.000 description 9
- 239000012634 fragment Substances 0.000 description 8
- -1 less than 500 Chemical class 0.000 description 8
- 238000007481 next generation sequencing Methods 0.000 description 8
- 239000000439 tumor marker Substances 0.000 description 8
- 229930024421 Adenine Natural products 0.000 description 7
- GFFGJBXGBJISGV-UHFFFAOYSA-N Adenine Chemical compound NC1=NC=NC2=C1N=CN2 GFFGJBXGBJISGV-UHFFFAOYSA-N 0.000 description 7
- 229960000643 adenine Drugs 0.000 description 7
- 210000002381 plasma Anatomy 0.000 description 7
- 229940113082 thymine Drugs 0.000 description 7
- 108091028043 Nucleic acid sequence Proteins 0.000 description 6
- 108091034117 Oligonucleotide Proteins 0.000 description 6
- ISAKRJDGNUQOIC-UHFFFAOYSA-N Uracil Chemical compound O=C1C=CNC(=O)N1 ISAKRJDGNUQOIC-UHFFFAOYSA-N 0.000 description 6
- 230000000875 corresponding effect Effects 0.000 description 6
- 230000006870 function Effects 0.000 description 6
- 238000009396 hybridization Methods 0.000 description 6
- 238000009169 immunotherapy Methods 0.000 description 6
- 238000003786 synthesis reaction Methods 0.000 description 6
- 108091093088 Amplicon Proteins 0.000 description 5
- 230000004075 alteration Effects 0.000 description 5
- 238000013500 data storage Methods 0.000 description 5
- 238000001514 detection method Methods 0.000 description 5
- 230000003287 optical effect Effects 0.000 description 5
- 210000001519 tissue Anatomy 0.000 description 5
- 210000002700 urine Anatomy 0.000 description 5
- 241001465754 Metazoa Species 0.000 description 4
- 230000002159 abnormal effect Effects 0.000 description 4
- 210000001175 cerebrospinal fluid Anatomy 0.000 description 4
- 210000000349 chromosome Anatomy 0.000 description 4
- 238000012217 deletion Methods 0.000 description 4
- 230000037430 deletion Effects 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 230000004927 fusion Effects 0.000 description 4
- 238000005304 joining Methods 0.000 description 4
- 230000036961 partial effect Effects 0.000 description 4
- 230000000644 propagated effect Effects 0.000 description 4
- 238000012175 pyrosequencing Methods 0.000 description 4
- 238000007841 sequencing by ligation Methods 0.000 description 4
- 241000894007 species Species 0.000 description 4
- 230000017105 transposition Effects 0.000 description 4
- 102000016928 DNA-directed DNA polymerase Human genes 0.000 description 3
- 108010014303 DNA-directed DNA polymerase Proteins 0.000 description 3
- 108020004682 Single-Stranded DNA Proteins 0.000 description 3
- 239000000427 antigen Substances 0.000 description 3
- 108091007433 antigens Proteins 0.000 description 3
- 102000036639 antigens Human genes 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 3
- 238000001574 biopsy Methods 0.000 description 3
- 238000005251 capillar electrophoresis Methods 0.000 description 3
- 230000008859 change Effects 0.000 description 3
- 238000007405 data analysis Methods 0.000 description 3
- 230000029087 digestion Effects 0.000 description 3
- 230000001973 epigenetic effect Effects 0.000 description 3
- 238000007672 fourth generation sequencing Methods 0.000 description 3
- 230000007614 genetic variation Effects 0.000 description 3
- 210000003917 human chromosome Anatomy 0.000 description 3
- 208000015181 infectious disease Diseases 0.000 description 3
- 238000003780 insertion Methods 0.000 description 3
- 230000037431 insertion Effects 0.000 description 3
- 239000002777 nucleoside Substances 0.000 description 3
- 125000003835 nucleoside group Chemical group 0.000 description 3
- 239000004065 semiconductor Substances 0.000 description 3
- 210000002966 serum Anatomy 0.000 description 3
- 239000007787 solid Substances 0.000 description 3
- 230000001225 therapeutic effect Effects 0.000 description 3
- 238000002560 therapeutic procedure Methods 0.000 description 3
- 229940035893 uracil Drugs 0.000 description 3
- 101150023956 ALK gene Proteins 0.000 description 2
- 206010069754 Acquired gene mutation Diseases 0.000 description 2
- 108091061744 Cell-free fetal DNA Proteins 0.000 description 2
- 108091007412 Piwi-interacting RNA Proteins 0.000 description 2
- VYPSYNLAJGMNEJ-UHFFFAOYSA-N Silicium dioxide Chemical compound O=[Si]=O VYPSYNLAJGMNEJ-UHFFFAOYSA-N 0.000 description 2
- 208000000453 Skin Neoplasms Diseases 0.000 description 2
- 108020003224 Small Nucleolar RNA Proteins 0.000 description 2
- 102000042773 Small Nucleolar RNA Human genes 0.000 description 2
- 210000001744 T-lymphocyte Anatomy 0.000 description 2
- IQFYYKKMVGJFEH-XLPZGREQSA-N Thymidine Chemical compound O=C1NC(=O)C(C)=CN1[C@@H]1O[C@H](CO)[C@@H](O)C1 IQFYYKKMVGJFEH-XLPZGREQSA-N 0.000 description 2
- JLCPHMBAVCMARE-UHFFFAOYSA-N [3-[[3-[[3-[[3-[[3-[[3-[[3-[[3-[[3-[[3-[[3-[[5-(2-amino-6-oxo-1H-purin-9-yl)-3-[[3-[[3-[[3-[[3-[[3-[[5-(2-amino-6-oxo-1H-purin-9-yl)-3-[[5-(2-amino-6-oxo-1H-purin-9-yl)-3-hydroxyoxolan-2-yl]methoxy-hydroxyphosphoryl]oxyoxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(5-methyl-2,4-dioxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxyoxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(5-methyl-2,4-dioxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(5-methyl-2,4-dioxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(5-methyl-2,4-dioxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methyl [5-(6-aminopurin-9-yl)-2-(hydroxymethyl)oxolan-3-yl] hydrogen phosphate Polymers Cc1cn(C2CC(OP(O)(=O)OCC3OC(CC3OP(O)(=O)OCC3OC(CC3O)n3cnc4c3nc(N)[nH]c4=O)n3cnc4c3nc(N)[nH]c4=O)C(COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3CO)n3cnc4c(N)ncnc34)n3ccc(N)nc3=O)n3cnc4c(N)ncnc34)n3ccc(N)nc3=O)n3ccc(N)nc3=O)n3ccc(N)nc3=O)n3cnc4c(N)ncnc34)n3cnc4c(N)ncnc34)n3cc(C)c(=O)[nH]c3=O)n3cc(C)c(=O)[nH]c3=O)n3ccc(N)nc3=O)n3cc(C)c(=O)[nH]c3=O)n3cnc4c3nc(N)[nH]c4=O)n3cnc4c(N)ncnc34)n3cnc4c(N)ncnc34)n3cnc4c(N)ncnc34)n3cnc4c(N)ncnc34)O2)c(=O)[nH]c1=O JLCPHMBAVCMARE-UHFFFAOYSA-N 0.000 description 2
- OIRDTQYFTABQOQ-KQYNXXCUSA-N adenosine Chemical compound C1=NC=2C(N)=NC=NC=2N1[C@@H]1O[C@H](CO)[C@@H](O)[C@H]1O OIRDTQYFTABQOQ-KQYNXXCUSA-N 0.000 description 2
- 208000036878 aneuploidy Diseases 0.000 description 2
- 231100001075 aneuploidy Toxicity 0.000 description 2
- 238000013459 approach Methods 0.000 description 2
- 230000015572 biosynthetic process Effects 0.000 description 2
- 239000003795 chemical substances by application Substances 0.000 description 2
- 230000002759 chromosomal effect Effects 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 238000004925 denaturation Methods 0.000 description 2
- 230000036425 denaturation Effects 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 2
- 238000003745 diagnosis Methods 0.000 description 2
- 239000003814 drug Substances 0.000 description 2
- 229940079593 drug Drugs 0.000 description 2
- 230000007613 environmental effect Effects 0.000 description 2
- 210000003722 extracellular fluid Anatomy 0.000 description 2
- 230000001605 fetal effect Effects 0.000 description 2
- 238000011049 filling Methods 0.000 description 2
- 239000012530 fluid Substances 0.000 description 2
- 230000014509 gene expression Effects 0.000 description 2
- 238000012165 high-throughput sequencing Methods 0.000 description 2
- 210000002865 immune cell Anatomy 0.000 description 2
- 230000003902 lesion Effects 0.000 description 2
- 238000011068 loading method Methods 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 230000008774 maternal effect Effects 0.000 description 2
- 230000001404 mediated effect Effects 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 230000008439 repair process Effects 0.000 description 2
- 210000003296 saliva Anatomy 0.000 description 2
- 239000004055 small Interfering RNA Substances 0.000 description 2
- 238000000638 solvent extraction Methods 0.000 description 2
- 230000000392 somatic effect Effects 0.000 description 2
- 230000037439 somatic mutation Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 210000004881 tumor cell Anatomy 0.000 description 2
- 238000007704 wet chemistry method Methods 0.000 description 2
- MZOFCQQQCNRIBI-VMXHOPILSA-N (3s)-4-[[(2s)-1-[[(2s)-1-[[(1s)-1-carboxy-2-hydroxyethyl]amino]-4-methyl-1-oxopentan-2-yl]amino]-5-(diaminomethylideneamino)-1-oxopentan-2-yl]amino]-3-[[2-[[(2s)-2,6-diaminohexanoyl]amino]acetyl]amino]-4-oxobutanoic acid Chemical compound OC[C@@H](C(O)=O)NC(=O)[C@H](CC(C)C)NC(=O)[C@H](CCCN=C(N)N)NC(=O)[C@H](CC(O)=O)NC(=O)CNC(=O)[C@@H](N)CCCCN MZOFCQQQCNRIBI-VMXHOPILSA-N 0.000 description 1
- YKBGVTZYEHREMT-KVQBGUIXSA-N 2'-deoxyguanosine Chemical compound C1=NC=2C(=O)NC(N)=NC=2N1[C@H]1C[C@H](O)[C@@H](CO)O1 YKBGVTZYEHREMT-KVQBGUIXSA-N 0.000 description 1
- CKTSBUTUHBMZGZ-ULQXZJNLSA-N 4-amino-1-[(2r,4s,5r)-4-hydroxy-5-(hydroxymethyl)oxolan-2-yl]-5-tritiopyrimidin-2-one Chemical compound O=C1N=C(N)C([3H])=CN1[C@@H]1O[C@H](CO)[C@@H](O)C1 CKTSBUTUHBMZGZ-ULQXZJNLSA-N 0.000 description 1
- 102100033793 ALK tyrosine kinase receptor Human genes 0.000 description 1
- 206010003445 Ascites Diseases 0.000 description 1
- 241000271566 Aves Species 0.000 description 1
- 108010074708 B7-H1 Antigen Proteins 0.000 description 1
- 102000008096 B7-H1 Antigen Human genes 0.000 description 1
- DWRXFEITVBNRMK-UHFFFAOYSA-N Beta-D-1-Arabinofuranosylthymine Natural products O=C1NC(=O)C(C)=CN1C1C(O)C(O)C(CO)O1 DWRXFEITVBNRMK-UHFFFAOYSA-N 0.000 description 1
- 206010005949 Bone cancer Diseases 0.000 description 1
- 208000018084 Bone neoplasm Diseases 0.000 description 1
- 208000003174 Brain Neoplasms Diseases 0.000 description 1
- 102100035875 C-C chemokine receptor type 5 Human genes 0.000 description 1
- 101710149870 C-C chemokine receptor type 5 Proteins 0.000 description 1
- 102100027207 CD27 antigen Human genes 0.000 description 1
- 101150013553 CD40 gene Proteins 0.000 description 1
- 208000024172 Cardiovascular disease Diseases 0.000 description 1
- 108010077544 Chromatin Proteins 0.000 description 1
- 208000037051 Chromosomal Instability Diseases 0.000 description 1
- 108091026890 Coding region Proteins 0.000 description 1
- 108020004635 Complementary DNA Proteins 0.000 description 1
- RYGMFSIKBFXOCR-UHFFFAOYSA-N Copper Chemical compound [Cu] RYGMFSIKBFXOCR-UHFFFAOYSA-N 0.000 description 1
- 102000004127 Cytokines Human genes 0.000 description 1
- 108090000695 Cytokines Proteins 0.000 description 1
- 230000004544 DNA amplification Effects 0.000 description 1
- 206010061818 Disease progression Diseases 0.000 description 1
- 102000004190 Enzymes Human genes 0.000 description 1
- 108090000790 Enzymes Proteins 0.000 description 1
- LFQSCWFLJHTTHZ-UHFFFAOYSA-N Ethanol Chemical compound CCO LFQSCWFLJHTTHZ-UHFFFAOYSA-N 0.000 description 1
- 201000003741 Gastrointestinal carcinoma Diseases 0.000 description 1
- 241000282412 Homo Species 0.000 description 1
- 101000914511 Homo sapiens CD27 antigen Proteins 0.000 description 1
- 101001137987 Homo sapiens Lymphocyte activation gene 3 protein Proteins 0.000 description 1
- 101000851370 Homo sapiens Tumor necrosis factor receptor superfamily member 9 Proteins 0.000 description 1
- 108090001005 Interleukin-6 Proteins 0.000 description 1
- 102000002698 KIR Receptors Human genes 0.000 description 1
- 108010043610 KIR Receptors Proteins 0.000 description 1
- 208000008839 Kidney Neoplasms Diseases 0.000 description 1
- 102000017578 LAG3 Human genes 0.000 description 1
- 108020005198 Long Noncoding RNA Proteins 0.000 description 1
- 206010025323 Lymphomas Diseases 0.000 description 1
- 241000124008 Mammalia Species 0.000 description 1
- 206010027476 Metastases Diseases 0.000 description 1
- 108020005196 Mitochondrial DNA Proteins 0.000 description 1
- 208000003445 Mouth Neoplasms Diseases 0.000 description 1
- 101100407308 Mus musculus Pdcd1lg2 gene Proteins 0.000 description 1
- 208000010505 Nose Neoplasms Diseases 0.000 description 1
- 108091005461 Nucleic proteins Proteins 0.000 description 1
- 238000012408 PCR amplification Methods 0.000 description 1
- 206010061902 Pancreatic neoplasm Diseases 0.000 description 1
- 208000037581 Persistent Infection Diseases 0.000 description 1
- 208000002151 Pleural effusion Diseases 0.000 description 1
- 208000020584 Polyploidy Diseases 0.000 description 1
- 241000288906 Primates Species 0.000 description 1
- 206010036790 Productive cough Diseases 0.000 description 1
- 108700030875 Programmed Cell Death 1 Ligand 2 Proteins 0.000 description 1
- 102100024213 Programmed cell death 1 ligand 2 Human genes 0.000 description 1
- 238000003559 RNA-seq method Methods 0.000 description 1
- 208000015634 Rectal Neoplasms Diseases 0.000 description 1
- 108020004459 Small interfering RNA Proteins 0.000 description 1
- 208000005718 Stomach Neoplasms Diseases 0.000 description 1
- 108091046869 Telomeric non-coding RNA Proteins 0.000 description 1
- 206010043515 Throat cancer Diseases 0.000 description 1
- 208000024770 Thyroid neoplasm Diseases 0.000 description 1
- 108091023040 Transcription factor Proteins 0.000 description 1
- 102000040945 Transcription factor Human genes 0.000 description 1
- 108060008682 Tumor Necrosis Factor Proteins 0.000 description 1
- 102000000852 Tumor Necrosis Factor-alpha Human genes 0.000 description 1
- 101710165473 Tumor necrosis factor receptor superfamily member 4 Proteins 0.000 description 1
- 102100022153 Tumor necrosis factor receptor superfamily member 4 Human genes 0.000 description 1
- 102100040245 Tumor necrosis factor receptor superfamily member 5 Human genes 0.000 description 1
- 102100036856 Tumor necrosis factor receptor superfamily member 9 Human genes 0.000 description 1
- 208000007097 Urinary Bladder Neoplasms Diseases 0.000 description 1
- 241000251539 Vertebrata <Metazoa> Species 0.000 description 1
- 241000700605 Viruses Species 0.000 description 1
- 238000009825 accumulation Methods 0.000 description 1
- 238000000137 annealing Methods 0.000 description 1
- 230000006907 apoptotic process Effects 0.000 description 1
- 210000003719 b-lymphocyte Anatomy 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- IQFYYKKMVGJFEH-UHFFFAOYSA-N beta-L-thymidine Natural products O=C1NC(=O)C(C)=CN1C1OC(CO)C(O)C1 IQFYYKKMVGJFEH-UHFFFAOYSA-N 0.000 description 1
- 239000012472 biological sample Substances 0.000 description 1
- 239000010836 blood and blood product Substances 0.000 description 1
- 210000001772 blood platelet Anatomy 0.000 description 1
- 210000001185 bone marrow Anatomy 0.000 description 1
- 239000000872 buffer Substances 0.000 description 1
- 230000030833 cell death Effects 0.000 description 1
- 230000032823 cell division Effects 0.000 description 1
- 230000010261 cell growth Effects 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000005119 centrifugation Methods 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 238000007385 chemical modification Methods 0.000 description 1
- 210000003483 chromatin Anatomy 0.000 description 1
- 239000000356 contaminant Substances 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 230000034994 death Effects 0.000 description 1
- 239000005549 deoxyribonucleoside Substances 0.000 description 1
- 230000005750 disease progression Effects 0.000 description 1
- 238000006073 displacement reaction Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 238000001493 electron microscopy Methods 0.000 description 1
- 239000000839 emulsion Substances 0.000 description 1
- 210000002889 endothelial cell Anatomy 0.000 description 1
- 230000002255 enzymatic effect Effects 0.000 description 1
- 230000004049 epigenetic modification Effects 0.000 description 1
- 210000003743 erythrocyte Anatomy 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000013467 fragmentation Methods 0.000 description 1
- 238000006062 fragmentation reaction Methods 0.000 description 1
- 238000001502 gel electrophoresis Methods 0.000 description 1
- 230000004545 gene duplication Effects 0.000 description 1
- 230000004077 genetic alteration Effects 0.000 description 1
- 210000004602 germ cell Anatomy 0.000 description 1
- 210000003731 gingival crevicular fluid Anatomy 0.000 description 1
- 201000005787 hematologic cancer Diseases 0.000 description 1
- 208000024200 hematopoietic and lymphoid system neoplasm Diseases 0.000 description 1
- 208000006454 hepatitis Diseases 0.000 description 1
- 231100000283 hepatitis Toxicity 0.000 description 1
- 210000000987 immune system Anatomy 0.000 description 1
- 201000002313 intestinal cancer Diseases 0.000 description 1
- 238000002955 isolation Methods 0.000 description 1
- 210000000265 leukocyte Anatomy 0.000 description 1
- 238000007834 ligase chain reaction Methods 0.000 description 1
- 208000014018 liver neoplasm Diseases 0.000 description 1
- 208000020816 lung neoplasm Diseases 0.000 description 1
- 230000001926 lymphatic effect Effects 0.000 description 1
- 238000007403 mPCR Methods 0.000 description 1
- 230000003211 malignant effect Effects 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 108091070501 miRNA Proteins 0.000 description 1
- 239000002679 microRNA Substances 0.000 description 1
- 238000002493 microarray Methods 0.000 description 1
- 230000017074 necrotic cell death Effects 0.000 description 1
- 238000007857 nested PCR Methods 0.000 description 1
- 244000052769 pathogen Species 0.000 description 1
- 229920000642 polymer Polymers 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 230000002265 prevention Effects 0.000 description 1
- 230000002250 progressing effect Effects 0.000 description 1
- 230000000770 proinflammatory effect Effects 0.000 description 1
- 239000011541 reaction mixture Substances 0.000 description 1
- 230000008707 rearrangement Effects 0.000 description 1
- 230000002829 reductive effect Effects 0.000 description 1
- 230000001105 regulatory effect Effects 0.000 description 1
- 230000010076 replication Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 230000002441 reversible effect Effects 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 239000002342 ribonucleoside Substances 0.000 description 1
- 150000003839 salts Chemical class 0.000 description 1
- 238000007480 sanger sequencing Methods 0.000 description 1
- 230000028327 secretion Effects 0.000 description 1
- 210000000582 semen Anatomy 0.000 description 1
- 239000000377 silicon dioxide Substances 0.000 description 1
- 239000010454 slate Substances 0.000 description 1
- 239000007790 solid phase Substances 0.000 description 1
- 210000003802 sputum Anatomy 0.000 description 1
- 208000024794 sputum Diseases 0.000 description 1
- 210000004243 sweat Anatomy 0.000 description 1
- 210000001179 synovial fluid Anatomy 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
- 230000008685 targeting Effects 0.000 description 1
- 229940104230 thymidine Drugs 0.000 description 1
- 238000013518 transcription Methods 0.000 description 1
- 230000035897 transcription Effects 0.000 description 1
- 230000005945 translocation Effects 0.000 description 1
- 230000007306 turnover Effects 0.000 description 1
- 210000005166 vasculature Anatomy 0.000 description 1
- 238000005406 washing Methods 0.000 description 1
- 238000012070 whole genome sequencing analysis Methods 0.000 description 1
Images
Classifications
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6813—Hybridisation assays
- C12Q1/6827—Hybridisation assays for detection of mutation or polymorphism
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6869—Methods for sequencing
- C12Q1/6874—Methods for sequencing involving nucleic acid arrays, e.g. sequencing by hybridisation
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6806—Preparing nucleic acids for analysis, e.g. for polymerase chain reaction [PCR] assay
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6869—Methods for sequencing
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B25/00—ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
- G16B25/20—Polymerase chain reaction [PCR]; Primer or probe design; Probe optimisation
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
Definitions
- a tumor is an abnormal growth of cells. Fragmented DNA is often released into bodily fluid when cells, such as tumor cells, die. Thus, some of the cell-free DNA in body fluids is tumor DNA.
- a tumor can be benign or malignant.
- a malignant tumor is often referred to as a cancer.
- Cancer is a major cause of disease worldwide. Each year, tens of millions of people are diagnosed with cancer around the world, and more than half eventually die from it. In many countries, cancer ranks as the second most common cause of death following cardiovascular diseases. Early detection is associated with improved outcomes for many cancers.
- Cancer is caused by the accumulation of mutations and/or epigenetic variations within an individual's normal cells, at least some of which result in improperly regulated cell division.
- mutations commonly include copy number variations (CNVs), copy number aberrations (CNA), single nucleotide variations (SNVs), gene fusions and indels, and epigenetic variations include modifications to the 5th atom of the 6-atom ring of cytosine and association of DNA with chromatin and transcription factors.
- Cancers are often detected by biopsies of tumors followed by analysis of cells, markers or DNA extracted from cells. But more recently it has been proposed that cancers can also be detected from cell-free nucleic acids in body fluids, such as blood or urine (see, e.g., Siravegna et al., Nature Reviews 2017). Such tests have the advantage that they are non-invasive and can be performed without identifying suspected cancer cells through biopsy. However, such tests are complicated by the fact that the amount of nucleic acids in body fluids is very low and what nucleic acids are present require processing to a more homogenous form before sequencing can occur.
- One aspect of the disclosure relates to a method for identifying variant nucleotides in a population of nucleic acids comprising: (a) contacting a population of nucleic acids comprising double-stranded molecules with single-stranded overhangs at one or both ends with a protein having 5′-3′ polymerase activity and a 3′-5′ exonuclease activity, wherein the protein digests 3′ overhangs and fills in 5′ overhangs with complementary nucleic acids, to generate double-stranded blunt-ended nucleic acids at one or both ends; (b) determining sequences of the double-stranded blunt-ended nucleic acids to provide sequenced nucleic acids; (c) for each designated position in a reference sequence, (i) identifying a subset of sequenced nucleic acids including the designated position, and (ii) identifying sequenced nucleic acids in the subset in which the designated position is occupied by a variant nucleotide; and (d) calling presence of a variant nucle
- step (c)(ii) identifies the number of consensus sequences in the subset in which the designated position is occupied by a variant nucleotide and presence of a variant nucleotide at each designated position is called when the number of consensus sequences in the subset with the variation meets a threshold except as specified in steps (d)(i) and (ii).
- the variant nucleotide is categorized as a deamination error based on the representation of the C to T variation at the designated position within a defined proximity of the 5′-end in sequenced nucleic acids in the subset or representation of the G to A variation at the designated position within a defined proximity of the 3′-end in sequenced nucleic acids in the subset.
- (c)(ii) further comprises identifying the number of sequenced nucleic acids in the subset in which the designated position is occupied by a reference nucleotide.
- (b) comprises determining sequences of both strands of the double-stranded blunt-ended nucleic acid.
- (c) is performed for at least one designated position wherein the sequenced nucleic acids in the subset with the variation include sequences of both strands of the double-stranded blunt-ended nucleic acid sequenced nucleic acid.
- (b) comprises determining sequences from both ends of a strand.
- the method further comprises linking the double-stranded blunt-ended nucleic acids to adapters comprising barcodes, amplifying the nucleic acids primed from primer molecules binding to the adapters, wherein (b) comprises determining sequences of amplified nucleic acid molecules and classifying the sequences of the amplified nucleic acid molecules into families, the members of a family having the same start and stop points on the nucleic acid and the same barcodes, and determining consensus nucleotides at each of a plurality of positions for the families from the sequences of their respective members. The consensus sequences are not determined for families having only one member.
- the population of nucleic acids are from a cell-free nucleic acid sample of a subject.
- the cell-free nucleic acid sample can be from a body fluid of a subject having a cancer or having signs or symptoms consistent with having a cancer.
- the body fluid can be selected from the group consisting of blood, plasma, saliva, urine, and cerebrospinal fluid.
- Blood and blood products e.g. plasma and serum
- the C to T variation at the designated position is classified as a deamination error if its representation is at least 50% in a first fraction of the subset in which the designated position is within a defined proximity of the 5′ end or the G to A variation at the designated position is classified as a deamination error if its representation is at least 50% in a second fraction of the subset in which the designated position is within a defined proximity of the 3′ end.
- the C to T variation at the designated position can be classified as a deamination error based on the variation having at least twice the representation in a first fraction of the subset in which the designated position is within a defined proximity of the 5′ end than in other sequenced nucleic acid in the subset, or the G to A variation at the designated position is classified as a deamination error based on the variation having at least twice the representation in a second fraction of the subset in which the designated position is within a defined proximity to the 3′ end than in other sequenced nucleic acids in the subset.
- the threshold is that the variation is present in at least 1% of sequenced nucleic acids in the subset.
- the C to T or G to A variation is categorized as a deamination error at least based on the surrounding context being TCG to TTG or CGA to CAA.
- the defined proximity to the 5′ end is defined as being within 20 nucleotides or within a fewer number of nucleotides to the 5′ end and the defined proximity to the 3′ end is defined as being within 20 nucleotides or within a fewer number of nucleotides to the 3′ end.
- the defined proximity to the 5′ end can be defined as being within 20 nucleotides to the 5′ end and the defined proximity to the 3′ end is defined as being within 20 nucleotides to the 3′ end.
- the protein is Klenow.
- (c) and (d) are performed in a computer-operated system or the like to carry out these steps.
- the disclosure relates to a computer-implemented method for identifying variant nucleotides in a population of nucleic acids.
- the reference sequence is a sequence of a human genome.
- the reference sequence can be a sequence of a human chromosome.
- the reference sequence can comprise noncontiguous regions of a human genome.
- At least one of the variant nucleotides called is known to be associated with a cancer.
- the method can be performed on nucleic acid populations from samples from a population of subjects having or suspected of having a cancer, wherein subjects in the population thereafter receive different treatments depending on which variant nucleotides are called in the individual subject.
- variant nucleotides classified as deamination errors are at least 1% of the called variant nucleotides.
- variant nucleotides classified as deamination errors are at least 10% of the called variant nucleotides.
- the presence of a variant is not called if at least 5 variant nucleotides are classified as deamination errors.
- the population of nucleic acids are derived from a solid tissue.
- the body fluid is plasma.
- the adapters comprising barcodes linked to the 5′ ends are different from the adapters comprising barcodes linked to the 3′-end.
- a frequency of the deamination error is at least 1%.
- a frequency of the deamination error is at least 10%.
- the variant nucleotide is categorized as a deamination error based on the average distance of the C to T variation at the designated position being less than the average distance of the reference nucleotide at the designated position from the 5′-end of sequenced nucleic acids in the subset or the G to A variation at the designated position being less than the average distance of the reference nucleotide at the designated position from the 3′-end of sequenced nucleic acids in the subset.
- the variant nucleotide is a single nucleotide variant (SNV).
- One aspect of the disclosure relates to a method identifying variant nucleotides in a nucleic acid, comprising: (a) contacting a double-stranded nucleic acid with single-stranded overhangs with a protein having 5′-3′ polymerase activity and a 3′-5′ exonuclease activity thereby producing a double-stranded blunt-ended nucleic acid; (b) determining a sequence of the double-stranded blunt-ended nucleic acid; (c) comparing the determined sequence to a reference sequence, wherein the determined sequence includes at least one C to T variation in at least one designated position within 20 nucleotides or fewer of the 5′ end of the determined sequence or at least one G to A variation within 20 nucleotides or fewer of the 3′ end of the determined sequence; (d) calling a sequence for the nucleic acid as the determined sequence except in at least one of the positions in which a C to T variation is present within 20 nucleotides or fewer of the 5′ end
- the C to T or G to A variation occurs in a surrounding context of TCG to TTG or CGA to CAA.
- One aspect of the disclosure relates to a method identifying variant nucleotides in a population of nucleic acids comprising: (a) contacting a population of nucleic acids of overlapping sequences at least one of which is a double-stranded molecule with single-stranded overhangs at one or both ends with a protein having 5′-3′ polymerase activity and a 3′-5′ exonuclease activity, wherein the protein digests 3′ overhangs and fills in 5′ overhangs to generate double-stranded blunt-ended nucleic acids; (b) linking the double-stranded blunt-ended nucleic acids to adapters comprising barcodes, amplifying the nucleic acids primed from primer molecules binding to the adapters, wherein (c) determining sequences of amplified nucleic acid molecules and classifying the sequences of the amplified nucleic acid molecules into families, the members of a family having the same start and stop points on the nucleic acid and the same adapters, and determining consensus sequence
- step (c) identifies the number of consensus sequences in the subset in which the designated position is occupied by a variant nucleotide and presence of a variant nucleotide at each designated position is called when the number of consensus sequences in the subset with the variation meets a threshold except as specified in steps (d)(i) and (ii).
- the disclosure relates to a method for identifying false positive variant nucleotides in a population of nucleic acids comprising: (a) contacting a population of nucleic acids at least one of which is a double-stranded molecule with single-stranded overhangs at one or both ends and overlapping sequences with a protein having 5′-3′ polymerase activity and a 3′-5′ exonuclease activity, wherein the protein digests 3′ overhangs and fills in 5′ overhangs with complementary nucleic acids to generate double-stranded blunt-ended nucleic acids at one or both ends; (b) determining sequences of the double-stranded blunt-ended nucleic acids to provide sequenced nucleic acids (c) for each designated position in a reference sequence, identifying a subset of sequenced nucleic acids including the designated position and identifying sequenced nucleic acids in the subset in which the designated position is occupied by a reference nucleotide and the sequenced nucleic acids in
- step (c) identifies the number of consensus sequences in the subset in which the designated position is occupied by a variant nucleotide and presence of a variant nucleotide at each designated position is called when the number of consensus sequences in the subset with the variation meets a threshold except as specified in steps (d)(i) and (ii).
- the disclosure relates to a method of determining minor allele frequency of a “C” to “T” or a “G” to “A” variant at a designated position in a reference sequence in a population of sequenced nucleic acids mapping to the designated position, wherein minor allele frequency compares a number of sequenced nucleic acids mapping to the designated position comprising the variant (“variant number”) to a total number of sequenced nucleic acids mapping to the designated position, the method comprising adjusting the variant number of T or A variants at the designated position for probability of deamination errors, wherein probability of error is a function of distance of the variant from a 5′ terminus of a molecule in the case of “T” and from the 3′ end of the molecule in case of “A”.
- a C to T variant positioned within a selected distance from the 5′ end of a sequenced polynucleotide, or a G to A variant positioned within a selected distance from the 3′ end of a sequenced nucleic acid is not counted in the variant number.
- all C to T variants are discounted from the variant number when the ratio of C to T variants positioned within a selected distance from the 5′ end of a sequenced polynucleotide to C to T variants positioned outside the selected distance from the 5′ end of a sequenced nucleic acid is greater than a predetermined ratio (e.g., greater than 50%), or when the ratio of G to A variants positioned within a selected distance from the 3′ end of a sequenced nucleic acid to G to A variants positioned outside the selected distance from the 3′ end of a sequenced nucleic acid is greater than a predetermined ratio (e.g., greater than 50%).
- a predetermined ratio e.g., greater than 50%
- the variant number is determined as the sum of probabilities that each C to T variant or each G to A variant is a true variant.
- the disclosure relates to a method comprising administering to a subject determined to have cancer marker by the method of any of the previous claims, a therapeutic intervention effective to treat a cancer characterized by the cancer marker.
- the disclosure further provides a method comprising receiving data for the identity of one or more variant nucleotides in cell free nucleic acids of a subject by performing a method of any of the preceding claims; determining presence of a cancer marker from the one or more variant nucleotides; and administering a therapeutic intervention effective to treat a cancer characterized by the cancer marker.
- the disclosure relates to a system.
- One such system comprises:
- a computer in communication with the communication interface wherein the computer comprises one or more computer processors and a computer readable medium comprising machine-executable code that, upon execution by the one or more computer processors, implements a method comprising:
- step (c) identifies the number of consensus sequences in the subset in which the designated position is occupied by a variant nucleotide and presence of a variant nucleotide at each designated position is called when the number of consensus sequences in the subset with the variation meets a threshold except as specified in steps (d)(i) and (ii).
- the disclosure further provides a system, comprising:
- a communication interface that receives, over a communication network, sequencing reads generated by a nucleic acid sequencer
- a computer in communication with the communication interface comprising one or more computer processors and a computer readable medium comprising machine-executable code that, upon execution by the one or more computer processors, implements a method comprising:
- the disclosure further provides a system, comprising:
- a communication interface that receives, over a communication network, sequencing reads generated by a nucleic acid sequencer
- a computer in communication with the communication interface comprising one or more computer processors and a computer readable medium comprising machine-executable code that, upon execution by the one or more computer processors, implements a method comprising:
- step (c) identifies the number of consensus sequences in the subset in which the designated position is occupied by a variant nucleotide and presence of a variant nucleotide at each designated position is called when the number of consensus sequences in the subset with the variation meets a threshold except as specified in steps (d)(i) and (ii)
- the disclosure further provides a system, comprising:
- a communication interface that receives, over a communication network, sequencing reads generated by a nucleic acid sequencer
- a computer in communication with the communication interface comprising one or more computer processors and a computer readable medium comprising machine-executable code that, upon execution by the one or more computer processors, implements a method comprising:
- step (c) identifies the number of consensus sequences in the subset in which the designated position is occupied by a variant nucleotide and presence of a variant nucleotide at each designated position is called when the number of consensus sequences in the subset with the variation meets a threshold except as specified in steps (d)(i) and (ii).
- the disclosure further provides a system, comprising:
- a communication interface that receives, over a communication network, sequencing reads generated by a nucleic acid sequencer
- a computer in communication with the communication interface comprising one or more computer processors and a computer readable medium comprising machine-executable code that, upon execution by the one or more computer processors, implements a method comprising:
- any of the above systems can further include a nucleic acid sequencer.
- the nucleic acid sequencer sequences a sequencing library generated from cell-free DNA molecules derived from a subject, wherein the sequencing library comprises the cell-free DNA molecules and adapters, wherein the adapters comprise barcodes.
- the nucleic acid sequencer performs sequencing-by-synthesis on the sequencing library to generate the sequencing reads.
- the nucleic acid sequencer performs pyrosequencing, single-molecule sequencing, nanopore sequencing, semiconductor sequencing, sequencing-by-ligation or sequencing-by-hybridization on the sequencing library to generate the sequencing reads.
- the nucleic acid sequencer uses a clonal single molecule array derived from the sequencing library to generate the sequencing reads.
- the nucleic acid sequencer comprises a chip having an array of microwells for sequencing the sequencing library to generate the sequencing reads.
- the computer readable medium comprises a memory, a hard drive or a computer server.
- the communication network comprises a telecommunication network, an internet, an extranet, or an intranet.
- the communication network includes one or more computer servers capable of distributed computing, such as cloud computing.
- the computer is located on a computer server that is remotely located from the nucleic acid sequencer.
- the sequencing library further comprises sample barcodes that differentiate a sample from one or more samples.
- Some systems further comprise an electronic display in communication with the computer over a network, wherein the electronic display comprises a user interface for displaying results upon implementing (a)-(c), such as a graphical user interface (GUI) or web-based user interface.
- the electronic display is in a personal computer.
- the electronic display is in an internet enabled computer, optionally at a location remote from the computer.
- the results of the systems and methods disclosed herein are used as an input to generate a report in a paper format.
- this report may provide an indication of the called variants and/or the variants which are deemed to be deamination errors.
- FIG. 1 shows an overview of end repair with Klenow polymerase.
- FIG. 2 shows a C ⁇ T deamination scheme
- FIG. 3 shows preference of C ⁇ T conversion at 5′ end of molecule and G ⁇ A conversion at 3′ end of molecule.
- FIG. 4 shows a plot comparing the frequency of errors for C to T and G to A variations and those of other variations with distance from the molecular ends.
- the error frequency of C to T and G to A variations is much higher close to molecular ends whereas that of other variations is independent of position relative to molecular ends.
- the points labeled “C>T or G>A” show the average of the rate of C>T errors stratified by the distance measured from 5′ end, and the rate of G>A errors stratified by the distance measured from 3′ end and the points labeled “other errors” show the average of: the rate of C>A+C>G errors stratified by the distance measured from 5′ end, and the rate of G>T+G>C errors stratified by the distance measured from 3′ end.
- FIG. 5 shows a computer system
- FIG. 6 shows five sequencing families including a G to A substitution classified as a deamination error.
- the left-hand segment of reference genome sequence is SEQ ID NO:1
- the middle segment of reference genome sequence is SEQ ID NO:2
- the right-hand segment of reference genome sequence is SEQ ID NO:3.
- FIG. 7 shows five sequencing families including a G to A substitution classified as a bona fide mutation.
- the left-hand segment of reference genome sequence is SEQ ID NO:1
- the middle segment of reference genome sequence is SEQ ID NO:2
- the right-hand segment of reference genome sequence is SEQ ID NO:4.
- a subject refers to an animal, such as a mammalian species (preferably human) or avian (e.g., bird) species, or other organism, such as a plant. More specifically, a subject can be a vertebrate, e.g., a mammal such as a mouse, a primate, a simian or a human. Animals include farm animals, sport animals, and pets.
- a subject can be a healthy individual, an individual that has symptoms or signs or is suspected of having a disease or a predisposition to the disease, or an individual that is in need of therapy or suspected of needing therapy.
- a genetic variant refers to an alteration, variant or polymorphism in a nucleic acid sample or genome of a subject. Such alteration, variant or polymorphism can be with respect to a reference genome, which may be a reference genome of the species (e.g., for human, hG19 or hG38), the subject or other individual. Variations include one or more single nucleotide variations (SNVs), insertions, deletions, repeats, small insertions, small deletions, small repeats, structural variant junctions, variable length tandem repeats, and/or flanking sequences, copy number variants (CNVs), transversions, gene fusions and other rearrangements are also forms of genetic variation.
- a variation can be a base change, insertion, deletion, repeat, copy number variation, transversion, or a combination thereof.
- a cancer marker is a genetic variant associated with presence or risk of developing a cancer.
- a cancer marker can provide an indication a subject has cancer or a higher risk of developing cancer than an age and gender matched subject of the same species that does not have the cancer marker.
- a cancer marker may or may not be causative of cancer.
- a barcode is a short nucleic acid (e.g., less than 500, 100, 50 or 10 nucleotides long), used to label nucleic acid molecules to distinguish nucleic acids from different samples (e.g., representing a sample index), or different nucleic acid molecules in the same sample (e.g., representing a barcode), of different types, or which have undergone different processing.
- Tags can be single stranded, double-stranded or at least partially double-stranded. Tags can have the same length or varied lengths. Tags can be blunt-end or have an overhang. Tags can be attached to one end or both ends of the nucleic acids.
- Barcodes can be decoded to reveal information such as the sample of origin, form or processing of a nucleic acid.
- Tags can be used to allow pooling and parallel processing of multiple samples comprising nucleic acids bearing different barcodes and/or sample indexes with the nucleic acids subsequently being deconvoluted by reading the barcodes.
- Barcodes can also be referred to as molecular identifiers, sample identifier, index tag, and/or tags. Additionally or alternatively, barcodes can be used to distinguish different molecules in the same sample. This includes uniquely barcoding each different molecule in the sample, or non-uniquely barcoding each molecule.
- a limited number of barcodes may be used to barcode each molecule such that different molecules can be distinguished based on their start/stop position where they map on a reference genome in combination with at least one tag.
- a sufficient number of different barcodes are used such that there is a low probability (e.g. ⁇ 10%, ⁇ 5%, ⁇ 1%, or ⁇ 0.1%) that any two molecules having the same start/stop also have the same barcode.
- Some barcodes include multiple molecular identifiers to label samples, forms of molecule within a sample, and molecules within a form having the same start and stop points. Such barcodes can exist in the form Ali, wherein the letter indicates a sample type, the Arabic number indicates a form of molecule within a sample, and the Roman numeral indicates a molecule within a form.
- Adapters are short nucleic acids (e.g., less than 500, 100 or 50 nucleotides long) usually at least partly double-stranded for linkage to either or both ends of a sample nucleic acid molecule.
- Adapters can include primer binding sites to permit amplification of a nucleic acid molecule flanked by adapters at both ends, and/or a sequencing primer binding site, including primer binding sites for next generation sequencing (NGS).
- Adapters can also include binding sites for capture probes, such as an oligonucleotide attached to a flow cell support.
- Adapters can also include a barcode as described above.
- Barcodes are preferably position relative to primer and sequencing primer binding sites, such that a barcode is included in amplicons and sequencing reads of a nucleic acid molecule.
- the same or different adapters can be linked to the respective ends of a nucleic acid molecule. Sometimes the same adapter is linked to the respective ends except that the barcode is different.
- a preferred adapter is a Y-shaped adapter in which one end is blunt ended or tailed as described herein, for joining to a nucleic acid molecule, which is also blunt ended or tailed with one or more complementary nucleotides.
- Another preferred adapter is a bell-shaped adapter, likewise with a blunt or tailed end for joining to a nucleic acid to be analyzed.
- sequencing refers to any of a number of technologies used to determine the sequence of a biomolecule, e.g., a nucleic acid such as DNA or RNA.
- Exemplary sequencing methods include, but are not limited to, targeted sequencing, single molecule real-time sequencing, exon sequencing, electron microscopy-based sequencing, panel sequencing, transistor-mediated sequencing, direct sequencing, random shotgun sequencing, Sanger dideoxy termination sequencing, whole-genome sequencing, sequencing by hybridization, pyrosequencing, capillary electrophoresis, gel electrophoresis, duplex sequencing, cycle sequencing, single-base extension sequencing, solid-phase sequencing, high-throughput sequencing, massively parallel signature sequencing, emulsion PCR, co-amplification at lower denaturation temperature-PCR (COLD-PCR), multiplex PCR, sequencing by reversible dye terminator, paired-end sequencing, near-term sequencing, exonuclease sequencing, sequencing by ligation, short-read sequencing, single-molecule sequencing, sequencing-by
- next generation sequencing refers to sequencing technologies having increased throughput as compared to traditional Sanger- and capillary electrophoresis-based approaches, for example, with the ability to generate hundreds of thousands of relatively small sequence reads at a time.
- next generation sequencing techniques include, but are not limited to, sequencing by synthesis, sequencing by ligation, and sequencing by hybridization.
- sequencing run refers to any step or portion of a sequencing experiment performed to determine some information relating to at least one biomolecule (e.g., a nucleic acid molecule such as DNA or RNA).
- biomolecule e.g., a nucleic acid molecule such as DNA or RNA
- DNA deoxyribonucleic acid
- RNA ribonucleic acid
- A adenine
- T thymine
- C cytosine
- G guanine
- RNA ribonucleic acid
- A uracil
- G guanine
- complementary base pairing adenine (A) pairs with thymine (T) and cytosine (C) pairs with guanine (G).
- RNA adenine (A) pairs with uracil (U) and cytosine (C) pairs with guanine (G).
- U uracil
- C cytosine
- G guanine
- nucleic acid sequencing data denotes any information or data that is indicative of the order of the nucleotide bases (e.g., adenine, guanine, cytosine, and thymine or uracil) in a molecule (e.g., a whole genome, whole transcriptome, exome, oligonucleotide, polynucleotide, or fragment) of a nucleic acid such as DNA or RNA.
- nucleotide bases e.g., adenine, guanine, cytosine, and thymine or uracil
- sequence information obtained using all available varieties of techniques, platforms or technologies, including, but not limited to: capillary electrophoresis, microarrays, ligation-based systems, polymerase-based systems, hybridization-based systems, direct or indirect nucleotide identification systems, pyrosequencing, ion- or pH-based detection systems, and electronic signature-based systems.
- a “polynucleotide”, “nucleic acid”, “nucleic acid molecule”, or “oligonucleotide” refers to a linear polymer of nucleosides (including deoxyribonucleosides, ribonucleosides, or analogs thereof) joined by internucleosidic linkages. Typically, a polynucleotide comprises at least three nucleosides. Oligonucleotides often range in size from a few monomeric units, e.g. 3-4, to hundreds of monomeric units.
- a polynucleotide is represented by a sequence of letters, such as “ATGCCTG,” it will be understood that the nucleotides are in 5′ ⁇ 3′ order from left to right and that “A” denotes deoxyadenosine, “C” denotes deoxycytidine, “G” denotes deoxyguanosine, and “T” denotes thymidine, unless otherwise noted.
- the letters A, C, G, and T may be used to refer to the bases themselves, to nucleosides, or to nucleotides comprising the bases, as is standard in the art.
- a reference sequence is a known sequence used for purposes of comparison with experimentally determined sequences.
- a known sequence can be an entire genome, a chromosome, or any segment thereof.
- a reference typically includes at least 20, 50, 100, 200, 250, 300, 350, 400, 450, 500, 1000, or more nucleotides.
- a reference sequence can align with a single contiguous sequence of a genome or chromosome or can include noncontiguous segments aligning with different regions of a genome or chromosome.
- Reference human genomes include, e.g., hG19 and hG38.
- designated position in a reference sequence refers to a genomic coordinate in the reference sequence.
- a first single stranded nucleic acid sequence overlaps with a second single stranded sequence if the first nucleic acid sequence or its complement and the second nucleic acid sequence or its complement align with overlapping but non-identical segments of a contiguous reference sequence, such as the sequence of a human chromosome.
- a fully or partially double-stranded nucleic acid overlaps with another fully or partially double-stranded nucleic acid if either of its strands overlaps those of the other nucleic acid.
- a “C” to “T” variant or conversion refers to the presence of base “T” in a sequenced polynucleotide at a coordinate position occupied in a reference sequence by base “C”.
- a “G” to “A” variant or conversion refers to the presence of base “A” in a sequenced polynucleotide at a coordinate position occupied in a reference sequence by base “G”.
- a nucleic acid molecule can be conceptually divided into a 5′ terminal end, an internal portion and a 3′ terminal end. Terminal ends can be designated based a predetermined number of nucleotides from the terminus.
- the 5′ terminal end be represented by, e.g., the 20 terminal nucleotides to the 5′ end.
- the 3′ terminal end be represented by, e.g., the 20 terminal nucleotides to the 3′ end.
- the nucleic acid molecule can be divided into a terminal portion, as described, and a remainder.
- minor allele frequency refers to the frequency at which minor alleles (e.g., not the most common allele) occurs in a given population of nucleic acids, such as a sample. Genetic variants at a low minor allele frequency may have a relatively low frequency of presence in a sample.
- a “minor allele fraction” refers to the fraction of DNA molecules harboring an allelic alteration in a given sample.
- a MAF of a somatic variant can be less than 0.5, 0.1, 0.05, or 0.01. For example, a MAF of a somatic variant is ⁇ 0.05.
- processing can refer to determining a difference, e.g., a difference in number or sequence.
- a difference in number or sequence e.g., gene expression, copy number variation (CNV), indel, and/or single nucleotide variant (SNV) values or sequences can be processed.
- CNV copy number variation
- SNV single nucleotide variant
- Adapters are an artificially synthesized sequence that can be coupled to a nucleic acid molecule or a polynucleotide sequence by any approach including ligation, hybridization, and/or amplification.
- Adapters are short nucleic acids (e.g., less than 500, 100 or 50 nucleotides long) usually at least partly double-stranded for linkage to either or both ends of a sample nucleic acid molecule.
- Adapters can include primer binding sites to permit amplification of a nucleic acid molecule flanked by adapters at both ends, and/or a sequencing primer binding site, including primer binding sites for next generation sequencing (NGS).
- NGS next generation sequencing
- Adapters can also include binding sites for capture probes, such as an oligonucleotide attached to a flow cell support. Adapters can also include a barcode as described above. Tags are preferably position relative to primer and sequencing primer binding sites, such that a tag is included in amplicons and sequencing reads of a nucleic acid molecule.
- the same or different adapters can be linked to the respective ends of a nucleic acid molecule. Sometimes the same adapter is linked to the respective ends except that the tag is different.
- a preferred adapter is a Y-shaped adapter in which one end is blunt ended or tailed as described herein, for joining to a nucleic acid molecule, which is also blunt ended or tailed with one or more complementary nucleotides.
- Another preferred adapter is a bell-shaped adapter, likewise with a blunt or tailed end for joining to a nucleic acid to be analyzed.
- Sequencing nucleic acids can identify variations associated with the presence, susceptibility or prognosis of disease. However, the value of such information can be compromised by errors introduced by the sequencing process including preparing nucleic acids for sequencing or by other factors, such as environmental conditions which affect the quality of the sample of nucleic acids during transportation and/or initial laboratory processing. Environmental conditions affecting quality include temperature and length of storage period before processing.
- the disclosure is premised in certain aspects on the observation that blunting single-stranded overhangs on nucleic acids in a sample has a significant propensity for introducing deamination-induced sequencing errors in which a cytosine (C) is changed to thymine (T) at the 5′ end of a nucleic acid strand resulting in a guanine (G) to adenine (A) change in the complementary base at the 3′-end of the complementary nucleic acid strand.
- C cytosine
- T thymine
- G guanine
- A adenine
- Nucleic acids can be subject to deamination in which base “C” is converted to base “T”. In this case, in a double-stranded molecule, one strand will have “T”, and the complementary strand will have “G”. Such errors can be detected upon sequencing if the sequences of the different strands are tracked.
- the method can be performed on any nucleic acid that is partially double-stranded with at least one single-stranded overhang or a population including such a nucleic acid.
- the method is performed on a population of nucleic acids at least some of which are partially double-stranded with single-stranded overhangs at one or both ends.
- the methods can be performed for example, on a population including at least 2, 10,000, 1,000,000, 1,000,000,000, 10,000,000,000 or more different such nucleic acids.
- at least some nucleic acids including those with single-stranded overhangs in the population are of overlapping sequence.
- Such populations can exist naturally or as a result of fragmentation during preparation of a sample or can be generated enzymatically such as by partial restriction digestion.
- nucleic acid population is cell-free nucleic acids such as exist in blood and other body fluids.
- nucleic acids are typically in heterogeneous form including double-stranded DNA with single-stranded overhangs at one or both ends, as well as single-stranded DNA and RNA. Double-stranded blunt-ended DNA can also be present.
- the nucleic acid population can be prepared for sequencing by enzymatic blunt-ending of double-stranded nucleic acids with single-stranded overhangs at one or both ends.
- the population can be treated with a protein with a 5′-3′ DNA polymerase activity and a 3′-5′ exonuclease activity in the presence of nucleotides (e.g., A, C, G and T or U).
- nucleotides e.g., A, C, G and T or U.
- Exemplary proteins are DNA polymerases, such as Klenow large fragment and T4 DNA polymerase.
- the protein extends the recessed 3′ end on the complementary strand until it is flush with the 5′ end producing a blunt end.
- the protein digests from the 3′ end up to and sometimes beyond the 5′ end of the opposing strand. If digestion proceeds beyond the 5′ end of the opposing strand, the gap can be filled in by polymerase activity as for a 5′ overhang. Blunt-ending of double-stranded nucleic acids facilitates attachment of adapters and subsequent amplification.
- FIG. 1 shows a scheme by which a Klenow enzyme fills in 5′ overhangs and digests 3′ overhangs.
- FIGS. 2 and 3 show a scheme in which C-T deamination-induced errors are introduced at the 5′-end of a Watson strand and complementary G-A errors at the 3′ end of the complementary Crick strand.
- Deamination-induced C to T conversions are shown by the circled T's.
- the circled A's represent corresponding changes in the complementary strand.
- Deamination induced errors in the 5′ Watson strand are reproduced as a complementary nucleotide to the 3′ end of the Crick strand due to extension of the 3′ end based on the 5′ overhang of Watson strand, e.g., a C to T conversion on the Watson strand and a G to A conversion on the Crick strand.
- Deamination-induced errors in the double-stranded region are not reproduced by way of the filling or digesting processes, and the two strands have non-complementary nucleotides at that position or nucleotide.
- Deamination-induced errors in the 3′ end of the Watson strand are digested away.
- Deamination-induced errors near the 5′ end of the Crick strand may be retained if the 3′ end of the Watson strand is digested back so as to require fill-in of the nucleotide complementary to the deamination-induced error.
- only C to T variations at the 5′ end of a strand and G to A variations at the 3′ end of a strand are represented in both strands of a nucleic acid molecule.
- a “C” to “T” conversion positioned at a 5′ overhang in the Watson strand of the original molecule will be represented by a T error, and propagated in all amplified molecules as A on the complementary strands.
- a “C” to “T” conversion positioned at a double-stranded portion of the original molecule will be represented by G on one strand, and as A on the complementary strand.
- the error is likely to be propagated as “T” on one strand, e.g., the Watson strand, and a mixture of “A” and “G” at the same position on the complementary strand, e.g., the Crick strand.
- a “C” to “T” conversion positioned in a 3′ overhang in the Watson strand of the original molecule will be digested and eliminated from the overhang to form a blunt-ended double-stranded molecule.
- a “C” to “T” conversion positioned near the 5′ end of the Crick strand of a molecule having a 3′ overhang on the Watson strand may have the 5′ overhang digested back and, upon fill-in, be represented in the Watson/Crick strand as T/A. This will likely be propagated in all amplified molecules as T/A.
- “C” to “T” conversion in the double-stranded portion of the original molecule can be detected as errors, as the reads from the original Watson strand will contain T, but reads from the original Crick strand will contain G.
- a “C” to “T” conversion positioned at a 5′ overhang in the Watson strand of the original molecule will produce complementary T/A on the Watson/Crick strands, respectively.
- conversions of nucleotides in both 5′ and 3′ overhangs typically do not provide self-evident errors or double-stranded support, e.g., A/T (Watson/Crick) or C/G (Watson/Crick).
- Nucleic acid populations can be subject to additional processing such as conversion of single-stranded nucleic acids to double-stranded and/or conversion of RNA to DNA. These forms of nucleic acid can also be linked to adapters and amplified.
- nucleic acids subject to blunt-ending as described above, and optionally other nucleic acids in a sample are sequenced to produce sequenced nucleic acids.
- a sequenced nucleic acid can refer either to the sequence of a nucleic acid, including sequence reads produced after redundantly sequencing a nucleic acid (e.g., through amplification or re-reading of a single molecule) or a nucleic acid whose sequence has been determined. Sequencing is performed so as to provide sequence data of individual nucleic acid molecules in a sample either directly or indirectly from a consensus sequence of amplification products of an individual nucleic acid molecule in the sample.
- double-stranded nucleic acids with single-stranded overhangs in a sample after blunt-ending are linked at both ends to adapters including barcodes or tags (attached by ligation or by primer extension), and the sequencing determines nucleic acid sequences as well as barcodes in the adapters.
- the blunt-ended DNA molecules can be blunt-end ligated with a blunt end of an at least partially double-stranded adapter (e.g., a Y shaped or bell-shaped adapter).
- blunt ends of sample nucleic acids and adapters can be tailed with complementary nucleotides to facilitate ligation.
- the adapters may have a tail, e.g., at least one nucleotide attached or linked onto one of the strands, and the at least one nucleotide is complementary to an overhang introduced on the nucleic acid molecule of interest.
- the tail on the adapter can be any one or more of the nucleotides, A, T, C, or G.
- the sample may be contacted with a sufficient number of adapters that there is a low probability (e.g., ⁇ 1% or ⁇ 0.1%) that any two instances of the same nucleic acid receive the same combination of barcodes from the adapters linked at one end or both ends.
- a sufficient number of adapters that there is a low probability (e.g., ⁇ 1% or ⁇ 0.1%) that any two instances of the same nucleic acid receive the same combination of barcodes from the adapters linked at one end or both ends.
- the use of adapters in this manner permits grouping of sequences with the same start and stop points on a reference nucleic acid and linked to the same combination of barcodes into families of reads generated from the same original molecule. Such a family represents sequences of amplification products of a nucleic acid in the sample before amplification.
- sequences of family members can be compiled to derive consensus nucleotide(s) or a complete consensus sequence for a nucleic acid molecule in the original sample, as modified by blunt ending and adapter attachment.
- the nucleotide occupying a specified position of a nucleic acid in the sample is determined to be the consensus of nucleotides occupying that corresponding position in family member sequences.
- a consensus nucleotide can be determined by methods such as voting or confidence score, to name two methods. Families can include sequences of one or both strands of a double-stranded nucleic acid.
- members of a family include sequences of both strands from a double-stranded nucleic acid, sequences of one strand are converted to their complement for purposes of compiling all sequences to derive consensus nucleotide(s) or sequences.
- Some families may include only a single member sequence. In this case, this sequence can be taken as the sequence of a nucleic acid in the sample before amplification. Alternatively, families with only a single member sequence can be eliminated from subsequent analysis.
- Nucleotide variations in sequenced nucleic acids can be determined by comparing sequenced nucleic acids with a reference sequence.
- the reference sequence is often a known sequence, e.g., a known whole or partial genome sequence from an object, whole genome sequence of a human object.
- the reference sequence can be hG19.
- the sequenced nucleic acids can represent sequences determined directly for a nucleic acid in a sample, or a consensus of sequences of amplification products of such a nucleic acid, as described above.
- a comparison can be performed at one or more designated positions on a reference sequence.
- a subset of sequenced nucleic acids can be identified including a position corresponding with a designated position of the reference sequence when the respective sequences are maximally aligned.
- sequenced nucleic acids include a nucleotide variation at the designated position, and optionally which if any, include a reference nucleotide (i.e., same as in the reference sequence) and/or the number of sequences in the subset including the reference nucleotide.
- a variant may be called when supported by the sequenced nucleic acids including the nucleotide variation. For example, if the number of sequenced nucleic acids in the subset including a nucleotide variant exceeds a threshold, then a variant nucleotide can be called at the designated position.
- the threshold can be a simple number, such as at least 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10 sequenced nucleic acids within the subset including the nucleotide variant or it can be a ratio, such as a least 0.5, 1, 2, 3, 4, 5, 10, 15, or 20% of sequenced nucleic acids within the subset include the nucleotide variant, among other possibilities.
- the comparison can be repeated for any designated position of interest in the reference sequence. Sometimes a comparison can be performed for designated positions occupying at least 20, 100, 200, or 300 contiguous positions on a reference sequence, e.g., 20-500, or 50-300 contiguous positions. C to T or G to A variations supported by sequenced nucleic acids in the subset with the same confidence as that used to call other variations may nevertheless contain deamination-induced sequencing errors.
- Deamination-induced sequencing errors may be inadvertently included in called variant nucleotides unless measures are taken to eliminate them from the called variant nucleotides.
- Deamination-induced errors can be recognized by either or both of two basic criteria.
- deamination errors are context dependent. Deamination of cytosine to thymine occurs more when the cytosine is flanked by thymine and guanine (i.e., as TCG) than flanked by other nucleotides. Similarly, a variation of guanine to adenine (on the complementary strand) occurs more frequently when the guanine is flanked by C and A as CGA than flanked by other nucleotides. Thus, deamination-induced errors can be called when a C to T or G to A variation occurs in a TCG to TTG or CGA to CAA context respectively. In some methods, about 90% of deamination errors occur in these contexts.
- deamination-induced errors depend on the distance between a designated position and an end of a sequenced nucleic acid or, in other words, the number of nucleotides separating these positions. For example, deamination-induced errors occurring in an internal portion of a sequence are likely to be detectable as a “T” in a read from one strand and a “G” in a read from the complementary strand. However, deamination-induced errors occurring proximate to the ends (terminal end) of a nucleic acid being sequenced may not be evident because such errors are introduced by the process of blunt-ended repair, which can result in two perfectly complementary strands.
- sequence reads containing deamination of cytosine to thymine may more frequently occur proximate to the 5′ end of a sequenced nucleic acid and deamination of a guanine to an adenine may more frequently occur proximate to the 3′ end.
- the average distance between a C to T variation arising from deamination at a designated position and the 5′ end of sequenced nucleic acids is less than the average distance between the reference nucleotide at the designated position and the 5′ end of sequenced nucleic acids.
- the average distance between a G to A variation arising from deamination at a designated position and the 3′ end of sequenced nucleic acids is less than the average distance between the reference nucleotide at the designated position and the 3′ end of sequenced nucleic acids.
- a G to A or C to T variation at a designated position represents a real variation rather than a sequencing error there should be no systematic difference that may arise due to random factors between the average distances of these variations and the ends of sequenced nucleic acids compared with those of the reference nucleotide at the designated position.
- provided herein are methods of determining minor allele frequency of a “C” to “T” or a “G” to “A” variant at a designated position in a reference sequence in a population of sequenced polynucleotides mapping to the designated position, wherein minor allele frequency compares a number of sequenced polynucleotides mapping to the designated position comprising the variant (“variant number”) to total number of sequenced polynucleotides mapping to the designated position, the method comprising adjusting the variant number of T or A variants at the genomic coordinate for probability of deamination errors, wherein probability of error is a function of distance of the variant from a 5′ terminus of a molecule in the case of “T” and from the 3′ end of the molecule in case of “A”.
- the chance of a “T” variant in a molecule resulting from a deamination error is a function of the distance the position of the variant is from the 5′ end of a molecule. More specifically, the closer the variant is to the 5′ end of the molecule, the more likely that the variant is a C to T transversion. This is because errors are propagated where there is a 5′ overhang that is filled in, and shorter overhangs at the 5′ end are more likely than longer overhangs. Similarly, G to A variants at the 3′ end of the molecule are more likely the closer the position is to the 3′ terminus of the molecule, for similar reasons.
- the asymptotic amount represents the general deamination rate. This rate may vary from sample to sample.
- the relevant proximity to the ends of sequenced nucleic acids in which deamination-induced errors are likely to occur corresponds approximately to the length of single-stranded overhangs in a nucleic acid population being sequenced, but can be slightly longer in the case of a 3′ overhang due to digestion beyond the end of the complementary strand and subsequent filling in.
- the proximity can be defined for example, as less than or equal to 30, 25, 20, 15, 10 or 5 nucleotides from the 3′ or 5′ end of a sequenced nucleic acid strand (“terminal proximity”).
- the proximity can be defined the same or differently for the 3′ or 5′ end.
- a subset of sequenced nucleic acids is identified including a position corresponding with a designated position of the reference sequence when the respective sequences are maximally aligned. Some of the sequenced nucleic acids within this subset have the designated position occurring within a defined proximity of the 5′ end. These sequenced nucleic acids can be referred to as a first fraction of the subset. Some of the sequenced nucleic acids within the subset have the designated position occurring within a defined proximity of the 3′ end. These sequenced nucleic acids can be referred to as a second fraction of the subset.
- a “C” to “T” conversion can then be recognized by its representation in sequenced nucleic acids constituting the first fraction and a “G” to “A” conversion by its representation in sequenced nucleic acids constituting the second fraction.
- Representation can be defined simply as the number of sequenced nucleic acids present including a C to T or G to A variation at the designated position in the relevant fraction.
- a C to T deamination error can be called if a certain number, e.g., at least 1, 2, 3, 4, 5 or 6 sequenced nucleic acids of the first fraction including a C to T variation at the designated position.
- a G to A deamination error can be called if a certain number, e.g., at least 1, 2, 3, 4, 5, or 6 sequenced nucleic acids of the second fraction include a G to A variation at the designated position.
- Representation can also be defined by the proportion of nucleic acids within the first or second fraction including a C to T or G to A variation at the designated position as compared with the proportion outside the first fraction or second fraction respectively.
- a deamination error can be called if the representation of a C to T or G to A variation at the designated position with the relevant fraction is at least 25, 30, 40, 50, 60 or 70% of sequenced nucleic acids within the relevant fraction.
- Overrepresentation can also be defined by the relative proportion of sequenced nucleic acids within the relevant fraction with C to T or G to A variation at the designated position compared with the corresponding proportion of sequenced nucleic acids with the C to T or G to A variation outside the fraction but in the same subset.
- a higher representation of sequenced nucleic acids within the relevant fraction with the C to T or G to A variation than outside the fraction is an indication the variation is a deamination error. For example, if 50% of sequenced nucleic acids in a first fraction of the subset include a C to T transposition at the designated position, and only 1% of nucleic acids outside the fraction but within the subset (where the designated position is not within the defined proximity of the 5′ end), then the C to T transposition is probably a deamination-induced error.
- Determining minor allele fraction can comprise calculating a ratio of molecules mapping to a designated position that comprise a particular variant, to total molecules mapping to the designated position. So, for example, if 100 molecules map to the genomic coordinate, and 13 of them comprise the variant, the minor allele frequency can be calculated as 13%. However, if certain variants are considered to be the result of deamination error, these can be discounted from the count. So, for example, if 7 of the 13 variants are designated as errors, the ratio can be calculated as 6/93, or 6.4%. In certain instances, all variants at the designated position may be discounted, for example, if the ratio of variants at the coordinate located at the 5′ end of the molecule account for more than 50% of all variants at the coordinate.
- Deamination-induced errors can be so categorized based on either context or representation or both. For example, if a C to T or G to A transposition occurs in a context indicated above suggesting a deamination error, then the extent of overrepresentation in the relevant fraction of the subset required to categorize the transposition as a deamination error may be reduced compared with what would be required if the categorization were based on overrepresentation alone.
- Whether an apparent variant is called as a deamination error can be based on several factors.
- the existence of a variant at a locus can be as such when the absolute number of variant molecules is above a certain threshold (e.g., by ratio or by percentage).
- the existence of a variant can be reported out if the allele fraction (the percent of molecules mapping to a locus bearing the variant) is above a threshold, for example, determined by the expected rate in control samples. When reported out, both the presence of the variant and the minor allele fraction of the variant can be reported out.
- deamination errors can be treated in any of a number of different ways. In one embodiment, any “T” variants positioned within a predetermined terminal proximity may simply be attributed to error and discounted.
- the fraction of “T” variants positioned within the predetermined terminal proximity to those positioned outside the predetermined terminal proximity is determined. If that amount is above a certain threshold amount, e.g., above 20%, above 30%, above 40%, above 50%, then the error rate is considered high enough that no variant is reported at that position. If the amount is below the threshold level, then the variant is subjected to normal reporting requirements. In another method, if the minor allele fraction is above the expected general error rate then the variant is reported out regardless of the existence of error and may or may not be corrected for error.
- a certain threshold amount e.g., above 20%, above 30%, above 40%, above 50%
- a “T” variant is scored as the probability of the variant being an error, and scores at all positions are added to produce a number to be incorporated in the minor allele fraction. So, for example, the chance of a variant at the first (terminal) 5′ nucleotide being a true variant may be 50%. The chance of a variant at the tenth 5′ nucleotide might be 75%. The chance of a variant beyond the 20 th 5′ nucleotide might be 95%.
- Such probabilities can be determined empirically, for example by examining at least 10, at least 50, at least 100 or at least 500 control samples.
- each family member within a family including families representing both strands of the nucleic acid in the original sample includes the deamination error. If different strands have different nucleotides, the error is self-evident.
- the number of designated positions in the reference sequence in which a variant nucleotide is categorized as a deamination error in a particular sample can vary.
- the number of such designated positions can be at least 1, 2, 3, 4, 5, 6, 7, 8, 9 or 10 among other possibilities.
- the present methods can be computer-implemented, such that any or all of the steps described in the specification or appended claims other than wet chemistry steps can be performed in a suitable programmed computer.
- the computer can be a mainframe, personal computer, tablet, smart phone, cloud, online data storage, remote data storage, or the like.
- the computer can be operated in one or more locations.
- a computer program for analyzing a nucleic acid population can include codes for performing any of the steps other than wet chemistry steps described in the specification or in the appended claims; for example codes for determining sequences of the double-stranded blunt-ended nucleic acids to provide sequenced nucleic acids; code for identifying a subset of sequenced nucleic acids including the designated position and identifying the number of sequenced nucleic acids in the subset in which the designated position is occupied by a variant nucleotide at each designated position in a reference sequence; and code for calling presence of a variant nucleotide at each designated position at which the number of sequenced nucleic acids in the subset with the variation meets a threshold, except that presence of a variant nucleotide at a designated position is not called if: (i) the variant is a C to T or G to A variation compared with the reference nucleotide; and (ii) the variant nucleotide is categorized as a deamination error based on: (1) nucleo
- the present methods can be implemented in a system (e.g., a data processing system) for analyzing a nucleic acid population.
- the system can also include a processor, a system bus, a main memory and optionally an auxiliary memory coupled to one another to perform one or more of the steps described in the specification or appended claims, such as the following: determining sequences of the double-stranded blunt-ended nucleic acids to provide sequenced nucleic acids; identifying a subset of sequenced nucleic acids including the designated position and identifying the number of sequenced nucleic acids in the subset in which the designated position is occupied by a variant nucleotide at each designated position in a reference sequence; and calling presence of a variant nucleotide at each designated position at which the number of sequenced nucleic acids in the subset with the variation meets a threshold, except that presence of a variant nucleotide at a designated position is not called if: (i) the variant is a C to T or G to A variation compared with
- the system can also include a display or printer for outputting results, such as variant nucleotides and deamination-induced errors, a keyboard and/or pointer for providing user input, such as setting thresholds or defined proximities, among other accessories.
- the system can also include a sequencing apparatus coupled to the memory to provide raw sequencing data.
- Various steps of the present methods can utilize information and/or programs and generate results that are stored on computer-readable media (e.g., hard drive, auxiliary memory, external memory, server; database, portable memory device (e.g., CD-R, DVD, ZIP disk, flash memory cards), and the like.
- computer-readable media e.g., hard drive, auxiliary memory, external memory, server; database, portable memory device (e.g., CD-R, DVD, ZIP disk, flash memory cards), and the like.
- information used for and results generated by the methods that can be stored on computer-readable media include references sequences, thresholds or defined proximities for nucleotide variant or deamination-induced error calls, raw sequencing data, sequenced nucleic acids, variant nucleotides and their associations with disease, and deamination-induced errors.
- the present disclosure also includes an article of manufacture for analyzing a nucleic acid population that includes a machine-readable medium containing one or more programs which when executed implement the steps of the present methods.
- the disclosure can be implemented in hardware and/or software. For example, different aspects of the disclosure can be implemented in either client-side logic or server-side logic.
- the disclosure or components thereof can be embodied in a fixed media program component containing logic instructions and/or data that when loaded into an appropriately configured computing device cause that device to perform according to the disclosure.
- a fixed media containing logic instructions can be delivered to a viewer on a fixed media for physically loading into a viewer's computer or a fixed media containing logic instructions may reside on a remote server that a viewer accesses through a communication medium to download a program component.
- FIG. 5 shows a computer system 901 that is programmed or otherwise configured to implement methods of the present disclosure.
- the computer system 901 includes a central processing unit (CPU, also “processor” and “computer processor” herein) 905 , which can be a single core or multi core processor, or a plurality of processors for parallel processing.
- the computer system 901 also includes memory or memory location 910 (e.g., random-access memory, read-only memory, flash memory), electronic storage unit 915 (e.g., hard disk), communication interface 920 (e.g., network adapter) for communicating with one or more other systems, and peripheral devices 925 , such as cache, other memory, data storage and/or electronic display adapters.
- the memory 910 , storage unit 915 , interface 920 and peripheral devices 925 are in communication with the CPU 905
- the storage unit 915 can be a data storage unit (or data repository) for storing data.
- the computer system 901 can be operatively coupled to a computer network (“network”) 930 with the aid of the communication interface 920 .
- the network 930 can be the Internet, an internet and/or extranet, or an intranet and/or extranet that is in communication with the Internet.
- the network 930 in some cases is a telecommunication and/or data network.
- the network 930 can include a local area network.
- the network 930 can include one or more computer servers, which can enable distributed computing, such as cloud computing.
- the network 930 in some cases with the aid of the computer system 901 , can implement a peer-to-peer network, which may enable devices coupled to the computer system 901 to behave as a client or a server.
- the CPU 905 can execute a sequence of machine-readable instructions, which can be embodied in a program or software.
- the instructions may be stored in a memory location, such as the memory 910 .
- the instructions can be directed to the CPU 905 , which can subsequently program or otherwise configure the CPU 905 to implement methods of the present disclosure. Examples of operations performed by the CPU 905 can include fetch, decode, execute, and writeback.
- the CPU 905 can be part of a circuit, such as an integrated circuit.
- a circuit such as an integrated circuit.
- One or more other components of the system 901 can be included in the circuit.
- the circuit is an application specific integrated circuit (ASIC).
- the storage unit 915 can store files, such as drivers, libraries and saved programs.
- the storage unit 915 can store user data, e.g., user preferences and user programs.
- the computer system 901 in some cases can include one or more additional data storage units that are external to the computer system 901 , such as located on a remote server that is in communication with the computer system 901 through an intranet or the Internet.
- the computer system 901 can communicate with one or more remote computer systems through the network 930 .
- the computer system 901 can communicate with a remote computer system of a user.
- remote computer systems include personal computers (e.g., portable PC), slate or tablet PC's (e.g., Apple® iPad, Samsung® Galaxy Tab), telephones, Smart phones (e.g., Apple® iPhone, Android-enabled device, Blackberry®), or personal digital assistants.
- the user can access the computer system 901 via the network 930 .
- Methods as described herein can be implemented byway of machine (e.g., computer processor) executable code stored on an electronic storage location of the computer system 901 , such as, for example, on the memory 910 or electronic storage unit 915 .
- the machine executable or machine readable code can be provided in the form of software.
- the code can be executed by the processor 905 .
- the code can be retrieved from the storage unit 915 and stored on the memory 910 for ready access by the processor 905 .
- the electronic storage unit 915 can be precluded, and machine-executable instructions are stored on memory 910 .
- the code can be pre-compiled and configured for use with a machine having a processer adapted to execute the code, or can be compiled during runtime.
- the code can be supplied in a programming language that can be selected to enable the code to execute in a precompiled or as-compiled fashion.
- aspects of the systems and methods provided herein can be embodied in programming.
- Various aspects of the technology may be thought of as “products” or “articles of manufacture” typically in the form of machine (or processor) executable code and/or associated data that is carried on or embodied in a type of machine readable medium.
- Machine-executable code can be stored on an electronic storage unit, such as memory (e.g., read-only memory, random-access memory, flash memory) or a hard disk.
- “Storage” type media can include any or all of the tangible memory of the computers, processors or the like, or associated modules thereof, such as various semiconductor memories, tape drives, disk drives and the like, which may provide non-transitory storage at any time for the software programming. All or portions of the software may at times be communicated through the Internet or various other telecommunication networks. Such communications, for example, may enable loading of the software from one computer or processor into another, for example, from a management server or host computer into the computer platform of an application server.
- another type of media that may bear the software elements includes optical, electrical and electromagnetic waves, such as used across physical interfaces between local devices, through wired and optical landline networks and over various air-links. The physical elements that carry such waves, such as wired or wireless links, optical links or the like, also may be considered as media bearing the software. As used herein, unless restricted to non-transitory, tangible assets, such as compact discs, etc.
- Storage media terms such as computer or machine “readable medium” refer to any medium that participates in providing instructions to a processor for execution.
- a machine readable medium such as computer-executable code
- a tangible storage medium such as computer-executable code
- Non-volatile storage media include, for example, optical or magnetic disks, such as any of the storage devices in any computer(s) or the like, such as may be used to implement the databases, etc. shown in the drawings.
- Volatile storage media include dynamic memory, such as main memory of such a computer platform.
- Tangible transmission media include coaxial cables; copper wire and fiber optics, including the wires that comprise a bus within a computer system.
- Carrier-wave transmission media may take the form of electric or electromagnetic signals, or acoustic or light waves such as those generated during radio frequency (RF) and infrared (IR) data communications.
- RF radio frequency
- IR infrared
- Common forms of computer-readable media therefore include for example: a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD or DVD-ROM, any other optical medium, punch cards paper tape, any other physical storage medium with patterns of holes, a RAM, a ROM, a PROM and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave transporting data or instructions, cables or links transporting such a carrier wave, or any other medium from which a computer may read programming code and/or data.
- Many of these forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to a processor for execution.
- the computer system 901 can include or be in communication with an electronic display 935 that comprises a user interface (UI) 940 for providing, for example, a report.
- UI user interface
- Examples of UI's include, without limitation, a graphical user interface (GUI) and web-based user interface.
- Methods and systems of the present disclosure can be implemented by way of one or more algorithms.
- An algorithm can be implemented by way of software upon execution by the central processing unit 905 .
- a sample can be any biological sample isolated from a subject.
- Samples can include body tissues, such as known or suspected solid tumors, whole blood, platelets, serum, plasma, stool, red blood cells, white blood cells or leucocytes, endothelial cells, tissue biopsies, cerebrospinal fluid synovial fluid, lymphatic fluid, ascites fluid, interstitial or extracellular fluid, the fluid in spaces between cells, including gingival crevicular fluid, bone marrow, pleural effusions, cerebrospinal fluid, saliva, mucous, sputum, semen, sweat, urine. Samples are preferably body fluids, particularly blood and fractions thereof, and urine. Such samples include nucleic acids shed from tumors.
- the nucleic acids can include DNA and RNA and can be in double- and/or single-stranded forms.
- a sample can be in the form originally isolated from a subject or can have been subjected to further processing to remove or add components, such as cells, enrich for one component relative to another, or convert one form of nucleic acid to another, such as RNA to DNA or single-stranded nucleic acids to double-stranded.
- a body fluid for analysis is plasma or serum containing cell-free nucleic acids, e.g., cell-free DNA (cfDNA).
- the volume of plasma can depend on the desired read depth for sequenced regions. Exemplary volumes are 0.4-40 ml, 5-20 ml, 10-20 ml. For example, the volume can be 0.5 ml, 1 ml, 5 ml, 10 ml, 20 ml, 30 ml, or 40 ml. A volume of sampled plasma may be 5 to 20 ml.
- the sample can comprise various amounts of nucleic acid that contains genome equivalents.
- a sample of about 30 ng DNA can contain about 10,000 (10 4 ) haploid human genome equivalents and, in the case of cfDNA, about 200 billion (2 ⁇ 10 4 ) individual polynucleotide molecules.
- a sample of about 100 ng of DNA can contain about 30,000 haploid human genome equivalents and, in the case of cfDNA, about 600 billion individual molecules.
- a sample can comprise nucleic acids from different sources, e.g., from cells and cell free.
- a sample can comprise nucleic acids carrying mutations.
- a sample can comprise DNA carrying germline mutations and/or somatic mutations.
- a sample can comprise DNA carrying cancer-associated mutations (e.g., cancer-associated somatic mutations).
- Exemplary amounts of cell free nucleic acids in a sample before amplification range from about 1 fg to about 1 ⁇ g, e.g., 1 pg to 200 ng, 1 ng to 100 ng, 10 ng to 1000 ng.
- the amount can be up to about 600 ng, up to about 500 ng, up to about 400 ng, up to about 300 ng, up to about 200 ng, up to about 100 ng, up to about 50 ng, or up to about 20 ng of cell-free nucleic acid molecules.
- the amount can be at least 1 fg, at least 10 fg, at least 100 fg, at least 1 pg, at least 10 pg, at least 100 pg, at least 1 ng, at least 10 ng, at least 100 ng, at least 150 ng, or at least 200 ng of cell-free nucleic acid molecules.
- the amount can be up to 1 femtogram (fg), 10 fg, 100 fg, 1 picogram (pg), 10 pg, 100 pg, 1 ng, 10 ng, 100 ng, 150 ng, or 200 ng of cell-free nucleic acid molecules.
- the method can comprise obtaining 1 femtogram (fg) to 200 ng.
- a cell-free nucleic acid sample refers to a sample containing cell-free nucleic acids.
- Cell-free nucleic acids are nucleic acids not contained within or otherwise bound to a cell or in other words nucleic acids remaining in a sample after removing intact cells.
- Cell-free nucleic acids can be referred to all non-encapsulated nucleic acid sourced from a bodily fluid (e.g., blood, urine, CSF, etc.) from a subject.
- Cell-free nucleic acids include DNA (cfDNA), RNA (cfRNA), and hybrids thereof, including genomic DNA, mitochondrial DNA, circulating DNA, siRNA, miRNA, circulating RNA (cRNA), tRNA, rRNA, small nucleolar RNA (snoRNA), Piwi-interacting RNA (piRNA), long non-coding RNA (long ncRNA), or fragments of any of these.
- Cell-free nucleic acids can be double-stranded, single-stranded, or a hybrid thereof.
- a cell-free nucleic acid can be released into bodily fluid through secretion or cell death processes, e.g., cellular necrosis and apoptosis.
- ctDNA circulating tumor DNA
- cffDNA Cell-free fetal DNA
- a cell-free nucleic acid or proteins associated with it can have one or more epigenetic modifications, for example, a cell-free nucleic acid can be acetylated, 5-methylated, ubiquitylated, phosphorylated, sumoylated, ribosylated, and/or citrullinated.
- Cell-free nucleic acids have an exemplary size distribution of about 100-500 nucleotides, with molecules of 110 to about 230 nucleotides representing about 90% of molecules, with a mode of about 168 nucleotides in humans and a second minor peak in a range between 240 to 440 nucleotides.
- Cell-free nucleic acids can be about 160 to about 180 nucleotides, or about 320 to about 360 nucleotides, or about 440 to about 480 nucleotides.
- Cell-free nucleic acids can be isolated from bodily fluids through a partitioning step in which cell-free nucleic acids, as found in solution, are separated from intact cells and other non-soluble components of the bodily fluid. Partitioning may include techniques such as centrifugation or filtration. Alternatively cells in bodily fluids can be lysed and cell-free and cellular nucleic acids processed together. Generally, after addition of buffers and wash steps, cell-free nucleic acids can be precipitated with an alcohol. Further clean up steps may be used such as silica based columns to remove contaminants or salts. Non-specific bulk carrier nucleic acids, for example, may be added throughout the reaction to optimize certain aspects of the procedure such as yield.
- samples can include various forms of nucleic acid including double-stranded DNA, single stranded DNA and single stranded RNA.
- single stranded DNA and RNA can be converted to double-stranded forms so they are included in subsequent processing and analysis steps.
- Sample nucleic acids flanked by adapters can be amplified by PCR and other amplification methods typically primed from primers binding to primer binding sites in adapters flanking a DNA molecule to be amplified.
- Amplification methods can involve cycles of extension, denaturation and annealing resulting from thermocycling or can be isothermal as in transcription mediated amplification.
- Other amplification methods include the ligase chain reaction, strand displacement amplification, nucleic acid sequence based amplification, and self-sustained sequence based replication.
- One or more amplifications can be applied to introduce barcodes to a nucleic acid molecule using conventional nucleic acid amplification methods.
- the amplification can be conducted in one or more reaction mixtures.
- Molecule tags and sample indexes/tags can be introduced simultaneously, or in any sequential order. Molecule tags and sample indexes/tags can be introduced prior to and/or after sequence capturing. In some cases, only the molecule tags are introduced prior to probe capturing while the sample indexes/tags are introduced after sequence capturing. In some cases, both the molecule tags and the sample indexes/tags are introduced prior to probe capturing. In some cases, the sample indexes/tags are introduced after sequence capturing.
- sequence capturing involves introducing a single-stranded nucleic acid molecule complementary to a targeted sequence, e.g., a coding sequence of a genomic region and mutation of such region is associated with a cancer type.
- the amplifications generate a plurality of non-uniquely or uniquely tagged nucleic acid amplicons with molecule tags and sample indexes/tags at a size ranging from 200 nt to 700 nt, 250 nt to 350 nt, or 320 nt to 550 nt.
- the amplicons have a size of about 300 nt.
- the amplicons have a size of about 500 nt.
- Barcodes can be incorporated into or otherwise joined to adapters by chemical synthesis, ligation, overlap extension PCR among other methods. Generally, assignment of unique or non-unique barcodes in reactions follows methods and systems described by US patent applications 20010053519, 20110160078, and U.S. Pat. Nos. 6,582,908 and 7,537,898 and 9,598,731.
- Tags can be linked to sample nucleic acids randomly or non-randomly. In some cases, they are introduced at an expected ratio of identifiers (e.g., a combination of barcodes) to microwells.
- the collection of barcodes can be unique, e.g., all the barcodes have the same nucleotide sequence.
- the collection of barcodes can be non-unique, e.g., some of the barcodes have the same nucleotide sequence, and some of the barcodes have different nucleotide sequence.
- the identifiers may be loaded so that more than 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 50, 100, 500, 1000, 5000, 10000, 50,000, 100,000, 500,000, 1,000,000, 10,000,000, 50,000,000 or 1,000,000,000 identifiers are loaded per genome sample. In some cases, the identifiers may be loaded so that less than 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 50, 100, 500, 1000, 5000, 10000, 50,000, 100,000, 500,000, 1,000,000, 10,000,000, 50,000,000 or 1,000,000,000 identifiers are loaded per genome sample.
- the average number of identifiers loaded per sample genome is less than, or greater than, about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 50, 100, 500, 1000, 5000, 10000, 50,000, 100,000, 500,000, 1,000,000, 10,000,000, 50,000,000 or 1,000,000,000 identifiers per genome sample.
- a preferred format uses 20-50 different tags, ligated to both ends of a target molecule creating 20-50 ⁇ 20-50 tags, e.g., 400-2500 tags. Such numbers of tags are sufficient that different molecules having the same start and stop points have a high probability (e.g., at least 94%, 99.5%, 99.99%, 99.999%) of receiving different combinations of tags.
- identifiers may be predetermined or random or semi-random sequence oligonucleotides.
- a plurality of barcodes may be used such that barcodes are not necessarily unique to one another in the plurality.
- barcodes may be attached (e.g., by ligation or PCR amplification) to individual molecules such that the combination of the barcode and the sequence it may be attached to creates a unique sequence that may be individually tracked.
- detection of non-uniquely tagged barcodes in combination with sequence data of beginning (start) and end (stop) portions of sequence reads may allow assignment of a unique identity to a particular molecule.
- the length, or number of base pairs, of an individual sequence read may also be used to assign a unique identity to such a molecule.
- fragments from a single strand of nucleic acid having been assigned a unique identity may thereby permit subsequent identification of fragments from the parent strand, and/or a complementary strand.
- Sequencing methods include, for example, Sanger sequencing, high-throughput sequencing, pyrosequencing, sequencing-by-synthesis, single-molecule sequencing, nanopore sequencing, semiconductor sequencing, sequencing-by-ligation, sequencing-by-hybridization, RNA-Seq (Illumina), Digital Gene Expression (Helicos), Next generation sequencing, Single Molecule Sequencing by Synthesis (SMSS) (Helicos), massively-parallel sequencing, Clonal Single Molecule Array (Solexa), shotgun sequencing, ION TORRENTTM, Oxford Nanopore, Roche Genia, Maxim-Gilbert sequencing, primer walking, sequencing using PacBio, SOLiD, ION TORRENTTM, or Nanopore platforms. Sequencing reactions can be performed in a variety of sample processing units, which may be multiple lanes, multiple channels, multiple wells, or other means of processing multiple sample sets substantially simultaneously. Sample processing unit can also include multiple sample
- the sequencing reactions can be performed on one or more fragments types known to contain markers of cancer of other disease.
- the sequencing reactions can also be performed on any nucleic acid fragments present in the sample.
- the sequence reactions may provide for sequence coverage of the genome of at least 5%, 10%, 15%, 20%, 25%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 95%, 99%, 99.9% or 100%. In other cases, sequence coverage of the genome may be less than 5%, 10%, 15%, 20%, 25%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 95%, 99%, 99.9% or 100%.
- Simultaneous sequencing reactions may be performed using multiplex sequencing.
- cell free polynucleotides may be sequenced with at least 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 50000, 100,000 sequencing reactions. In other cases, cell free polynucleotides may be sequenced with less than 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 50000, 100,000 sequencing reactions. Sequencing reactions may be performed sequentially or simultaneously. Subsequent data analysis may be performed on all or part of the sequencing reactions. In some cases, data analysis may be performed on at least 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 50000, 100,000 sequencing reactions.
- data analysis may be performed on less than 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 50000, 100,000 sequencing reactions.
- An exemplary read depth is 1000-50000 reads per locus (base).
- the present methods can be used to diagnose presence of conditions, particularly cancer, in a subject, to characterize conditions (e.g., staging cancer or determining heterogeneity of a cancer), monitor response to treatment of a condition, effect prognosis risk of developing a condition or subsequent course of a condition.
- conditions e.g., staging cancer or determining heterogeneity of a cancer
- Cancers cells as most cells, can be characterized by a rate of turnover, in which old cells die and replaced by newer cells. Generally dead cells, in contact with vasculature in a given subject, may release DNA or fragments of DNA into the blood stream. This is also true of cancer cells during various stages of the disease. Cancer cells may also be characterized, dependent on the stage of the disease, by various genetic aberrations such as copy number variation as well as rare mutations. This phenomenon may be used to detect the presence or absence of cancers individuals using the methods and systems described herein.
- the types and number of cancers that may be detected may include blood cancers, brain cancers, lung cancers, skin cancers, nose cancers, throat cancers, liver cancers, bone cancers, lymphomas, pancreatic cancers, skin cancers, bowel cancers, rectal cancers, thyroid cancers, bladder cancers, kidney cancers, mouth cancers, stomach cancers, solid state tumors, heterogeneous tumors, homogenous tumors and the like.
- Cancers can be detected from genetic variations including mutations, rare mutations, indels, copy number variations, transversions, translocations, inversion, deletions, aneuploidy, partial aneuploidy, polyploidy, chromosomal instability, chromosomal structure alterations, gene fusions, chromosome fusions, gene truncations, gene amplification, gene duplications, chromosomal lesions, DNA lesions, abnormal changes in nucleic acid chemical modifications, abnormal changes in epigenetic patterns.
- Genetic data can also be used for characterizing a specific form of cancer. Cancers are often heterogeneous in both composition and staging. Genetic profile data may allow characterization of specific sub-types of cancer that may be important in the diagnosis or treatment of that specific sub-type. This information may also provide a subject or practitioner clues regarding the prognosis of a specific type of cancer and allow either a subject or practitioner to adapt treatment options in accord with the progress of the disease. Some cancers progress, becoming more aggressive and genetically unstable. Other cancers may remain benign, inactive or dormant. The system and methods of this disclosure may be useful in determining disease progression.
- the present analysis is also useful in determining the efficacy of a particular treatment option.
- Successful treatment options may increase the amount of copy number variation or rare mutations detected in a subject's blood if the treatment is successful as more cancers may die and shed DNA. In other examples, this may not occur.
- certain treatment options may be correlated with genetic profiles of cancers overtime. This correlation may be useful in selecting a therapy.
- the present methods can be used to monitor residual disease or recurrence of disease.
- the present methods can also be used for detecting genetic variations in conditions other than cancer.
- Immune cells such as B cells
- Clonal expansions may be monitored using copy number variation detection and certain immune states may be monitored.
- copy number variation analysis may be performed over time to produce a profile of how a particular disease may be progressing.
- Copy number variation or even rare mutation detection may be used to determine how a population of pathogens are changing during the course of infection. This may be particularly important during chronic infections, such as HIV/AIDs or Hepatitis infections, whereby viruses may change life cycle state and/or mutate into more virulent forms during the course of infection.
- the present methods may be used to determine or profile rejection activities of the host body, as immune cells attempt to destroy transplanted tissue to monitor the status of transplanted tissue as well as altering the course of treatment or prevention of rejection.
- a disease may be heterogeneous. Disease cells may not be identical.
- some tumors are known to comprise different types of tumor cells, some cells in different stages of the cancer.
- heterogeneity may comprise multiple foci of disease. Again, in the example of cancer, there may be multiple tumor foci, perhaps where one or more foci are the result of metastases that have spread from a primary site.
- the present methods can be used to generate or profile, fingerprint or set of data that is a summation of genetic information derived from different cells in a heterogeneous disease.
- This set of data may comprise copy number variation and rare mutation analyses alone or in combination.
- the present methods can be used to diagnose, prognose, monitor or observe cancers or other diseases of fetal origin. That is, these methodologies may be employed in a pregnant subject to diagnose, prognose, monitor or observe cancers or other diseases in a unborn subject whose DNA and other polynucleotides may co-circulate with maternal molecules.
- the number and types of variant nucleotides in a sample can provide an indication of the amenability of the subject providing the sample to treatment, i.e., therapeutic intervention.
- presence of a high number of variants nucleotides is a positive indicator for immunotherapy because the presence of such mutation is associated with neoepitopes forming targets for immunotherapy.
- Immunotherapy can include use of an antibody against any of PD-1, PD-2, PD-L1, PD-L2, CTLA-40, OX40, B7.1, B7He, LAG3, CD137, KIR, CCR5, CD27, or CD40 among other treatments.
- Other exemplary agents for immunotherapy include proinflammatory cytokines, such as IL-1 ⁇ , IL-6, and TNF- ⁇ .
- T-cells activated against a tumor, such as by expressing of a chimeric antigen targeting a tumor antigen from the T-cell.
- Immunotherapy stimulates the immune system to attack tumor antigens distinguished from wildtype counterparts by the presence of mutation(s).
- variant nucleotides provide targets for existing drugs or indicate resistance to such drugs. Eliminating false positive due to deamination-induced sequencing errors increases the accuracy with which the number and types of variant nucleotides can be determined. Thus, subjects analyzed by the present methods can thereafter be subject to differential treatment regimes depending on the nucleotide variants discovered. Thus, for example, a greater proportion of subjects whose number of determined variant nucleotides is at or exceeds a threshold can receive immunotherapy than subjects with number of determined variant nucleotides is below the threshold.
- Nucleotide variations in sequenced nucleic acids can be determined by comparing sequenced nucleic acids with a reference sequence.
- the reference sequence is often a known sequence, e.g., a known whole or partial genome sequence from an object, whole genome sequence of a human object.
- the reference sequence can be hG19.
- the sequenced nucleic acids can represent sequences determined directly for a nucleic acid in a sample, or a consensus of sequences of amplification products of such a nucleic acid, as described above.
- a comparison can be performed at one or more designated positions on a reference sequence.
- a subset of sequenced nucleic acids can be identified including a position corresponding with a designated position of the reference sequence when the respective sequences are maximally aligned.
- sequenced nucleic acids include a nucleotide variation at the designated position, and optionally which if any, include a reference nucleotide (i.e., same as in the reference sequence). If the number of sequenced nucleic acids in the subset including a nucleotide variant exceeds a threshold, then a variant nucleotide can be called at the designated position.
- the threshold can be a simple number, such as at least 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10 sequenced nucleic acid within the subset including the nucleotide variant or it can be a ratio, such as a least 0.5, 1, 2, 3, 4, 5, 10, 15, or 20 of sequenced nucleic acids within the subset include the nucleotide variant, among other possibilities.
- the comparison can be repeated for any designated position of interest in the reference sequence. Sometimes a comparison can be performed for designated positions occupying at least 20, 100, 200, or 300 contiguous positions on a reference sequence, e.g., 20-500, or 50-300 contiguous positions.
- FIG. 6 shows families of sequencing reads of cell free DNA.
- the sequencing reads map to various segment of an ALK gene (CD246) on human chromosome 2.
- the reference sequence of the relevant region of the ALK gene is shown at the bottom of the figure (the gap in the sequence represents additional nucleotides not shown for conciseness of the figure).
- the figure shows five families of sequencing reads having 2, 3, 6, 3 and 6 reads respectively from top to bottom. Reads from one orientation are shown in black and reads from the other orientation are shown in white. Each of the families shows a G to A mismatch in each read of the family. Viewed in isolation, these families of sequencing reads provide sufficient evidence to call a G to A mutation. However, this picture changes when the position of the G to A mutation is considered relative to the 3′ end of the sequence reads as follows:
- FIG. 7 is presented in similar format to FIG. 6 showing sequencing reads from five families with 8, 4, 2, 5 and 4 members respectively. Again each of the five families has an apparent G to A substitution in each of its reads. However, in this case, the relative positions of the substitution to the 3′ end of sequencing reads is different as shown below:
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Chemical & Material Sciences (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Organic Chemistry (AREA)
- Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Zoology (AREA)
- Wood Science & Technology (AREA)
- Analytical Chemistry (AREA)
- Genetics & Genomics (AREA)
- Molecular Biology (AREA)
- Biotechnology (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Immunology (AREA)
- Biochemistry (AREA)
- Microbiology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Chemical Kinetics & Catalysis (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
- Apparatus Associated With Microorganisms And Enzymes (AREA)
Priority Applications (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US16/866,252 US11008616B2 (en) | 2017-11-03 | 2020-05-04 | Correcting for deamination-induced sequence errors |
| US17/210,202 US11718873B2 (en) | 2017-11-03 | 2021-03-23 | Correcting for deamination-induced sequence errors |
| US18/336,281 US20240141425A1 (en) | 2017-11-03 | 2023-06-16 | Correcting for deamination-induced sequence errors |
Applications Claiming Priority (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US201762581609P | 2017-11-03 | 2017-11-03 | |
| PCT/US2018/059056 WO2019090147A1 (en) | 2017-11-03 | 2018-11-02 | Correcting for deamination-induced sequence errors |
| US16/866,252 US11008616B2 (en) | 2017-11-03 | 2020-05-04 | Correcting for deamination-induced sequence errors |
Related Parent Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2018/059056 Continuation WO2019090147A1 (en) | 2017-11-03 | 2018-11-02 | Correcting for deamination-induced sequence errors |
Related Child Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US17/210,202 Continuation US11718873B2 (en) | 2017-11-03 | 2021-03-23 | Correcting for deamination-induced sequence errors |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| US20200377941A1 US20200377941A1 (en) | 2020-12-03 |
| US11008616B2 true US11008616B2 (en) | 2021-05-18 |
Family
ID=66332356
Family Applications (3)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US16/866,252 Active US11008616B2 (en) | 2017-11-03 | 2020-05-04 | Correcting for deamination-induced sequence errors |
| US17/210,202 Active 2039-06-17 US11718873B2 (en) | 2017-11-03 | 2021-03-23 | Correcting for deamination-induced sequence errors |
| US18/336,281 Pending US20240141425A1 (en) | 2017-11-03 | 2023-06-16 | Correcting for deamination-induced sequence errors |
Family Applications After (2)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US17/210,202 Active 2039-06-17 US11718873B2 (en) | 2017-11-03 | 2021-03-23 | Correcting for deamination-induced sequence errors |
| US18/336,281 Pending US20240141425A1 (en) | 2017-11-03 | 2023-06-16 | Correcting for deamination-induced sequence errors |
Country Status (6)
| Country | Link |
|---|---|
| US (3) | US11008616B2 (https=) |
| EP (1) | EP3704265A4 (https=) |
| JP (3) | JP7304852B2 (https=) |
| CN (1) | CN111542616A (https=) |
| CA (1) | CA3079252A1 (https=) |
| WO (1) | WO2019090147A1 (https=) |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| EP4581625A1 (en) * | 2022-08-29 | 2025-07-09 | Foundation Medicine, Inc. | Methods and systems for detecting tumor shedding |
Citations (15)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20010053519A1 (en) | 1990-12-06 | 2001-12-20 | Fodor Stephen P.A. | Oligonucleotides |
| US6630144B1 (en) * | 1999-08-30 | 2003-10-07 | The United States Of America As Represented By The Secretary Of The Army | Monoclonal antibodies to Ebola glycoprotein |
| US7537898B2 (en) | 2001-11-28 | 2009-05-26 | Applied Biosystems, Llc | Compositions and methods of selective nucleic acid isolation |
| US20110160078A1 (en) | 2009-12-15 | 2011-06-30 | Affymetrix, Inc. | Digital Counting of Individual Molecules by Stochastic Attachment of Diverse Labels |
| WO2013142389A1 (en) | 2012-03-20 | 2013-09-26 | University Of Washington Through Its Center For Commercialization | Methods of lowering the error rate of massively parallel dna sequencing using duplex consensus sequencing |
| WO2013191637A1 (en) | 2012-06-19 | 2013-12-27 | Sjoeblom Tobias | Method and device for efficient calculation of allele ratio confidence intervals and uses thereof |
| WO2014149134A2 (en) | 2013-03-15 | 2014-09-25 | Guardant Health Inc. | Systems and methods to detect rare mutations and copy number variation |
| US20150031559A1 (en) | 2010-09-21 | 2015-01-29 | Population Genetics Technologies Ltd | Increased Confidence of Allele Calls with Molecular Counting |
| US20150044191A1 (en) * | 2013-08-09 | 2015-02-12 | President And Fellows Of Harvard College | Methods for identifying a target site of a cas9 nuclease |
| US20150066385A1 (en) | 2013-08-30 | 2015-03-05 | 10X Technologies, Inc. | Sequencing methods |
| US20150087535A1 (en) | 2012-03-13 | 2015-03-26 | Abhijit Ajit Patel | Measurement of nucleic acid variants using highly-multiplexed error-suppressed deep sequencing |
| WO2015164432A1 (en) | 2014-04-21 | 2015-10-29 | Natera, Inc. | Detecting mutations and ploidy in chromosomal segments |
| US9598731B2 (en) | 2012-09-04 | 2017-03-21 | Guardant Health, Inc. | Systems and methods to detect rare mutations and copy number variation |
| US20170204459A1 (en) | 2014-06-06 | 2017-07-20 | Cornell University | Method for identification and enumeration of nucleic acid sequence, expression, copy, or dna methylation changes, using combined nuclease, ligase, polymerase, and sequencing reactions |
| US20180251848A1 (en) | 2014-09-12 | 2018-09-06 | The Board Of Trustees Of The Leland Stanford Junior University | Identification and use of circulating nucleic acids |
Family Cites Families (10)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2008512129A (ja) * | 2004-09-10 | 2008-04-24 | セクエノム,インコーポレイティド | 核酸の広範囲配列分析法 |
| US8486630B2 (en) * | 2008-11-07 | 2013-07-16 | Industrial Technology Research Institute | Methods for accurate sequence data and modified base position determination |
| CA2867489A1 (en) * | 2012-03-30 | 2013-10-03 | Pacific Biosciences Of California, Inc. | Methods and composition for sequencing modified nucleic acids |
| US9092401B2 (en) * | 2012-10-31 | 2015-07-28 | Counsyl, Inc. | System and methods for detecting genetic variation |
| CA2905429A1 (en) * | 2013-03-14 | 2014-10-02 | Abbott Molecular Inc. | Minimizing errors using uracil-dna-n-glycosylase |
| GB201502374D0 (en) * | 2015-02-13 | 2015-04-01 | Prokyma Technologies Ltd | Method and apparatus relating to treatment of a blood sample for sequencing of circulating tumour cells |
| WO2016149261A1 (en) * | 2015-03-16 | 2016-09-22 | Personal Genome Diagnostics, Inc. | Systems and methods for analyzing nucleic acid |
| JP6675164B2 (ja) * | 2015-07-28 | 2020-04-01 | 株式会社理研ジェネシス | 変異判定方法、変異判定プログラムおよび記録媒体 |
| CN116640847A (zh) * | 2016-02-02 | 2023-08-25 | 夸登特健康公司 | 癌症进化检测和诊断 |
| CN109511265B (zh) * | 2016-05-16 | 2023-07-14 | 安可济控股有限公司 | 通过链鉴定改进测序的方法 |
-
2018
- 2018-11-02 WO PCT/US2018/059056 patent/WO2019090147A1/en not_active Ceased
- 2018-11-02 CA CA3079252A patent/CA3079252A1/en active Pending
- 2018-11-02 CN CN201880085431.9A patent/CN111542616A/zh active Pending
- 2018-11-02 EP EP18874697.8A patent/EP3704265A4/en active Pending
- 2018-11-02 JP JP2020524480A patent/JP7304852B2/ja active Active
-
2020
- 2020-05-04 US US16/866,252 patent/US11008616B2/en active Active
-
2021
- 2021-03-23 US US17/210,202 patent/US11718873B2/en active Active
-
2023
- 2023-03-01 JP JP2023030896A patent/JP7606554B2/ja active Active
- 2023-06-16 US US18/336,281 patent/US20240141425A1/en active Pending
-
2024
- 2024-12-13 JP JP2024218936A patent/JP2025028203A/ja active Pending
Patent Citations (16)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20010053519A1 (en) | 1990-12-06 | 2001-12-20 | Fodor Stephen P.A. | Oligonucleotides |
| US6582908B2 (en) | 1990-12-06 | 2003-06-24 | Affymetrix, Inc. | Oligonucleotides |
| US6630144B1 (en) * | 1999-08-30 | 2003-10-07 | The United States Of America As Represented By The Secretary Of The Army | Monoclonal antibodies to Ebola glycoprotein |
| US7537898B2 (en) | 2001-11-28 | 2009-05-26 | Applied Biosystems, Llc | Compositions and methods of selective nucleic acid isolation |
| US20110160078A1 (en) | 2009-12-15 | 2011-06-30 | Affymetrix, Inc. | Digital Counting of Individual Molecules by Stochastic Attachment of Diverse Labels |
| US20150031559A1 (en) | 2010-09-21 | 2015-01-29 | Population Genetics Technologies Ltd | Increased Confidence of Allele Calls with Molecular Counting |
| US20150087535A1 (en) | 2012-03-13 | 2015-03-26 | Abhijit Ajit Patel | Measurement of nucleic acid variants using highly-multiplexed error-suppressed deep sequencing |
| WO2013142389A1 (en) | 2012-03-20 | 2013-09-26 | University Of Washington Through Its Center For Commercialization | Methods of lowering the error rate of massively parallel dna sequencing using duplex consensus sequencing |
| WO2013191637A1 (en) | 2012-06-19 | 2013-12-27 | Sjoeblom Tobias | Method and device for efficient calculation of allele ratio confidence intervals and uses thereof |
| US9598731B2 (en) | 2012-09-04 | 2017-03-21 | Guardant Health, Inc. | Systems and methods to detect rare mutations and copy number variation |
| WO2014149134A2 (en) | 2013-03-15 | 2014-09-25 | Guardant Health Inc. | Systems and methods to detect rare mutations and copy number variation |
| US20150044191A1 (en) * | 2013-08-09 | 2015-02-12 | President And Fellows Of Harvard College | Methods for identifying a target site of a cas9 nuclease |
| US20150066385A1 (en) | 2013-08-30 | 2015-03-05 | 10X Technologies, Inc. | Sequencing methods |
| WO2015164432A1 (en) | 2014-04-21 | 2015-10-29 | Natera, Inc. | Detecting mutations and ploidy in chromosomal segments |
| US20170204459A1 (en) | 2014-06-06 | 2017-07-20 | Cornell University | Method for identification and enumeration of nucleic acid sequence, expression, copy, or dna methylation changes, using combined nuclease, ligase, polymerase, and sequencing reactions |
| US20180251848A1 (en) | 2014-09-12 | 2018-09-06 | The Board Of Trustees Of The Leland Stanford Junior University | Identification and use of circulating nucleic acids |
Non-Patent Citations (13)
| Title |
|---|
| Akre et al. "Mutation Processes in 293-Based Clones Overexpressing the DNA Cytosine Deaminase APOBEC38," PLoS One, May 10, 2016 (May 10, 2016), vol. 11, No. 5, 00155391, pp. 1-17. entire document. |
| Arbeithuberetal. DNA Research. 2016. 23(6):547-559. (Year: 2016). * |
| Cannistraro, V.J. et al. "Rapid Deamination of Cyclobutane Pyrimidine Dimer Photoproducts at TCG Sites in a Translationally and Rotationally Positioned Nucleosome in Vivo" J. Bill Chem (2015) 290(44):26597-26609. |
| Chen et al. "DNA damage is a major cause of sequencing errors, directly confounding variant identification," bloRxiv, Aug. 23, 2016 (Aug. 23, 2016), pp. 1-30. |
| Clark, T.A. et al. "Analytical Validation of a Hybrid Capture Based Next-Generation Sequencing Clinical Assay for Genomic Profiling of Cell-Free Circulating Tumor DNA," J. Mol. Diagnostics (2018) 20(5):686-702. |
| International search report and written opinion dated Jan. 17, 2019 for PCT/US2018/059056. |
| Ma et al. Genome Biology. 2019. 20:50. (Year: 2019). * |
| Newman, et al. An ultrasensitive method for quantitating circulating tumor DNA with broad patient coverage. Nat Med. May 2014;20(5):548-54. doi: 10.1038/nm.3519. Epub Apr. 6, 2014. |
| Paweletz, C.P. et al. "Bias-corrected targeted next-generation sequencing for rapid, multiplexed detection of actionable alterations in cell-free DNA from advanced lung cancer patients" Clin Canc Res (2016) 22(4):915-922. |
| Phallen, J. et al. "Direct detection of early-stage cancers using circulating tumor DNA" Sci Trans Med (2017) vol. 9, Issue 403, eaan2415DOI: 10.1126/scitranslmed.aan2415. |
| Salk et al. Nat Rev Genet. 2018. 19(5):269-285. (Year: 2018). * |
| Siravegna, G. et al. "Integrating liquid biopsies into the management of cancer" Nature Reviews Clinical Oncology (2017)14:531-548. |
| Sloan et al. Trends Biotechnol. 2018. 36(7)729-740. (Year: 2018). * |
Also Published As
| Publication number | Publication date |
|---|---|
| EP3704265A1 (en) | 2020-09-09 |
| US11718873B2 (en) | 2023-08-08 |
| CN111542616A (zh) | 2020-08-14 |
| JP7606554B2 (ja) | 2024-12-25 |
| JP2025028203A (ja) | 2025-02-28 |
| US20240141425A1 (en) | 2024-05-02 |
| CA3079252A1 (en) | 2019-05-09 |
| US20210395816A1 (en) | 2021-12-23 |
| US20200377941A1 (en) | 2020-12-03 |
| JP7304852B2 (ja) | 2023-07-07 |
| WO2019090147A1 (en) | 2019-05-09 |
| EP3704265A4 (en) | 2021-09-29 |
| JP2021502072A (ja) | 2021-01-28 |
| JP2023060046A (ja) | 2023-04-27 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US12385097B2 (en) | Normalizing tumor mutation burden | |
| US12106825B2 (en) | Computational modeling of loss of function based on allelic frequency | |
| US12567492B1 (en) | Genetic variant detection based on merged and unmerged reads | |
| JP2024056984A (ja) | エピジェネティック区画アッセイを較正するための方法、組成物およびシステム | |
| US20240141425A1 (en) | Correcting for deamination-induced sequence errors | |
| US20200075124A1 (en) | Methods and systems for detecting allelic imbalance in cell-free nucleic acid samples | |
| US20220068433A1 (en) | Computational detection of copy number variation at a locus in the absence of direct measurement of the locus |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
| AS | Assignment |
Owner name: GUARDANT HEALTH, INC., CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:SIKORA, MARCIN;KENNEDY, ANDREW;JAIMOVICH, ARIEL;AND OTHERS;SIGNING DATES FROM 20181214 TO 20190417;REEL/FRAME:053175/0100 |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE AFTER FINAL ACTION FORWARDED TO EXAMINER |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT RECEIVED |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED |
|
| STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
| MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 4 |