US20240087680A1 - Methods for multi-resolution analysis of cell-free nucleic acids - Google Patents
Methods for multi-resolution analysis of cell-free nucleic acids Download PDFInfo
- Publication number
- US20240087680A1 US20240087680A1 US18/503,392 US202318503392A US2024087680A1 US 20240087680 A1 US20240087680 A1 US 20240087680A1 US 202318503392 A US202318503392 A US 202318503392A US 2024087680 A1 US2024087680 A1 US 2024087680A1
- Authority
- US
- United States
- Prior art keywords
- bait
- regions
- sample
- genomic
- bait set
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 150000007523 nucleic acids Chemical class 0.000 title claims abstract description 169
- 102000039446 nucleic acids Human genes 0.000 title claims abstract description 159
- 108020004707 nucleic acids Proteins 0.000 title claims abstract description 159
- 238000000034 method Methods 0.000 title claims abstract description 99
- 238000004458 analytical method Methods 0.000 title claims description 20
- 108010047956 Nucleosomes Proteins 0.000 claims abstract description 63
- 210000001623 nucleosome Anatomy 0.000 claims abstract description 63
- 201000010099 disease Diseases 0.000 claims abstract description 35
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 claims abstract description 35
- 210000004027 cell Anatomy 0.000 claims abstract description 24
- 210000001519 tissue Anatomy 0.000 claims abstract description 21
- 238000012163 sequencing technique Methods 0.000 claims description 101
- 230000035772 mutation Effects 0.000 claims description 93
- 102000053602 DNA Human genes 0.000 claims description 78
- 108020004414 DNA Proteins 0.000 claims description 78
- 239000000203 mixture Substances 0.000 claims description 44
- 230000002068 genetic effect Effects 0.000 claims description 35
- 238000000954 titration curve Methods 0.000 claims description 24
- 238000003556 assay Methods 0.000 claims description 18
- 108091093088 Amplicon Proteins 0.000 claims description 14
- 239000002773 nucleotide Substances 0.000 claims description 11
- 125000003729 nucleotide group Chemical group 0.000 claims description 9
- 108091035707 Consensus sequence Proteins 0.000 claims description 7
- 102200048928 rs121434568 Human genes 0.000 claims description 6
- 210000004369 blood Anatomy 0.000 claims description 5
- 239000008280 blood Substances 0.000 claims description 5
- 210000002966 serum Anatomy 0.000 claims description 5
- 238000002560 therapeutic procedure Methods 0.000 claims description 5
- 206010058467 Lung neoplasm malignant Diseases 0.000 claims description 4
- 102000052116 epidermal growth factor receptor activity proteins Human genes 0.000 claims description 4
- 108700015053 epidermal growth factor receptor activity proteins Proteins 0.000 claims description 4
- 201000005202 lung cancer Diseases 0.000 claims description 4
- 208000020816 lung neoplasm Diseases 0.000 claims description 4
- YOHYSYJDKVYCJI-UHFFFAOYSA-N n-[3-[[6-[3-(trifluoromethyl)anilino]pyrimidin-4-yl]amino]phenyl]cyclopropanecarboxamide Chemical compound FC(F)(F)C1=CC=CC(NC=2N=CN=C(NC=3C=C(NC(=O)C4CC4)C=CC=3)C=2)=C1 YOHYSYJDKVYCJI-UHFFFAOYSA-N 0.000 claims description 4
- 229940121358 tyrosine kinase inhibitor Drugs 0.000 claims description 4
- 239000005483 tyrosine kinase inhibitor Substances 0.000 claims description 4
- 150000004917 tyrosine kinase inhibitor derivatives Chemical class 0.000 claims description 4
- 208000000649 small cell carcinoma Diseases 0.000 claims description 3
- 239000000523 sample Substances 0.000 description 153
- 230000006870 function Effects 0.000 description 61
- 206010028980 Neoplasm Diseases 0.000 description 50
- 108700028369 Alleles Proteins 0.000 description 32
- 230000035945 sensitivity Effects 0.000 description 31
- 201000011510 cancer Diseases 0.000 description 30
- 238000001514 detection method Methods 0.000 description 28
- 102000040430 polynucleotide Human genes 0.000 description 22
- 108091033319 polynucleotide Proteins 0.000 description 22
- 238000003780 insertion Methods 0.000 description 21
- 230000037431 insertion Effects 0.000 description 21
- 230000015654 memory Effects 0.000 description 21
- 239000002157 polynucleotide Substances 0.000 description 21
- 108090000623 proteins and genes Proteins 0.000 description 21
- 238000012217 deletion Methods 0.000 description 19
- 230000037430 deletion Effects 0.000 description 19
- 238000012360 testing method Methods 0.000 description 19
- 238000003860 storage Methods 0.000 description 18
- 230000004927 fusion Effects 0.000 description 17
- 206010069754 Acquired gene mutation Diseases 0.000 description 16
- 230000037439 somatic mutation Effects 0.000 description 16
- 238000004891 communication Methods 0.000 description 15
- 230000003321 amplification Effects 0.000 description 14
- 238000004422 calculation algorithm Methods 0.000 description 14
- 230000037437 driver mutation Effects 0.000 description 14
- 238000003199 nucleic acid amplification method Methods 0.000 description 14
- 238000013459 approach Methods 0.000 description 12
- 102100025064 Cellular tumor antigen p53 Human genes 0.000 description 10
- 108010078814 Tumor Suppressor Protein p53 Proteins 0.000 description 10
- 238000012544 monitoring process Methods 0.000 description 10
- 102100030708 GTPase KRas Human genes 0.000 description 8
- 230000008569 process Effects 0.000 description 8
- 230000001225 therapeutic effect Effects 0.000 description 8
- 238000013461 design Methods 0.000 description 7
- 229920002477 rna polymer Polymers 0.000 description 7
- 230000000392 somatic effect Effects 0.000 description 7
- 108700024394 Exon Proteins 0.000 description 6
- 108091028043 Nucleic acid sequence Proteins 0.000 description 6
- 210000001124 body fluid Anatomy 0.000 description 6
- 238000002474 experimental method Methods 0.000 description 6
- 239000012530 fluid Substances 0.000 description 6
- 239000012634 fragment Substances 0.000 description 6
- 238000005457 optimization Methods 0.000 description 6
- 238000000638 solvent extraction Methods 0.000 description 6
- 238000007619 statistical method Methods 0.000 description 6
- 238000004448 titration Methods 0.000 description 6
- 102100023600 Fibroblast growth factor receptor 2 Human genes 0.000 description 5
- 101710182389 Fibroblast growth factor receptor 2 Proteins 0.000 description 5
- 101001126417 Homo sapiens Platelet-derived growth factor receptor alpha Proteins 0.000 description 5
- 101000686031 Homo sapiens Proto-oncogene tyrosine-protein kinase ROS Proteins 0.000 description 5
- 101000984753 Homo sapiens Serine/threonine-protein kinase B-raf Proteins 0.000 description 5
- 238000007476 Maximum Likelihood Methods 0.000 description 5
- 102100030485 Platelet-derived growth factor receptor alpha Human genes 0.000 description 5
- 102100023347 Proto-oncogene tyrosine-protein kinase ROS Human genes 0.000 description 5
- 102100027103 Serine/threonine-protein kinase B-raf Human genes 0.000 description 5
- 239000012491 analyte Substances 0.000 description 5
- 230000004069 differentiation Effects 0.000 description 5
- 238000013507 mapping Methods 0.000 description 5
- 230000003287 optical effect Effects 0.000 description 5
- 230000008707 rearrangement Effects 0.000 description 5
- 102100039788 GTPase NRas Human genes 0.000 description 4
- 101000967216 Homo sapiens Eosinophil cationic protein Proteins 0.000 description 4
- 101000744505 Homo sapiens GTPase NRas Proteins 0.000 description 4
- 101000605639 Homo sapiens Phosphatidylinositol 4,5-bisphosphate 3-kinase catalytic subunit alpha isoform Proteins 0.000 description 4
- 101000712530 Homo sapiens RAF proto-oncogene serine/threonine-protein kinase Proteins 0.000 description 4
- 241001465754 Metazoa Species 0.000 description 4
- 102100025725 Mothers against decapentaplegic homolog 4 Human genes 0.000 description 4
- 101710143112 Mothers against decapentaplegic homolog 4 Proteins 0.000 description 4
- 102100038332 Phosphatidylinositol 4,5-bisphosphate 3-kinase catalytic subunit alpha isoform Human genes 0.000 description 4
- 102100033479 RAF proto-oncogene serine/threonine-protein kinase Human genes 0.000 description 4
- 201000000582 Retinoblastoma Diseases 0.000 description 4
- 230000004075 alteration Effects 0.000 description 4
- 230000005540 biological transmission Effects 0.000 description 4
- 210000000349 chromosome Anatomy 0.000 description 4
- 210000003722 extracellular fluid Anatomy 0.000 description 4
- 238000011528 liquid biopsy Methods 0.000 description 4
- 239000003550 marker Substances 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 230000002441 reversible effect Effects 0.000 description 4
- 102100034580 AT-rich interactive domain-containing protein 1A Human genes 0.000 description 3
- -1 BRCA Proteins 0.000 description 3
- 102000036365 BRCA1 Human genes 0.000 description 3
- 108700020463 BRCA1 Proteins 0.000 description 3
- 101150072950 BRCA1 gene Proteins 0.000 description 3
- 102000052609 BRCA2 Human genes 0.000 description 3
- 108700020462 BRCA2 Proteins 0.000 description 3
- 101001042041 Bos taurus Isocitrate dehydrogenase [NAD] subunit beta, mitochondrial Proteins 0.000 description 3
- 101150008921 Brca2 gene Proteins 0.000 description 3
- 102100028914 Catenin beta-1 Human genes 0.000 description 3
- 102100037858 G1/S-specific cyclin-E1 Human genes 0.000 description 3
- 102100032610 Guanine nucleotide-binding protein G(s) subunit alpha isoforms XLas Human genes 0.000 description 3
- 101000924266 Homo sapiens AT-rich interactive domain-containing protein 1A Proteins 0.000 description 3
- 101000916173 Homo sapiens Catenin beta-1 Proteins 0.000 description 3
- 101000738568 Homo sapiens G1/S-specific cyclin-E1 Proteins 0.000 description 3
- 101001014590 Homo sapiens Guanine nucleotide-binding protein G(s) subunit alpha isoforms XLas Proteins 0.000 description 3
- 101001014594 Homo sapiens Guanine nucleotide-binding protein G(s) subunit alpha isoforms short Proteins 0.000 description 3
- 101000960234 Homo sapiens Isocitrate dehydrogenase [NADP] cytoplasmic Proteins 0.000 description 3
- 101000599886 Homo sapiens Isocitrate dehydrogenase [NADP], mitochondrial Proteins 0.000 description 3
- 101001014610 Homo sapiens Neuroendocrine secretory protein 55 Proteins 0.000 description 3
- 101000797903 Homo sapiens Protein ALEX Proteins 0.000 description 3
- 101000628562 Homo sapiens Serine/threonine-protein kinase STK11 Proteins 0.000 description 3
- 101000819111 Homo sapiens Trans-acting T-cell-specific transcription factor GATA-3 Proteins 0.000 description 3
- 102100039905 Isocitrate dehydrogenase [NADP] cytoplasmic Human genes 0.000 description 3
- 102100037845 Isocitrate dehydrogenase [NADP], mitochondrial Human genes 0.000 description 3
- 108091092878 Microsatellite Proteins 0.000 description 3
- 102100030086 Receptor tyrosine-protein kinase erbB-2 Human genes 0.000 description 3
- 101100495925 Schizosaccharomyces pombe (strain 972 / ATCC 24843) chr3 gene Proteins 0.000 description 3
- 102100026715 Serine/threonine-protein kinase STK11 Human genes 0.000 description 3
- 102100021386 Trans-acting T-cell-specific transcription factor GATA-3 Human genes 0.000 description 3
- 230000006907 apoptotic process Effects 0.000 description 3
- 238000001574 biopsy Methods 0.000 description 3
- 230000000295 complement effect Effects 0.000 description 3
- 238000013500 data storage Methods 0.000 description 3
- 210000004602 germ cell Anatomy 0.000 description 3
- 230000011987 methylation Effects 0.000 description 3
- 238000007069 methylation reaction Methods 0.000 description 3
- 210000002381 plasma Anatomy 0.000 description 3
- 102000054765 polymorphisms of proteins Human genes 0.000 description 3
- 230000010076 replication Effects 0.000 description 3
- 210000003296 saliva Anatomy 0.000 description 3
- 239000013077 target material Substances 0.000 description 3
- 230000005945 translocation Effects 0.000 description 3
- KDCGOANMDULRCW-UHFFFAOYSA-N 7H-purine Chemical compound N1=CNC2=NC=NC2=C1 KDCGOANMDULRCW-UHFFFAOYSA-N 0.000 description 2
- 206010003445 Ascites Diseases 0.000 description 2
- 102000038594 Cdh1/Fizzy-related Human genes 0.000 description 2
- 108091007854 Cdh1/Fizzy-related Proteins 0.000 description 2
- 108091026890 Coding region Proteins 0.000 description 2
- 206010009944 Colon cancer Diseases 0.000 description 2
- 108020004635 Complementary DNA Proteins 0.000 description 2
- 108010058546 Cyclin D1 Proteins 0.000 description 2
- 108010025464 Cyclin-Dependent Kinase 4 Proteins 0.000 description 2
- 108010009392 Cyclin-Dependent Kinase Inhibitor p16 Proteins 0.000 description 2
- 102100036252 Cyclin-dependent kinase 4 Human genes 0.000 description 2
- 102100024458 Cyclin-dependent kinase inhibitor 2A Human genes 0.000 description 2
- 102100031480 Dual specificity mitogen-activated protein kinase kinase 1 Human genes 0.000 description 2
- 102100023266 Dual specificity mitogen-activated protein kinase kinase 2 Human genes 0.000 description 2
- 101710182386 Fibroblast growth factor receptor 1 Proteins 0.000 description 2
- 102100027842 Fibroblast growth factor receptor 3 Human genes 0.000 description 2
- 101710182396 Fibroblast growth factor receptor 3 Proteins 0.000 description 2
- 102100024165 G1/S-specific cyclin-D1 Human genes 0.000 description 2
- 102100027541 GTP-binding protein Rheb Human genes 0.000 description 2
- 102100025477 GTP-binding protein Rit1 Human genes 0.000 description 2
- 102100029974 GTPase HRas Human genes 0.000 description 2
- 102100022057 Hepatocyte nuclear factor 1-alpha Human genes 0.000 description 2
- 102100035108 High affinity nerve growth factor receptor Human genes 0.000 description 2
- 101000574654 Homo sapiens GTP-binding protein Rit1 Proteins 0.000 description 2
- 101000584633 Homo sapiens GTPase HRas Proteins 0.000 description 2
- 101001045751 Homo sapiens Hepatocyte nuclear factor 1-alpha Proteins 0.000 description 2
- 101000596894 Homo sapiens High affinity nerve growth factor receptor Proteins 0.000 description 2
- 101001109719 Homo sapiens Nucleophosmin Proteins 0.000 description 2
- 101000779418 Homo sapiens RAC-alpha serine/threonine-protein kinase Proteins 0.000 description 2
- 101000771237 Homo sapiens Serine/threonine-protein kinase A-Raf Proteins 0.000 description 2
- 101000997832 Homo sapiens Tyrosine-protein kinase JAK2 Proteins 0.000 description 2
- 101000934996 Homo sapiens Tyrosine-protein kinase JAK3 Proteins 0.000 description 2
- 101001087416 Homo sapiens Tyrosine-protein phosphatase non-receptor type 11 Proteins 0.000 description 2
- 108010068342 MAP Kinase Kinase 1 Proteins 0.000 description 2
- 108010068353 MAP Kinase Kinase 2 Proteins 0.000 description 2
- 208000000172 Medulloblastoma Diseases 0.000 description 2
- ZYTPOUNUXRBYGW-YUMQZZPRSA-N Met-Met Chemical compound CSCC[C@H]([NH3+])C(=O)N[C@H](C([O-])=O)CCSC ZYTPOUNUXRBYGW-YUMQZZPRSA-N 0.000 description 2
- 208000003445 Mouth Neoplasms Diseases 0.000 description 2
- 101150097381 Mtor gene Proteins 0.000 description 2
- 102100022678 Nucleophosmin Human genes 0.000 description 2
- 102000014160 PTEN Phosphohydrolase Human genes 0.000 description 2
- 108010011536 PTEN Phosphohydrolase Proteins 0.000 description 2
- 206010036790 Productive cough Diseases 0.000 description 2
- 102100033810 RAC-alpha serine/threonine-protein kinase Human genes 0.000 description 2
- 101150020518 RHEB gene Proteins 0.000 description 2
- 101150111584 RHOA gene Proteins 0.000 description 2
- 102100029437 Serine/threonine-protein kinase A-Raf Human genes 0.000 description 2
- 102100023085 Serine/threonine-protein kinase mTOR Human genes 0.000 description 2
- 102100022387 Transforming protein RhoA Human genes 0.000 description 2
- 102100033444 Tyrosine-protein kinase JAK2 Human genes 0.000 description 2
- 102100025387 Tyrosine-protein kinase JAK3 Human genes 0.000 description 2
- 102100033019 Tyrosine-protein phosphatase non-receptor type 11 Human genes 0.000 description 2
- ISAKRJDGNUQOIC-UHFFFAOYSA-N Uracil Chemical compound O=C1C=CNC(=O)N1 ISAKRJDGNUQOIC-UHFFFAOYSA-N 0.000 description 2
- OIRDTQYFTABQOQ-KQYNXXCUSA-N adenosine group Chemical group [C@@H]1([C@H](O)[C@H](O)[C@@H](CO)O1)N1C=NC=2C(N)=NC=NC12 OIRDTQYFTABQOQ-KQYNXXCUSA-N 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 210000001772 blood platelet Anatomy 0.000 description 2
- 210000001185 bone marrow Anatomy 0.000 description 2
- 238000010804 cDNA synthesis Methods 0.000 description 2
- 210000001175 cerebrospinal fluid Anatomy 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 239000002299 complementary DNA Substances 0.000 description 2
- OPTASPLRGRRNAP-UHFFFAOYSA-N cytosine Chemical compound NC=1C=CNC(=O)N=1 OPTASPLRGRRNAP-UHFFFAOYSA-N 0.000 description 2
- 238000007405 data analysis Methods 0.000 description 2
- 230000007423 decrease Effects 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 229940079593 drug Drugs 0.000 description 2
- 239000003814 drug Substances 0.000 description 2
- 239000000835 fiber Substances 0.000 description 2
- 230000014509 gene expression Effects 0.000 description 2
- 210000003731 gingival crevicular fluid Anatomy 0.000 description 2
- UYTPUPDQBNUYGX-UHFFFAOYSA-N guanine Chemical compound O=C1NC(N)=NC2=C1N=CN2 UYTPUPDQBNUYGX-UHFFFAOYSA-N 0.000 description 2
- 238000009396 hybridization Methods 0.000 description 2
- 210000004185 liver Anatomy 0.000 description 2
- 230000001926 lymphatic effect Effects 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 201000001441 melanoma Diseases 0.000 description 2
- 108010085203 methionylmethionine Proteins 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 201000005962 mycosis fungoides Diseases 0.000 description 2
- 230000017074 necrotic cell death Effects 0.000 description 2
- 230000001338 necrotic effect Effects 0.000 description 2
- 210000000056 organ Anatomy 0.000 description 2
- 201000008968 osteosarcoma Diseases 0.000 description 2
- 239000013610 patient sample Substances 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 102200048955 rs121434569 Human genes 0.000 description 2
- 229920006395 saturated elastomer Polymers 0.000 description 2
- 210000000582 semen Anatomy 0.000 description 2
- 241000894007 species Species 0.000 description 2
- 210000003802 sputum Anatomy 0.000 description 2
- 208000024794 sputum Diseases 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 210000004243 sweat Anatomy 0.000 description 2
- 210000001179 synovial fluid Anatomy 0.000 description 2
- RWQNBRDOKXIBIV-UHFFFAOYSA-N thymine Chemical compound CC1=CNC(=O)NC1=O RWQNBRDOKXIBIV-UHFFFAOYSA-N 0.000 description 2
- 210000002700 urine Anatomy 0.000 description 2
- 208000024893 Acute lymphoblastic leukemia Diseases 0.000 description 1
- 208000031261 Acute myeloid leukaemia Diseases 0.000 description 1
- 206010061424 Anal cancer Diseases 0.000 description 1
- 208000007860 Anus Neoplasms Diseases 0.000 description 1
- 241000271566 Aves Species 0.000 description 1
- 208000010839 B-cell chronic lymphocytic leukemia Diseases 0.000 description 1
- 208000032791 BCR-ABL1 positive chronic myelogenous leukemia Diseases 0.000 description 1
- 206010004146 Basal cell carcinoma Diseases 0.000 description 1
- 206010004593 Bile duct cancer Diseases 0.000 description 1
- 206010005003 Bladder cancer Diseases 0.000 description 1
- 206010005949 Bone cancer Diseases 0.000 description 1
- 208000018084 Bone neoplasm Diseases 0.000 description 1
- 208000003174 Brain Neoplasms Diseases 0.000 description 1
- 206010006143 Brain stem glioma Diseases 0.000 description 1
- 206010006187 Breast cancer Diseases 0.000 description 1
- 208000026310 Breast neoplasm Diseases 0.000 description 1
- 208000011691 Burkitt lymphomas Diseases 0.000 description 1
- 239000002126 C01EB10 - Adenosine Substances 0.000 description 1
- 206010007275 Carcinoid tumour Diseases 0.000 description 1
- 206010008342 Cervix carcinoma Diseases 0.000 description 1
- 201000009047 Chordoma Diseases 0.000 description 1
- 206010065163 Clonal evolution Diseases 0.000 description 1
- 208000001333 Colorectal Neoplasms Diseases 0.000 description 1
- RYGMFSIKBFXOCR-UHFFFAOYSA-N Copper Chemical compound [Cu] RYGMFSIKBFXOCR-UHFFFAOYSA-N 0.000 description 1
- 208000009798 Craniopharyngioma Diseases 0.000 description 1
- 108010025468 Cyclin-Dependent Kinase 6 Proteins 0.000 description 1
- 102100026804 Cyclin-dependent kinase 6 Human genes 0.000 description 1
- 230000007067 DNA methylation Effects 0.000 description 1
- 230000007018 DNA scission Effects 0.000 description 1
- 206010014733 Endometrial cancer Diseases 0.000 description 1
- 206010014759 Endometrial neoplasm Diseases 0.000 description 1
- 201000008228 Ependymoblastoma Diseases 0.000 description 1
- 206010014967 Ependymoma Diseases 0.000 description 1
- 206010014968 Ependymoma malignant Diseases 0.000 description 1
- 208000000461 Esophageal Neoplasms Diseases 0.000 description 1
- 102100038595 Estrogen receptor Human genes 0.000 description 1
- 208000006168 Ewing Sarcoma Diseases 0.000 description 1
- 206010053717 Fibrous histiocytoma Diseases 0.000 description 1
- 208000022072 Gallbladder Neoplasms Diseases 0.000 description 1
- 208000032612 Glial tumor Diseases 0.000 description 1
- 206010018338 Glioma Diseases 0.000 description 1
- 208000017604 Hodgkin disease Diseases 0.000 description 1
- 208000021519 Hodgkin lymphoma Diseases 0.000 description 1
- 208000010747 Hodgkins lymphoma Diseases 0.000 description 1
- 101000882584 Homo sapiens Estrogen receptor Proteins 0.000 description 1
- 101000584612 Homo sapiens GTPase KRas Proteins 0.000 description 1
- 101001052493 Homo sapiens Mitogen-activated protein kinase 1 Proteins 0.000 description 1
- 101001052490 Homo sapiens Mitogen-activated protein kinase 3 Proteins 0.000 description 1
- 101001012157 Homo sapiens Receptor tyrosine-protein kinase erbB-2 Proteins 0.000 description 1
- 206010021042 Hypopharyngeal cancer Diseases 0.000 description 1
- 206010056305 Hypopharyngeal neoplasm Diseases 0.000 description 1
- 206010061218 Inflammation Diseases 0.000 description 1
- 208000037396 Intraductal Noninfiltrating Carcinoma Diseases 0.000 description 1
- 206010073094 Intraductal proliferative breast lesion Diseases 0.000 description 1
- 206010061252 Intraocular melanoma Diseases 0.000 description 1
- 208000007766 Kaposi sarcoma Diseases 0.000 description 1
- 208000008839 Kidney Neoplasms Diseases 0.000 description 1
- 206010023825 Laryngeal cancer Diseases 0.000 description 1
- 206010061523 Lip and/or oral cavity cancer Diseases 0.000 description 1
- 206010062038 Lip neoplasm Diseases 0.000 description 1
- 208000006644 Malignant Fibrous Histiocytoma Diseases 0.000 description 1
- 208000032271 Malignant tumor of penis Diseases 0.000 description 1
- 241000124008 Mammalia Species 0.000 description 1
- 102100024193 Mitogen-activated protein kinase 1 Human genes 0.000 description 1
- 102100024192 Mitogen-activated protein kinase 3 Human genes 0.000 description 1
- 208000034578 Multiple myelomas Diseases 0.000 description 1
- 201000003793 Myelodysplastic syndrome Diseases 0.000 description 1
- 102100029166 NT-3 growth factor receptor Human genes 0.000 description 1
- 206010028729 Nasal cavity cancer Diseases 0.000 description 1
- 206010028767 Nasal sinus cancer Diseases 0.000 description 1
- 208000001894 Nasopharyngeal Neoplasms Diseases 0.000 description 1
- 206010061306 Nasopharyngeal cancer Diseases 0.000 description 1
- 206010029260 Neuroblastoma Diseases 0.000 description 1
- 208000015914 Non-Hodgkin lymphomas Diseases 0.000 description 1
- 206010030155 Oesophageal carcinoma Diseases 0.000 description 1
- 108091034117 Oligonucleotide Proteins 0.000 description 1
- 206010031096 Oropharyngeal cancer Diseases 0.000 description 1
- 206010057444 Oropharyngeal neoplasm Diseases 0.000 description 1
- 206010033128 Ovarian cancer Diseases 0.000 description 1
- 206010061535 Ovarian neoplasm Diseases 0.000 description 1
- 206010061902 Pancreatic neoplasm Diseases 0.000 description 1
- 206010061332 Paraganglion neoplasm Diseases 0.000 description 1
- 208000003937 Paranasal Sinus Neoplasms Diseases 0.000 description 1
- 208000000821 Parathyroid Neoplasms Diseases 0.000 description 1
- 208000002471 Penile Neoplasms Diseases 0.000 description 1
- 206010034299 Penile cancer Diseases 0.000 description 1
- 208000009565 Pharyngeal Neoplasms Diseases 0.000 description 1
- 206010034811 Pharyngeal cancer Diseases 0.000 description 1
- 208000007641 Pinealoma Diseases 0.000 description 1
- 208000007913 Pituitary Neoplasms Diseases 0.000 description 1
- 206010035226 Plasma cell myeloma Diseases 0.000 description 1
- 241000288906 Primates Species 0.000 description 1
- 206010060862 Prostate cancer Diseases 0.000 description 1
- 208000000236 Prostatic Neoplasms Diseases 0.000 description 1
- CZPWVGJYEJSRLH-UHFFFAOYSA-N Pyrimidine Chemical compound C1=CN=CN=C1 CZPWVGJYEJSRLH-UHFFFAOYSA-N 0.000 description 1
- 208000015634 Rectal Neoplasms Diseases 0.000 description 1
- 206010038389 Renal cancer Diseases 0.000 description 1
- 208000006265 Renal cell carcinoma Diseases 0.000 description 1
- 208000004337 Salivary Gland Neoplasms Diseases 0.000 description 1
- 206010061934 Salivary gland cancer Diseases 0.000 description 1
- 206010039491 Sarcoma Diseases 0.000 description 1
- 208000009359 Sezary Syndrome Diseases 0.000 description 1
- 208000021388 Sezary disease Diseases 0.000 description 1
- 208000000453 Skin Neoplasms Diseases 0.000 description 1
- 208000021712 Soft tissue sarcoma Diseases 0.000 description 1
- 208000005718 Stomach Neoplasms Diseases 0.000 description 1
- 208000002847 Surgical Wound Diseases 0.000 description 1
- 208000031673 T-Cell Cutaneous Lymphoma Diseases 0.000 description 1
- 208000024313 Testicular Neoplasms Diseases 0.000 description 1
- 206010057644 Testis cancer Diseases 0.000 description 1
- 206010043515 Throat cancer Diseases 0.000 description 1
- 208000024770 Thyroid neoplasm Diseases 0.000 description 1
- 208000015778 Undifferentiated pleomorphic sarcoma Diseases 0.000 description 1
- 206010046431 Urethral cancer Diseases 0.000 description 1
- 206010046458 Urethral neoplasms Diseases 0.000 description 1
- 208000007097 Urinary Bladder Neoplasms Diseases 0.000 description 1
- 208000006105 Uterine Cervical Neoplasms Diseases 0.000 description 1
- 208000002495 Uterine Neoplasms Diseases 0.000 description 1
- 201000005969 Uveal melanoma Diseases 0.000 description 1
- 241000251539 Vertebrata <Metazoa> Species 0.000 description 1
- 206010047741 Vulval cancer Diseases 0.000 description 1
- 208000004354 Vulvar Neoplasms Diseases 0.000 description 1
- 208000016025 Waldenstroem macroglobulinemia Diseases 0.000 description 1
- 208000033559 Waldenström macroglobulinemia Diseases 0.000 description 1
- 208000008383 Wilms tumor Diseases 0.000 description 1
- 229960005305 adenosine Drugs 0.000 description 1
- 208000020990 adrenal cortex carcinoma Diseases 0.000 description 1
- 208000007128 adrenocortical carcinoma Diseases 0.000 description 1
- 201000011165 anus cancer Diseases 0.000 description 1
- 230000001640 apoptogenic effect Effects 0.000 description 1
- 208000001119 benign fibrous histiocytoma Diseases 0.000 description 1
- 208000026900 bile duct neoplasm Diseases 0.000 description 1
- 239000012472 biological sample Substances 0.000 description 1
- 210000000601 blood cell Anatomy 0.000 description 1
- 210000000481 breast Anatomy 0.000 description 1
- 208000002458 carcinoid tumor Diseases 0.000 description 1
- 201000010881 cervical cancer Diseases 0.000 description 1
- 239000003153 chemical reaction reagent Substances 0.000 description 1
- 208000006990 cholangiocarcinoma Diseases 0.000 description 1
- 208000029742 colonic neoplasm Diseases 0.000 description 1
- 238000012790 confirmation Methods 0.000 description 1
- 230000008094 contradictory effect Effects 0.000 description 1
- 201000007241 cutaneous T cell lymphoma Diseases 0.000 description 1
- 229940104302 cytosine Drugs 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 230000003292 diminished effect Effects 0.000 description 1
- 208000028715 ductal breast carcinoma in situ Diseases 0.000 description 1
- 201000007273 ductal carcinoma in situ Diseases 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 210000002889 endothelial cell Anatomy 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 210000003743 erythrocyte Anatomy 0.000 description 1
- 201000004101 esophageal cancer Diseases 0.000 description 1
- 230000029142 excretion Effects 0.000 description 1
- 208000024519 eye neoplasm Diseases 0.000 description 1
- 201000010175 gallbladder cancer Diseases 0.000 description 1
- 206010017758 gastric cancer Diseases 0.000 description 1
- 230000007614 genetic variation Effects 0.000 description 1
- 230000037442 genomic alteration Effects 0.000 description 1
- 201000009277 hairy cell leukemia Diseases 0.000 description 1
- 201000010536 head and neck cancer Diseases 0.000 description 1
- 208000014829 head and neck neoplasm Diseases 0.000 description 1
- 201000010235 heart cancer Diseases 0.000 description 1
- 208000024348 heart neoplasm Diseases 0.000 description 1
- 201000006866 hypopharynx cancer Diseases 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000012405 in silico analysis Methods 0.000 description 1
- 238000010348 incorporation Methods 0.000 description 1
- 230000004054 inflammatory process Effects 0.000 description 1
- 238000002955 isolation Methods 0.000 description 1
- 201000010982 kidney cancer Diseases 0.000 description 1
- 206010023841 laryngeal neoplasm Diseases 0.000 description 1
- 210000000265 leukocyte Anatomy 0.000 description 1
- 208000012987 lip and oral cavity carcinoma Diseases 0.000 description 1
- 201000006721 lip cancer Diseases 0.000 description 1
- 210000004072 lung Anatomy 0.000 description 1
- 208000015486 malignant pancreatic neoplasm Diseases 0.000 description 1
- 208000026045 malignant tumor of parathyroid gland Diseases 0.000 description 1
- 210000005075 mammary gland Anatomy 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000002493 microarray Methods 0.000 description 1
- 201000008026 nephroblastoma Diseases 0.000 description 1
- 210000004882 non-tumor cell Anatomy 0.000 description 1
- 201000008106 ocular cancer Diseases 0.000 description 1
- 201000002575 ocular melanoma Diseases 0.000 description 1
- 201000005443 oral cavity cancer Diseases 0.000 description 1
- 201000006958 oropharynx cancer Diseases 0.000 description 1
- JMANVNJQNLATNU-UHFFFAOYSA-N oxalonitrile Chemical compound N#CC#N JMANVNJQNLATNU-UHFFFAOYSA-N 0.000 description 1
- 201000002528 pancreatic cancer Diseases 0.000 description 1
- 208000008443 pancreatic carcinoma Diseases 0.000 description 1
- 208000003154 papilloma Diseases 0.000 description 1
- 208000029211 papillomatosis Diseases 0.000 description 1
- 208000007312 paraganglioma Diseases 0.000 description 1
- 201000007052 paranasal sinus cancer Diseases 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 208000020943 pineal parenchymal cell neoplasm Diseases 0.000 description 1
- 208000010916 pituitary tumor Diseases 0.000 description 1
- 208000010626 plasma cell neoplasm Diseases 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 208000025638 primary cutaneous T-cell non-Hodgkin lymphoma Diseases 0.000 description 1
- 208000029340 primitive neuroectodermal tumor Diseases 0.000 description 1
- 210000002307 prostate Anatomy 0.000 description 1
- 102000004169 proteins and genes Human genes 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 206010038038 rectal cancer Diseases 0.000 description 1
- 201000001275 rectum cancer Diseases 0.000 description 1
- 208000015347 renal cell adenocarcinoma Diseases 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 201000009410 rhabdomyosarcoma Diseases 0.000 description 1
- 102200085789 rs121913279 Human genes 0.000 description 1
- 238000007790 scraping Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000011896 sensitive detection Methods 0.000 description 1
- 238000013207 serial dilution Methods 0.000 description 1
- 201000000849 skin cancer Diseases 0.000 description 1
- 239000010454 slate Substances 0.000 description 1
- 201000002314 small intestine cancer Diseases 0.000 description 1
- 239000000243 solution Substances 0.000 description 1
- 206010041823 squamous cell carcinoma Diseases 0.000 description 1
- 201000011549 stomach cancer Diseases 0.000 description 1
- 201000003120 testicular cancer Diseases 0.000 description 1
- 229940113082 thymine Drugs 0.000 description 1
- 208000008732 thymoma Diseases 0.000 description 1
- 201000002510 thyroid cancer Diseases 0.000 description 1
- 108010064892 trkC Receptor Proteins 0.000 description 1
- 210000004881 tumor cell Anatomy 0.000 description 1
- 239000000439 tumor marker Substances 0.000 description 1
- 229940035893 uracil Drugs 0.000 description 1
- 201000005112 urinary bladder cancer Diseases 0.000 description 1
- 206010046766 uterine cancer Diseases 0.000 description 1
- 208000037965 uterine sarcoma Diseases 0.000 description 1
- 206010046885 vaginal cancer Diseases 0.000 description 1
- 208000013139 vaginal neoplasm Diseases 0.000 description 1
- 201000005102 vulva cancer Diseases 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6806—Preparing nucleic acids for analysis, e.g. for polymerase chain reaction [PCR] assay
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6813—Hybridisation assays
- C12Q1/6827—Hybridisation assays for detection of mutation or polymorphism
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B25/00—ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
- G16B25/10—Gene or protein expression profiling; Expression-ratio estimation or normalisation
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/30—Unsupervised data analysis
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B5/00—ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q2535/00—Reactions characterised by the assay type for determining the identity of a nucleotide base or a sequence of oligonucleotides
- C12Q2535/122—Massive parallel sequencing
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q2537/00—Reactions characterised by the reaction format or use of a specific feature
- C12Q2537/10—Reactions characterised by the reaction format or use of a specific feature the purpose or use of
- C12Q2537/159—Reduction of complexity, e.g. amplification of subsets, removing duplicated genomic regions
Definitions
- a typical analysis approach may comprise enriching a nucleic acid sample for targeted regions of a genome, followed by sequencing of enriched nucleic acids and analysis of sequence read data for genetic variants of interest. These nucleic acids may be enriched using a bait mixture selected for a particular assay according to assay constraints, including limited sequencing load and utility associated with each genomic region of interest.
- the present disclosure provides a bait set panel comprising one or more bait sets that selectively enrich for one or more nucleosome-associated regions of a genome, said nucleosome-associated regions comprising genomic regions having one or more genomic base positions with differential nucleosomal occupancy, wherein the differential nucleosomal occupancy is characteristic of a cell or a tissue type of origin or a disease state.
- each of the one or more nucleosome-associated regions of a bait set panel comprise at least one of: (i) significant structural variation, comprising a variation in nucleosomal positioning, said structural variation selected from the group consisting of: an insertion, a deletion, a translocation, a gene rearrangement, methylation status, a micro-satellite, a copy number variation, a copy number-related structural variation, or any other variation which indicates differentiation; and (ii) instability, comprising one or more significant fluctuations or peaks in a genome partitioning map indicating one or more locations of nucleosomal map disruptions in a genome.
- the one or more bait sets of a bait set panel are configured to capture nucleosome-associated regions of the genome based on a function of a plurality of reference nucleosomal occupancy profiles (i) associated with one or more disease states and one or more non-disease states; (ii) associated with a known somatic mutation, such as SNV, CNV, indel, or re-arrangement; and/or (iii) associated with differential expression patterns.
- the one or more bait sets of a bait set panel selectively enrich for one or more nucleosome-associated regions in a cell-free deoxyribonucleic acid (cfDNA) sample.
- cfDNA cell-free deoxyribonucleic acid
- the present disclosure provides a method for enriching a nucleic acid sample for nucleosome-associated regions of a genome comprising (a) bringing a nucleic acid sample in contact with a bait set panel, said bait set panel comprising one or more bait sets that selectively enrich for one or more nucleosome-associated regions of a genome; and (b) enriching the nucleic acid sample for one or more nucleosome-associated regions of a genome.
- the one or more bait sets in a bait set panel are configured to capture nucleosome-associated regions of the genome based on a function of a plurality of reference nucleosomal occupancy profiles associated with one or more disease states and one or more non-disease states.
- the one or bait sets in a bait set panel selectively enrich for the one or more nucleosome-associated regions in a cfDNA sample.
- the method for enriching a nucleic acid sample for nucleosome-associated regions of a genome further comprises sequencing the enriched nucleic acids to produce sequence reads of the nucleosome-associated regions of a genome.
- the present disclosure provides a method for generating a bait set comprising (a) identifying one or more regions of a genome, said regions associated with a nucleosome profile, and (b) selecting a bait set to selectively capture said regions.
- a bait set in a bait set panel selectively enriches for one or more nucleosome-associated regions in a cell-free deoxyribonucleic acid sample.
- the present disclosure provides a bait panel comprising a first bait set that selectively hybridizes to a first set of genomic regions of a nucleic acid sample comprising a predetermined amount of DNA, which is provided at a first concentration ratio that is less than a saturation point of the first bait set; and a second bait set that selectively hybridizes to a second set of genomic regions of the nucleic acid sample, which is provided at a second concentration ratio that is associated with a saturation point of the second bait set.
- the first set of genomic regions comprises one or more backbone genomic regions and the second set of genomic regions comprises one or more hotspot genomic regions.
- the present disclosure provides a method for enriching for multiple genomic regions comprising bringing a predetermined amount of a nucleic acid sample in contact with a bait panel comprising (i) a first bait set that selectively hybridizes to a first set of genomic regions of the nucleic acid sample, provided at a first concentration ratio that is less than a saturation point of the first bait set, and (ii) a second bait set that selectively hybridizes to a second set of genomic regions of the nucleic acid sample, provided at a second concentration ratio that is associated with a saturation point of the second bait set; and enriching the nucleic acid sample for the first set of genomic regions and the second set of genomic regions.
- the method further comprises sequencing the enriched nucleic acids to produce sequence reads of the first set of genomic regions and the second set of genomic regions.
- the saturation point of a bait set is determined by (a) for each of the baits in the bait set, generating a titration curve comprising (i) measuring the capture efficiency of the bait as a function of the concentration of the bait, and (ii) identifying an inflection point within the titration curve, thereby identifying a saturation point associated with the bait; and (b) selecting a saturation point that is larger than substantially all of the saturation points associated with baits in the bait set, thereby determining the saturation point of the bait set.
- the capture efficiency of a bait is determined by (a) providing a plurality of nucleic acid samples obtained from a plurality of subjects in a cohort; (b) hybridizing the bait with each of the nucleic acid samples, at each of a plurality of concentrations of the bait; (c) enriching with the bait, a plurality of genomic regions of the nucleic acid samples, at each of the plurality of concentrations of the bait; and (d) measuring number of unique nucleic acid molecules or nucleic acid molecules with representation of both strands of an original double-stranded nucleic acid molecule representing the capture efficiency at each of the plurality of concentrations of the bait.
- an inflection point is a first concentration of the bait such that observed capture efficiency does not increase significantly at concentrations of the bait greater than the first concentration.
- An inflection point may be a first concentration of the bait such that an observed increase between (1) the capture efficiency at a bait concentration of twice the first concentration compared to (2) the capture efficiency at the first bait concentration, is less than about 1%, less than about 2%, less than about 3%, less than about 4%, less than about 5%, less than about 6%, less than about 7%, less than about 8%, less than about 9%, less than about 10%, less than about 12%, less than about 14%, less than about 16%, less than about 18%, or less than about 20%.
- the nucleic acid sample comprises a cell-free nucleic acid sample.
- a method for enriching for multiple genomic regions further comprises sequencing the enriched nucleic acid sample to produce a plurality of sequence reads.
- a method for enriching for multiple genomic regions further comprises producing an output comprising a nucleic acid sequence representative of the nucleic acid sample.
- the present disclosure provides a bait panel comprising a first set that selectively captures backbone regions of a genome, said backbone regions associated with a ranking function of sequencing load and utility, wherein the ranking function of each backbone region has a value less than a predetermined threshold value; and a second bait set that selectively captures hotspot regions of a genome, said hotspot regions associated with a ranking function of sequencing load and utility, wherein the ranking function of each hotspot region has a value greater than or equal to the predetermined threshold value.
- the hotspot regions comprise one or more nucleosome informative regions, said nucleosome informative regions comprising a region of maximum nucleosome differentiation.
- the bait panel further comprises a second bait set that selectively captures disease informative regions.
- the baits in the first bait set are at a first relative concentration to the bait panel, and the baits in the second bait set are at a second relative concentration to the bait panel.
- the present disclosure provides a method for generating a bait set comprising identifying one or more backbone genomic regions of interest, wherein the identifying the one or more backbone genomic regions comprises maximizing a ranking function of sequencing load and utility associated with each of the backbone genomic regions; identifying one or more hot-spot genomic regions of interest; creating a first bait set that selectively captures the backbone genomic regions of interest; and creating a second bait set that selectively captures the hot-spot genomic regions of interest, wherein the second bait set has a higher capture efficiency than the first bait set.
- the one or more hot-spots are selected using one or more of the following: (i) maximizing a ranking function of sequencing load and utility associated with each of the hot-spot genomic regions, (ii) nucleosome profiling across the one or more genomic regions of interest, (iii) predetermined cancer driver mutations or prevalence across a relevant patient cohort, and (iv) empirically identified cancer driver mutations.
- identifying one or more hotspots of interest comprises using a programmed computer processor to rank a set of hot-spot genomic regions based on a ranking function of sequencing load and utility associated with each of the hot-spot genomic regions.
- identifying the one or more backbone genomic regions of interest comprises ranking a set of backbone genomic regions based on a ranking function of sequencing load and utility associated with each of the backbone genomic regions of interest.
- identifying the one or more hot-spot genomic regions of interest comprises utilizing a set of empirically determined minor allele frequency (MAF) values or clonality of a variant measured by its MAF in relationship to the highest presumed driver or clonal mutation in a sample.
- MAF minor allele frequency
- sequencing load of a genomic region is calculated by multiplying together one or more of (i) size of the genomic region in base pairs, (ii) relative fraction of reads spent on sequencing fragments mapping to the genomic region, (iii) relative coverage as a result of sequence bias of the genomic region, (iv) relative coverage as a result of amplification bias of the genomic region, and (v) relative coverage as a result of capture bias of the genomic region.
- utility of a genomic region is calculated by multiplying together one or more of (i) frequency of one or more actionable mutations in the genomic region, (ii) frequency of one or more mutations associated with above-average minor allele frequencies (MAFs) in the genomic region, (iii) fraction of patients in a cohort harboring a somatic mutation within the genomic region, (iv) sum of MAF for variants in patients in a cohort, said patients harboring a somatic mutation within the genomic region, and (v) ratio of (1) MAF for variants in patients in a cohort, said patients harboring a somatic mutation within the genomic region, to (2) maximum MAF for a given patient in the cohort.
- MAFs above-average minor allele frequencies
- actionable mutations comprise one or more of (i) druggable mutations, (ii) mutations for therapeutic monitoring, (iii) disease specific mutations, (iv) tissue specific mutations, (v) cell type specific mutations, (vi) resistance mutations, and (vii) diagnostic mutations.
- mutations associated with higher minor allele frequencies comprise one or more driver mutations or are known from external data or annotation sources.
- the present disclosure provides a bait panel comprising a plurality of bait sets, each bait set (i) comprising one or more baits that selectively capture one or more genomic regions with utility in the same quantile across the plurality of baits, and (ii) having a different relative concentration from each of the other bait sets with utility in a different quantile across the plurality of baits.
- the present disclosure provides a method of selecting a set of panel blocks comprising (a) for each panel block, (i) calculating a utility of the panel block, (ii) calculating a sequencing load of the panel block, and (iii) calculating a ranking function of the panel block; and (b) performing an optimization process to select a set of panel blocks that maximizes the total ranking function values of the selected panel blocks.
- a ranking function of a panel block is calculated as the utility of a panel block divided by the sequencing load of a panel block.
- the combinatorial optimization process comprises a greedy algorithm.
- the present disclosure provides a method comprising (a) providing a plurality of bait mixtures, wherein each bait mixture comprises a first bait set that selectively hybridizes to a first set of genomic regions and a second bait set that selectively hybridizes to a second set of genomic regions, and wherein the bait mixtures comprise the first bait set at different concentrations and the second bait set at the same concentrations; (b) contacting each bait mixture with a nucleic acid sample to capture nucleic acid from the sample with the bait sets, wherein the nucleic acid samples have a nucleic acid concentration around the saturation point of the second bait set; (c) sequencing the nucleic acids captured with each bait mixture to produce sets of sequence reads; (d) determining the relative number of sequence reads for the first set of genomic regions and the second set of genomic regions for each bait mixture; and (e) identifying at least one bait mixture that provides read depths for the second set of genomic regions and, optionally, first set of genomic regions, at predetermined amounts.
- the present disclosure provides a method for improving accuracy of detecting an insertion or deletion (indel) from a plurality of sequence reads derived from cell-free deoxyribonucleic acid (cfDNA) molecules in a bodily sample of a subject, which plurality of sequence reads are generated by nucleic acid sequencing, comprising (a) for each of the plurality of sequence reads associated with the cell-free DNA molecules, providing: a predetermined expectation of an indel being detected in one or more sequence reads of the plurality of sequence reads; a predetermined expectation that a detected indel is a true indel present in a given cell-free DNA molecule of the cell-free DNA molecules, given that an indel has been detected in the one or more of the sequence reads; and a predetermined expectation that a detected indel is introduced by non-biological error, given that an indel has been detected in the one or more of the sequence reads; (b) providing quantitative measures of one or more model parameters characteristic of sequence reads generated by
- the present disclosure provides a kit comprising (a) a sample comprising a predetermined amount of DNA; and (b) a bait set panel comprising (i) a first bait set that selectively hybridizes to a first set of genomic regions of a nucleic acid sample comprising a predetermined amount of DNA, provided at a first concentration ratio that is less than a saturation point of the first bait set and (ii) a second bait set that selectively hybridizes to a second set of genomic regions of the nucleic acid sample, provided at a second concentration ratio that is associated with a saturation point of the second bait set.
- the method for improving accuracy of detecting an insertion or deletion (indel) from a plurality of sequence reads derived from cell-free deoxyribonucleic acid (cfDNA) molecules in a bodily sample of a subject further comprises enriching one or more loci from the cell-free DNA in the bodily sample before step (a), thereby producing enriched polynucleotides.
- the method further comprises amplifying the enriched polynucleotides to produce families of amplicons, wherein each family comprises amplicons originating from a single strand of the cell-free DNA molecules.
- the non-biological error comprises error in sequencing at a plurality of genomic base locations. In some embodiments, the non-biological error comprises error in amplification at a plurality of genomic base locations.
- model parameters comprise one or more of (e.g., one or more of, two or more of, three or more of, or four of) (i) for each of one or more variant alleles, a frequency of the variant allele ( ⁇ ) and a frequency of non-reference alleles other than the variant allele ( ⁇ ′); (ii) a frequency of an indel error in the entire forward strand of a family of strands ( ⁇ 1 ), wherein a family comprises a collection of amplicons originating from a single strand of the cell-free DNA molecules; (iii) a frequency of an indel error in the entire reverse strand of a family of strands ( ⁇ 2 ); and (iv) a frequency of an indel error in a sequence read ( ⁇ ).
- ⁇ e.g., one or more of, two or more of, three or more of, or four of
- the step of performing a hypothesis test comprises performing a multi-parameter maximization algorithm.
- the multi-parameter maximization algorithm comprises a Nelder-Mead algorithm.
- the classifying of a candidate indel as a true indel or an introduced indel comprises (a) maximizing a multi-parameter likelihood function, (b) classifying a candidate indel as a true indel if the maximum likelihood function value is greater than a predetermined threshold value, and (c) classifying a candidate indel as an introduced indel if the maximum likelihood function value is less than or equal to a predetermined threshold value.
- the present disclosure provides a non-transitory computer-readable medium comprising machine executable code that, upon execution by one or more computer processors, implements a method for generating a bait set comprises identifying one or more backbone genomic regions of interest, wherein the identifying the one or more backbone genomic regions comprises maximizing a ranking function of sequencing load and utility associated with each of the backbone genomic regions; identifying one or more hot-spot genomic regions of interest; creating a first bait set that selectively captures the backbone genomic regions of interest; and creating a second bait set that selectively captures the hot-spot genomic regions of interest, wherein the second bait set has a higher capture efficiency than the first bait set.
- the present disclosure provides a non-transitory computer-readable medium comprising machine executable code that, upon execution by one or more computer processors, implements a method of selecting a set of panel blocks comprises (a) for each panel block, (i) calculating a utility of the panel block, (ii) calculating a sequencing load of the panel block, and (iii) calculating a ranking function of the panel block; and (b) performing an optimization process to select a set of panel blocks that maximizes the total ranking function values of the selected panel block.
- the present disclosure provides a non-transitory computer-readable medium comprising machine executable code that, upon execution by one or more computer processors, implements a method for improving accuracy of detecting an insertion or deletion (indel) from a plurality of sequence reads derived from cell-free deoxyribonucleic acid (cfDNA) molecules in a bodily sample of a subject, which plurality of sequence reads are generated by nucleic acid sequencing, comprises (a) for each of the plurality of sequence reads associated with the cell-free DNA molecules, providing: a predetermined expectation of an indel being detected in one or more sequence reads of the plurality of sequence reads; a predetermined expectation that a detected indel is a true indel present in a given cell-free DNA molecule of the cell-free DNA molecules, given that an indel has been detected in the one or more of the sequence reads; and a predetermined expectation that a detected indel is introduced by non-biological error, given that an indel has been detected
- the present disclosure provides a method for enriching for multiple genomic regions, comprising: (a) bringing a predetermined amount of nucleic acid from a sample in contact with a bait mixture comprising (i) a first bait set that selectively hybridizes to a first set of genomic regions of the nucleic acid from the sample, which first bait set is provided at a first concentration that is less than a saturation point of the first bait set, and (ii) a second bait set that selectively hybridizes to a second set of genomic regions of the nucleic acid sample, which second bait set is provided at a second concentration that is associated with a saturation point of the second bait set; and (b) enriching the nucleic acid sample for the first set of genomic regions and the second set of genomic regions.
- the second bait set has a saturation point that is larger than substantially all of the saturation points associated with baits in the second bait set when a bait of the second bait set is subjected to a titration curve generated by (i) measuring the capture efficiency of a bait of the second bait set as a function of the concentration of the bait, and (ii) identifying an inflection point within the titration curve, thereby identifying a saturation point associated with the bait.
- the saturation point is selected such that an observed capture efficiency increases by less than 20% at a concentration of the bait twice that of the first concentration.
- the saturation point is selected such that an observed capture efficiency increases by less than 10% at a concentration of the bait twice that of the first concentration. In some embodiments, the saturation point is selected such that an observed capture efficiency increases by less than 5% at a concentration of the bait twice that of the first concentration. In some embodiments, the saturation point is selected such that an observed capture efficiency increases by less than 2% at a concentration of the bait twice that of the first concentration. In some embodiments, the saturation point is selected such that an observed capture efficiency increases by less than 1% at a concentration of the bait twice that of the first concentration.
- the first bait set or the second bait set selectively enrich for one or more nucleosome-associated regions of a genome, said nucleosome-associated regions comprising genomic regions having one or more genomic base positions with differential nucleosomal occupancy, wherein the differential nucleosomal occupancy is characteristic of a cell or tissue type of origin or disease state.
- the nucleic acid sample comprises a cell-free nucleic acid sample.
- the method further comprises: (c) sequencing the enriched nucleic acid sample to produce a plurality of sequence reads.
- the method further comprises: (d) producing an output comprising a nucleic acid sequence representative of the nucleic acid sample.
- the present disclosure provides a method for generating a bait set comprising: (a) identifying one or more predetermined backbone genomic regions, wherein the identifying the one or more backbone genomic regions comprises maximizing a ranking function of sequencing load and utility associated with each of the backbone genomic regions; (b) identifying one or more predetermined hot-spot genomic regions, wherein the one or more hot-spots are selected using one or more of the following: (i) maximizing a ranking function of sequencing load and utility associated with each of the hot-spot genomic regions, (ii) nucleosome profiling across the one or more predetermined genomic regions, (iii) predetermined cancer driver mutations or prevalence across a relevant patient cohort, and (iv) empirically identified cancer driver mutations; (c) creating a first bait set that selectively captures the predetermined backbone genomic regions; and (d) creating a second bait set that selectively captures the predetermined hotspot genomic regions, wherein the second bait set has a higher capture efficiency than the first bait set.
- a predetermined region e.g., a predetermined backbone region or a predetermined hotspot region
- a region of interest e.g., a backbone region of interest or a hotspot region of interest, respectively.
- the identifying the one or more predetermined hotspots comprises using a programmed computer processor to rank a set of hotspot genomic regions based on a ranking function of sequencing load and utility associated with each of the hotspot genomic regions.
- the identifying the one or more predetermined backbone genomic regions comprises: (i) ranking a set of backbone genomic regions based on a ranking function of sequencing load and utility associated with each of the predetermined backbone genomic regions; (ii) utilizing a set of empirically determined minor allele frequency (MAF) values or clonality of a variant measured by its MAF in relationship to the highest presumed driver or clonal mutation in a sample; or (iii) a combination of (i) and (ii).
- MAF minor allele frequency
- the sequencing load of a genomic region is calculated by multiplying together one or more of: (i) size of the genomic region in base pairs, (ii) relative fraction of reads spent on sequencing fragments mapping to the genomic region, (iii) relative coverage as a result of sequence bias of the genomic region, (iv) relative coverage as a result of amplification bias of the genomic region, and (v) relative coverage as a result of capture bias of the genomic region.
- the utility of a genomic region is calculated by multiplying together one or more of: (i) frequency of one or more actionable mutations in the genomic region, (ii) frequency of one or more mutations associated with above-average minor allele frequencies (MAFs) in the genomic region, (iii) fraction of patients in a cohort harboring a somatic mutation within the genomic region, (iv) sum of MAF for variants in patients in a cohort, said patients harboring a somatic mutation within the genomic region, and (v) ratio of (1) MAF for variants in patients in a cohort, said patients harboring a somatic mutation within the genomic region, to (2) maximum MAF for a given patient in the cohort.
- MAFs above-average minor allele frequencies
- the actionable mutations comprise one or more of: (i) druggable mutations, (ii) mutations for therapeutic monitoring, (iii) disease specific mutations, (iv) tissue specific mutations, (v) cell type specific mutations, (vi) resistance mutations, and (vii) diagnostic mutations.
- the mutations associated with higher minor allele frequencies comprise one or more driver mutations or are known from external data or annotation sources.
- the present disclosure provides a method comprising: (a) providing a plurality of bait mixtures, wherein each bait mixture comprises a first bait set that selectively hybridizes to a first set of genomic regions and a second bait set that selectively hybridizes to a second set of genomic regions, and wherein the bait mixtures comprise the first bait set at different concentrations and the second bait set at the same concentrations; (b) contacting each bait mixture with a nucleic acid sample to capture nucleic acid from the sample with the bait sets, wherein the second bait set in each mixture is provided at a concentration that is at or above a saturation point of the second bait set, wherein nucleic acid from the sample is captured by the bait sets; (c) sequencing a portion of the nucleic acids captured with each bait mixture to produce sets of sequence reads within an allocated number of sequence reads; (d) determining the read depth of sequence reads for the first bait set and the second bait set for each bait mixture; and (e) identifying at least one bait mixture that provides read depths for the
- the second bait set has a saturation point when subjected to titration, which titration comprises: generating a titration curve comprising: (i) measuring the capture efficiency of the second bait set as a function of the concentration of the baits; and (ii) identifying an inflection point within the titration curve, thereby identifying a saturation point associated with the second bait set.
- the saturation point is selected such that an observed capture efficiency increases by less than 20% at a concentration of the bait twice that of the first concentration. In some embodiments, the saturation point is selected such that an observed capture efficiency increases by less than 10% at a concentration of the bait twice that of the first concentration. In some embodiments, the saturation point is selected such that an observed capture efficiency increases by less than 5% at a concentration of the bait twice that of the first concentration. In some embodiments, the saturation point is selected such that an observed capture efficiency increases by less than 2% at a concentration of the bait twice that of the first concentration. In some embodiments, the saturation point is selected such that an observed capture efficiency increases by less than 1% at a concentration of the bait twice that of the first concentration.
- the first bait set or the second bait set selectively enrich for one or more nucleosome-associated regions of a genome, said nucleosome-associated regions comprising genomic regions having one or more genomic base positions with differential nucleosomal occupancy, wherein the differential nucleosomal occupancy is characteristic of a cell or tissue type of origin or disease state.
- the first set of genomic regions or the second genomic regions comprises one or more actionable mutations, wherein the one or more actionable mutations comprise one or more of: (i) druggable mutations, (ii) mutations for therapeutic monitoring, (iii) disease specific mutations, (iv) tissue specific mutations, (v) cell type specific mutations, (vi) resistance mutations, and (vii) diagnostic mutations.
- the first and second genomic regions comprise at least a portion of each of at least 5 genes selected from Table 3. In some embodiments, the first and second genomic regions have a size between about 25 kilobases to 1,000 kilobases and a read depth of between 1,000 counts/base and 50,000 counts/base.
- the present disclosure provides a method for enriching multiple genomic regions, comprising: (a) bringing a predetermined amount of nucleic acid from a sample in contact with a bait mixture comprising: (i) a first bait set that selectively hybridizes to a first set of genomic regions of the nucleic acid from the sample, which first bait set is provided at a first concentration that is less than a saturation point of the first bait set, and (ii) a second bait set that selectively hybridizes to a second set of genomic regions of the nucleic acid from the sample, which second bait set is provided at a second concentration that is at or above a saturation point of the second bait set; and (b) enriching the nucleic acid from the sample for the first set of genomic regions and the second set of genomic regions, thereby producing an enriched nucleic acid.
- the second bait set has a saturation point that is larger than substantially all of the saturation points associated with baits in the second bait set when a bait of the second bait set is subjected to a titration curve generated by (i) measuring capture efficiency of a bait of the second bait set as a function of the concentration of the bait, and (ii) identifying an inflection point within the titration curve, thereby identifying a saturation point associated with the bait.
- the saturation point of the first bait set is selected such that an observed capture efficiency increases by less than 10% at a concentration of the bait twice that of the first concentration.
- the first bait set or the second bait set selectively enrich for one or more nucleosome-associated regions of a genome, the nucleosome-associated regions comprising genomic regions having one or more genomic base positions with differential nucleosomal occupancy, wherein the differential nucleosomal occupancy is characteristic of a cell or tissue type of origin or disease state.
- the method further comprises (c) sequencing the enriched nucleic acid to produce a plurality of sequence reads.
- the method further comprises (d) producing an output comprising nucleic acid sequences representative of the nucleic acid from the sample.
- the present disclosure provides a method comprising: (a) providing a plurality of bait mixtures, wherein each of the plurality of bait mixtures comprises a first bait set that selectively hybridizes to a first set of genomic regions and a second bait set that selectively hybridizes to a second set of genomic regions, wherein the first bait set is at different concentrations across the plurality of bait mixtures and the second bait set is at the same concentration across the plurality of bait mixtures; (b) contacting each of the plurality of bait mixtures with a nucleic acid sample to capture nucleic acids from the nucleic acid sample with the first bait set and the second bait set, wherein the second bait set in each bait mixture is provided at a first concentration that is at or above a saturation point of the second bait set, wherein nucleic acids from the nucleic acid sample are captured by the first bait set and the second bait set; (c) sequencing a portion of the nucleic acids captured with each bait mixture to produce sets of sequence reads within an allocated number of sequence reads;
- the second bait set has a saturation point when subjected to titration, which titration comprises generating a titration curve comprising: (i) measuring capture efficiency of the second bait set as a function of the concentration of the baits; and (ii) identifying an inflection point within the titration curve, thereby identifying a saturation point associated with the second bait set.
- the saturation point is selected such that an observed capture efficiency increases by less than 10% at a concentration of the bait set twice that of the first concentration.
- the first bait set or the second bait set selectively enrich for one or more nucleosome-associated regions of a genome, the nucleosome-associated regions comprising genomic regions having one or more genomic base positions with differential nucleosomal occupancy, wherein the differential nucleosomal occupancy is characteristic of a cell or tissue type of origin or disease state.
- the first set of genomic regions comprises one or more actionable mutations, wherein the one or more actionable mutations comprise one or more of: (i) druggable mutations, (ii) mutations for therapeutic monitoring, (iii) disease specific mutations, (iv) tissue specific mutations, (v) cell type specific mutations, (vi) resistance mutations, and (vii) diagnostic mutations.
- the first genomic regions comprise at least a portion of each of at least 5 genes selected from Table 1. In some embodiments, the first genomic regions have a size between about 25 kilobases to 1,000 kilobases and a read depth of between 1,000 counts/base and 50,000 counts/base. In some embodiments, the saturation point of the second bait set is selected such that an observed capture efficiency increases by less than 10% at a concentration of the bait twice that of the second concentration.
- the second set of genomic regions comprises one or more actionable mutations, wherein the one or more actionable mutations comprise one or more of: (i) druggable mutations, (ii) mutations for therapeutic monitoring, (iii) disease specific mutations, (iv) tissue specific mutations, (v) cell type specific mutations, (vi) resistance mutations, and (vii) diagnostic mutations.
- the second genomic regions comprise at least a portion of each of at least 5 genes selected from Table 1.
- the second genomic regions have a size between about 25 kilobases to 1,000 kilobases and a read depth of between 1,000 counts/base and 50,000 counts/base.
- FIG. 1 illustrates how a plurality of reads may be generated for each locus enriched from a cell-free nucleic acid sample.
- FIG. 2 illustrates an example of an insertion being supported by a large family.
- FIG. 3 illustrates an example of small families of reads (which may appear to provide evidence for a real variant) and large families of reads (which may indicate a likely random error stemming from PCR or sequencing.
- FIG. 4 illustrates the various parameters that may be used in a hypothesis test and how each parameter may be related to a particular probability, e.g., of a family of reads matching a reference, of a strand's reads matching a reference, and of a read matching a reference.
- FIG. 5 illustrates an example of a computer system that may be programmed or otherwise configured to implement methods of the present disclosure.
- FIG. 6 illustrates an exemplary saturation curve showing unique molecule count on the y-axis as a function of input cfDNA amount on the x-axis.
- genetic variant generally refers to an alteration, variant or polymorphism in a nucleic acid sample or genome of a subject. Such alteration, variant or polymorphism can be with respect to a reference genome, which may be a reference genome of the subject or other individual.
- Single nucleotide polymorphisms are a form of polymorphisms.
- one or more polymorphisms comprise one or more single nucleotide variations (SNVs), insertions, deletions, repeats, small insertions, small deletions, small repeats, structural variant junctions, variable length tandem repeats, and/or flanking sequences.
- Copy number variations (CNVs) transversions and other rearrangements are also forms of genetic variation.
- a genomic alteration may be a base change, insertion, deletion, repeat, copy number variation, or transversion.
- polynucleotide generally refers to a molecule comprising one or more nucleic acid subunits (a “nucleic acid molecule”).
- a polynucleotide can include one or more subunits selected from adenosine (A), cytosine (C), guanine (G), thymine (T) and uracil (U), or variants thereof.
- a nucleotide can include A, C, G, T or U, or variants thereof.
- a nucleotide can include any subunit that can be incorporated into a growing nucleic acid strand.
- Such subunit can be an A, C, G, T, or U, or any other subunit that is specific to one or more complementary A, C, G, T or U, or complementary to a purine (i.e., A or G, or variant thereof) or a pyrimidine (i.e., C, T or U, or variant thereof).
- Identification of a subunit can enable individual nucleic acid bases or groups of bases (e.g., AA, TA, AT, GC, CG, CT, TC, GT, TG, AC, CA, or uracil-counterparts thereof) to be resolved.
- a polynucleotide is deoxyribonucleic acid (DNA) or ribonucleic acid (RNA), or derivatives thereof.
- a polynucleotide can be single-stranded or double stranded.
- a polynucleotide can comprise any type of nucleic acids, such as DNA and/or RNA.
- a polynucleotide can be genomic DNA, complementary DNA (cDNA), or any other deoxyribonucleic acid.
- a polynucleotide can be a cell-free nucleic acid. As used herein, the terms cell-free nucleic acid and extracellular nucleic acid can be used interchangeably.
- a polynucleotide can be cell-free DNA (cfDNA).
- the polynucleotide can be circulating DNA. The circulating DNA can comprise circulating tumor DNA (ctDNA).
- the cell-free or extracellular nucleic acids can be derived from any bodily fluid including, but not limited to, whole blood, platelets, serum, plasma, synovial fluid, lymphatic fluid, ascites fluid, interstitial or extracellular fluid, the fluid in spaces between cells, gingival crevicular fluid, bone marrow, cerebrospinal fluid, saliva, mucous, sputum, semen, sweat, urine, cervical fluid or lavage, vaginal fluid or lavage, mammary gland or lavage, and/or any combination thereof.
- the cell-free or extracellular nucleic acids can be derived from plasma.
- a bodily fluid containing cells can be processed to remove the cells in order to purify and/or extract cell-free or extracellular nucleic acids.
- a polynucleotide can be double-stranded or single-stranded. Alternatively, a polynucleotide can comprise a combination of a double-stranded portion and a single-stranded portion.
- Polynucleotides do not have to be cell-free.
- the polynucleotides can be isolated from a sample.
- a sample can be a composition comprising an analyte.
- a sample can be any biological sample isolated from a subject including, without limitation, bodily fluid, whole blood, platelets, serum, plasma, stool, red blood cells, white blood cells or leucocytes, endothelial cells, tissue biopsies, synovial fluid, lymphatic fluid, ascites fluid, interstitial or extracellular fluid, the fluid in spaces between cells, including gingival crevicular fluid, bone marrow, cerebrospinal fluid, saliva, mucous, sputum, semen, sweat, urine, or any other bodily fluids, and/or any combination thereof.
- a bodily fluid can include saliva, blood, or serum.
- a polynucleotide can be cell-free DNA isolated from a bodily fluid, e.g., blood or serum.
- a sample can also be a tumor sample, which can be obtained from a subject by various approaches, including, but not limited to, venipuncture, excretion, ejaculation, massage, biopsy, needle aspirate, lavage, scraping, surgical incision, or intervention or other approaches.
- a sample is a nucleic acid sample, e.g., a purified nucleic acid sample.
- a nucleic acid sample comprises cell-free DNA (cfDNA). An analyte in a sample can be in various stages of purity.
- a raw sample may be taken directly from a subject can contain the analyte in an unpurified state.
- a sample also may be enriched for an analyte.
- An analyte also may be present in the sample in isolated or substantially isolated form.
- the polynucleotides can comprise sequences associated with cancer, such as acute lymphoblastic leukemia (ALL), acute myeloid leukemia (AML), adrenocortical carcinoma, Kaposi Sarcoma, anal cancer, basal cell carcinoma, bile duct cancer, bladder cancer, bone cancer, osteosarcoma, malignant fibrous histiocytoma, brain stem glioma, brain cancer, craniopharyngioma, ependymoblastoma, ependymoma, medulloblastoma, medulloeptithelioma, pineal parenchymal tumor, breast cancer, bronchial tumor, Burkitt lymphoma, Non-Hodgkin lymphoma, carcinoid tumor, cervical cancer, chordoma, chronic lymphocytic leukemia (CLL), chronic myelogenous leukemia (CML), colon cancer, colorectal cancer, cutaneous T-cell lymphoma,
- a sample can comprise various amount of nucleic acid that contains genome equivalents.
- a sample of about 30 ng DNA can contain about 10,000 (10 4 ) haploid human genome equivalents and, in the case of cfDNA, about 200 billion (2 ⁇ 10 11 ) individual polynucleotide molecules.
- a sample of about 100 ng of DNA can contain about 30,000 haploid human genome equivalents and, in the case of cfDNA, about 600 billion individual molecules.
- a sample can comprise nucleic acids from different sources.
- a sample can comprise germline DNA or somatic DNA.
- a sample can comprise nucleic acids carrying mutations.
- a sample can comprise DNA carrying germline mutations and/or somatic mutations.
- a sample can also comprise DNA carrying cancer-associated mutations (e.g., cancer-associated somatic mutations).
- subject generally refers to an animal, such as a mammalian species (e.g., human) or avian (e.g., bird) species, or other organism, such as a plant. More specifically, the subject can be a vertebrate, a mammal, a mouse, a primate, a simian or a human. Animals include, but are not limited to, farm animals, sport animals, and pets.
- a subject can be a healthy individual, an individual that has or is suspected of having a disease or a pre-disposition to the disease, or an individual that is in need of therapy or suspected of needing therapy.
- a subject can be a patient.
- a genome generally refers to an entirety of an organism's hereditary information.
- a genome can be encoded either in DNA or in RNA.
- a genome can comprise coding regions that code for proteins as well as non-coding regions.
- a genome can include the sequence of all chromosomes together in an organism.
- the human genome has a total of 46 chromosomes. The sequence of all of these together may constitute a human genome.
- a genome may comprise a diploid or a haploid genome.
- bait generally refers to a target-specific oligonucleotide (e.g., a capture probe) designed and used to capture specific genomic regions of interest (e.g., targets, or predetermined genomic regions of interest).
- the bait may capture its intended targets by selectively hybridizing to complementary nucleic acids.
- bait panel or “bait set panel,” as used herein, generally refers to a set of baits targeted toward a selected set of genomic regions of interest.
- a bait panel or bait set panel may be referred to as a bait mixture.
- the bait panel may capture its intended targets in a single selective hybridization step.
- error rate of detecting a genetic variant (e.g., an indel), as used herein, generally refers to the percentage of candidate (e.g., detected) genetic variants detected through analysis of one or more sequence reads that are identified as an introduced genetic variant attributable to non-biological origin (e.g., sequencing or amplification error).
- analysis of one or more sequence reads identifies 100 candidate genetic variants, of which 90 are attributable to biological origin and 10 are attributed to non-biological origin, then this analysis has an accuracy of detecting the genetic variant of 90% and an error rate of 10%.
- the term “about” and its grammatical equivalents in relation to a reference numerical value can include a range of values up to plus or minus 10% from that value.
- the amount “about 10” can include amounts from 9 to 11.
- the term “about” in relation to a reference numerical value can include a range of values plus or minus 10%, 9%, 8%, 7%, 6%, 5%, 4%, 3%, 2%, or 1% from that value.
- the term “at least” and its grammatical equivalents in relation to a reference numerical value can include the reference numerical value and greater than that value.
- the amount “at least 10” can include the value 10 and any numerical value above 10, such as 11, 100, and 1,000.
- the term “at most” and its grammatical equivalents in relation to a reference numerical value can include the reference numerical value and less than that value.
- the amount “at most 10” can include the value 10 and any numerical value under 10, such as 9, 8, 5, 1, 0.5, and 0.1.
- processing can refer to determining a difference, e.g., a difference in number or sequence.
- a difference in number or sequence e.g., gene expression, copy number variation (CNV), indel, and/or single nucleotide variant (SNV) values or sequences can be processed.
- CNV copy number variation
- SNV single nucleotide variant
- the present disclosure provides methods and systems for multi-resolution analysis of cell-free nucleic acids (e.g., deoxyribonucleic acid (DNA)), wherein targeted genomic regions of interest may be enriched with capture probes (“baits”) selected for one or more bait set panels using a differential tiling and capture scheme.
- a differential tiling and capture scheme uses bait sets of different relative concentrations to differentially tile (e.g., at different “resolutions”) across genomic regions associated with baits, subject to a set of constraints (e.g., sequencer constraints such as sequencing load, utility of each bait, etc.), and capture them at a desired level for downstream sequencing.
- the targeted genomic regions of interest may include single-nucleotide variants (SNVs) and indels (i.e., insertions or deletions).
- the targeted genomic regions of interest may comprise backbone genomic regions of interest (“backbone regions”) or hot-spot genomic regions of interest (“hot-spot regions” or “hotspot regions” or “hot-spots” or “hotspots”). While “hotpots” can refer to particular loci associated with sequence variants, “backbone” regions can refer to larger genomic regions, each of which can have one or more potential sequence variants.
- a backbone region can be a region containing one or more cancer-associated mutations, while a hotspot can be a locus with a particular mutation associated with recurring cancer.
- Both backbone and hot-spot genomic regions of interest may comprise tumor-relevant marker genes commonly included in liquid biopsy assays (e.g., BRAF, BRCA, EGFR, KRAS, PIK3CA, ROS1, TP53, and others), for which one or more variants may be expected to be seen in subjects with cancer.
- tumor-relevant marker genes commonly included in liquid biopsy assays (e.g., BRAF, BRCA, EGFR, KRAS, PIK3CA, ROS1, TP53, and others), for which one or more variants may be expected to be seen in subjects with cancer.
- hot-spot genomic regions of interest may be selected to be represented by a higher proportion of sequence reads compared to the backbone genomic regions of interest in the experimental protocol.
- This experimental protocol may comprise steps including isolation, amplification, capture, sequencing, and data analysis.
- the selection of regions as hot-spot regions or backbone regions may be driven by considerations such as the capture efficiency, sequencing load, and/or utility associated with each of the regions and their corresponding bait.
- Utility may be assessed by the clinical relevance (e.g., “clinical value”) of a genomic marker of interest (e.g., a tumor marker) toward a liquid biopsy assay, e.g., predetermined cancer driver mutations, genomic regions with prevalence across a relevant patient cohort, empirically identified cancer driver mutations, or nucleosome-associated genomic regions.
- a genomic marker of interest e.g., a tumor marker
- a liquid biopsy assay e.g., predetermined cancer driver mutations, genomic regions with prevalence across a relevant patient cohort, empirically identified cancer driver mutations, or nucleosome-associated genomic regions.
- utility can be measured by a metric representative of expected yield of actionable and/or disease-associated genetic variants in detection or contribution toward determining tissue of origin or disease state of a sample.
- Utility may be a monotonically increasing function of clinical value.
- a multi-resolution analysis approach to generate a bait set panel that preferentially enriches “hot-spot regions” as compared to backbone regions will enable efficient use of sequencing reads for genetic variant detection for cancer detection and assessment applications, by focusing sequencing at higher read depths for hot-spot regions over backbone regions.
- Using this approach may enable the improvement of a sample assay, given a limited or constrained sequencing load (e.g., number of sequenced reads per sample assayed), such that greater number of clinically actionable genetic variants may be detected per sample assay compared to an un-optimized sample assay.
- the present disclosure provides methods for improving accuracy of detecting an insertion or deletion (indel) from a plurality of sequence reads derived from cell-free deoxyribonucleic acid (cfDNA) molecules in a bodily sample of a subject, which plurality of sequence reads are generated by nucleic acid sequencing. For each of the plurality of sequence reads associated with cfDNA molecules, a candidate indel may be identified.
- indel an insertion or deletion
- Each candidate indel may then be classified as either a true indel or an introduced indel, using a combination of predetermined expectations of (i) an indel being detected in one or more sequence reads of the plurality of sequence reads, (ii) that a detected indel is a true indel present in a given cfDNA molecule of the cell-free DNA molecules, given that an indel has been detected in the one or more of the sequence reads, and/or (iii) that a detected indel is introduced by non-biological error, given that an indel has been detected in the one or more of the sequence reads, in conjunction with one or more model parameters to perform a hypothesis test.
- This approach may reduce error and improve accuracy of detecting an indel from sequence read data.
- Regions of a genome are selected for sequencing. These regions may be collectively referred to as a panel or a panel block.
- the panel is divided into a first set of genomic regions and a second set of genomic regions.
- the first set of genomic regions may be referred to as the backbone region, while the second set may be referred to as the hotspot regions.
- These regions may be divided between genes or within genes or outside genes as desired by the practitioner. For example, an exon of a gene may be divided into portions allocated to the hotspot region and portions allocated to the backbone region.
- a first bait set and a second bait set are prepared which selectively hybridize to the first genomic regions and the second genomic regions, respectively.
- bait set concentrations are determined which, for a test sample having a predetermined amount of DNA, capture DNA in the sample at a saturation point (for the bait set directed to the hotspot regions) and below the saturation point (for the bait set directed to the backbone regions). Capturing DNA molecules from a sample at the saturation point contributes to detecting genetic variants at the highest level of sensitivity because molecules genetic variants are more likely to be captured.
- a “read budget” is a way to conceptualize the amount of genetic information that can be extracted from a sample.
- a per-sample read budget can be selected that identifies the total number of base reads to be allocated to a test sample comprising a predetermined amount of DNA in a sequencing experiment.
- the read budget can be based on total reads produced, e.g., including redundant reads produced through amplification. Alternatively, it can be based on number of unique molecules detected in the sample.
- read budget can reflect the amount of double-stranded support for a call at a locus. That is, the percentage of loci for which reads from both strands of a DNA molecule are detected.
- Factors of a read budget include read depth and panel length.
- a read budget of 3,000,000,000 reads can be allocated as 150,000 bases at an average read depth of 20,000 reads/base.
- Read depth can refer to number of molecules producing a read at a locus.
- the reads at each base can be allocated between bases in the backbone region of the panel, at a first average read depth and bases in the hotspot region of the panel, at a deeper read depth.
- a read budget consists of 100,000 read counts for a given sample, those 100,000 read counts will be divided between reads of backbone regions and reads of hotspot regions. Allocating a large number of those reads (e.g., 90,000 reads) to backbone regions will result in a small number of reads (e.g., the remaining 10,000 reads) being allocated to hotspot regions. Conversely, allocating a large number of reads (e.g., 90,000 reads) to hotspot regions will result in a small number of reads (e.g., the remaining 10,000 reads) being allocated to backbone regions.
- a skilled worker can allocate a read budget to provide desired levels of sensitivity and specificity.
- the read budget can be between 100,000,000 reads and 100,000,000,000 reads, e.g., between 500,000,000 reads and 50,000,000,000 reads, or between 1,000,000,000 reads and 5,000,000,000 reads across, for example, 20,000 bases to 100,000 bases.
- First and second sensitivity levels are selected for detection of genetic variants in the backbone and hotspot regions, respectively.
- Sensitivity refers to the detection limit of a genetic variant as a function of frequency in a sample.
- the sensitivity may be at least 1%, at least 0.1%, at least 0.01%, at least 0.001%, at least 0.0001%, or at least 0.00001%, meaning that a given sequence can be detected in a sample at a frequency of at least 1%, at least 0.1%, at least 0.01%, at least 0.001%, at least 0.0001%, or at least 0.00001%, respectively. That is, genetic variants present in the sample at the levels are detectable by sequencing.
- sensitivity selected for hotspot regions will be higher than sensitivity selected for backbone regions.
- the sensitivity level for hotspot regions may be selected at at least 0.001%, while the sensitivity level for background regions may be selected at at least 0.1% or at least 1%.
- the relative concentrations of bait sets directed to background regions and hotspot regions can be selected to optimize reads in a sequencing experiment with respect to selected read budget and selected sensitivities for the backbone and hotspot regions for a selected sample. So, for example, given a test sample containing a predetermined amount of DNA, and a hotspot bait set that captures DNA for the hotspot regions at saturation, an amount of backbone bait set that is below saturation for the sample is selected such that in a sequencing experiment producing reads within the selected read budget, the resultant read set detects genetic variants in the hotspot regions and in the backbone regions at the preselected sensitivity levels.
- the relative amounts of the bait sets is a function of several factors.
- One of these factors is the relative proportion of the panel allocated to the hotspot regions and to the backbone regions respectively.
- the larger the relative percentage of hotspot regions in the panel the fewer the number of reads and the budget that can be allocated to the backbone region.
- Another factor is the selected sensitivity of detection for hotspot regions. For a given sample, the higher the sensitivity that is necessary for the hotspot regions, the lower sensitivity will be for the backbone region.
- Another factor is the read budget. For a sensitivity for the hotspot regions, the smaller the read budget, the lower the sensitivity possible for the backbone region.
- Another factor is the size of the overall panel. For any given read budget, the larger the panel, the more sensitivity of the backbone regions must be sacrificed to achieving desired sensitivity at the hotspot regions.
- the relative sensitivity levels of hotspot regions can be high enough to achieve targeted detection levels, while sensitivity level at backbone regions are not so low such that meaningful levels of genetic variants are missed. These relative levels are selected by the practitioner to achieve the desired results.
- the skilled worker will use a bait mixture calculated to capture all (or substantially all) hotspot regions in a sample and a portion of the backbone regions, such that the read depth of the captured regions will provide desired hotspot and backbone sensitivities.
- a bait set panel may comprise one or more bait sets that selectively enrich for one or more nucleosome-associated regions of a genome.
- Nucleosome-associated regions may comprise genomic regions having one or more genomic base positions with differential nucleosomal occupancy. Differential nucleosomal occupancy may be characteristic of a cell or tissue type of origin or disease state. Analysis of differential nucleosomal occupancy may be performed using one or more nucleosomal occupancy profiles of a given cell or tissue type. Examples of nucleosomal occupancy profiling techniques include Statham et al., Genomics Data, Volume 3, March 2015, Pages 94-96 (2015), which is entirely incorporated herein by reference.
- Cell-free nucleic acids in a sample obtained from a subject may be primarily shed through a combination of apoptotic and necrotic processes in cells, tissues, and organs.
- nucleosomal patterns or profiles associated with apoptotic processes and necrotic processes may be evident from analyzing cell-free nucleic acid fragments for nucleosome-associated regions of a genome.
- nucleosome-associated patterns can be used, independently or in conjunction with detected somatic variants, to monitor a condition in a subject. For example, as a tumor expands, the ratio of necrosis to apoptosis in the tumor micro-environment may change. Such changes in necrosis and/or apoptosis can be detected by selectively enriching a cell-free nucleic acid sample for one or more nucleosome-associated regions. As another example, a distribution of fragment lengths may be observed due to differential nucleosomal protection across different cell types, or across tumor vs. non-tumor cells. Analysis of nucleosome-associated regions for fragment length distribution may be clinically relevant for cancer detection and assessment applications. This analysis may comprise selectively enriching for nucleosome-associated regions, then sequencing the enriched regions to produce a plurality of sequence reads representative of the nucleic acid sample, and analyzing the sequence reads for genetic variants and nucleosome profiles of interest.
- nucleosome-associated regions may be used for modular panel design. See below.
- Such modular panel design may allow for designs of a set of probes or baits that selectively enrich regions of the genome that are relevant for nucleosomal profiling. By incorporating this “nucleosomal awareness,” sequence data from many individuals can be gleaned to optimize the procedure of panel design, e.g., the determination of which genomic locations to target and the optimal concentration of probes for these genomic locations.
- panels of probes, baits or primers can be designed to target specific portions of the genome (“hotspots”) with known patterns or clusters of structural variation or instability.
- hotspots For example, statistical analysis of sequence data reveals a series of accumulated somatic events and structural variations, and thereby enables clonal evolution studies. The data analysis reveals important biological insights, including differential coverage across cohorts, patterns indicating the presence of certain subsets of tumors, foreign structural events in samples with high somatic mutation load, and differential coverage attributed from blood cells versus tumor cells.
- a localized genomic region refers to a short region of the genome that may range in length from, or from about, 2 to 200 base pairs, from 2 to 190 base pairs, from 2 to 180 base pairs, from 2 to 170 base pairs, from 2 to 160 base pairs, from 2 to 150 base pairs, from 2 to 140 base pairs, from 2 to 130 base pairs, from 2 to 120 base pairs, from 2 to 110 base pairs, from 2 to 100 base pairs, from 2 to 90 base pairs, from 2 to 80 base pairs, from 2 to 70 base pairs, from 2 to 60 base pairs, from 2 to 50 base pairs, from 2 to 40 base pairs, from 2 to 30 base pairs, from 2 to 20 base pairs, from 2 to 10 base pairs, and/or from 2 to 5 base pairs.
- Each localized genomic region may contain a pattern or cluster of significant structural variation or instability. Genome partitioning maps may be provided to identify relevant localized genomic regions. A localized genomic region may contain a pattern or cluster of significant structural variation or structural instability. A cluster may be a hotspot region within a localized genomic region. The hotspot region may contain one or more significant fluctuations or peaks.
- a structural variation may be selected from the group consisting of: an insertion, a deletion, a translocation, a gene re-arrangement, methylation status, a micro-satellite, a copy number variation, a copy number-related structural variation, or any other variation which indicates differentiation. A structural variation can cause a variation in nucleosomal positioning.
- a genome partitioning map may be obtained by: (a) providing samples of cell-free DNA or RNA from two or more subjects in a cohort, (b) obtaining a plurality of sequence reads from each of the samples of cell-free DNA or RNA, and (c) analyzing the plurality of sequence reads to identify one or more localized genomic regions, each of which contains a pattern or cluster of significant structural variation or instability. Statistical analysis may be performed on sequence information to associate a set of sequence reads with one or more nucleosomal occupancy profiles representing distinct cohorts (e.g., a group of subjects with a common characteristic such as a disease state or a non-disease state).
- the statistical analysis may comprise providing one or more genome partitioning maps listing relevant genomic intervals representative of genes of interest for further analysis.
- the statistical analysis may further comprise selecting a set of one or more localized genomic regions based on the genome partitioning maps.
- the statistical analysis may further comprise analyzing one or more localized genomic regions in the set to obtain a set of one or more nucleosomal map disruptions.
- the statistical analysis may comprise one or more of (e.g., one or more, two or more, or three of): pattern recognition, deep learning, and unsupervised learning.
- a nucleosomal map disruption is a measured value that characterizes a given localized genomic region in terms of biologically relevant information.
- a nucleosomal map disruption may be associated with a driver mutation chosen from the group consisting of: wild-type, somatic variant, germline variant, and DNA methylation.
- nucleosomal map disruptions may be used to classify a set of sequence reads as being associated with one or more nucleosomal occupancy profiles representing distinct cohorts. These nucleosomal occupancy profiles may be associated with one or more assessments. An assessment may be considered as part of a therapeutic intervention (e.g., treatment options, selection of treatment, further assessment by biopsy and/or imaging).
- An assessment may be selected from the group consisting of: indication, tumor type, tumor severity, tumor aggressiveness, tumor resistance to treatment, and tumor clonality.
- An assessment of tumor clonality may be determined from observing heterogeneity in nucleosomal map disruption across cell-free DNA molecules in a sample. An assessment of relative contributions of each of two or more clones is determined.
- Each of the one or more nucleosome-associated regions of a bait set panel may comprise at least one of: (i) significant structural variation, comprising a variation in nucleosomal positioning, said structural variation selected from the group consisting of: an insertion, a deletion, a translocation, a gene rearrangement, methylation status, a micro-satellite, a copy number variation, a copy number-related structural variation, or any other variation which indicates differentiation; and (ii) instability, comprising one or more significant fluctuations or peaks in a genome partitioning map indicating one or more locations of nucleosomal map disruptions in a genome.
- the one or more bait sets of a bait set panel may be configured to capture nucleosome-associated regions of the genome based on a function of a plurality of reference nucleosomal occupancy profiles associated with one or more disease states and one or more non-disease states.
- the one or more bait sets of a bait set panel may selectively enrich for one or more nucleosome-associated regions in a cell-free deoxyribonucleic acid (cfDNA) sample.
- the bait set may selectively enrich for one or more nucleosome-associated regions by bringing a nucleic sample in contact with the bait set, and allowing the bait set to selectively hybridize to the set of nucleosome-associated genomic regions associated with the bait set.
- a method for enriching a nucleic acid sample for nucleosome-associated regions of a genome may comprise (a) bringing a nucleic acid sample in contact with a bait set panel, said bait set panel comprising one or more bait sets that selectively enrich for one or more nucleosome-associated regions of a genome; and (b) enriching the nucleic acid sample for one or more nucleosome-associated regions of a genome.
- the one or more bait sets in a bait set panel may be configured to capture nucleosome-associated regions of the genome based on a function of a plurality of reference nucleosomal occupancy profiles associated with one or more disease states and one or more non-disease states.
- the plurality of reference nucleosomal occupancy profiles may serve as a “map” for which analysis may reveal patterns or clusters of genomic regions and/or locations which may be targeted for capture for nucleosome-associated variant detection.
- the one or more bait sets in a bait set panel may selectively enrich for the one or more nucleosome-associated regions in a cell-free deoxyribonucleic acid (cfDNA) sample.
- the method for enriching a nucleic acid sample for nucleosome-associated regions of a genome may further comprise sequencing the enriched nucleic acids to produce sequence reads of the nucleosome-associated regions of a genome. These sequence reads may be aligned to a reference genome and analyzed for nucleosome-associated and/or genetic variants (e.g., SNVs and/or indels).
- a method for generating a bait set may comprise (a) identifying one or more regions of a genome, said regions associated with a nucleosome profile, and (b) selecting a bait set to selectively capture said regions.
- a bait set in a bait set panel may selectively enrich for one or more nucleosome-associated genomic regions in a cell-free deoxyribonucleic acid (cfDNA) sample.
- the bait set may selectively enrich for one or more nucleosome-associated regions by bringing a nucleic sample in contact with the bait set, and allowing the bait set to selectively hybridize to the set of nucleosome-associated genomic regions associated with the bait set.
- a bait panel may comprise a first bait set that selectively hybridizes to a first set of genomic regions of a nucleic acid sample comprising a predetermined amount of DNA, wherein the first bait set may be provided at a first concentration ratio that is less than a saturation point of the first bait set; and a second bait set that selectively hybridizes to a second set of genomic regions of the nucleic acid sample, wherein the second bait set may be provided at a second concentration ratio that is associated with a saturation point of the second bait set.
- a concentration associated with a saturation point can be at or above the saturation point. In some embodiments, a concentration associated with a saturation point is at or above a point that is 10% below the saturation point.
- the first set of genomic regions may comprise one or more backbone genomic regions.
- the second set of genomic regions may comprise one or more hotspot genomic regions.
- the predetermined amount of DNA may be about 200 ng, about 150 ng, about 125 ng, about 100 ng, about 75 ng, about 50 ng, about 25 ng, about 10 ng, about 5 ng, and/or about 1 ng.
- a method for enriching for multiple genomic regions may comprise bringing a predetermined amount of a nucleic acid sample in contact with a bait panel comprising (i) a first bait set that selectively hybridizes to a first set of genomic regions of the nucleic acid sample, which may be provided at a first concentration ratio that is less than a saturation point of the first bait set, and (ii) a second bait set that selectively hybridizes to a second set of genomic regions of the nucleic acid sample, which may be provided at a second concentration ratio that is associated with a saturation point of the second bait set; and enriching the nucleic acid sample for the first set of genomic regions and the second set of genomic regions.
- Enriching can comprise the following steps: (a) bringing sample nucleic acid into contact with a bait set; (b) capturing nucleic acids from the sample by hybridizing them to probes in the bait set; and (c) separating captured nucleic acids from un-captured nucleic acids.
- capture of the second set of genomic regions at a saturation point of its bait set may yield high-sensitivity detection of variants of the second set of genomic regions (e.g., hot-spot regions), while capture of the first set of genomic regions below the saturation point of its bait set may be desired for the first set of genomic regions (e.g., backbone regions).
- the flexibility of this method to adjust the capture of different bait sets at or below their saturation levels may be leveraged to strategically select genomic regions of interest for hot-spot or backbone bait set panels, given each genomic region's characteristics such as sequencing load and utility.
- the method may further comprise sequencing the enriched nucleic acids to produce a plurality of sequence reads of the first set of genomic regions and the second set of genomic regions. These sequence reads may be analyzed for cancer-relevant genetic variants (e.g., SNVs and indels) for cancer detection and assessment applications.
- cancer-relevant genetic variants e.g., SNVs and indels
- saturation point refers to saturation of binding kinetics.
- concentration of a bait or set of baits
- the amount of target that binds to the bait or set of baits
- the amount of target in a given sample will be fixed, and thus, at a certain point, effectively all the target in the sample will be bound to the bait (or set of baits). Therefore, as bait concentrations increase beyond this point, the amount of bound target will not substantially increase because the system will approach binding equilibrium (the rates at which bait molecules bind and release target molecules will start to converge).
- Saturation point refers to a concentration or amount of bait at which point increasing that concentration or amount does not substantially increase the amount of target material captured from a sample, e.g., that point at which increases in the concentration of bait produce increasingly diminished increases in total amount of target material captured.
- the point at which increasing the concentration or amount of a bait does not substantially increase the amount of target material captured from a sample is the point at which increasing the concentration or amount of bait produces no increase in the amount of target captured from the sample.
- the saturation point can be an inflection point on a saturation curve measuring the amount of captured target nucleic acid with increasing concentrations of the bait set.
- the saturation point can be the point at which an increase of 100% in the bait concentration (e.g., 2 ⁇ or twice the concentration) increases an amount of target captured by any of less than 20%, less than 19%, less than 18%, less than 17%, less than 16%, less than 15%, less than 14%, less than 13%, less than 12%, less than 11%, less than 10%, less than 9%, less than 8%, less than 7%, less than 6%, less than 5%, less than 4%, less than 3%, less than 2%, or less than 1%.
- an increase of 100% in the bait concentration e.g., 2 ⁇ or twice the concentration
- an increase of 50% in the bait concentration increases an amount of target captured by any of less than 20%, less than 19%, less than 18%, less than 17%, less than 16%, less than 15%, less than 14%, less than 13%, less than 12%, less than 11%, less than 10%, less than 9%, less than 8%, less than 7%, less than 6%, less than 5%, less than 4%, less than 3%, less than 2%, or less than 1%.
- an increase of 20% in the bait concentration increases an amount of target captured by any of less than 20%, less than 19%, less than 18%, less than 17%, less than 16%, less than 15%, less than 14%, less than 13%, less than 12%, less than 11%, less than 10%, less than 9%, less than 8%, less than 7%, less than 6%, less than 5%, less than 4%, less than 3%, less than 2%, or less than 1%.
- an increase of 10% in the bait concentration increases an amount of target captured by any of less than 20%, less than 19%, less than 18%, less than 17%, less than 16%, less than 15%, less than 14%, less than 13%, less than 12%, less than 11%, less than 10%, less than 9%, less than 8%, less than 7%, less than 6%, less than 5%, less than 4%, less than 3%, less than 2%, or less than 1%.
- the saturation point can be the point at which an increase of 100% in the bait concentration (e.g., 2 ⁇ or twice the concentration) increases an amount of target captured by at most 20%.
- the saturation point can be the point at which an increase of 50% in the bait concentration (e.g., 1.5 ⁇ or twice the concentration) increases an amount of target captured by at most 20%.
- the saturation point can be the point at which an increase of 20% in the bait concentration (e.g., 1.2 ⁇ or twice the concentration) increases an amount of target captured by at most 20%.
- the saturation point can be the point at which an increase of 10% in the bait concentration (e.g., 1.1 ⁇ or twice the concentration) increases an amount of target captured by at most 20%.
- a saturation curve can be generated, for example, by titrating differing amounts of target nucleic acids against a fixed or varying amount of baits (e.g., baits fixed on a microarray) to measure the amount of target nucleic acid (including, for example, the number of unique molecules) bound to the baits.
- a saturation curve also can be generated, for example, by titrating differing amounts of baits against a fixed or varying amount of target nucleic acids to measure the amount of target nucleic acid (including, for example, the number of unique molecules) bound to the baits.
- a saturation curve can be generated using a subset of sequence reads as a measure of target nucleic acid (e.g., unique molecule count) captured.
- sequence reads can be categorized as having either single stranded support (when all reads within a group of unique reads are from the same original nucleic acid strand of a double stranded nucleic acid such as DNA) or double stranded support (when the reads within a group of unique reads are from both original nucleic acid strands of a double stranded nucleic acid such as DNA).
- double stranded support the skilled worker would understand to count only captured unique molecules for which both strands are observed.
- Double stranded support can be determined, for example, by differentially tagging each of the two different strands of a nucleic acid such that the reads for each strand can be counted separately.
- a target nucleic acid with double stranded support will require a higher amount of bait to reach saturation for that target than would be required for a bait with single stranded support.
- FIG. 6 depicts an exemplary saturation curve showing unique molecule count on the y-axis as a function of input bait amount on the x-axis.
- the amount of bait panel was titrated to generate the curve.
- Exemplary experimental titration curve designs are shown in Table 1 and Table 2 below. Number of unique sequence reads vs. input bait amount can be used to generate a titration curve as shown in FIG. 6 .
- a person of skill in the art can calculate a saturation point. For example, looking at Vol. 0.8 ⁇ , the unique molecule count is approximately 2700. At 2 ⁇ the amount of bait (Vol. 1.6 ⁇ ), the unique molecule count is approximately 3200, a difference of 500. Thus, doubling the amount of bait results in an increase in capture of about 18.5%. By contrast, at Vol. 2 ⁇ , the unique molecule count is approximately 3250, and at 1 ⁇ l, the unique molecule count is approximately 3500, a difference of 250. Doubling the amount of bait here results in an increase in capture of only about 7.7%. Accordingly, a person of skill in the art looking to use a saturation point at which an increase of 100% in the bait concentration to increase an amount of target captured by less than 8% might therefore use Vol. 2 ⁇ of bait as the saturation point.
- the bait set can capture any of at least 40%, at least 50%, at least 60%, at least 70%, at least 80%, at least 85%, at least 86%, at least 87%, at least 88%, at least 89%, at least 90%, at least 91%, at least 92%, at least 93%, at least 94%, at least 95%, at least 96%, at least 97%, at least 98%, and/or at least 99% of a target sequence in a sample.
- Saturation point can refer to the saturation point of a bait set or of a particular bait, depending on the context in which the term is used.
- the saturation point of a bait set may be determined by the following method: (a) for each of the baits in the bait set, generating a titration curve comprising (i) measuring the capture efficiency of the bait on a given amount of input sample (e.g., test sample) as a function of the concentration of the bait, and (ii) identifying an inflection point within the titration curve, thereby identifying a saturation point associated with the bait; and (b) selecting a saturation point that is larger than substantially all of the saturation points associated with baits in the bait set, thereby determining the saturation point of the bait set.
- the selection of a saturation point may be influenced by capture efficiency of a bait and the associated costs, such that the concentration at the saturation point may be high enough to achieve a desired capture efficiency, while still low enough to ensure reasonable assay reagent costs.
- the capture efficiency of a bait may be determined by (a) providing a plurality of nucleic acid samples obtained from a plurality of subjects in a cohort; (b) hybridizing the bait with each of the nucleic acid samples, at each of a plurality of concentrations of the bait; (c) enriching with the bait, a plurality of genomic regions of the nucleic acid samples, at each of the plurality of concentrations of the bait; and (d) measuring number of unique nucleic acid molecules or nucleic acid molecules with representation of both strands of an original double-stranded nucleic acid molecule representing the capture efficiency at each of the plurality of concentrations of the bait.
- the capture efficiency of a bait e.g., the percentage of molecules containing the target genomic region of the bait that are captured from a sample comprising such molecules
- concentration e.g., the percentage of molecules containing the target genomic region of the bait that are captured from a sample comprising such molecules
- An inflection point may be a first concentration of a bait such that observed capture efficiency does not increase significantly at concentrations of the bait greater than the first concentration.
- An inflection point may be a first concentration of the bait such that an observed increase between (1) the capture efficiency at a bait concentration of twice the first concentration compared to (2) the capture efficiency at the first bait concentration, is less than about 1%, less than about 2%, less than about 3%, less than about 4%, less than about 5%, less than about 6%, less than about 7%, less than about 8%, less than about 9%, less than about 10%, less than about 12%, less than about 14%, less than about 16%, less than about 18%, less than about 20%, less than about 30%, less than about 40%, or less than about 50%.
- Such an identified inflection point can be considered a saturation point associated with a bait.
- a bait can be used at a concentration of a saturation point in an assay to enable optimal capture of a target genomic region and hence sensitivity of detecting genetic variants of the target genomic region.
- the saturation point associated with a bait set is the saturation point of the weakest bait in that bait set.
- the bait set has a saturation point that is larger than substantially all of the saturation points associated with baits in the bait set when a bait of the bait set is subjected to a titration curve generated by (i) measuring the capture efficiency of a bait of the bait set as a function of the concentration of the bait, and (ii) identifying an inflection point within the titration curve, thereby identifying a saturation point associated with the bait.
- a titration curve generated by (i) measuring the capture efficiency of a bait of the bait set as a function of the concentration of the bait, and (ii) identifying an inflection point within the titration curve, thereby identifying a saturation point associated with the bait.
- the nucleic acid sample may be a cell-free nucleic acid sample (e.g., cfDNA).
- a method for enriching for multiple genomic regions may further comprise sequencing the enriched nucleic acid sample to produce a plurality of sequence reads.
- a method for enriching for multiple genomic regions may further comprise producing an output comprising a nucleic acid sequence representative of the nucleic acid sample. This nucleic acid sequence may then be aligned to a reference genome and analyzed for cancer-relevant genetic variants through bioinformatics approaches.
- An original molecule can produce redundant sequence reads, for example, after amplification and sequencing of amplicons, or by repeated sequencing of the same molecule.
- Redundant sequence reads from an original molecule can be collapsed into a consensus sequence (e.g., a “unique sequence”) representing the sequence of the original molecule. This can be done by generating a consensus sequence for the full molecule, for part of the molecule or at a single nucleotide position in the molecule (consensus nucleotide).
- sequenced polynucleotide refers either to sequence reads generated from amplicons of an original molecule, or a consensus sequence of an original molecule derived from such amplicons.
- Reads can be unique based on the sequence of an original molecule, or based on the sequence of an original molecule plus one or more barcode sequences attached to an original molecule. For example, two identical original molecules can still yield unique reads if their barcodes are different. Likewise, two different original molecules will produce unique reads even if their barcodes are the same. Consensus sequences can be unique sequences when they are generated by grouping unique reads.
- a bait panel may comprise a first set that selectively captures backbone regions of a genome, said backbone regions associated with a ranking function of sequencing load and utility, wherein the ranking function of each backbone region has a value less than a predetermined threshold value; and a second bait set that selectively captures hotspot regions of a genome, said hotspot regions associated with a ranking function of sequencing load and utility, wherein the ranking function of each hotspot region has a value greater than or equal to the predetermined threshold value.
- This approach may use at least two bait sets corresponding to backbone and hotspot regions.
- Hotspot regions may be relatively more important than backbone regions to capture and analyze in a given cell-free nucleic acid sample due to their relatively high utility and/or relatively low sequencing load.
- the selection of a given region as a hotspot region or a backbone region depends on its ranking function value, which is calculated as a function of sequencing load and utility.
- a ranking function value may be calculated as utility of a genomic region divided by sequencing load of a genomic region.
- the backbone or hotspot regions may comprise one or more nucleosome informative regions.
- Nucleosome informative regions may comprise a region of maximum nucleosome differentiation.
- the bait panel may further comprise a second bait set that selectively captures disease informative regions.
- the baits in the first bait set may be at a first concentration (e.g., a first concentration relative to the bait panel), and the baits in the second bait set may be at a second concentration (e.g., a second concentration relative to the bait panel).
- a method for generating a bait set may comprise identifying one or more backbone genomic regions of interest, wherein the identifying the one or more backbone genomic regions may comprise maximizing a ranking function of sequencing load and utility associated with each of the backbone genomic regions; identifying one or more hotspot genomic regions of interest; creating a first bait set that selectively captures the backbone genomic regions of interest; and creating a second bait set that selectively captures the hot-spot genomic regions of interest.
- the second bait set may have a higher capture efficiency than the first bait set.
- the one or more hot-spots may be selected using one or more of (e.g., one or more, two or more, three or more, or four of) the following: (i) maximizing a ranking function of sequencing load and utility associated with each of the hot-spot genomic regions, (ii) nucleosome profiling across the one or more genomic regions of interest, (iii) predetermined cancer driver mutations or prevalence across a relevant patient cohort, and (iv) empirically identified cancer driver mutations.
- Identifying one or more hotspots of interest may comprise using a programmed computer processor to rank a set of hotspot genomic regions based on a ranking function of sequencing load and utility associated with each of the hotspot genomic regions.
- Identifying the one or more backbone genomic regions of interest may comprise ranking a set of backbone genomic regions based on a ranking function of sequencing load and utility associated with each of the backbone genomic regions of interest.
- Identifying the one or more hot-spot genomic regions of interest may comprise utilizing a set of empirically determined minor allele frequency (MAF) values or clonality of a variant measured by its MAF in relationship to the highest presumed driver or clonal mutation in a sample obtained from one or more subjects in a cohort of interest.
- Genomic regions that have relatively high MAF values in a cohort of interest may be suitable hotspots because they may indicate cancer-relevant assessments such as detection, cell type or tissue or origin, tumor burden, and/or treatment efficacy.
- Sequencing load of a genomic region may be calculated by multiplying together one or more of (e.g., one or more, two or more, three or more, four or more, or five of) (i) size of the genomic region in base pairs, (ii) relative fraction of reads spent on sequencing fragments mapping to the genomic region, (iii) relative coverage as a result of sequence bias of the genomic region, (iv) relative coverage as a result of amplification bias of the genomic region, and (v) relative coverage as a result of capture bias of the genomic region.
- This indicator may be calculated for each genomic region in a bait panel set to identify the “costs” associated with generating sequence reads associated with the genomic region from a nucleic acid sample.
- the sequencing load of a genomic region is linearly proportional to the size of the genomic region in base pairs.
- the relative fraction of reads spent on sequencing fragments mapping to the genomic region also influences the sequencing load of the genomic region, since some genomic regions may be especially difficult to sequence reliably (e.g., due to high GC-content or the presence of highly repeating sequences) and hence may require higher sequencing depth for analysis at the bait's desired resolution.
- relative coverage as a result of sequence bias, amplification bias, and/or capture bias of the genomic region may also affect the sequencing load of the genomic region.
- the total sequencing load of a given assay's sequencing run may then be calculated by summing all sequencing loads of the baits (including hot-spots and backbone regions) in the assay's selected bait panel set.
- utility of a genomic region may be calculated by multiplying together one or more of (e.g., one or more, two or more, three or more, four or more, five or more, six or more, or seven of) the following utility factors: (i) presence of one or more actionable mutations in the genomic region, (ii) frequency of one or more actionable mutations in the genomic region, (iii) presence of one or more mutations associated with above-average minor allele frequencies (MAFs) in the genomic region, (iv) frequency of one or more mutations associated with above-average MAFs in the genomic region, (v) fraction of patients in a cohort harboring a somatic mutation within the genomic region, (vi) sum of MAFs for variants in patients in a cohort, said patients harboring a somatic mutation within the genomic region, and (vii) ratio of (1) MAF for variants in patients in a cohort, said patients harboring a somatic mutation within the genomic region, to (2) maximum MAF for a given patient in the cohort.
- utility factors e.
- the goal of calculating utility of a genomic region may be to help assess its relative importance for inclusion in a bait set panel. For example, the presence and/or frequency of one or more actionable mutations in the genomic region affect the utility of a genomic region for inclusion in a bait set panel, since genomic regions containing highly frequent mutations are good markers (e.g., indicators) of disease states including cancer. Similarly, the selection of genomic regions with presence and/or frequency of mutations associated with above-average MAFs will enable highly sensitive detection of these mutations in a liquid biopsy assay.
- the fraction of patients in a cohort harboring a somatic mutation within the genomic region may indicate driver mutations that are suitable as a marker for the cohort's disease (e.g., breast, colorectal, pancreatic, prostate, melanoma, lung, or liver).
- driver mutations that are suitable as a marker for the cohort's disease (e.g., breast, colorectal, pancreatic, prostate, melanoma, lung, or liver).
- the sum of MAF for variants in patients in a cohort may be used as a utility factor.
- the ratio of (1) MAF for variants in patients in a cohort, said patients harboring a somatic mutation within the genomic region, to (2) maximum MAF for a given patient in the cohort may be used as a utility factor.
- Mutations associated with higher minor allele frequencies may comprise one or more driver mutations or are known from external data or annotation sources.
- Actionable mutations may comprise mutations whose detected presence may influence or determine clinical decisions (e.g., diagnosis, cancer monitoring, therapy monitoring, assessment of therapy efficacy). Actionable mutations may comprise one or more of (e.g., one or more, two or more, three or more, four or more, five or more, six or more, or seven of) (i) druggable mutations, (ii) mutations for therapeutic monitoring, (iii) disease specific mutations, (iv) tissue specific mutations, (v) cell type specific mutations, (vi) resistance mutations, and (vii) diagnostic mutations.
- druggable mutations e.g., one or more, two or more, three or more, four or more, five or more, six or more, or seven of) (i) druggable mutations, (ii) mutations for therapeutic monitoring, (iii) disease specific mutations, (iv) tissue specific mutations, (v) cell type specific mutations, (vi) resistance mutations, and (vii) diagnostic mutations.
- Druggable mutations may include those mutations whose detected presence in a nucleic acid sample from a subject may indicate that the subject is an appropriate candidate for treatment with a certain drug associated with the mutation (e.g., detection of EGFR L858R mutation may indicate the need to treat with a tyrosine kinase inhibitor (TKI) treatment).
- Mutations for therapeutic monitoring include those mutations whose detected presence or increased level in a nucleic acid sample from a subject may indicate that the subject's cancer is responding to a treatment course.
- Resistance mutations include those mutations whose detected presence or increased level in a nucleic acid sample from a subject may indicate that the subject's cancer has become resistant to a treatment course (e.g., emergence of EGFR T790M mutation may indicate the onset of resistance). Mutations may be specific to a disease (e.g., tumor type), tissue type, or cell type, whose detection may indicate cancer, inflammation, or another disease state in a particular organ, tissue, or cell type.
- a disease e.g., tumor type
- tissue type e.g., or cell type
- genomic regions used in the methods of the present disclosure comprise at least a portion of at least 5, at least 10, at least 15, at least 20, at least 25, at least 30, at least 35, at least 40, at least 45, at least 50, at least 55, at least 60, at least 65, at least 70, at least 75, at least 80, at least 85, at least 90, at least 95, or 97 of the genes of Table 3.
- genomic regions used in the methods of the present disclosure comprise at least 5, at least 10, at least 15, at least 20, at least 25, at least 30, at least 35, at least 40, at least 45, at least 50, at least 55, at least 60, at least 65, or 70 of the SNVs of Table 3.
- genomic regions used in the methods of the present disclosure comprise at least 1, at least 2, at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 11, at least 12, at least 13, at least 14, at least 15, at least 16, at least 17, or 18 of the CNVs of Table 3.
- genomic regions used in the methods of the present disclosure comprise at least 1, at least 2, at least 3, at least 4, at least 5, or 6 of the fusions of Table 3.
- genomic regions used in the methods of the present disclosure comprise at least a portion of at least 1, at least 2, or 3 of the indels of Table 3.
- genomic regions used in the methods of the present disclosure comprise at least a portion of at least 5, at least 10, at least 15, at least 20, at least 25, at least 30, at least 35, at least 40, at least 45, at least 50, at least 55, at least 60, at least 65, at least 70, at least 75, at least 80, at least 85, at least 90, at least 95, at least 100, at least 105, at least 110, or 115 of the genes of Table 4.
- genomic regions used in the methods of the present disclosure comprise at least 5, at least 10, at least 15, at least 20, at least 25, at least 30, at least 35, at least 40, at least 45, at least 50, at least 55, at least 60, at least 65, at least 70, or 73 of the SNVs of Table 4.
- genomic regions used in the methods of the present disclosure comprise at least 1, at least 2, at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 11, at least 12, at least 13, at least 14, at least 15, at least 16, at least 17, or 18 of the CNVs of Table 4. In some embodiments, genomic regions used in the methods of the present disclosure comprise at least 1, at least 2, at least 3, at least 4, at least 5, or 6 of the fusions of Table 4.
- genomic regions used in the methods of the present disclosure comprise at least a portion of at least 1, at least 2, at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 11, at least 12, at least 13, at least 14, at least 15, at least 16, at least 17, or 18 of the indels of Table 4.
- Each of these genomic locations of interest may be identified as a backbone region or hot-spot region for a given bait set panel. An exemplary listing of hot-spot genomic locations of interest may be found in Table 5.
- genomic regions used in the methods of the present disclosure comprise at least a portion of at least 1, at least 2, at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 11, at least 12, at least 13, at least 14, at least 15, at least 16, at least 17, at least 18, at least 19, or at least 20 of the genes of Table 5.
- Each hot-spot genomic region is listed with several characteristics, including the associated gene, chromosome on which it resides, the start and stop position of the genome representing the gene's locus, the length of the gene's locus in base pairs, the exons covered by the gene, and the critical feature (e.g., type of mutation) that a given genomic region of interest may seek to capture.
- a bait panel may comprise a plurality of bait sets, each bait set (i) comprising one or more baits that selectively capture one or more genomic regions with utility in the same quantile across the plurality of baits, and (ii) having a different relative concentration from each of the other bait sets with utility in a different quantile across the plurality of baits.
- Quantiles may be, for example, two halves, three thirds, four quarters, etc.
- a bait panel may comprise three bait sets, each bait set comprising baits that selectively capture genomic regions with utility in the upper third, middle third, or lower third of utility values across the plurality of baits, with each of the three bait sets having a different relative concentration.
- a bait panel may comprise a plurality of bait sets, each bait set (i) comprising one or more baits that selectively capture one or more genomic regions with sequencing load in the same quantile across the plurality of baits, and (ii) having a different relative concentration from each of the other bait sets with sequencing load in a different quantile across the plurality of baits.
- a bait panel may comprise a plurality of bait sets, each bait set (i) comprising one or more baits that selectively capture one or more genomic regions with ranking function value (e.g., utility divided by sequencing load) in the same quantile across the plurality of baits, and (ii) having a different relative concentration from each of the other bait sets with ranking function value in a different quantile across the plurality of baits.
- ranking function value e.g., utility divided by sequencing load
- a method of selecting a set of panel blocks may comprise (a) for each panel block, (i) calculating a utility of the panel block, (ii) calculating a sequencing load of the panel block, and (iii) calculating a ranking function of the panel block; and (b) performing an optimization process to select a set of panel blocks that maximizes the total ranking function values of the selected panel blocks.
- a ranking function of a panel block may be calculated as the utility of a panel block divided by the sequencing load of a panel block.
- the combinatorial optimization process may optimize the total sum of ranking function values of all panel blocks selected for the set of panel blocks in a single assay. This approach may enable an optimal panel selection given constraints in sequence load and utility.
- a method may comprise (a) providing a plurality of bait mixtures, wherein each of the plurality of bait mixtures comprises a first bait set that selectively hybridizes to a first set of genomic regions and a second bait set that selectively hybridizes to a second set of genomic regions, wherein the first bait set is at different concentrations across the plurality of bait mixtures and the second bait set is at the same concentration across the plurality of bait mixtures; (b) contacting each of the plurality of bait mixture with a nucleic acid sample to capture nucleic acids from the nucleic acid sample with the first bait set and the second bait set, wherein the nucleic acids from the nucleic acid samples are capture by the first bait set and the second bait set; (c) sequencing a portion of the nucleic acids captured with each bait mixture to produce sets of sequence reads within an allocated number of sequence reads; (d) determining the read depth for the first bait set and the second bait set for each bait mixture;
- the read depths for the second set of genomic regions provides a sensitivity of detecting a genetic variant of at least 0.0001% MAF.
- a first set of genomic regions and/or a second set of regions have a size between 25 kilobases to 1,000 kilobases.
- a first set of genomic regions and/or a second set of regions have a read depth of between 1,000 counts/base and 50,000 counts/base.
- a method for improving accuracy of detecting an insertion or deletion (indel) from a plurality of sequence reads derived from cell-free deoxyribonucleic acid (cfDNA) molecules in a bodily sample of a subject, which plurality of sequence reads are generated by nucleic acid sequencing. For each of the plurality of sequence reads associated with cfDNA molecules, a candidate indel may be identified.
- indel an insertion or deletion
- Each candidate indel may then be classified as either a true indel or an introduced indel, using a combination of predetermined expectations of (i) an indel being detected in one or more sequence reads of the plurality of sequence reads, (ii) that a detected indel is a true indel present in a given cell-free DNA molecule of the cell-free DNA molecules, given that an indel has been detected in the one or more of the sequence reads, and/or (iii) that a detected indel is introduced by non-biological error, given that an indel has been detected in the one or more of the sequence reads, in conjunction with one or more model parameters to perform a hypothesis test.
- This approach may reduce error and improve accuracy of detecting an indel from sequence read data.
- FIG. 1 illustrates how a plurality of reads may be generated for each locus enriched from a cell-free nucleic acid sample.
- Each enriched nucleic acid molecule e.g., DNA molecule
- Each enriched nucleic acid molecule is amplified to produce a family of amplicons.
- These amplicons may then be sequenced on both forward and reverse strands to produce a plurality of sequence read data.
- candidate indels may be detected and classified as either true indels or introduced (e.g., non-biological) indels.
- This algorithm presumes that for any given DNA molecule for which a plurality of sequence reads is analyzed for variants comprising indels, there exists a predetermined expectation (e.g., probability) of an indel being present either in the original molecule (e.g., a “true” biological indel) or introduced at some point in a protocol that culminates a set of sequence reads (e.g., an introduced non-biological indel stemming from error, including amplification or sequencing error).
- a predetermined expectation e.g., probability
- the model may aim to perform a hypothesis test which asks, given a pattern of reads mapping to a particular base position (e.g., cover the base position somewhere in the read), if the observed pattern is most indicative of an indel in a sequence being present at the beginning of the protocol (e.g., a true biological indel) or introduced during the protocol (a non-biological indel).
- a hypothesis test which asks, given a pattern of reads mapping to a particular base position (e.g., cover the base position somewhere in the read), if the observed pattern is most indicative of an indel in a sequence being present at the beginning of the protocol (e.g., a true biological indel) or introduced during the protocol (a non-biological indel).
- a method for improving accuracy of detecting an insertion or deletion (indel) from a plurality of sequence reads derived from cell-free deoxyribonucleic acid (cfDNA) molecules in a bodily sample of a subject, which plurality of sequence reads are generated by nucleic acid sequencing may comprise (a) for each of the plurality of sequence reads associated with the cell-free DNA molecules, providing: a predetermined expectation of an indel being detected in one or more sequence reads of the plurality of sequence reads; a predetermined expectation that a detected indel is a true indel present in a given cell-free DNA molecule of the cell-free DNA molecules, given that an indel has been detected in the one or more of the sequence reads; and a predetermined expectation that a detected indel is introduced by non-biological error, given that an indel has been detected in the one or more of the sequence reads; (b) providing quantitative measures of one or more model parameters characteristic of sequence reads generated by nucleic acid
- the method for improving accuracy of detecting an insertion or deletion (indel) from a plurality of sequence reads derived from cell-free deoxyribonucleic acid (cfDNA) molecules in a bodily sample of a subject may further comprise enriching one or more loci from the cell-free DNA in the bodily sample before step (a), thereby producing enriched polynucleotides.
- the method may further comprise amplifying the enriched polynucleotides to produce families of amplicons, wherein each family comprises amplicons originating from a single strand of the cell-free DNA molecules.
- the non-biological error may comprise error in sequencing at a plurality of genomic base locations.
- the non-biological error may comprise error in amplification at a plurality of genomic base locations.
- FIG. 2 illustrates an example of small families of reads (which may appear to provide evidence for a true indel variant) and large families of reads (which may indicate a likely introduced error stemming from PCR or sequencing.
- true indels may be expected to be detected or measured as small families of reads, since they may not be expected to affect large numbers of DNA molecules biologically.
- introduced indels may be expected to be detected or measured as larger families of reads, which may indicate an introduced error during PCR or sequencing.
- Some untrimmed or erroneous reads may cause the algorithm to disqualify the family based on a hypothesis test that classifies an indel (e.g., insertion or deletion) as introduced rather than biological.
- FIG. 3 illustrates an example of an insertion being supported by a large family upon aligning and comparing a plurality of sequence reads to a reference genome.
- some untrimmed or erroneous reads may cause the algorithm to disqualify the family based on a hypothesis test that classifies an indel (e.g., insertion or deletion) as introduced rather than biological.
- Model parameters may comprise one or more of (e.g., one or more, two or more, three or more, or four of) (i) for each of one or more variant alleles, a frequency of the variant allele ( ⁇ ) and a frequency of non-reference alleles other than the variant allele ( ⁇ ′); (ii) a frequency of an indel error in the entire forward strand of a family of strands ( ⁇ 1 ), wherein a family comprises a collection of amplicons originating from a single strand of the cell-free DNA molecules; (iii) a frequency of an indel error in the entire reverse strand of a family of strands ( ⁇ 2 ); and (iv) a frequency of an indel error in a sequence read ( ⁇ ).
- ⁇ e.g., one or more, two or more, three or more, or four of
- FIG. 4 illustrates the various parameters that may be used in a hypothesis test and how each parameter may be related to a particular probability, e.g., of a family of reads matching a reference, of a strands' reads matching a reference, and of a read matching a reference.
- FIG. 2 also illustrates how a parameter test containing a maximum likelihood function may be performed. If the parameter test is greater than a predetermined threshold when performed on a candidate indel, then the candidate may be classified as a true indel. If the parameter test is less than or equal to a predetermined threshold when performed on a candidate indel, then the candidate may be classified as an introduced (e.g., non-biological) indel.
- an introduced e.g., non-biological
- the step of performing a hypothesis test may comprise performing a multi-parameter maximization algorithm.
- the multi-parameter maximization algorithm may comprise a Nelder-Mead algorithm.
- the classifying of a candidate indel as a true indel or an introduced indel may comprise (a) maximizing a multi-parameter likelihood function, (b) classifying a candidate indel as a true indel if the maximum likelihood function value is greater than a predetermined threshold value, and (c) classifying a candidate indel as an introduced indel if the maximum likelihood function value is less than or equal to a predetermined threshold value.
- the multi-parameter likelihood function may be given as:
- Pr ⁇ ⁇ Reads ⁇ ⁇ " ⁇ [LeftBracketingBar]" ⁇ , , ⁇ ′ , ⁇ 1 , ⁇ 2 , ⁇ ⁇ ⁇ Families ( ⁇ ⁇ ( ( 1 - ⁇ 1 ) ⁇ ( 1 - ⁇ ) ⁇ R 1 ⁇ ⁇ V 1 + O 1 + ⁇ 1 ⁇ ⁇ R 1 ( 1 - ⁇ ) V 1 + O 1 ) ⁇ ( ( 1 - ⁇ 2 ) ⁇ ( 1 - ⁇ ) ⁇ ( R 2 ⁇ ⁇ V 2 + O 2 + ⁇ 2 ⁇ ⁇ R 2 ( 1 - ⁇ ) V 2 + O 2 ) + ⁇ ′ ⁇ ( ... ) + ( 1 - ⁇ - ⁇ ′ ) ⁇ ( ... ) )
- ⁇ , ⁇ ′, ⁇ 1 , ⁇ 2 , ⁇ may represent a probability of an observed configuration of reads according to the model illustrated in FIG. 4 (and described in paragraph [00112]).
- One assumption of the model may be that, given certain values of parameters (e.g., ⁇ , ⁇ ′, ⁇ 1 , ⁇ 2 , and ⁇ ), an observed configuration of reads within a family is statistically independent from an observed configuration of reads within all other families.
- ⁇ , ⁇ ′, ⁇ 1 , ⁇ 2 , ⁇ can be expressed as a product of Pr ⁇ reads in family f
- This per-family probability itself may comprise a weighted sum of at least three components, wherein each component corresponds to a possible family type: a) having the variant allele (with weight a), b) having other non-reference variant allele (with weight ⁇ ′, or c) having the reference allele (with weight 1 ⁇ ′).
- These components being summed may be probabilities of observed read configuration for the respective family type Pr ⁇ reads in family f
- ⁇ , ⁇ ′, ⁇ 1 , ⁇ 2 , ⁇ , and family f having variant allele ⁇ may be itself a product of the probability of observed configuration of reads from the forward strand and the probability of observed configuration of reads from the reverse strand.
- Each of these probabilities may be itself a weighted sum of at least two components, wherein each component corresponds to a possible outcome: X) the strand-specific indel error did affect this family strand (with weight ⁇ 1 or ⁇ 2 ) and Y) the strand-specific indel error did not affect this family strand (with weight 1 ⁇ 1 or ⁇ 2 ).
- the probability of a specific read configuration may be a product of probabilities for individual reads, since it is postulated by the model that these reads have a statistically independent probability of falling into one of the three categories: i) read supports the variant allele, ii) read supports other non-reference variant allele, or iii) read supports the reference allele. These probabilities are listed in Table 6 below.
- the present disclosure provides computer control systems that are programmed to implement methods of the disclosure.
- the present disclosure provides a system comprising a computer comprising a processor and computer memory, wherein the computer is in communication with a communications network, and wherein computer memory comprises code which, when executed by the processor, (1) receives sequence data into computer memory from the communications network; (2) determines whether a genetic variant in the sequence data represents a mutant; and (3) reports out, over the communications network, the determination.
- a communications network can be any available network that connects to the Internet.
- the communications network can utilize, for example, a high-speed transmission network including, without limitation, Broadband over Powerlines (BPL), Cable Modem, Digital Subscriber Line (DSL), Fiber, Satellite and Wireless.
- BPL Broadband over Powerlines
- DSL Digital Subscriber Line
- Fiber Satellite and Wireless.
- a system comprising: a local area network; one or more DNA sequencers comprising computer memory configured to store DNA sequence data which are connected to the local area network; a bioinformatics computer comprising a computer memory and a processor, which computer is connected to the local area network; wherein the computer further comprises code which, when executed, copies DNA sequence data stored on the DNA sequencer, writes the copied data to memory in the bioinformatics computer and performs steps as described herein.
- FIG. 5 shows a computer system 501 that is programmed or otherwise configured to implements methods for generating a bait set, for selecting a set of panel blocks, and for improving accuracy of detecting an indel from a plurality of sequence reads derived from cfDNA molecules.
- the computer system 501 can regulate various aspects of the present disclosure, such as, for example, methods for generating a bait set, for selecting a set of panel blocks, or for improving accuracy of detecting an indel from a plurality of sequence reads derived from cfDNA molecules.
- the computer system 501 can be an electronic device of a user or a computer system that is remotely located with respect to the electronic device.
- the electronic device can be a mobile electronic device.
- the computer system 501 includes a central processing unit (CPU, also “processor” and “computer processor” herein) 505 , which can be a single core or multi core processor, or a plurality of processors for parallel processing.
- the computer system 501 also includes memory or memory location 510 (e.g., random-access memory, read-only memory, flash memory), electronic storage unit 515 (e.g., hard disk), communication interface 520 (e.g., network adapter) for communicating with one or more other systems, and peripheral devices 525 , such as cache, other memory, data storage and/or electronic display adapters.
- the memory 510 , storage unit 515 , interface 520 and peripheral devices 525 are in communication with the CPU 505 through a communication bus (solid lines), such as a motherboard.
- the storage unit 515 can be a data storage unit (or data repository) for storing data.
- the computer system 501 can be operatively coupled to a computer network (“network”) 530 with the aid of the communication interface 520 .
- the network 530 can be the Internet, an internet and/or extranet, or an intranet and/or extranet that is in communication with the Internet.
- the network 530 in some cases is a telecommunication and/or data network.
- the network 530 can include one or more computer servers, which can enable distributed computing, such as cloud computing.
- the network 530 in some cases with the aid of the computer system 501 , can implement a peer-to-peer network, which may enable devices coupled to the computer system 501 to behave as a client or a server.
- the CPU 505 can execute a sequence of machine-readable instructions, which can be embodied in a program or software.
- the instructions may be stored in a memory location, such as the memory 510 .
- the instructions can be directed to the CPU 505 , which can subsequently program or otherwise configure the CPU 505 to implement methods of the present disclosure. Examples of operations performed by the CPU 505 can include fetch, decode, execute, and writeback.
- the CPU 505 can be part of a circuit, such as an integrated circuit.
- a circuit such as an integrated circuit.
- One or more other components of the system 501 can be included in the circuit.
- the circuit is an application specific integrated circuit (ASIC).
- the storage unit 515 can store files, such as drivers, libraries and saved programs.
- the storage unit 515 can store user data, e.g., user preferences and user programs.
- the computer system 501 in some cases can include one or more additional data storage units that are external to the computer system 501 , such as located on a remote server that is in communication with the computer system 501 through an intranet or the Internet.
- the computer system 501 can communicate with one or more remote computer systems through the network 530 .
- the computer system 501 can communicate with a remote computer system of a user.
- remote computer systems include personal computers (e.g., portable PC), slate or tablet PC's (e.g., Apple® iPad, Samsung® Galaxy Tab), telephones, Smart phones (e.g., Apple® iPhone, Android-enabled device, Blackberry®), or personal digital assistants.
- the user can access the computer system 501 via the network 530 .
- Methods as described herein can be implemented by way of machine (e.g., computer processor) executable code stored on an electronic storage location of the computer system 501 , such as, for example, on the memory 510 or electronic storage unit 515 .
- the machine executable or machine readable code can be provided in the form of software.
- the code can be executed by the processor 505 .
- the code can be retrieved from the storage unit 515 and stored on the memory 510 for ready access by the processor 505 .
- the electronic storage unit 515 can be precluded, and machine-executable instructions are stored on memory 510 .
- the code can be pre-compiled and configured for use with a machine having a processer adapted to execute the code, or can be compiled during runtime.
- the code can be supplied in a programming language that can be selected to enable the code to execute in a pre-compiled or as-compiled fashion.
- aspects of the systems and methods provided herein can be embodied in programming.
- Various aspects of the technology may be thought of as “products” or “articles of manufacture” typically in the form of machine (or processor) executable code and/or associated data that is carried on or embodied in a type of machine readable medium.
- Machine-executable code can be stored on an electronic storage unit, such as memory (e.g., read-only memory, random-access memory, flash memory) or a hard disk.
- “Storage” type media can include any or all of the tangible memory of the computers, processors or the like, or associated modules thereof, such as various semiconductor memories, tape drives, disk drives and the like, which may provide non-transitory storage at any time for the software programming. All or portions of the software may at times be communicated through the Internet or various other telecommunication networks. Such communications, for example, may enable loading of the software from one computer or processor into another, for example, from a management server or host computer into the computer platform of an application server.
- another type of media that may bear the software elements includes optical, electrical and electromagnetic waves, such as used across physical interfaces between local devices, through wired and optical landline networks and over various air-links.
- a machine readable medium such as computer-executable code
- a tangible storage medium such as computer-executable code
- Non-volatile storage media include, for example, optical or magnetic disks, such as any of the storage devices in any computer(s) or the like, such as may be used to implement the databases, etc. shown in the drawings.
- Volatile storage media include dynamic memory, such as main memory of such a computer platform.
- Tangible transmission media include coaxial cables; copper wire and fiber optics, including the wires that comprise a bus within a computer system.
- Carrier-wave transmission media may take the form of electric or electromagnetic signals, or acoustic or light waves such as those generated during radio frequency (RF) and infrared (IR) data communications.
- RF radio frequency
- IR infrared
- Common forms of computer-readable media therefore include for example: a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD or DVD-ROM, any other optical medium, punch cards paper tape, any other physical storage medium with patterns of holes, a RAM, a ROM, a PROM and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave transporting data or instructions, cables or links transporting such a carrier wave, or any other medium from which a computer may read programming code and/or data.
- Many of these forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to a processor for execution.
- the computer system 501 can include or be in communication with an electronic display 535 that comprises a user interface (UI) 540 for providing, for example, input parameters for methods for generating a bait set, for selecting a set of panel blocks, or for improving accuracy of detecting an indel from a plurality of sequence reads derived from cfDNA.
- UIs include, without limitation, a graphical user interface (GUI) and web-based user interface.
- Methods and systems of the present disclosure can be implemented by way of one or more algorithms.
- An algorithm can be implemented by way of software upon execution by the central processing unit 505 .
- the algorithm can, for example, generate a bait set, select a set of panel blocks, or improve accuracy of detecting an indel from a plurality of sequence reads derived from cfDNA molecules.
- Analytical sensitivity (as defined by the limit of detection and by positive percent agreement) and precision were assessed throughout the reportable allelic fraction and copy number ranges via multiple serial dilution studies of orthogonally-characterized contrived material and patient samples.
- Analytical specificity was assessed by calculating the false positive rate in pre-characterized healthy donor sample mixtures serially diluted across the lower reportable range down to allelic fractions below the limit of detection.
- Positive predictive value (PPV) was estimated as a function of allelic fraction/copy number from pre-characterized clinical patient samples and prevalence-adjusted using a cohort of 2,585 consecutive clinical samples. Orthogonal qualitative and quantitative confirmation was performed using ddPCR.
- Analytical performance is summarized in Table 7 below. Analytical specificity was 100% for single nucleotide variants (SNVs), fusions, and copy number alterations (CNAs) and 96% (24/25) for indels across 25 defined samples. Relative to other methods, this assay demonstrated 20%-50% increases in fusion molecule recovery, depending on the sequence context. Retrospective in silico analysis of 2,585 consecutive clinical samples demonstrated a >15% relative increase in actionable fusion detection, a 6%-15% increase in actionable indel detection (excluding newly reportable indels), and a 3%-6% increase in actionable SNV detection.
- Table 7 Analytical performance characteristics based on standard cfDNA input (30 ng). Analytical sensitivity/limit of detection estimates are provided for clinically actionable variants and can vary by sequence context and cfDNA input. Positive predictive value is estimated across the entire reportable panel space (PPV was 100% for clinically actionable variants).
- the assay comprehensively detected all adult solid tumor guideline-recommended somatic genomic variants with high sensitivity, accuracy, and specificity.
- Hotspot and backbone panels were designed for both default probe replication and optimized probe replication.
- the hotspot panel is approximately 12 kb and targets regions of genomic targets that may be indicative of drug response, a disease status (e.g., cancer), and/or a genomic target listed under National Comprehensive Cancer Network (“NCCN”) guidelines.
- the backbone panel is approximately 140 kb and covers the rest of the panel content.
- the hotspot and backbone panel may comprise any genetic locations in Table 3.
- a titration experiment was performed for panel input amount for each of the four panels at 5 ng, 15 ng, and 30 ng of cfDNA as set forth in Table 1.
- FIG. 6 shows input amount versus unique molecule count for the generic panel. The unique molecule count saturated at about Vol. 3 ⁇ for the backbone bait and about Vol. 1.2 ⁇ for the hotspot bait (data not shown), suggesting that the optimized backbone panel was less variable compared to the default panel.
- a concentration of backbone bait and a concentration of hotspot bait were determined.
- a mixture of backbone bait (e.g., Vol. A) and hotspot bait (e.g., Vol. B) was generated and the molecule count for the hotspot/backbone bait mixture was compared with molecule count for a generic panel.
- the molecule counts from the hotspot panel were higher than the backbone panel. The difference became more noticeable at higher cfDNA input amount as the backbone bait saturated out much faster, e.g., at lower input amount, as compared to the hotspot bait. A similar trend was seen with the double-stranded count (data not shown). Family size was also higher for the hotspot panel than the backbone panel (data not shown).
- the difference in family sizes may indicate that the hotspot panel is capturing more than the backbone panel, despite that the effect was masked with molecule counts. For example, with the large family sizes for 5 ng, it is likely that most of the unique molecules were captured, thus there was no obvious difference between the hotspot and backbone panel. With the family size differences, it is likely that more PCR duplicates were being captured by the hotspot panel than the backbone panel.
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Chemical & Material Sciences (AREA)
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Organic Chemistry (AREA)
- Biotechnology (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Analytical Chemistry (AREA)
- Genetics & Genomics (AREA)
- Molecular Biology (AREA)
- Zoology (AREA)
- Wood Science & Technology (AREA)
- Medical Informatics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Microbiology (AREA)
- Biochemistry (AREA)
- General Engineering & Computer Science (AREA)
- Immunology (AREA)
- Chemical Kinetics & Catalysis (AREA)
- Data Mining & Analysis (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Bioethics (AREA)
- Physiology (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- Evolutionary Computation (AREA)
- Public Health (AREA)
- Software Systems (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
- Investigating Or Analysing Biological Materials (AREA)
- Apparatus Associated With Microorganisms And Enzymes (AREA)
Abstract
Description
- This application is a continuation of U.S. patent application Ser. No. 18/482,779, filed Oct. 6, 2023, which is a continuation of U.S. patent application Ser. No. 18/155,523 filed Jan. 17, 2023, which is a continuation of U.S. patent application Ser. No. 18/055,298 filed Nov. 14, 2022, which is a continuation of U.S. patent application Ser. No. 17/383,385 filed Jul. 22, 2021, which is a continuation of U.S. patent application Ser. No. 16/338,445, filed Mar. 29, 2019, which is a U.S. national stage application of International Patent Application No. PCT/US2017/054607, filed Sep. 29, 2017, which claims priority to U.S. Provisional Application No. 62/402,940, filed Sep. 30, 2016, U.S. Provisional Application No. 62/468,201, filed Mar. 7, 2017, and U.S. Provisional Application No. 62/489,391, filed Apr. 24, 2017, each of which is entirely incorporated herein by reference.
- The instant application contains a Sequence Listing which has been submitted electronically in XML format and is hereby incorporated by reference in its entirety. Said XML copy, created on Apr. 25, 2023, is named 5714_016US4_SL.xml and is 15,197 bytes in size.
- Analysis of cell-free nucleic acids (e.g., deoxyribonucleic acid or ribonucleic acid) for tumor-derived genetic variants is a critical step in a typical analysis pipeline for cancer detection, assessment, and monitoring applications. Most current methods of cancer diagnostic assays of cell-free nucleic acids focus on the detection of tumor-related somatic variants, including single-nucleotide variants (SNVs), copy-number variations (CNVs), fusions, and insertions/deletions (indels), which are all mainstream targets for liquid biopsy. A typical analysis approach may comprise enriching a nucleic acid sample for targeted regions of a genome, followed by sequencing of enriched nucleic acids and analysis of sequence read data for genetic variants of interest. These nucleic acids may be enriched using a bait mixture selected for a particular assay according to assay constraints, including limited sequencing load and utility associated with each genomic region of interest.
- In an aspect, the present disclosure provides a bait set panel comprising one or more bait sets that selectively enrich for one or more nucleosome-associated regions of a genome, said nucleosome-associated regions comprising genomic regions having one or more genomic base positions with differential nucleosomal occupancy, wherein the differential nucleosomal occupancy is characteristic of a cell or a tissue type of origin or a disease state.
- In some embodiments, each of the one or more nucleosome-associated regions of a bait set panel comprise at least one of: (i) significant structural variation, comprising a variation in nucleosomal positioning, said structural variation selected from the group consisting of: an insertion, a deletion, a translocation, a gene rearrangement, methylation status, a micro-satellite, a copy number variation, a copy number-related structural variation, or any other variation which indicates differentiation; and (ii) instability, comprising one or more significant fluctuations or peaks in a genome partitioning map indicating one or more locations of nucleosomal map disruptions in a genome.
- In some embodiments, the one or more bait sets of a bait set panel are configured to capture nucleosome-associated regions of the genome based on a function of a plurality of reference nucleosomal occupancy profiles (i) associated with one or more disease states and one or more non-disease states; (ii) associated with a known somatic mutation, such as SNV, CNV, indel, or re-arrangement; and/or (iii) associated with differential expression patterns. In an embodiment, the one or more bait sets of a bait set panel selectively enrich for one or more nucleosome-associated regions in a cell-free deoxyribonucleic acid (cfDNA) sample.
- In another aspect, the present disclosure provides a method for enriching a nucleic acid sample for nucleosome-associated regions of a genome comprising (a) bringing a nucleic acid sample in contact with a bait set panel, said bait set panel comprising one or more bait sets that selectively enrich for one or more nucleosome-associated regions of a genome; and (b) enriching the nucleic acid sample for one or more nucleosome-associated regions of a genome.
- In some embodiments, the one or more bait sets in a bait set panel are configured to capture nucleosome-associated regions of the genome based on a function of a plurality of reference nucleosomal occupancy profiles associated with one or more disease states and one or more non-disease states. In an embodiment, the one or bait sets in a bait set panel selectively enrich for the one or more nucleosome-associated regions in a cfDNA sample. In an embodiment, the method for enriching a nucleic acid sample for nucleosome-associated regions of a genome further comprises sequencing the enriched nucleic acids to produce sequence reads of the nucleosome-associated regions of a genome.
- In another aspect, the present disclosure provides a method for generating a bait set comprising (a) identifying one or more regions of a genome, said regions associated with a nucleosome profile, and (b) selecting a bait set to selectively capture said regions. In an embodiment, a bait set in a bait set panel selectively enriches for one or more nucleosome-associated regions in a cell-free deoxyribonucleic acid sample.
- In another aspect, the present disclosure provides a bait panel comprising a first bait set that selectively hybridizes to a first set of genomic regions of a nucleic acid sample comprising a predetermined amount of DNA, which is provided at a first concentration ratio that is less than a saturation point of the first bait set; and a second bait set that selectively hybridizes to a second set of genomic regions of the nucleic acid sample, which is provided at a second concentration ratio that is associated with a saturation point of the second bait set. In an embodiment, the first set of genomic regions comprises one or more backbone genomic regions and the second set of genomic regions comprises one or more hotspot genomic regions.
- In another aspect, the present disclosure provides a method for enriching for multiple genomic regions comprising bringing a predetermined amount of a nucleic acid sample in contact with a bait panel comprising (i) a first bait set that selectively hybridizes to a first set of genomic regions of the nucleic acid sample, provided at a first concentration ratio that is less than a saturation point of the first bait set, and (ii) a second bait set that selectively hybridizes to a second set of genomic regions of the nucleic acid sample, provided at a second concentration ratio that is associated with a saturation point of the second bait set; and enriching the nucleic acid sample for the first set of genomic regions and the second set of genomic regions.
- In some embodiments, the method further comprises sequencing the enriched nucleic acids to produce sequence reads of the first set of genomic regions and the second set of genomic regions.
- In some embodiments, the saturation point of a bait set is determined by (a) for each of the baits in the bait set, generating a titration curve comprising (i) measuring the capture efficiency of the bait as a function of the concentration of the bait, and (ii) identifying an inflection point within the titration curve, thereby identifying a saturation point associated with the bait; and (b) selecting a saturation point that is larger than substantially all of the saturation points associated with baits in the bait set, thereby determining the saturation point of the bait set.
- In some embodiments, the capture efficiency of a bait is determined by (a) providing a plurality of nucleic acid samples obtained from a plurality of subjects in a cohort; (b) hybridizing the bait with each of the nucleic acid samples, at each of a plurality of concentrations of the bait; (c) enriching with the bait, a plurality of genomic regions of the nucleic acid samples, at each of the plurality of concentrations of the bait; and (d) measuring number of unique nucleic acid molecules or nucleic acid molecules with representation of both strands of an original double-stranded nucleic acid molecule representing the capture efficiency at each of the plurality of concentrations of the bait.
- In some embodiments, an inflection point is a first concentration of the bait such that observed capture efficiency does not increase significantly at concentrations of the bait greater than the first concentration. An inflection point may be a first concentration of the bait such that an observed increase between (1) the capture efficiency at a bait concentration of twice the first concentration compared to (2) the capture efficiency at the first bait concentration, is less than about 1%, less than about 2%, less than about 3%, less than about 4%, less than about 5%, less than about 6%, less than about 7%, less than about 8%, less than about 9%, less than about 10%, less than about 12%, less than about 14%, less than about 16%, less than about 18%, or less than about 20%.
- In some embodiments, the nucleic acid sample comprises a cell-free nucleic acid sample. In an embodiment, a method for enriching for multiple genomic regions further comprises sequencing the enriched nucleic acid sample to produce a plurality of sequence reads. In an embodiment, a method for enriching for multiple genomic regions further comprises producing an output comprising a nucleic acid sequence representative of the nucleic acid sample.
- In another aspect, the present disclosure provides a bait panel comprising a first set that selectively captures backbone regions of a genome, said backbone regions associated with a ranking function of sequencing load and utility, wherein the ranking function of each backbone region has a value less than a predetermined threshold value; and a second bait set that selectively captures hotspot regions of a genome, said hotspot regions associated with a ranking function of sequencing load and utility, wherein the ranking function of each hotspot region has a value greater than or equal to the predetermined threshold value.
- In some embodiments, the hotspot regions comprise one or more nucleosome informative regions, said nucleosome informative regions comprising a region of maximum nucleosome differentiation. In an embodiment, the bait panel further comprises a second bait set that selectively captures disease informative regions. In an embodiment, the baits in the first bait set are at a first relative concentration to the bait panel, and the baits in the second bait set are at a second relative concentration to the bait panel.
- In another aspect, the present disclosure provides a method for generating a bait set comprising identifying one or more backbone genomic regions of interest, wherein the identifying the one or more backbone genomic regions comprises maximizing a ranking function of sequencing load and utility associated with each of the backbone genomic regions; identifying one or more hot-spot genomic regions of interest; creating a first bait set that selectively captures the backbone genomic regions of interest; and creating a second bait set that selectively captures the hot-spot genomic regions of interest, wherein the second bait set has a higher capture efficiency than the first bait set.
- In some embodiments, the one or more hot-spots are selected using one or more of the following: (i) maximizing a ranking function of sequencing load and utility associated with each of the hot-spot genomic regions, (ii) nucleosome profiling across the one or more genomic regions of interest, (iii) predetermined cancer driver mutations or prevalence across a relevant patient cohort, and (iv) empirically identified cancer driver mutations.
- In some embodiments, identifying one or more hotspots of interest comprises using a programmed computer processor to rank a set of hot-spot genomic regions based on a ranking function of sequencing load and utility associated with each of the hot-spot genomic regions. In some embodiments, identifying the one or more backbone genomic regions of interest comprises ranking a set of backbone genomic regions based on a ranking function of sequencing load and utility associated with each of the backbone genomic regions of interest. In some embodiments, identifying the one or more hot-spot genomic regions of interest comprises utilizing a set of empirically determined minor allele frequency (MAF) values or clonality of a variant measured by its MAF in relationship to the highest presumed driver or clonal mutation in a sample.
- In some embodiments, sequencing load of a genomic region is calculated by multiplying together one or more of (i) size of the genomic region in base pairs, (ii) relative fraction of reads spent on sequencing fragments mapping to the genomic region, (iii) relative coverage as a result of sequence bias of the genomic region, (iv) relative coverage as a result of amplification bias of the genomic region, and (v) relative coverage as a result of capture bias of the genomic region.
- In some embodiments, utility of a genomic region is calculated by multiplying together one or more of (i) frequency of one or more actionable mutations in the genomic region, (ii) frequency of one or more mutations associated with above-average minor allele frequencies (MAFs) in the genomic region, (iii) fraction of patients in a cohort harboring a somatic mutation within the genomic region, (iv) sum of MAF for variants in patients in a cohort, said patients harboring a somatic mutation within the genomic region, and (v) ratio of (1) MAF for variants in patients in a cohort, said patients harboring a somatic mutation within the genomic region, to (2) maximum MAF for a given patient in the cohort.
- In some embodiments, actionable mutations comprise one or more of (i) druggable mutations, (ii) mutations for therapeutic monitoring, (iii) disease specific mutations, (iv) tissue specific mutations, (v) cell type specific mutations, (vi) resistance mutations, and (vii) diagnostic mutations. In an embodiment, mutations associated with higher minor allele frequencies comprise one or more driver mutations or are known from external data or annotation sources.
- In another aspect, the present disclosure provides a bait panel comprising a plurality of bait sets, each bait set (i) comprising one or more baits that selectively capture one or more genomic regions with utility in the same quantile across the plurality of baits, and (ii) having a different relative concentration from each of the other bait sets with utility in a different quantile across the plurality of baits.
- In another aspect, the present disclosure provides a method of selecting a set of panel blocks comprising (a) for each panel block, (i) calculating a utility of the panel block, (ii) calculating a sequencing load of the panel block, and (iii) calculating a ranking function of the panel block; and (b) performing an optimization process to select a set of panel blocks that maximizes the total ranking function values of the selected panel blocks.
- In some embodiments, a ranking function of a panel block is calculated as the utility of a panel block divided by the sequencing load of a panel block. In some embodiments, the combinatorial optimization process comprises a greedy algorithm.
- In another aspect, the present disclosure provides a method comprising (a) providing a plurality of bait mixtures, wherein each bait mixture comprises a first bait set that selectively hybridizes to a first set of genomic regions and a second bait set that selectively hybridizes to a second set of genomic regions, and wherein the bait mixtures comprise the first bait set at different concentrations and the second bait set at the same concentrations; (b) contacting each bait mixture with a nucleic acid sample to capture nucleic acid from the sample with the bait sets, wherein the nucleic acid samples have a nucleic acid concentration around the saturation point of the second bait set; (c) sequencing the nucleic acids captured with each bait mixture to produce sets of sequence reads; (d) determining the relative number of sequence reads for the first set of genomic regions and the second set of genomic regions for each bait mixture; and (e) identifying at least one bait mixture that provides read depths for the second set of genomic regions and, optionally, first set of genomic regions, at predetermined amounts.
- In another aspect, the present disclosure provides a method for improving accuracy of detecting an insertion or deletion (indel) from a plurality of sequence reads derived from cell-free deoxyribonucleic acid (cfDNA) molecules in a bodily sample of a subject, which plurality of sequence reads are generated by nucleic acid sequencing, comprising (a) for each of the plurality of sequence reads associated with the cell-free DNA molecules, providing: a predetermined expectation of an indel being detected in one or more sequence reads of the plurality of sequence reads; a predetermined expectation that a detected indel is a true indel present in a given cell-free DNA molecule of the cell-free DNA molecules, given that an indel has been detected in the one or more of the sequence reads; and a predetermined expectation that a detected indel is introduced by non-biological error, given that an indel has been detected in the one or more of the sequence reads; (b) providing quantitative measures of one or more model parameters characteristic of sequence reads generated by nucleic acid sequencing; (c) detecting one or more candidate indels in the plurality of sequence reads associated with the cell-free DNA molecules; and (d) for each candidate indel, performing a hypothesis test using one or more of the model parameters to classify said candidate indel as a true indel or an introduced indel, thereby improving accuracy of detecting an indel.
- In another aspect, the present disclosure provides a kit comprising (a) a sample comprising a predetermined amount of DNA; and (b) a bait set panel comprising (i) a first bait set that selectively hybridizes to a first set of genomic regions of a nucleic acid sample comprising a predetermined amount of DNA, provided at a first concentration ratio that is less than a saturation point of the first bait set and (ii) a second bait set that selectively hybridizes to a second set of genomic regions of the nucleic acid sample, provided at a second concentration ratio that is associated with a saturation point of the second bait set.
- In some embodiments, the method for improving accuracy of detecting an insertion or deletion (indel) from a plurality of sequence reads derived from cell-free deoxyribonucleic acid (cfDNA) molecules in a bodily sample of a subject further comprises enriching one or more loci from the cell-free DNA in the bodily sample before step (a), thereby producing enriched polynucleotides.
- In some embodiments, the method further comprises amplifying the enriched polynucleotides to produce families of amplicons, wherein each family comprises amplicons originating from a single strand of the cell-free DNA molecules. In some embodiments, the non-biological error comprises error in sequencing at a plurality of genomic base locations. In some embodiments, the non-biological error comprises error in amplification at a plurality of genomic base locations.
- In some embodiments, model parameters comprise one or more of (e.g., one or more of, two or more of, three or more of, or four of) (i) for each of one or more variant alleles, a frequency of the variant allele (α) and a frequency of non-reference alleles other than the variant allele (α′); (ii) a frequency of an indel error in the entire forward strand of a family of strands (β1), wherein a family comprises a collection of amplicons originating from a single strand of the cell-free DNA molecules; (iii) a frequency of an indel error in the entire reverse strand of a family of strands (β2); and (iv) a frequency of an indel error in a sequence read (γ).
- In some embodiments, the step of performing a hypothesis test comprises performing a multi-parameter maximization algorithm. In some embodiments, the multi-parameter maximization algorithm comprises a Nelder-Mead algorithm. In an embodiment, the classifying of a candidate indel as a true indel or an introduced indel comprises (a) maximizing a multi-parameter likelihood function, (b) classifying a candidate indel as a true indel if the maximum likelihood function value is greater than a predetermined threshold value, and (c) classifying a candidate indel as an introduced indel if the maximum likelihood function value is less than or equal to a predetermined threshold value.
- In another aspect, the present disclosure provides a non-transitory computer-readable medium comprising machine executable code that, upon execution by one or more computer processors, implements a method for generating a bait set comprises identifying one or more backbone genomic regions of interest, wherein the identifying the one or more backbone genomic regions comprises maximizing a ranking function of sequencing load and utility associated with each of the backbone genomic regions; identifying one or more hot-spot genomic regions of interest; creating a first bait set that selectively captures the backbone genomic regions of interest; and creating a second bait set that selectively captures the hot-spot genomic regions of interest, wherein the second bait set has a higher capture efficiency than the first bait set.
- In another aspect, the present disclosure provides a non-transitory computer-readable medium comprising machine executable code that, upon execution by one or more computer processors, implements a method of selecting a set of panel blocks comprises (a) for each panel block, (i) calculating a utility of the panel block, (ii) calculating a sequencing load of the panel block, and (iii) calculating a ranking function of the panel block; and (b) performing an optimization process to select a set of panel blocks that maximizes the total ranking function values of the selected panel block.
- In another aspect, the present disclosure provides a non-transitory computer-readable medium comprising machine executable code that, upon execution by one or more computer processors, implements a method for improving accuracy of detecting an insertion or deletion (indel) from a plurality of sequence reads derived from cell-free deoxyribonucleic acid (cfDNA) molecules in a bodily sample of a subject, which plurality of sequence reads are generated by nucleic acid sequencing, comprises (a) for each of the plurality of sequence reads associated with the cell-free DNA molecules, providing: a predetermined expectation of an indel being detected in one or more sequence reads of the plurality of sequence reads; a predetermined expectation that a detected indel is a true indel present in a given cell-free DNA molecule of the cell-free DNA molecules, given that an indel has been detected in the one or more of the sequence reads; and a predetermined expectation that a detected indel is introduced by non-biological error, given that an indel has been detected in the one or more of the sequence reads; (b) providing quantitative measures of one or more model parameters characteristic of sequence reads generated by nucleic acid sequencing; (c) detecting one or more candidate indels in the plurality of sequence reads associated with the cell-free DNA molecules; and (d) for each candidate indel, performing a hypothesis test using one or more of the model parameters to classify said candidate indel as a true indel or an introduced indel, thereby improving accuracy of detecting an indel.
- In another aspect, the present disclosure provides a method for enriching for multiple genomic regions, comprising: (a) bringing a predetermined amount of nucleic acid from a sample in contact with a bait mixture comprising (i) a first bait set that selectively hybridizes to a first set of genomic regions of the nucleic acid from the sample, which first bait set is provided at a first concentration that is less than a saturation point of the first bait set, and (ii) a second bait set that selectively hybridizes to a second set of genomic regions of the nucleic acid sample, which second bait set is provided at a second concentration that is associated with a saturation point of the second bait set; and (b) enriching the nucleic acid sample for the first set of genomic regions and the second set of genomic regions.
- In some embodiments, the second bait set has a saturation point that is larger than substantially all of the saturation points associated with baits in the second bait set when a bait of the second bait set is subjected to a titration curve generated by (i) measuring the capture efficiency of a bait of the second bait set as a function of the concentration of the bait, and (ii) identifying an inflection point within the titration curve, thereby identifying a saturation point associated with the bait. In some embodiments, the saturation point is selected such that an observed capture efficiency increases by less than 20% at a concentration of the bait twice that of the first concentration.
- In some embodiments, the saturation point is selected such that an observed capture efficiency increases by less than 10% at a concentration of the bait twice that of the first concentration. In some embodiments, the saturation point is selected such that an observed capture efficiency increases by less than 5% at a concentration of the bait twice that of the first concentration. In some embodiments, the saturation point is selected such that an observed capture efficiency increases by less than 2% at a concentration of the bait twice that of the first concentration. In some embodiments, the saturation point is selected such that an observed capture efficiency increases by less than 1% at a concentration of the bait twice that of the first concentration.
- In some embodiments, the first bait set or the second bait set selectively enrich for one or more nucleosome-associated regions of a genome, said nucleosome-associated regions comprising genomic regions having one or more genomic base positions with differential nucleosomal occupancy, wherein the differential nucleosomal occupancy is characteristic of a cell or tissue type of origin or disease state. In some embodiments, the nucleic acid sample comprises a cell-free nucleic acid sample. In some embodiments, the method further comprises: (c) sequencing the enriched nucleic acid sample to produce a plurality of sequence reads. In some embodiments, the method further comprises: (d) producing an output comprising a nucleic acid sequence representative of the nucleic acid sample.
- In another aspect, the present disclosure provides a method for generating a bait set comprising: (a) identifying one or more predetermined backbone genomic regions, wherein the identifying the one or more backbone genomic regions comprises maximizing a ranking function of sequencing load and utility associated with each of the backbone genomic regions; (b) identifying one or more predetermined hot-spot genomic regions, wherein the one or more hot-spots are selected using one or more of the following: (i) maximizing a ranking function of sequencing load and utility associated with each of the hot-spot genomic regions, (ii) nucleosome profiling across the one or more predetermined genomic regions, (iii) predetermined cancer driver mutations or prevalence across a relevant patient cohort, and (iv) empirically identified cancer driver mutations; (c) creating a first bait set that selectively captures the predetermined backbone genomic regions; and (d) creating a second bait set that selectively captures the predetermined hotspot genomic regions, wherein the second bait set has a higher capture efficiency than the first bait set. In some embodiments, a predetermined region (e.g., a predetermined backbone region or a predetermined hotspot region) is a region of interest (e.g., a backbone region of interest or a hotspot region of interest, respectively).
- In some embodiments, the identifying the one or more predetermined hotspots comprises using a programmed computer processor to rank a set of hotspot genomic regions based on a ranking function of sequencing load and utility associated with each of the hotspot genomic regions. In some embodiments, the identifying the one or more predetermined backbone genomic regions comprises: (i) ranking a set of backbone genomic regions based on a ranking function of sequencing load and utility associated with each of the predetermined backbone genomic regions; (ii) utilizing a set of empirically determined minor allele frequency (MAF) values or clonality of a variant measured by its MAF in relationship to the highest presumed driver or clonal mutation in a sample; or (iii) a combination of (i) and (ii).
- In some embodiments, the sequencing load of a genomic region is calculated by multiplying together one or more of: (i) size of the genomic region in base pairs, (ii) relative fraction of reads spent on sequencing fragments mapping to the genomic region, (iii) relative coverage as a result of sequence bias of the genomic region, (iv) relative coverage as a result of amplification bias of the genomic region, and (v) relative coverage as a result of capture bias of the genomic region. In some embodiments, the utility of a genomic region is calculated by multiplying together one or more of: (i) frequency of one or more actionable mutations in the genomic region, (ii) frequency of one or more mutations associated with above-average minor allele frequencies (MAFs) in the genomic region, (iii) fraction of patients in a cohort harboring a somatic mutation within the genomic region, (iv) sum of MAF for variants in patients in a cohort, said patients harboring a somatic mutation within the genomic region, and (v) ratio of (1) MAF for variants in patients in a cohort, said patients harboring a somatic mutation within the genomic region, to (2) maximum MAF for a given patient in the cohort.
- In some embodiments, the actionable mutations comprise one or more of: (i) druggable mutations, (ii) mutations for therapeutic monitoring, (iii) disease specific mutations, (iv) tissue specific mutations, (v) cell type specific mutations, (vi) resistance mutations, and (vii) diagnostic mutations. In some embodiments, the mutations associated with higher minor allele frequencies comprise one or more driver mutations or are known from external data or annotation sources.
- In another aspect, the present disclosure provides a method comprising: (a) providing a plurality of bait mixtures, wherein each bait mixture comprises a first bait set that selectively hybridizes to a first set of genomic regions and a second bait set that selectively hybridizes to a second set of genomic regions, and wherein the bait mixtures comprise the first bait set at different concentrations and the second bait set at the same concentrations; (b) contacting each bait mixture with a nucleic acid sample to capture nucleic acid from the sample with the bait sets, wherein the second bait set in each mixture is provided at a concentration that is at or above a saturation point of the second bait set, wherein nucleic acid from the sample is captured by the bait sets; (c) sequencing a portion of the nucleic acids captured with each bait mixture to produce sets of sequence reads within an allocated number of sequence reads; (d) determining the read depth of sequence reads for the first bait set and the second bait set for each bait mixture; and (e) identifying at least one bait mixture that provides read depths for the second set of genomic regions; wherein the read depths for the second set of genomic regions provides a sensitivity of detecting of at least 0.0001%.
- In some embodiments, the second bait set has a saturation point when subjected to titration, which titration comprises: generating a titration curve comprising: (i) measuring the capture efficiency of the second bait set as a function of the concentration of the baits; and (ii) identifying an inflection point within the titration curve, thereby identifying a saturation point associated with the second bait set.
- In some embodiments, the saturation point is selected such that an observed capture efficiency increases by less than 20% at a concentration of the bait twice that of the first concentration. In some embodiments, the saturation point is selected such that an observed capture efficiency increases by less than 10% at a concentration of the bait twice that of the first concentration. In some embodiments, the saturation point is selected such that an observed capture efficiency increases by less than 5% at a concentration of the bait twice that of the first concentration. In some embodiments, the saturation point is selected such that an observed capture efficiency increases by less than 2% at a concentration of the bait twice that of the first concentration. In some embodiments, the saturation point is selected such that an observed capture efficiency increases by less than 1% at a concentration of the bait twice that of the first concentration.
- In some embodiments, the first bait set or the second bait set selectively enrich for one or more nucleosome-associated regions of a genome, said nucleosome-associated regions comprising genomic regions having one or more genomic base positions with differential nucleosomal occupancy, wherein the differential nucleosomal occupancy is characteristic of a cell or tissue type of origin or disease state. In some embodiments, the first set of genomic regions or the second genomic regions comprises one or more actionable mutations, wherein the one or more actionable mutations comprise one or more of: (i) druggable mutations, (ii) mutations for therapeutic monitoring, (iii) disease specific mutations, (iv) tissue specific mutations, (v) cell type specific mutations, (vi) resistance mutations, and (vii) diagnostic mutations.
- In some embodiments, the first and second genomic regions comprise at least a portion of each of at least 5 genes selected from Table 3. In some embodiments, the first and second genomic regions have a size between about 25 kilobases to 1,000 kilobases and a read depth of between 1,000 counts/base and 50,000 counts/base.
- In one aspect, the present disclosure provides a method for enriching multiple genomic regions, comprising: (a) bringing a predetermined amount of nucleic acid from a sample in contact with a bait mixture comprising: (i) a first bait set that selectively hybridizes to a first set of genomic regions of the nucleic acid from the sample, which first bait set is provided at a first concentration that is less than a saturation point of the first bait set, and (ii) a second bait set that selectively hybridizes to a second set of genomic regions of the nucleic acid from the sample, which second bait set is provided at a second concentration that is at or above a saturation point of the second bait set; and (b) enriching the nucleic acid from the sample for the first set of genomic regions and the second set of genomic regions, thereby producing an enriched nucleic acid.
- In some embodiments, the second bait set has a saturation point that is larger than substantially all of the saturation points associated with baits in the second bait set when a bait of the second bait set is subjected to a titration curve generated by (i) measuring capture efficiency of a bait of the second bait set as a function of the concentration of the bait, and (ii) identifying an inflection point within the titration curve, thereby identifying a saturation point associated with the bait. In some embodiments, the saturation point of the first bait set is selected such that an observed capture efficiency increases by less than 10% at a concentration of the bait twice that of the first concentration. In some embodiments, the first bait set or the second bait set selectively enrich for one or more nucleosome-associated regions of a genome, the nucleosome-associated regions comprising genomic regions having one or more genomic base positions with differential nucleosomal occupancy, wherein the differential nucleosomal occupancy is characteristic of a cell or tissue type of origin or disease state. In some embodiments, the method further comprises (c) sequencing the enriched nucleic acid to produce a plurality of sequence reads. In some embodiments, the method further comprises (d) producing an output comprising nucleic acid sequences representative of the nucleic acid from the sample.
- In one aspect, the present disclosure provides a method comprising: (a) providing a plurality of bait mixtures, wherein each of the plurality of bait mixtures comprises a first bait set that selectively hybridizes to a first set of genomic regions and a second bait set that selectively hybridizes to a second set of genomic regions, wherein the first bait set is at different concentrations across the plurality of bait mixtures and the second bait set is at the same concentration across the plurality of bait mixtures; (b) contacting each of the plurality of bait mixtures with a nucleic acid sample to capture nucleic acids from the nucleic acid sample with the first bait set and the second bait set, wherein the second bait set in each bait mixture is provided at a first concentration that is at or above a saturation point of the second bait set, wherein nucleic acids from the nucleic acid sample are captured by the first bait set and the second bait set; (c) sequencing a portion of the nucleic acids captured with each bait mixture to produce sets of sequence reads within an allocated number of sequence reads; (d) determining the read depth of sequence reads for the first bait set and the second bait set for each bait mixture; and (e) identifying at least one bait mixture that provides read depths for the second set of genomic regions; wherein the read depths for the second set of genomic regions provides a sensitivity of detecting of a genetic variant of at least 0.0001% minor allele frequency (MAF). In some embodiments, steps (d) and/or (e) are optional.
- In some embodiments, the second bait set has a saturation point when subjected to titration, which titration comprises generating a titration curve comprising: (i) measuring capture efficiency of the second bait set as a function of the concentration of the baits; and (ii) identifying an inflection point within the titration curve, thereby identifying a saturation point associated with the second bait set. In some embodiments, the saturation point is selected such that an observed capture efficiency increases by less than 10% at a concentration of the bait set twice that of the first concentration. In some embodiments, the first bait set or the second bait set selectively enrich for one or more nucleosome-associated regions of a genome, the nucleosome-associated regions comprising genomic regions having one or more genomic base positions with differential nucleosomal occupancy, wherein the differential nucleosomal occupancy is characteristic of a cell or tissue type of origin or disease state. In some embodiments, the first set of genomic regions comprises one or more actionable mutations, wherein the one or more actionable mutations comprise one or more of: (i) druggable mutations, (ii) mutations for therapeutic monitoring, (iii) disease specific mutations, (iv) tissue specific mutations, (v) cell type specific mutations, (vi) resistance mutations, and (vii) diagnostic mutations. In some embodiments, the first genomic regions comprise at least a portion of each of at least 5 genes selected from Table 1. In some embodiments, the first genomic regions have a size between about 25 kilobases to 1,000 kilobases and a read depth of between 1,000 counts/base and 50,000 counts/base. In some embodiments, the saturation point of the second bait set is selected such that an observed capture efficiency increases by less than 10% at a concentration of the bait twice that of the second concentration. In some embodiments, the second set of genomic regions comprises one or more actionable mutations, wherein the one or more actionable mutations comprise one or more of: (i) druggable mutations, (ii) mutations for therapeutic monitoring, (iii) disease specific mutations, (iv) tissue specific mutations, (v) cell type specific mutations, (vi) resistance mutations, and (vii) diagnostic mutations. In some embodiments, the second genomic regions comprise at least a portion of each of at least 5 genes selected from Table 1. In some embodiments, the second genomic regions have a size between about 25 kilobases to 1,000 kilobases and a read depth of between 1,000 counts/base and 50,000 counts/base.
- Additional aspects and advantages of the present disclosure will become readily apparent to those skilled in this art from the following detailed description, wherein only illustrative embodiments of the present disclosure are shown and described. As will be realized, the present disclosure is capable of other and different embodiments, and its several details are capable of modifications in various obvious respects, all without departing from the disclosure. Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive.
- All publications, patents, and patent applications mentioned in this specification are herein incorporated by reference to the same extent as if each individual publication, patent, or patent application was specifically and individually indicated to be incorporated by reference. To the extent publications and patents or patent applications incorporated by reference contradict the disclosure contained in the specification, the specification is intended to supersede and/or take precedence over any such contradictory material.
- The novel features of the disclosure are set forth with particularity in the appended claims. A better understanding of the features and advantages of the present disclosure will be obtained by reference to the following detailed description that sets forth illustrative embodiments, in which the principles of the disclosure are utilized, and the accompanying drawings (also “Figure” and “FIG.” herein), of which:
-
FIG. 1 illustrates how a plurality of reads may be generated for each locus enriched from a cell-free nucleic acid sample. -
FIG. 2 illustrates an example of an insertion being supported by a large family. -
FIG. 3 illustrates an example of small families of reads (which may appear to provide evidence for a real variant) and large families of reads (which may indicate a likely random error stemming from PCR or sequencing. -
FIG. 4 illustrates the various parameters that may be used in a hypothesis test and how each parameter may be related to a particular probability, e.g., of a family of reads matching a reference, of a strand's reads matching a reference, and of a read matching a reference. -
FIG. 5 illustrates an example of a computer system that may be programmed or otherwise configured to implement methods of the present disclosure. -
FIG. 6 illustrates an exemplary saturation curve showing unique molecule count on the y-axis as a function of input cfDNA amount on the x-axis. - While various embodiments of the invention have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. Numerous variations, changes, and substitutions may occur to those skilled in the art without departing from the invention. It should be understood that various alternatives to the embodiments of the invention described herein may be employed.
- The term “genetic variant,” as used herein, generally refers to an alteration, variant or polymorphism in a nucleic acid sample or genome of a subject. Such alteration, variant or polymorphism can be with respect to a reference genome, which may be a reference genome of the subject or other individual. Single nucleotide polymorphisms (SNPs) are a form of polymorphisms. In some examples, one or more polymorphisms comprise one or more single nucleotide variations (SNVs), insertions, deletions, repeats, small insertions, small deletions, small repeats, structural variant junctions, variable length tandem repeats, and/or flanking sequences. Copy number variations (CNVs), transversions and other rearrangements are also forms of genetic variation. A genomic alteration may be a base change, insertion, deletion, repeat, copy number variation, or transversion.
- The term “polynucleotide,” or “polynucleic acid” as used herein, generally refers to a molecule comprising one or more nucleic acid subunits (a “nucleic acid molecule”). A polynucleotide can include one or more subunits selected from adenosine (A), cytosine (C), guanine (G), thymine (T) and uracil (U), or variants thereof. A nucleotide can include A, C, G, T or U, or variants thereof. A nucleotide can include any subunit that can be incorporated into a growing nucleic acid strand. Such subunit can be an A, C, G, T, or U, or any other subunit that is specific to one or more complementary A, C, G, T or U, or complementary to a purine (i.e., A or G, or variant thereof) or a pyrimidine (i.e., C, T or U, or variant thereof). Identification of a subunit can enable individual nucleic acid bases or groups of bases (e.g., AA, TA, AT, GC, CG, CT, TC, GT, TG, AC, CA, or uracil-counterparts thereof) to be resolved. In some examples, a polynucleotide is deoxyribonucleic acid (DNA) or ribonucleic acid (RNA), or derivatives thereof. A polynucleotide can be single-stranded or double stranded.
- A polynucleotide can comprise any type of nucleic acids, such as DNA and/or RNA. For example, if a polynucleotide is DNA, it can be genomic DNA, complementary DNA (cDNA), or any other deoxyribonucleic acid. A polynucleotide can be a cell-free nucleic acid. As used herein, the terms cell-free nucleic acid and extracellular nucleic acid can be used interchangeably. A polynucleotide can be cell-free DNA (cfDNA). For example, the polynucleotide can be circulating DNA. The circulating DNA can comprise circulating tumor DNA (ctDNA). The cell-free or extracellular nucleic acids can be derived from any bodily fluid including, but not limited to, whole blood, platelets, serum, plasma, synovial fluid, lymphatic fluid, ascites fluid, interstitial or extracellular fluid, the fluid in spaces between cells, gingival crevicular fluid, bone marrow, cerebrospinal fluid, saliva, mucous, sputum, semen, sweat, urine, cervical fluid or lavage, vaginal fluid or lavage, mammary gland or lavage, and/or any combination thereof. In some embodiments, the cell-free or extracellular nucleic acids can be derived from plasma. In some embodiments, a bodily fluid containing cells can be processed to remove the cells in order to purify and/or extract cell-free or extracellular nucleic acids. A polynucleotide can be double-stranded or single-stranded. Alternatively, a polynucleotide can comprise a combination of a double-stranded portion and a single-stranded portion.
- Polynucleotides do not have to be cell-free. In some cases, the polynucleotides can be isolated from a sample. A sample can be a composition comprising an analyte. For example, a sample can be any biological sample isolated from a subject including, without limitation, bodily fluid, whole blood, platelets, serum, plasma, stool, red blood cells, white blood cells or leucocytes, endothelial cells, tissue biopsies, synovial fluid, lymphatic fluid, ascites fluid, interstitial or extracellular fluid, the fluid in spaces between cells, including gingival crevicular fluid, bone marrow, cerebrospinal fluid, saliva, mucous, sputum, semen, sweat, urine, or any other bodily fluids, and/or any combination thereof. A bodily fluid can include saliva, blood, or serum. For example, a polynucleotide can be cell-free DNA isolated from a bodily fluid, e.g., blood or serum. A sample can also be a tumor sample, which can be obtained from a subject by various approaches, including, but not limited to, venipuncture, excretion, ejaculation, massage, biopsy, needle aspirate, lavage, scraping, surgical incision, or intervention or other approaches. In some embodiments, a sample is a nucleic acid sample, e.g., a purified nucleic acid sample. In some embodiments, a nucleic acid sample comprises cell-free DNA (cfDNA). An analyte in a sample can be in various stages of purity. For example, a raw sample may be taken directly from a subject can contain the analyte in an unpurified state. A sample also may be enriched for an analyte. An analyte also may be present in the sample in isolated or substantially isolated form.
- The polynucleotides can comprise sequences associated with cancer, such as acute lymphoblastic leukemia (ALL), acute myeloid leukemia (AML), adrenocortical carcinoma, Kaposi Sarcoma, anal cancer, basal cell carcinoma, bile duct cancer, bladder cancer, bone cancer, osteosarcoma, malignant fibrous histiocytoma, brain stem glioma, brain cancer, craniopharyngioma, ependymoblastoma, ependymoma, medulloblastoma, medulloeptithelioma, pineal parenchymal tumor, breast cancer, bronchial tumor, Burkitt lymphoma, Non-Hodgkin lymphoma, carcinoid tumor, cervical cancer, chordoma, chronic lymphocytic leukemia (CLL), chronic myelogenous leukemia (CML), colon cancer, colorectal cancer, cutaneous T-cell lymphoma, ductal carcinoma in situ, endometrial cancer, esophageal cancer, Ewing Sarcoma, eye cancer, intraocular melanoma, retinoblastoma, fibrous histiocytoma, gallbladder cancer, gastric cancer, glioma, hairy cell leukemia, head and neck cancer, heart cancer, hepatocellular (liver) cancer, Hodgkin lymphoma, hypopharyngeal cancer, kidney cancer, laryngeal cancer, lip cancer, oral cavity cancer, lung cancer, non-small cell carcinoma, small cell carcinoma, melanoma, mouth cancer, myelodysplastic syndromes, multiple myeloma, medulloblastoma, nasal cavity cancer, paranasal sinus cancer, neuroblastoma, nasopharyngeal cancer, oral cancer, oropharyngeal cancer, osteosarcoma, ovarian cancer, pancreatic cancer, papillomatosis, paraganglioma, parathyroid cancer, penile cancer, pharyngeal cancer, pituitary tumor, plasma cell neoplasm, prostate cancer, rectal cancer, renal cell cancer, rhabdomyosarcoma, salivary gland cancer, Sezary syndrome, skin cancer, nonmelanoma, small intestine cancer, soft tissue sarcoma, squamous cell carcinoma, testicular cancer, throat cancer, thymoma, thyroid cancer, urethral cancer, uterine cancer, uterine sarcoma, vaginal cancer, vulvar cancer, Waldenstrom macroglobulinemia, and/or Wilms Tumor.
- A sample can comprise various amount of nucleic acid that contains genome equivalents. For example, a sample of about 30 ng DNA can contain about 10,000 (104) haploid human genome equivalents and, in the case of cfDNA, about 200 billion (2×1011) individual polynucleotide molecules. Similarly, a sample of about 100 ng of DNA can contain about 30,000 haploid human genome equivalents and, in the case of cfDNA, about 600 billion individual molecules.
- A sample can comprise nucleic acids from different sources. For example, a sample can comprise germline DNA or somatic DNA. A sample can comprise nucleic acids carrying mutations. For example, a sample can comprise DNA carrying germline mutations and/or somatic mutations. A sample can also comprise DNA carrying cancer-associated mutations (e.g., cancer-associated somatic mutations).
- The term “subject,” as used herein, generally refers to an animal, such as a mammalian species (e.g., human) or avian (e.g., bird) species, or other organism, such as a plant. More specifically, the subject can be a vertebrate, a mammal, a mouse, a primate, a simian or a human. Animals include, but are not limited to, farm animals, sport animals, and pets. A subject can be a healthy individual, an individual that has or is suspected of having a disease or a pre-disposition to the disease, or an individual that is in need of therapy or suspected of needing therapy. A subject can be a patient.
- The term “genome,” as used herein, generally refers to an entirety of an organism's hereditary information. A genome can be encoded either in DNA or in RNA. A genome can comprise coding regions that code for proteins as well as non-coding regions. A genome can include the sequence of all chromosomes together in an organism. For example, the human genome has a total of 46 chromosomes. The sequence of all of these together may constitute a human genome. A genome may comprise a diploid or a haploid genome.
- The term “bait,” as used herein, generally refers to a target-specific oligonucleotide (e.g., a capture probe) designed and used to capture specific genomic regions of interest (e.g., targets, or predetermined genomic regions of interest). The bait may capture its intended targets by selectively hybridizing to complementary nucleic acids.
- The term “bait panel” or “bait set panel,” as used herein, generally refers to a set of baits targeted toward a selected set of genomic regions of interest. A bait panel or bait set panel may be referred to as a bait mixture. The bait panel may capture its intended targets in a single selective hybridization step.
- The term “accuracy,” of detecting a genetic variant (e.g., an indel), as used herein, generally refers to the percentage of candidate (e.g., detected) genetic variants detected through analysis of one or more sequence reads that are identified as a true genetic variant attributable to biological origin (e.g., not attributable to introduced error such as that stemming from sequencing or amplification error). The term “error rate,” of detecting a genetic variant (e.g., an indel), as used herein, generally refers to the percentage of candidate (e.g., detected) genetic variants detected through analysis of one or more sequence reads that are identified as an introduced genetic variant attributable to non-biological origin (e.g., sequencing or amplification error). For example, if analysis of one or more sequence reads identifies 100 candidate genetic variants, of which 90 are attributable to biological origin and 10 are attributed to non-biological origin, then this analysis has an accuracy of detecting the genetic variant of 90% and an error rate of 10%.
- The term “about” and its grammatical equivalents in relation to a reference numerical value can include a range of values up to plus or minus 10% from that value. For example, the amount “about 10” can include amounts from 9 to 11. In other embodiments, the term “about” in relation to a reference numerical value can include a range of values plus or minus 10%, 9%, 8%, 7%, 6%, 5%, 4%, 3%, 2%, or 1% from that value.
- The term “at least” and its grammatical equivalents in relation to a reference numerical value can include the reference numerical value and greater than that value. For example, the amount “at least 10” can include the value 10 and any numerical value above 10, such as 11, 100, and 1,000.
- The term “at most” and its grammatical equivalents in relation to a reference numerical value can include the reference numerical value and less than that value. For example, the amount “at most 10” can include the value 10 and any numerical value under 10, such as 9, 8, 5, 1, 0.5, and 0.1.
- The terms “processing”, “calculating”, and “comparing” can be used interchangeably. The term can refer to determining a difference, e.g., a difference in number or sequence. For example, gene expression, copy number variation (CNV), indel, and/or single nucleotide variant (SNV) values or sequences can be processed.
- The present disclosure provides methods and systems for multi-resolution analysis of cell-free nucleic acids (e.g., deoxyribonucleic acid (DNA)), wherein targeted genomic regions of interest may be enriched with capture probes (“baits”) selected for one or more bait set panels using a differential tiling and capture scheme. A differential tiling and capture scheme uses bait sets of different relative concentrations to differentially tile (e.g., at different “resolutions”) across genomic regions associated with baits, subject to a set of constraints (e.g., sequencer constraints such as sequencing load, utility of each bait, etc.), and capture them at a desired level for downstream sequencing. These targeted genomic regions of interest may include single-nucleotide variants (SNVs) and indels (i.e., insertions or deletions). The targeted genomic regions of interest may comprise backbone genomic regions of interest (“backbone regions”) or hot-spot genomic regions of interest (“hot-spot regions” or “hotspot regions” or “hot-spots” or “hotspots”). While “hotpots” can refer to particular loci associated with sequence variants, “backbone” regions can refer to larger genomic regions, each of which can have one or more potential sequence variants. For example, a backbone region can be a region containing one or more cancer-associated mutations, while a hotspot can be a locus with a particular mutation associated with recurring cancer. Both backbone and hot-spot genomic regions of interest may comprise tumor-relevant marker genes commonly included in liquid biopsy assays (e.g., BRAF, BRCA, EGFR, KRAS, PIK3CA, ROS1, TP53, and others), for which one or more variants may be expected to be seen in subjects with cancer.
- Among the set of tumor-relevant marker genes that may be selected for inclusion in a bait set panel, hot-spot genomic regions of interest may be selected to be represented by a higher proportion of sequence reads compared to the backbone genomic regions of interest in the experimental protocol. This experimental protocol may comprise steps including isolation, amplification, capture, sequencing, and data analysis. The selection of regions as hot-spot regions or backbone regions may be driven by considerations such as the capture efficiency, sequencing load, and/or utility associated with each of the regions and their corresponding bait. Utility may be assessed by the clinical relevance (e.g., “clinical value”) of a genomic marker of interest (e.g., a tumor marker) toward a liquid biopsy assay, e.g., predetermined cancer driver mutations, genomic regions with prevalence across a relevant patient cohort, empirically identified cancer driver mutations, or nucleosome-associated genomic regions. For example, utility can be measured by a metric representative of expected yield of actionable and/or disease-associated genetic variants in detection or contribution toward determining tissue of origin or disease state of a sample. Utility may be a monotonically increasing function of clinical value.
- Given that each sequencing run of a given sample of cell-free nucleic acids is typically limited by a certain total number of reads, a multi-resolution analysis approach to generate a bait set panel that preferentially enriches “hot-spot regions” as compared to backbone regions will enable efficient use of sequencing reads for genetic variant detection for cancer detection and assessment applications, by focusing sequencing at higher read depths for hot-spot regions over backbone regions. Using this approach may enable the improvement of a sample assay, given a limited or constrained sequencing load (e.g., number of sequenced reads per sample assayed), such that greater number of clinically actionable genetic variants may be detected per sample assay compared to an un-optimized sample assay.
- The present disclosure provides methods for improving accuracy of detecting an insertion or deletion (indel) from a plurality of sequence reads derived from cell-free deoxyribonucleic acid (cfDNA) molecules in a bodily sample of a subject, which plurality of sequence reads are generated by nucleic acid sequencing. For each of the plurality of sequence reads associated with cfDNA molecules, a candidate indel may be identified. Each candidate indel may then be classified as either a true indel or an introduced indel, using a combination of predetermined expectations of (i) an indel being detected in one or more sequence reads of the plurality of sequence reads, (ii) that a detected indel is a true indel present in a given cfDNA molecule of the cell-free DNA molecules, given that an indel has been detected in the one or more of the sequence reads, and/or (iii) that a detected indel is introduced by non-biological error, given that an indel has been detected in the one or more of the sequence reads, in conjunction with one or more model parameters to perform a hypothesis test. This approach may reduce error and improve accuracy of detecting an indel from sequence read data.
- One embodiment of multi-resolution analysis proceeds as follows. Regions of a genome are selected for sequencing. These regions may be collectively referred to as a panel or a panel block. The panel is divided into a first set of genomic regions and a second set of genomic regions. The first set of genomic regions may be referred to as the backbone region, while the second set may be referred to as the hotspot regions. These regions may be divided between genes or within genes or outside genes as desired by the practitioner. For example, an exon of a gene may be divided into portions allocated to the hotspot region and portions allocated to the backbone region.
- A first bait set and a second bait set are prepared which selectively hybridize to the first genomic regions and the second genomic regions, respectively. Using methods described herein, e.g., preparation of titration curves, bait set concentrations are determined which, for a test sample having a predetermined amount of DNA, capture DNA in the sample at a saturation point (for the bait set directed to the hotspot regions) and below the saturation point (for the bait set directed to the backbone regions). Capturing DNA molecules from a sample at the saturation point contributes to detecting genetic variants at the highest level of sensitivity because molecules genetic variants are more likely to be captured.
- The amount of sequencing data that can be obtained from a sample is finite, and constrained by such factors as the quality of nucleic acid templates, number of target sequences, scarcity of specific sequences, limitations in sequencing techniques, and practical considerations such as time and expense. Thus, a “read budget” is a way to conceptualize the amount of genetic information that can be extracted from a sample. A per-sample read budget can be selected that identifies the total number of base reads to be allocated to a test sample comprising a predetermined amount of DNA in a sequencing experiment. The read budget can be based on total reads produced, e.g., including redundant reads produced through amplification. Alternatively, it can be based on number of unique molecules detected in the sample. In certain embodiments read budget can reflect the amount of double-stranded support for a call at a locus. That is, the percentage of loci for which reads from both strands of a DNA molecule are detected.
- Factors of a read budget include read depth and panel length. For example, a read budget of 3,000,000,000 reads can be allocated as 150,000 bases at an average read depth of 20,000 reads/base. Read depth can refer to number of molecules producing a read at a locus. In the present disclosure, the reads at each base can be allocated between bases in the backbone region of the panel, at a first average read depth and bases in the hotspot region of the panel, at a deeper read depth.
- By way of non-limiting example, if a read budget consists of 100,000 read counts for a given sample, those 100,000 read counts will be divided between reads of backbone regions and reads of hotspot regions. Allocating a large number of those reads (e.g., 90,000 reads) to backbone regions will result in a small number of reads (e.g., the remaining 10,000 reads) being allocated to hotspot regions. Conversely, allocating a large number of reads (e.g., 90,000 reads) to hotspot regions will result in a small number of reads (e.g., the remaining 10,000 reads) being allocated to backbone regions. Thus, a skilled worker can allocate a read budget to provide desired levels of sensitivity and specificity. In certain embodiments, the read budget can be between 100,000,000 reads and 100,000,000,000 reads, e.g., between 500,000,000 reads and 50,000,000,000 reads, or between 1,000,000,000 reads and 5,000,000,000 reads across, for example, 20,000 bases to 100,000 bases.
- First and second sensitivity levels are selected for detection of genetic variants in the backbone and hotspot regions, respectively. Sensitivity, as used herein, refers to the detection limit of a genetic variant as a function of frequency in a sample. For example, the sensitivity may be at least 1%, at least 0.1%, at least 0.01%, at least 0.001%, at least 0.0001%, or at least 0.00001%, meaning that a given sequence can be detected in a sample at a frequency of at least 1%, at least 0.1%, at least 0.01%, at least 0.001%, at least 0.0001%, or at least 0.00001%, respectively. That is, genetic variants present in the sample at the levels are detectable by sequencing. Typically, sensitivity selected for hotspot regions will be higher than sensitivity selected for backbone regions. For example, the sensitivity level for hotspot regions may be selected at at least 0.001%, while the sensitivity level for background regions may be selected at at least 0.1% or at least 1%.
- The relative concentrations of bait sets directed to background regions and hotspot regions can be selected to optimize reads in a sequencing experiment with respect to selected read budget and selected sensitivities for the backbone and hotspot regions for a selected sample. So, for example, given a test sample containing a predetermined amount of DNA, and a hotspot bait set that captures DNA for the hotspot regions at saturation, an amount of backbone bait set that is below saturation for the sample is selected such that in a sequencing experiment producing reads within the selected read budget, the resultant read set detects genetic variants in the hotspot regions and in the backbone regions at the preselected sensitivity levels.
- The relative amounts of the bait sets is a function of several factors. One of these factors is the relative proportion of the panel allocated to the hotspot regions and to the backbone regions respectively. The larger the relative percentage of hotspot regions in the panel, the fewer the number of reads and the budget that can be allocated to the backbone region. Another factor is the selected sensitivity of detection for hotspot regions. For a given sample, the higher the sensitivity that is necessary for the hotspot regions, the lower sensitivity will be for the backbone region. Another factor is the read budget. For a sensitivity for the hotspot regions, the smaller the read budget, the lower the sensitivity possible for the backbone region. Another factor is the size of the overall panel. For any given read budget, the larger the panel, the more sensitivity of the backbone regions must be sacrificed to achieving desired sensitivity at the hotspot regions.
- It will be evident that for any given read budget, increasing the percentage of reads allocated to the backbone regions will decrease the sensitivity of detection at the hotspot regions. Conversely, increasing the sensitivity of detection at the hotspot regions, by increasing the amount of the read budget allocated to hotspot regions, decreases the detection of the backbone regions. Accordingly, the relative sensitivity levels of hotspot regions can be high enough to achieve targeted detection levels, while sensitivity level at backbone regions are not so low such that meaningful levels of genetic variants are missed. These relative levels are selected by the practitioner to achieve the desired results. In some embodiments, the skilled worker will use a bait mixture calculated to capture all (or substantially all) hotspot regions in a sample and a portion of the backbone regions, such that the read depth of the captured regions will provide desired hotspot and backbone sensitivities.
- In an aspect, a bait set panel may comprise one or more bait sets that selectively enrich for one or more nucleosome-associated regions of a genome. Nucleosome-associated regions may comprise genomic regions having one or more genomic base positions with differential nucleosomal occupancy. Differential nucleosomal occupancy may be characteristic of a cell or tissue type of origin or disease state. Analysis of differential nucleosomal occupancy may be performed using one or more nucleosomal occupancy profiles of a given cell or tissue type. Examples of nucleosomal occupancy profiling techniques include Statham et al., Genomics Data,
Volume 3, March 2015, Pages 94-96 (2015), which is entirely incorporated herein by reference. Cell-free nucleic acids in a sample obtained from a subject may be primarily shed through a combination of apoptotic and necrotic processes in cells, tissues, and organs. As a result of variable nucleosomal occupancy and protection against DNA cleavage in certain locations of a genome, nucleosomal patterns or profiles associated with apoptotic processes and necrotic processes may be evident from analyzing cell-free nucleic acid fragments for nucleosome-associated regions of a genome. - Detection of such nucleosome-associated patterns can be used, independently or in conjunction with detected somatic variants, to monitor a condition in a subject. For example, as a tumor expands, the ratio of necrosis to apoptosis in the tumor micro-environment may change. Such changes in necrosis and/or apoptosis can be detected by selectively enriching a cell-free nucleic acid sample for one or more nucleosome-associated regions. As another example, a distribution of fragment lengths may be observed due to differential nucleosomal protection across different cell types, or across tumor vs. non-tumor cells. Analysis of nucleosome-associated regions for fragment length distribution may be clinically relevant for cancer detection and assessment applications. This analysis may comprise selectively enriching for nucleosome-associated regions, then sequencing the enriched regions to produce a plurality of sequence reads representative of the nucleic acid sample, and analyzing the sequence reads for genetic variants and nucleosome profiles of interest.
- Once nucleosome-associated regions have been identified, they may be used for modular panel design. See below. Such modular panel design may allow for designs of a set of probes or baits that selectively enrich regions of the genome that are relevant for nucleosomal profiling. By incorporating this “nucleosomal awareness,” sequence data from many individuals can be gleaned to optimize the procedure of panel design, e.g., the determination of which genomic locations to target and the optimal concentration of probes for these genomic locations.
- By incorporating knowledge of both somatic variations and structural variations and instability, panels of probes, baits or primers can be designed to target specific portions of the genome (“hotspots”) with known patterns or clusters of structural variation or instability. For example, statistical analysis of sequence data reveals a series of accumulated somatic events and structural variations, and thereby enables clonal evolution studies. The data analysis reveals important biological insights, including differential coverage across cohorts, patterns indicating the presence of certain subsets of tumors, foreign structural events in samples with high somatic mutation load, and differential coverage attributed from blood cells versus tumor cells.
- A localized genomic region refers to a short region of the genome that may range in length from, or from about, 2 to 200 base pairs, from 2 to 190 base pairs, from 2 to 180 base pairs, from 2 to 170 base pairs, from 2 to 160 base pairs, from 2 to 150 base pairs, from 2 to 140 base pairs, from 2 to 130 base pairs, from 2 to 120 base pairs, from 2 to 110 base pairs, from 2 to 100 base pairs, from 2 to 90 base pairs, from 2 to 80 base pairs, from 2 to 70 base pairs, from 2 to 60 base pairs, from 2 to 50 base pairs, from 2 to 40 base pairs, from 2 to 30 base pairs, from 2 to 20 base pairs, from 2 to 10 base pairs, and/or from 2 to 5 base pairs. Each localized genomic region may contain a pattern or cluster of significant structural variation or instability. Genome partitioning maps may be provided to identify relevant localized genomic regions. A localized genomic region may contain a pattern or cluster of significant structural variation or structural instability. A cluster may be a hotspot region within a localized genomic region. The hotspot region may contain one or more significant fluctuations or peaks. A structural variation may be selected from the group consisting of: an insertion, a deletion, a translocation, a gene re-arrangement, methylation status, a micro-satellite, a copy number variation, a copy number-related structural variation, or any other variation which indicates differentiation. A structural variation can cause a variation in nucleosomal positioning.
- A genome partitioning map may be obtained by: (a) providing samples of cell-free DNA or RNA from two or more subjects in a cohort, (b) obtaining a plurality of sequence reads from each of the samples of cell-free DNA or RNA, and (c) analyzing the plurality of sequence reads to identify one or more localized genomic regions, each of which contains a pattern or cluster of significant structural variation or instability. Statistical analysis may be performed on sequence information to associate a set of sequence reads with one or more nucleosomal occupancy profiles representing distinct cohorts (e.g., a group of subjects with a common characteristic such as a disease state or a non-disease state).
- The statistical analysis may comprise providing one or more genome partitioning maps listing relevant genomic intervals representative of genes of interest for further analysis. The statistical analysis may further comprise selecting a set of one or more localized genomic regions based on the genome partitioning maps. The statistical analysis may further comprise analyzing one or more localized genomic regions in the set to obtain a set of one or more nucleosomal map disruptions. The statistical analysis may comprise one or more of (e.g., one or more, two or more, or three of): pattern recognition, deep learning, and unsupervised learning.
- A nucleosomal map disruption is a measured value that characterizes a given localized genomic region in terms of biologically relevant information. A nucleosomal map disruption may be associated with a driver mutation chosen from the group consisting of: wild-type, somatic variant, germline variant, and DNA methylation.
- One or more nucleosomal map disruptions may be used to classify a set of sequence reads as being associated with one or more nucleosomal occupancy profiles representing distinct cohorts. These nucleosomal occupancy profiles may be associated with one or more assessments. An assessment may be considered as part of a therapeutic intervention (e.g., treatment options, selection of treatment, further assessment by biopsy and/or imaging).
- An assessment may be selected from the group consisting of: indication, tumor type, tumor severity, tumor aggressiveness, tumor resistance to treatment, and tumor clonality. An assessment of tumor clonality may be determined from observing heterogeneity in nucleosomal map disruption across cell-free DNA molecules in a sample. An assessment of relative contributions of each of two or more clones is determined.
- Each of the one or more nucleosome-associated regions of a bait set panel may comprise at least one of: (i) significant structural variation, comprising a variation in nucleosomal positioning, said structural variation selected from the group consisting of: an insertion, a deletion, a translocation, a gene rearrangement, methylation status, a micro-satellite, a copy number variation, a copy number-related structural variation, or any other variation which indicates differentiation; and (ii) instability, comprising one or more significant fluctuations or peaks in a genome partitioning map indicating one or more locations of nucleosomal map disruptions in a genome. The one or more bait sets of a bait set panel may be configured to capture nucleosome-associated regions of the genome based on a function of a plurality of reference nucleosomal occupancy profiles associated with one or more disease states and one or more non-disease states.
- The one or more bait sets of a bait set panel may selectively enrich for one or more nucleosome-associated regions in a cell-free deoxyribonucleic acid (cfDNA) sample. For example, the bait set may selectively enrich for one or more nucleosome-associated regions by bringing a nucleic sample in contact with the bait set, and allowing the bait set to selectively hybridize to the set of nucleosome-associated genomic regions associated with the bait set.
- In an aspect, a method for enriching a nucleic acid sample for nucleosome-associated regions of a genome may comprise (a) bringing a nucleic acid sample in contact with a bait set panel, said bait set panel comprising one or more bait sets that selectively enrich for one or more nucleosome-associated regions of a genome; and (b) enriching the nucleic acid sample for one or more nucleosome-associated regions of a genome. The one or more bait sets in a bait set panel may be configured to capture nucleosome-associated regions of the genome based on a function of a plurality of reference nucleosomal occupancy profiles associated with one or more disease states and one or more non-disease states. The plurality of reference nucleosomal occupancy profiles may serve as a “map” for which analysis may reveal patterns or clusters of genomic regions and/or locations which may be targeted for capture for nucleosome-associated variant detection.
- The one or more bait sets in a bait set panel may selectively enrich for the one or more nucleosome-associated regions in a cell-free deoxyribonucleic acid (cfDNA) sample. The method for enriching a nucleic acid sample for nucleosome-associated regions of a genome may further comprise sequencing the enriched nucleic acids to produce sequence reads of the nucleosome-associated regions of a genome. These sequence reads may be aligned to a reference genome and analyzed for nucleosome-associated and/or genetic variants (e.g., SNVs and/or indels).
- In an aspect, a method for generating a bait set may comprise (a) identifying one or more regions of a genome, said regions associated with a nucleosome profile, and (b) selecting a bait set to selectively capture said regions. A bait set in a bait set panel may selectively enrich for one or more nucleosome-associated genomic regions in a cell-free deoxyribonucleic acid (cfDNA) sample. For example, the bait set may selectively enrich for one or more nucleosome-associated regions by bringing a nucleic sample in contact with the bait set, and allowing the bait set to selectively hybridize to the set of nucleosome-associated genomic regions associated with the bait set.
- In an aspect, a bait panel may comprise a first bait set that selectively hybridizes to a first set of genomic regions of a nucleic acid sample comprising a predetermined amount of DNA, wherein the first bait set may be provided at a first concentration ratio that is less than a saturation point of the first bait set; and a second bait set that selectively hybridizes to a second set of genomic regions of the nucleic acid sample, wherein the second bait set may be provided at a second concentration ratio that is associated with a saturation point of the second bait set. As used herein, a concentration associated with a saturation point can be at or above the saturation point. In some embodiments, a concentration associated with a saturation point is at or above a point that is 10% below the saturation point. The first set of genomic regions may comprise one or more backbone genomic regions. The second set of genomic regions may comprise one or more hotspot genomic regions. The predetermined amount of DNA may be about 200 ng, about 150 ng, about 125 ng, about 100 ng, about 75 ng, about 50 ng, about 25 ng, about 10 ng, about 5 ng, and/or about 1 ng.
- In an aspect, a method for enriching for multiple genomic regions may comprise bringing a predetermined amount of a nucleic acid sample in contact with a bait panel comprising (i) a first bait set that selectively hybridizes to a first set of genomic regions of the nucleic acid sample, which may be provided at a first concentration ratio that is less than a saturation point of the first bait set, and (ii) a second bait set that selectively hybridizes to a second set of genomic regions of the nucleic acid sample, which may be provided at a second concentration ratio that is associated with a saturation point of the second bait set; and enriching the nucleic acid sample for the first set of genomic regions and the second set of genomic regions.
- Enriching can comprise the following steps: (a) bringing sample nucleic acid into contact with a bait set; (b) capturing nucleic acids from the sample by hybridizing them to probes in the bait set; and (c) separating captured nucleic acids from un-captured nucleic acids.
- Using this approach, capture of the second set of genomic regions at a saturation point of its bait set may yield high-sensitivity detection of variants of the second set of genomic regions (e.g., hot-spot regions), while capture of the first set of genomic regions below the saturation point of its bait set may be desired for the first set of genomic regions (e.g., backbone regions). The flexibility of this method to adjust the capture of different bait sets at or below their saturation levels may be leveraged to strategically select genomic regions of interest for hot-spot or backbone bait set panels, given each genomic region's characteristics such as sequencing load and utility.
- The method may further comprise sequencing the enriched nucleic acids to produce a plurality of sequence reads of the first set of genomic regions and the second set of genomic regions. These sequence reads may be analyzed for cancer-relevant genetic variants (e.g., SNVs and indels) for cancer detection and assessment applications.
- The skilled worker will appreciate that saturation point refers to saturation of binding kinetics. In essence, as the concentration of a bait (or set of baits) increases, the amount of target that binds to the bait (or set of baits) will also increase. However, the amount of target in a given sample will be fixed, and thus, at a certain point, effectively all the target in the sample will be bound to the bait (or set of baits). Therefore, as bait concentrations increase beyond this point, the amount of bound target will not substantially increase because the system will approach binding equilibrium (the rates at which bait molecules bind and release target molecules will start to converge).
- Saturation point refers to a concentration or amount of bait at which point increasing that concentration or amount does not substantially increase the amount of target material captured from a sample, e.g., that point at which increases in the concentration of bait produce increasingly diminished increases in total amount of target material captured. In some embodiments, the point at which increasing the concentration or amount of a bait does not substantially increase the amount of target material captured from a sample is the point at which increasing the concentration or amount of bait produces no increase in the amount of target captured from the sample. The saturation point can be an inflection point on a saturation curve measuring the amount of captured target nucleic acid with increasing concentrations of the bait set. For example, the saturation point can be the point at which an increase of 100% in the bait concentration (e.g., 2× or twice the concentration) increases an amount of target captured by any of less than 20%, less than 19%, less than 18%, less than 17%, less than 16%, less than 15%, less than 14%, less than 13%, less than 12%, less than 11%, less than 10%, less than 9%, less than 8%, less than 7%, less than 6%, less than 5%, less than 4%, less than 3%, less than 2%, or less than 1%. In some embodiments, an increase of 50% in the bait concentration (e.g., 1.5× or one-and-a-half times the concentration) increases an amount of target captured by any of less than 20%, less than 19%, less than 18%, less than 17%, less than 16%, less than 15%, less than 14%, less than 13%, less than 12%, less than 11%, less than 10%, less than 9%, less than 8%, less than 7%, less than 6%, less than 5%, less than 4%, less than 3%, less than 2%, or less than 1%. In some embodiments, an increase of 20% in the bait concentration (e.g., 1.2×) increases an amount of target captured by any of less than 20%, less than 19%, less than 18%, less than 17%, less than 16%, less than 15%, less than 14%, less than 13%, less than 12%, less than 11%, less than 10%, less than 9%, less than 8%, less than 7%, less than 6%, less than 5%, less than 4%, less than 3%, less than 2%, or less than 1%. In some embodiments, an increase of 10% in the bait concentration (e.g., 1.1×) increases an amount of target captured by any of less than 20%, less than 19%, less than 18%, less than 17%, less than 16%, less than 15%, less than 14%, less than 13%, less than 12%, less than 11%, less than 10%, less than 9%, less than 8%, less than 7%, less than 6%, less than 5%, less than 4%, less than 3%, less than 2%, or less than 1%.
- As another example, the saturation point can be the point at which an increase of 100% in the bait concentration (e.g., 2× or twice the concentration) increases an amount of target captured by at most 20%. The saturation point can be the point at which an increase of 50% in the bait concentration (e.g., 1.5× or twice the concentration) increases an amount of target captured by at most 20%. The saturation point can be the point at which an increase of 20% in the bait concentration (e.g., 1.2× or twice the concentration) increases an amount of target captured by at most 20%. The saturation point can be the point at which an increase of 10% in the bait concentration (e.g., 1.1× or twice the concentration) increases an amount of target captured by at most 20%.
- A saturation curve can be generated, for example, by titrating differing amounts of target nucleic acids against a fixed or varying amount of baits (e.g., baits fixed on a microarray) to measure the amount of target nucleic acid (including, for example, the number of unique molecules) bound to the baits. A saturation curve also can be generated, for example, by titrating differing amounts of baits against a fixed or varying amount of target nucleic acids to measure the amount of target nucleic acid (including, for example, the number of unique molecules) bound to the baits. In some embodiments, a saturation curve can be generated using a subset of sequence reads as a measure of target nucleic acid (e.g., unique molecule count) captured. For example, sequence reads can be categorized as having either single stranded support (when all reads within a group of unique reads are from the same original nucleic acid strand of a double stranded nucleic acid such as DNA) or double stranded support (when the reads within a group of unique reads are from both original nucleic acid strands of a double stranded nucleic acid such as DNA). In embodiments selecting for double stranded support, the skilled worker would understand to count only captured unique molecules for which both strands are observed. Double stranded support can be determined, for example, by differentially tagging each of the two different strands of a nucleic acid such that the reads for each strand can be counted separately. In some embodiments, a target nucleic acid with double stranded support will require a higher amount of bait to reach saturation for that target than would be required for a bait with single stranded support.
-
FIG. 6 depicts an exemplary saturation curve showing unique molecule count on the y-axis as a function of input bait amount on the x-axis. At each input amount (shown as a series of volumes of a bait solution), the amount of bait panel was titrated to generate the curve. Exemplary experimental titration curve designs are shown in Table 1 and Table 2 below. Number of unique sequence reads vs. input bait amount can be used to generate a titration curve as shown inFIG. 6 . -
TABLE 1 Titration curve design Amount of bait (backbone or Input target amount (0, 5, 15 or 30 ng) hotspot; μl) Vol. A Vol. B Vol. C Vol. D Vol. E Vol. F Vol. G Vol. H Backbone 1 (ng 0 5 5 0 5 0 5 5 of input target nucleic acid) Backbone 2 (ng 30 30 30 0 30 0 30 30 of input target nucleic acid) Hotspot 1 (ng of 0 0 0 0 0 5 0 0 input target nucleic acid) Hotspot 2 (ng of 0 0 0 0 0 15 0 0 input target nucleic acid) Hotspot 3 (ng of 0 0 0 0 0 30 0 0 input target nucleic acid) Backbone 3 (ng 5 5 5 0 5 0 5 5 of input target nucleic acid) Backbone 4 (ng 0 0 15 0 15 0 15 15 of input target nucleic acid) Backbone 5 (ng 30 30 30 0 30 0 30 30 of input target nucleic acid) Hotspot 4 (ng of 5 5 0 5 0 5 0 0 input target nucleic acid) Hotspot 5 (ng of 0 0 0 0 0 15 0 0 input target nucleic acid) Hotspot 6 (ng of 30 30 0 30 0 30 0 0 input target nucleic acid) -
TABLE 2 Titration curve design. Hybridization performed at 65° C. Input target Hotspot bait Backbone bait nucleic acid Condition # (μl) (μl) amount (ng) 1 A B 5 2 A B 5 3 A B 5 4 A B 5 5 A2 B1 5 6 A2 B1 5 7 A2 B2 5 8 A2 B2 5 9 A B1 15 10 A B1 15 11 A B2 15 12 A B2 15 13 A2 B1 15 14 A2 B1 15 15 A2 B2 15 16 A2 B2 15 17 A2 B2 30 18 A2 B2 30 - Using a titration curve such as that of
FIG. 6 , a person of skill in the art can calculate a saturation point. For example, looking at Vol. 0.8×, the unique molecule count is approximately 2700. At 2× the amount of bait (Vol. 1.6×), the unique molecule count is approximately 3200, a difference of 500. Thus, doubling the amount of bait results in an increase in capture of about 18.5%. By contrast, at Vol. 2×, the unique molecule count is approximately 3250, and at 1 μl, the unique molecule count is approximately 3500, a difference of 250. Doubling the amount of bait here results in an increase in capture of only about 7.7%. Accordingly, a person of skill in the art looking to use a saturation point at which an increase of 100% in the bait concentration to increase an amount of target captured by less than 8% might therefore use Vol. 2× of bait as the saturation point. - At the saturation point, the bait set can capture any of at least 40%, at least 50%, at least 60%, at least 70%, at least 80%, at least 85%, at least 86%, at least 87%, at least 88%, at least 89%, at least 90%, at least 91%, at least 92%, at least 93%, at least 94%, at least 95%, at least 96%, at least 97%, at least 98%, and/or at least 99% of a target sequence in a sample. Saturation point can refer to the saturation point of a bait set or of a particular bait, depending on the context in which the term is used.
- The saturation point of a bait set may be determined by the following method: (a) for each of the baits in the bait set, generating a titration curve comprising (i) measuring the capture efficiency of the bait on a given amount of input sample (e.g., test sample) as a function of the concentration of the bait, and (ii) identifying an inflection point within the titration curve, thereby identifying a saturation point associated with the bait; and (b) selecting a saturation point that is larger than substantially all of the saturation points associated with baits in the bait set, thereby determining the saturation point of the bait set. The selection of a saturation point may be influenced by capture efficiency of a bait and the associated costs, such that the concentration at the saturation point may be high enough to achieve a desired capture efficiency, while still low enough to ensure reasonable assay reagent costs.
- The capture efficiency of a bait may be determined by (a) providing a plurality of nucleic acid samples obtained from a plurality of subjects in a cohort; (b) hybridizing the bait with each of the nucleic acid samples, at each of a plurality of concentrations of the bait; (c) enriching with the bait, a plurality of genomic regions of the nucleic acid samples, at each of the plurality of concentrations of the bait; and (d) measuring number of unique nucleic acid molecules or nucleic acid molecules with representation of both strands of an original double-stranded nucleic acid molecule representing the capture efficiency at each of the plurality of concentrations of the bait. Typically, the capture efficiency of a bait (e.g., the percentage of molecules containing the target genomic region of the bait that are captured from a sample comprising such molecules) increases rapidly with concentration until an inflection point is reached, after which the percentage of captured molecules increases much more slowly.
- An inflection point may be a first concentration of a bait such that observed capture efficiency does not increase significantly at concentrations of the bait greater than the first concentration. An inflection point may be a first concentration of the bait such that an observed increase between (1) the capture efficiency at a bait concentration of twice the first concentration compared to (2) the capture efficiency at the first bait concentration, is less than about 1%, less than about 2%, less than about 3%, less than about 4%, less than about 5%, less than about 6%, less than about 7%, less than about 8%, less than about 9%, less than about 10%, less than about 12%, less than about 14%, less than about 16%, less than about 18%, less than about 20%, less than about 30%, less than about 40%, or less than about 50%. Such an identified inflection point can be considered a saturation point associated with a bait. A bait can be used at a concentration of a saturation point in an assay to enable optimal capture of a target genomic region and hence sensitivity of detecting genetic variants of the target genomic region. In some embodiments, the saturation point associated with a bait set is the saturation point of the weakest bait in that bait set. For example, the bait set has a saturation point that is larger than substantially all of the saturation points associated with baits in the bait set when a bait of the bait set is subjected to a titration curve generated by (i) measuring the capture efficiency of a bait of the bait set as a function of the concentration of the bait, and (ii) identifying an inflection point within the titration curve, thereby identifying a saturation point associated with the bait. When each bait in the bait set is at a first concentration that is least at its saturation point, the bait set will have captured target sequences such that observed capture efficiency of the target sequences increases by less than 20% at a concentration of the baits twice that of the first concentration
- The nucleic acid sample may be a cell-free nucleic acid sample (e.g., cfDNA). A method for enriching for multiple genomic regions may further comprise sequencing the enriched nucleic acid sample to produce a plurality of sequence reads. A method for enriching for multiple genomic regions may further comprise producing an output comprising a nucleic acid sequence representative of the nucleic acid sample. This nucleic acid sequence may then be aligned to a reference genome and analyzed for cancer-relevant genetic variants through bioinformatics approaches.
- An original molecule can produce redundant sequence reads, for example, after amplification and sequencing of amplicons, or by repeated sequencing of the same molecule. Redundant sequence reads from an original molecule can be collapsed into a consensus sequence (e.g., a “unique sequence”) representing the sequence of the original molecule. This can be done by generating a consensus sequence for the full molecule, for part of the molecule or at a single nucleotide position in the molecule (consensus nucleotide). As used herein “sequenced polynucleotide” refers either to sequence reads generated from amplicons of an original molecule, or a consensus sequence of an original molecule derived from such amplicons. Unique reads are reads that are different from every other read. Reads can be unique based on the sequence of an original molecule, or based on the sequence of an original molecule plus one or more barcode sequences attached to an original molecule. For example, two identical original molecules can still yield unique reads if their barcodes are different. Likewise, two different original molecules will produce unique reads even if their barcodes are the same. Consensus sequences can be unique sequences when they are generated by grouping unique reads.
- In an aspect, a bait panel may comprise a first set that selectively captures backbone regions of a genome, said backbone regions associated with a ranking function of sequencing load and utility, wherein the ranking function of each backbone region has a value less than a predetermined threshold value; and a second bait set that selectively captures hotspot regions of a genome, said hotspot regions associated with a ranking function of sequencing load and utility, wherein the ranking function of each hotspot region has a value greater than or equal to the predetermined threshold value. This approach may use at least two bait sets corresponding to backbone and hotspot regions.
- Hotspot regions may be relatively more important than backbone regions to capture and analyze in a given cell-free nucleic acid sample due to their relatively high utility and/or relatively low sequencing load. The selection of a given region as a hotspot region or a backbone region depends on its ranking function value, which is calculated as a function of sequencing load and utility. A ranking function value may be calculated as utility of a genomic region divided by sequencing load of a genomic region.
- The backbone or hotspot regions may comprise one or more nucleosome informative regions. Nucleosome informative regions may comprise a region of maximum nucleosome differentiation. The bait panel may further comprise a second bait set that selectively captures disease informative regions. The baits in the first bait set may be at a first concentration (e.g., a first concentration relative to the bait panel), and the baits in the second bait set may be at a second concentration (e.g., a second concentration relative to the bait panel).
- In an aspect, a method for generating a bait set may comprise identifying one or more backbone genomic regions of interest, wherein the identifying the one or more backbone genomic regions may comprise maximizing a ranking function of sequencing load and utility associated with each of the backbone genomic regions; identifying one or more hotspot genomic regions of interest; creating a first bait set that selectively captures the backbone genomic regions of interest; and creating a second bait set that selectively captures the hot-spot genomic regions of interest. The second bait set may have a higher capture efficiency than the first bait set.
- The one or more hot-spots may be selected using one or more of (e.g., one or more, two or more, three or more, or four of) the following: (i) maximizing a ranking function of sequencing load and utility associated with each of the hot-spot genomic regions, (ii) nucleosome profiling across the one or more genomic regions of interest, (iii) predetermined cancer driver mutations or prevalence across a relevant patient cohort, and (iv) empirically identified cancer driver mutations.
- Identifying one or more hotspots of interest may comprise using a programmed computer processor to rank a set of hotspot genomic regions based on a ranking function of sequencing load and utility associated with each of the hotspot genomic regions. Identifying the one or more backbone genomic regions of interest may comprise ranking a set of backbone genomic regions based on a ranking function of sequencing load and utility associated with each of the backbone genomic regions of interest. Identifying the one or more hot-spot genomic regions of interest may comprise utilizing a set of empirically determined minor allele frequency (MAF) values or clonality of a variant measured by its MAF in relationship to the highest presumed driver or clonal mutation in a sample obtained from one or more subjects in a cohort of interest. Genomic regions that have relatively high MAF values in a cohort of interest may be suitable hotspots because they may indicate cancer-relevant assessments such as detection, cell type or tissue or origin, tumor burden, and/or treatment efficacy.
- Sequencing load of a genomic region may be calculated by multiplying together one or more of (e.g., one or more, two or more, three or more, four or more, or five of) (i) size of the genomic region in base pairs, (ii) relative fraction of reads spent on sequencing fragments mapping to the genomic region, (iii) relative coverage as a result of sequence bias of the genomic region, (iv) relative coverage as a result of amplification bias of the genomic region, and (v) relative coverage as a result of capture bias of the genomic region. This indicator may be calculated for each genomic region in a bait panel set to identify the “costs” associated with generating sequence reads associated with the genomic region from a nucleic acid sample.
- The sequencing load of a genomic region is linearly proportional to the size of the genomic region in base pairs. The relative fraction of reads spent on sequencing fragments mapping to the genomic region also influences the sequencing load of the genomic region, since some genomic regions may be especially difficult to sequence reliably (e.g., due to high GC-content or the presence of highly repeating sequences) and hence may require higher sequencing depth for analysis at the bait's desired resolution. Similarly relative coverage as a result of sequence bias, amplification bias, and/or capture bias of the genomic region may also affect the sequencing load of the genomic region. The total sequencing load of a given assay's sequencing run may then be calculated by summing all sequencing loads of the baits (including hot-spots and backbone regions) in the assay's selected bait panel set.
- In some examples, utility of a genomic region may be calculated by multiplying together one or more of (e.g., one or more, two or more, three or more, four or more, five or more, six or more, or seven of) the following utility factors: (i) presence of one or more actionable mutations in the genomic region, (ii) frequency of one or more actionable mutations in the genomic region, (iii) presence of one or more mutations associated with above-average minor allele frequencies (MAFs) in the genomic region, (iv) frequency of one or more mutations associated with above-average MAFs in the genomic region, (v) fraction of patients in a cohort harboring a somatic mutation within the genomic region, (vi) sum of MAFs for variants in patients in a cohort, said patients harboring a somatic mutation within the genomic region, and (vii) ratio of (1) MAF for variants in patients in a cohort, said patients harboring a somatic mutation within the genomic region, to (2) maximum MAF for a given patient in the cohort.
- The goal of calculating utility of a genomic region may be to help assess its relative importance for inclusion in a bait set panel. For example, the presence and/or frequency of one or more actionable mutations in the genomic region affect the utility of a genomic region for inclusion in a bait set panel, since genomic regions containing highly frequent mutations are good markers (e.g., indicators) of disease states including cancer. Similarly, the selection of genomic regions with presence and/or frequency of mutations associated with above-average MAFs will enable highly sensitive detection of these mutations in a liquid biopsy assay.
- The fraction of patients in a cohort harboring a somatic mutation within the genomic region may indicate driver mutations that are suitable as a marker for the cohort's disease (e.g., breast, colorectal, pancreatic, prostate, melanoma, lung, or liver). To maximize the chances of detecting the highest MAF or driver variant, the sum of MAF for variants in patients in a cohort, said patients harboring a somatic mutation within the genomic region may be used as a utility factor. To give maximal weight to the driver mutations, the ratio of (1) MAF for variants in patients in a cohort, said patients harboring a somatic mutation within the genomic region, to (2) maximum MAF for a given patient in the cohort may be used as a utility factor. Mutations associated with higher minor allele frequencies may comprise one or more driver mutations or are known from external data or annotation sources.
- Actionable mutations may comprise mutations whose detected presence may influence or determine clinical decisions (e.g., diagnosis, cancer monitoring, therapy monitoring, assessment of therapy efficacy). Actionable mutations may comprise one or more of (e.g., one or more, two or more, three or more, four or more, five or more, six or more, or seven of) (i) druggable mutations, (ii) mutations for therapeutic monitoring, (iii) disease specific mutations, (iv) tissue specific mutations, (v) cell type specific mutations, (vi) resistance mutations, and (vii) diagnostic mutations.
- Druggable mutations may include those mutations whose detected presence in a nucleic acid sample from a subject may indicate that the subject is an appropriate candidate for treatment with a certain drug associated with the mutation (e.g., detection of EGFR L858R mutation may indicate the need to treat with a tyrosine kinase inhibitor (TKI) treatment). Mutations for therapeutic monitoring include those mutations whose detected presence or increased level in a nucleic acid sample from a subject may indicate that the subject's cancer is responding to a treatment course. Resistance mutations include those mutations whose detected presence or increased level in a nucleic acid sample from a subject may indicate that the subject's cancer has become resistant to a treatment course (e.g., emergence of EGFR T790M mutation may indicate the onset of resistance). Mutations may be specific to a disease (e.g., tumor type), tissue type, or cell type, whose detection may indicate cancer, inflammation, or another disease state in a particular organ, tissue, or cell type.
- Exemplary listings of genomic locations of interest may be found in Table 3 and Table 4. In some embodiments, genomic regions used in the methods of the present disclosure comprise at least a portion of at least 5, at least 10, at least 15, at least 20, at least 25, at least 30, at least 35, at least 40, at least 45, at least 50, at least 55, at least 60, at least 65, at least 70, at least 75, at least 80, at least 85, at least 90, at least 95, or 97 of the genes of Table 3. In some embodiments, genomic regions used in the methods of the present disclosure comprise at least 5, at least 10, at least 15, at least 20, at least 25, at least 30, at least 35, at least 40, at least 45, at least 50, at least 55, at least 60, at least 65, or 70 of the SNVs of Table 3. In some embodiments, genomic regions used in the methods of the present disclosure comprise at least 1, at least 2, at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 11, at least 12, at least 13, at least 14, at least 15, at least 16, at least 17, or 18 of the CNVs of Table 3. In some embodiments, genomic regions used in the methods of the present disclosure comprise at least 1, at least 2, at least 3, at least 4, at least 5, or 6 of the fusions of Table 3. In some embodiments, genomic regions used in the methods of the present disclosure comprise at least a portion of at least 1, at least 2, or 3 of the indels of Table 3. In some embodiments, genomic regions used in the methods of the present disclosure comprise at least a portion of at least 5, at least 10, at least 15, at least 20, at least 25, at least 30, at least 35, at least 40, at least 45, at least 50, at least 55, at least 60, at least 65, at least 70, at least 75, at least 80, at least 85, at least 90, at least 95, at least 100, at least 105, at least 110, or 115 of the genes of Table 4. In some embodiments, genomic regions used in the methods of the present disclosure comprise at least 5, at least 10, at least 15, at least 20, at least 25, at least 30, at least 35, at least 40, at least 45, at least 50, at least 55, at least 60, at least 65, at least 70, or 73 of the SNVs of Table 4. In some embodiments, genomic regions used in the methods of the present disclosure comprise at least 1, at least 2, at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 11, at least 12, at least 13, at least 14, at least 15, at least 16, at least 17, or 18 of the CNVs of Table 4. In some embodiments, genomic regions used in the methods of the present disclosure comprise at least 1, at least 2, at least 3, at least 4, at least 5, or 6 of the fusions of Table 4. In some embodiments, genomic regions used in the methods of the present disclosure comprise at least a portion of at least 1, at least 2, at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 11, at least 12, at least 13, at least 14, at least 15, at least 16, at least 17, or 18 of the indels of Table 4. Each of these genomic locations of interest may be identified as a backbone region or hot-spot region for a given bait set panel. An exemplary listing of hot-spot genomic locations of interest may be found in Table 5. In some embodiments, genomic regions used in the methods of the present disclosure comprise at least a portion of at least 1, at least 2, at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 11, at least 12, at least 13, at least 14, at least 15, at least 16, at least 17, at least 18, at least 19, or at least 20 of the genes of Table 5. Each hot-spot genomic region is listed with several characteristics, including the associated gene, chromosome on which it resides, the start and stop position of the genome representing the gene's locus, the length of the gene's locus in base pairs, the exons covered by the gene, and the critical feature (e.g., type of mutation) that a given genomic region of interest may seek to capture.
-
TABLE 3 Amplifications Point Mutations (SNVs) (CNVs) Fusions Indels AKT1 ALK APC AR ARAF ARID1A AR BRAF ALK EGFR ATM BRAF BRCA1 BRCA2 CCND1 CCND2 CCND1 CCND2 FGFR2 (exons CCNE1 CDH1 CDK4 CDK6 CDKN2A CDKN2B CCNE1 CDK4 FGFR3 19 & 20) CTNNB1 EGFR ERBB2 ESR1 EZH2 FBXW7 CDK6 EGFR NTRK1 ERBB2 FGFR1 FGFR2 FGFR3 GATA3 GNA11 GNAQ ERBB2 FGFR1 RET (exons GNAS HNF1A HRAS IDH1 IDH2 JAK2 FGFR2 KIT ROS1 19 & 20) JAK3 KIT KRAS MAP2K1 MAP2K2 MET KRAS MET MET MLH1 MPL MYC NF1 NFE2L2 NOTCH1 MYC PDGFRA (exon 14 NPM1 NRAS NTRK1 PDGFRA PIK3CA PTEN PIK3CA RAF1 skipping) PTPN11 RAF1 RB1 RET RHEB RHOA RIT1 ROS1 SMAD4 SMO SRC STK11 TERT TP53 TSC1 VHL -
TABLE 4 Amplifications Point Mutations (SNVs) (CNVs) Fusions Indels AKT1 ALK APC AR ARAF ARID1A AR BRAF ALK EGFR ATM BRAF BRCA1 BRCA2 CCND1 CCND2 CCND1 CCND2 FGFR2 (exons CCNE1 CDH1 CDK4 CDK6 CDKN2A DDR2 CCNE1 CDK4 FGFR3 19 & 20) CTNNB1 EGFR ERBB2 ESR1 EZH2 FBXW7 CDK6 EGFR NTRK1 ERBB2 FGFR1 FGFR2 FGFR3 GATA3 GNA11 GNAQ ERBB2 FGFR1 RET (exons GNAS HNF1A HRAS IDH1 IDH2 JAK2 FGFR2 KIT ROS1 19 & 20) JAK3 KIT KRAS MAP2K1 MAP2K2 MET KRAS MET MET MLH1 MPL MYC NF1 NFE2L2 NOTCH1 MYC PDGFRA (exon 14 NPM1 NRAS NTRK1 PDGFRA PIK3CA PTEN PIK3CA RAF1 skipping) PTPN11 RAF1 RB1 RET RHEB RHOA ATM RIT1 ROS1 SMAD4 SMO MAPK1 STK11 TERT TP53 TSC1 VHL MAPK3 MTOR NTRK3 APC ARID1A BRCA1 BRCA2 CDH1 CDKN2A GATA3 KIT MLH1 MTOR NF1 PDGFRA PTEN RB1 SMAD4 STK11 TP53 TSC1 VHL -
TABLE 5 Start Stop Length Exons Gene Chromosome Position Position (bp) Covered Critical Feature ALK chr2 29446405 29446655 250 intron 19 Fusion ALK chr2 29446062 29446197 135 intron 20 Fusion ALK chr2 29446198 29446404 206 20 Fusion ALK chr2 29447353 29447473 120 intron 19 Fusion ALK chr2 29447614 29448316 702 intron 19 Fusion ALK chr2 29448317 29448441 124 19 Fusion ALK chr2 29449366 29449777 411 intron 18 Fusion ALK chr2 29449778 29449950 172 18 Fusion BRAF chr7 140453064 140453203 139 15 BRAF V600 CTNNB1 chr3 41266007 41266254 247 3 S37 EGFR chr7 55240528 55240827 299 18 and 19 G719 and deletions EGFR chr7 55241603 55241746 143 20 Insertions/T790M EGFR chr7 55242404 55242523 119 21 L858R ERBB2 chr17 37880952 37881174 222 20 Insertions ESR1 chr6 152419857 152420111 254 10 V534, P535, L536, Y537, D538 FGFR2 chr10 123279482 123279693 211 6 S252 GATA3 chr10 8111426 8111571 145 5 SS/Indels GATA3 chr10 8115692 8116002 310 6 SS/Indels GNAS chr20 57484395 57484488 93 8 R844 IDH1 chr2 209113083 209113394 311 4 R132 IDH2 chr15 90631809 90631989 180 4 R140, R172 KIT chr4 55524171 55524258 87 1 KIT chr4 55561667 55561957 290 2 KIT chr4 55564439 55564741 302 3 KIT chr4 55565785 55565942 157 4 KIT chr4 55569879 55570068 189 5 KIT chr4 55573253 55573463 210 6 KIT chr4 55575579 55575719 140 7 KIT chr4 55589739 55589874 135 8 KIT chr4 55592012 55592226 214 9 KIT chr4 55593373 55593718 345 10 and 11 557, 559, 560, 576 KIT chr4 55593978 55594297 319 12 and 13 V654 KIT chr4 55595490 55595661 171 14 T670, S709 KIT chr4 55597483 55597595 112 15 D716 KIT chr4 55598026 55598174 148 16 L783 KIT chr4 55599225 55599368 143 17 C809, R815, D816, L818, D820, S821F, N822, Y823 KIT chr4 55602653 55602785 132 18 A829P KIT chr4 55602876 55602996 120 19 KIT chr4 55603330 55603456 126 20 KIT chr4 55604584 55604733 149 21 KRAS chr12 25378537 25378717 180 4 A146 KRAS chr12 25380157 25380356 199 3 Q61 KRAS chr12 25398197 25398328 131 2 G12/G13 MET chr7 116411535 116412255 720 13, 14, MET exon 14 SS intron 13, intron 14 NRAS chr1 115256410 115256609 199 3 Q61 NRAS chr1 115258660 115258791 131 2 G12/G13 PIK3CA chr3 178935987 178936132 145 10 E545K PIK3CA chr3 178951871 178952162 291 21 H1047R PTEN chr10 89692759 89693018 259 5 R130 SMAD4 chr18 48604616 48604849 233 12 D537 TERT chr5 1294841 1295512 671 promoter chr5:1295228 TP53 chr17 7573916 7574043 127 11 Q331, R337, R342 TP53 chr17 7577008 7577165 157 8 R273 TP53 chr17 7577488 7577618 130 7 R248 TP53 chr17 7578127 7578299 172 6 R213/Y220 TP53 chr17 7578360 7578564 204 5 R175/Deletions TP53 chr17 7579301 7579600 299 4 12574 (total target region) 16330 (total probe coverage) - In an aspect, a bait panel may comprise a plurality of bait sets, each bait set (i) comprising one or more baits that selectively capture one or more genomic regions with utility in the same quantile across the plurality of baits, and (ii) having a different relative concentration from each of the other bait sets with utility in a different quantile across the plurality of baits. Quantiles may be, for example, two halves, three thirds, four quarters, etc. For example, a bait panel may comprise three bait sets, each bait set comprising baits that selectively capture genomic regions with utility in the upper third, middle third, or lower third of utility values across the plurality of baits, with each of the three bait sets having a different relative concentration.
- A bait panel may comprise a plurality of bait sets, each bait set (i) comprising one or more baits that selectively capture one or more genomic regions with sequencing load in the same quantile across the plurality of baits, and (ii) having a different relative concentration from each of the other bait sets with sequencing load in a different quantile across the plurality of baits. A bait panel may comprise a plurality of bait sets, each bait set (i) comprising one or more baits that selectively capture one or more genomic regions with ranking function value (e.g., utility divided by sequencing load) in the same quantile across the plurality of baits, and (ii) having a different relative concentration from each of the other bait sets with ranking function value in a different quantile across the plurality of baits.
- In an aspect, a method of selecting a set of panel blocks may comprise (a) for each panel block, (i) calculating a utility of the panel block, (ii) calculating a sequencing load of the panel block, and (iii) calculating a ranking function of the panel block; and (b) performing an optimization process to select a set of panel blocks that maximizes the total ranking function values of the selected panel blocks. A ranking function of a panel block may be calculated as the utility of a panel block divided by the sequencing load of a panel block. The combinatorial optimization process may optimize the total sum of ranking function values of all panel blocks selected for the set of panel blocks in a single assay. This approach may enable an optimal panel selection given constraints in sequence load and utility. The combinatorial optimization process may be a greedy algorithm. In an aspect, a method may comprise (a) providing a plurality of bait mixtures, wherein each of the plurality of bait mixtures comprises a first bait set that selectively hybridizes to a first set of genomic regions and a second bait set that selectively hybridizes to a second set of genomic regions, wherein the first bait set is at different concentrations across the plurality of bait mixtures and the second bait set is at the same concentration across the plurality of bait mixtures; (b) contacting each of the plurality of bait mixture with a nucleic acid sample to capture nucleic acids from the nucleic acid sample with the first bait set and the second bait set, wherein the nucleic acids from the nucleic acid samples are capture by the first bait set and the second bait set; (c) sequencing a portion of the nucleic acids captured with each bait mixture to produce sets of sequence reads within an allocated number of sequence reads; (d) determining the read depth for the first bait set and the second bait set for each bait mixture; and (e) identifying at least one bait mixture that provides read depths for the second set of genomic regions and, optionally, first set of genomic regions, at predetermined amounts. In some embodiments, the read depths for the second set of genomic regions provides a sensitivity of detecting a genetic variant of at least 0.0001% MAF. In some embodiments, a first set of genomic regions and/or a second set of regions have a size between 25 kilobases to 1,000 kilobases. In some embodiments, a first set of genomic regions and/or a second set of regions have a read depth of between 1,000 counts/base and 50,000 counts/base.
- A method is disclosed for improving accuracy of detecting an insertion or deletion (indel) from a plurality of sequence reads derived from cell-free deoxyribonucleic acid (cfDNA) molecules in a bodily sample of a subject, which plurality of sequence reads are generated by nucleic acid sequencing. For each of the plurality of sequence reads associated with cfDNA molecules, a candidate indel may be identified. Each candidate indel may then be classified as either a true indel or an introduced indel, using a combination of predetermined expectations of (i) an indel being detected in one or more sequence reads of the plurality of sequence reads, (ii) that a detected indel is a true indel present in a given cell-free DNA molecule of the cell-free DNA molecules, given that an indel has been detected in the one or more of the sequence reads, and/or (iii) that a detected indel is introduced by non-biological error, given that an indel has been detected in the one or more of the sequence reads, in conjunction with one or more model parameters to perform a hypothesis test. This approach may reduce error and improve accuracy of detecting an indel from sequence read data.
-
FIG. 1 illustrates how a plurality of reads may be generated for each locus enriched from a cell-free nucleic acid sample. Each enriched nucleic acid molecule (e.g., DNA molecule) is amplified to produce a family of amplicons. These amplicons may then be sequenced on both forward and reverse strands to produce a plurality of sequence read data. From the plurality of sequence read data, candidate indels may be detected and classified as either true indels or introduced (e.g., non-biological) indels. - This algorithm presumes that for any given DNA molecule for which a plurality of sequence reads is analyzed for variants comprising indels, there exists a predetermined expectation (e.g., probability) of an indel being present either in the original molecule (e.g., a “true” biological indel) or introduced at some point in a protocol that culminates a set of sequence reads (e.g., an introduced non-biological indel stemming from error, including amplification or sequencing error). The model may aim to perform a hypothesis test which asks, given a pattern of reads mapping to a particular base position (e.g., cover the base position somewhere in the read), if the observed pattern is most indicative of an indel in a sequence being present at the beginning of the protocol (e.g., a true biological indel) or introduced during the protocol (a non-biological indel).
- In an aspect, a method for improving accuracy of detecting an insertion or deletion (indel) from a plurality of sequence reads derived from cell-free deoxyribonucleic acid (cfDNA) molecules in a bodily sample of a subject, which plurality of sequence reads are generated by nucleic acid sequencing, may comprise (a) for each of the plurality of sequence reads associated with the cell-free DNA molecules, providing: a predetermined expectation of an indel being detected in one or more sequence reads of the plurality of sequence reads; a predetermined expectation that a detected indel is a true indel present in a given cell-free DNA molecule of the cell-free DNA molecules, given that an indel has been detected in the one or more of the sequence reads; and a predetermined expectation that a detected indel is introduced by non-biological error, given that an indel has been detected in the one or more of the sequence reads; (b) providing quantitative measures of one or more model parameters characteristic of sequence reads generated by nucleic acid sequencing; (c) detecting one or more candidate indels in the plurality of sequence reads associated with the cell-free DNA molecules; and (d) for each candidate indel, performing a hypothesis test using one or more of the model parameters to classify said candidate indel as a true indel or an introduced indel, thereby improving accuracy of detecting an indel.
- The method for improving accuracy of detecting an insertion or deletion (indel) from a plurality of sequence reads derived from cell-free deoxyribonucleic acid (cfDNA) molecules in a bodily sample of a subject may further comprise enriching one or more loci from the cell-free DNA in the bodily sample before step (a), thereby producing enriched polynucleotides.
- The method may further comprise amplifying the enriched polynucleotides to produce families of amplicons, wherein each family comprises amplicons originating from a single strand of the cell-free DNA molecules. The non-biological error may comprise error in sequencing at a plurality of genomic base locations. The non-biological error may comprise error in amplification at a plurality of genomic base locations.
-
FIG. 2 illustrates an example of small families of reads (which may appear to provide evidence for a true indel variant) and large families of reads (which may indicate a likely introduced error stemming from PCR or sequencing. In general, true indels may be expected to be detected or measured as small families of reads, since they may not be expected to affect large numbers of DNA molecules biologically. In contrast, introduced indels may be expected to be detected or measured as larger families of reads, which may indicate an introduced error during PCR or sequencing. Some untrimmed or erroneous reads may cause the algorithm to disqualify the family based on a hypothesis test that classifies an indel (e.g., insertion or deletion) as introduced rather than biological. -
FIG. 3 illustrates an example of an insertion being supported by a large family upon aligning and comparing a plurality of sequence reads to a reference genome. As in the above case inFIG. 3 , some untrimmed or erroneous reads may cause the algorithm to disqualify the family based on a hypothesis test that classifies an indel (e.g., insertion or deletion) as introduced rather than biological. - Model parameters may comprise one or more of (e.g., one or more, two or more, three or more, or four of) (i) for each of one or more variant alleles, a frequency of the variant allele (α) and a frequency of non-reference alleles other than the variant allele (α′); (ii) a frequency of an indel error in the entire forward strand of a family of strands (β1), wherein a family comprises a collection of amplicons originating from a single strand of the cell-free DNA molecules; (iii) a frequency of an indel error in the entire reverse strand of a family of strands (β2); and (iv) a frequency of an indel error in a sequence read (γ).
-
FIG. 4 illustrates the various parameters that may be used in a hypothesis test and how each parameter may be related to a particular probability, e.g., of a family of reads matching a reference, of a strands' reads matching a reference, and of a read matching a reference.FIG. 2 also illustrates how a parameter test containing a maximum likelihood function may be performed. If the parameter test is greater than a predetermined threshold when performed on a candidate indel, then the candidate may be classified as a true indel. If the parameter test is less than or equal to a predetermined threshold when performed on a candidate indel, then the candidate may be classified as an introduced (e.g., non-biological) indel. - The step of performing a hypothesis test may comprise performing a multi-parameter maximization algorithm. The multi-parameter maximization algorithm may comprise a Nelder-Mead algorithm. The classifying of a candidate indel as a true indel or an introduced indel may comprise (a) maximizing a multi-parameter likelihood function, (b) classifying a candidate indel as a true indel if the maximum likelihood function value is greater than a predetermined threshold value, and (c) classifying a candidate indel as an introduced indel if the maximum likelihood function value is less than or equal to a predetermined threshold value. The multi-parameter likelihood function may be given as:
-
- A multi-parameter likelihood function Pr{Reads|α, α′, β1, β2, γ} may represent a probability of an observed configuration of reads according to the model illustrated in
FIG. 4 (and described in paragraph [00112]). One assumption of the model may be that, given certain values of parameters (e.g., α, α′, β1, β2, and γ), an observed configuration of reads within a family is statistically independent from an observed configuration of reads within all other families. Therefore, the probability Pr{Reads|α, α′, β1, β2, γ} can be expressed as a product of Pr{reads in family f|α, α′, ⊖1, β2, γ} over all families. This per-family probability itself may comprise a weighted sum of at least three components, wherein each component corresponds to a possible family type: a) having the variant allele (with weight a), b) having other non-reference variant allele (with weight α′, or c) having the reference allele (withweight 1−α−α′). These components being summed may be probabilities of observed read configuration for the respective family type Pr{reads in family f|α, α′, β1, β2, γ, and family f having variant allele}, Pr{reads in family f|α, α′, β1, β2, γ, and family f having other non-reference variant allele}, and Pr{reads in family f|α, α′, β1, β2, γ, and family f having reference allele}. - Since the model postulates that within a family each strand may be affected by an indel error independently of the other strand, the probability of observed read configuration for a family having variant allele Pr{reads in family f|α, α′, β1, β2, γ, and family f having variant allele} may be itself a product of the probability of observed configuration of reads from the forward strand and the probability of observed configuration of reads from the reverse strand. Each of these probabilities may be itself a weighted sum of at least two components, wherein each component corresponds to a possible outcome: X) the strand-specific indel error did affect this family strand (with weight β1 or β2) and Y) the strand-specific indel error did not affect this family strand (with
weight 1−β1 or β2). - Finally, within a family of assumed type a), b), or c), and/or within a strand of assumed type X) or Y), the probability of a specific read configuration may be a product of probabilities for individual reads, since it is postulated by the model that these reads have a statistically independent probability of falling into one of the three categories: i) read supports the variant allele, ii) read supports other non-reference variant allele, or iii) read supports the reference allele. These probabilities are listed in Table 6 below.
-
TABLE 6 i) read ii) read iii) read supports supports supports Family Strand error variant other reference a) variant allele present γ 1 − γ 1 − γ absent 1 − γ γ γ b) other variant present 1 − γ γ 1 − γ allele absent γ 1 − γ γ c) reference present 1 − γ 1 − γ γ allele absent γ γ 1 − γ - While preferred embodiments of the present invention have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. It is not intended that the invention be limited by the specific examples provided within the specification. While the invention has been described with reference to the aforementioned specification, the descriptions and illustrations of the embodiments herein are not meant to be construed in a limiting sense. Numerous variations, changes, and substitutions will now occur to those skilled in the art without departing from the invention. Furthermore, it shall be understood that all aspects of the invention are not limited to the specific depictions, configurations or relative proportions set forth herein which depend upon a variety of conditions and variables. It should be understood that various alternatives to the embodiments of the invention described herein may be employed in practicing the invention. It is therefore contemplated that the invention shall also cover any such alternatives, modifications, variations or equivalents. It is intended that the following claims define the scope of the invention and that methods and structures within the scope of these claims and their equivalents be covered thereby.
- The present disclosure provides computer control systems that are programmed to implement methods of the disclosure. In one aspect, the present disclosure provides a system comprising a computer comprising a processor and computer memory, wherein the computer is in communication with a communications network, and wherein computer memory comprises code which, when executed by the processor, (1) receives sequence data into computer memory from the communications network; (2) determines whether a genetic variant in the sequence data represents a mutant; and (3) reports out, over the communications network, the determination.
- A communications network can be any available network that connects to the Internet. The communications network can utilize, for example, a high-speed transmission network including, without limitation, Broadband over Powerlines (BPL), Cable Modem, Digital Subscriber Line (DSL), Fiber, Satellite and Wireless.
- In another aspect provided herein a system comprising: a local area network; one or more DNA sequencers comprising computer memory configured to store DNA sequence data which are connected to the local area network; a bioinformatics computer comprising a computer memory and a processor, which computer is connected to the local area network; wherein the computer further comprises code which, when executed, copies DNA sequence data stored on the DNA sequencer, writes the copied data to memory in the bioinformatics computer and performs steps as described herein.
-
FIG. 5 shows acomputer system 501 that is programmed or otherwise configured to implements methods for generating a bait set, for selecting a set of panel blocks, and for improving accuracy of detecting an indel from a plurality of sequence reads derived from cfDNA molecules. Thecomputer system 501 can regulate various aspects of the present disclosure, such as, for example, methods for generating a bait set, for selecting a set of panel blocks, or for improving accuracy of detecting an indel from a plurality of sequence reads derived from cfDNA molecules. Thecomputer system 501 can be an electronic device of a user or a computer system that is remotely located with respect to the electronic device. The electronic device can be a mobile electronic device. - The
computer system 501 includes a central processing unit (CPU, also “processor” and “computer processor” herein) 505, which can be a single core or multi core processor, or a plurality of processors for parallel processing. Thecomputer system 501 also includes memory or memory location 510 (e.g., random-access memory, read-only memory, flash memory), electronic storage unit 515 (e.g., hard disk), communication interface 520 (e.g., network adapter) for communicating with one or more other systems, andperipheral devices 525, such as cache, other memory, data storage and/or electronic display adapters. Thememory 510,storage unit 515,interface 520 andperipheral devices 525 are in communication with theCPU 505 through a communication bus (solid lines), such as a motherboard. Thestorage unit 515 can be a data storage unit (or data repository) for storing data. Thecomputer system 501 can be operatively coupled to a computer network (“network”) 530 with the aid of thecommunication interface 520. Thenetwork 530 can be the Internet, an internet and/or extranet, or an intranet and/or extranet that is in communication with the Internet. Thenetwork 530 in some cases is a telecommunication and/or data network. Thenetwork 530 can include one or more computer servers, which can enable distributed computing, such as cloud computing. Thenetwork 530, in some cases with the aid of thecomputer system 501, can implement a peer-to-peer network, which may enable devices coupled to thecomputer system 501 to behave as a client or a server. - The
CPU 505 can execute a sequence of machine-readable instructions, which can be embodied in a program or software. The instructions may be stored in a memory location, such as thememory 510. The instructions can be directed to theCPU 505, which can subsequently program or otherwise configure theCPU 505 to implement methods of the present disclosure. Examples of operations performed by theCPU 505 can include fetch, decode, execute, and writeback. - The
CPU 505 can be part of a circuit, such as an integrated circuit. One or more other components of thesystem 501 can be included in the circuit. In some cases, the circuit is an application specific integrated circuit (ASIC). - The
storage unit 515 can store files, such as drivers, libraries and saved programs. Thestorage unit 515 can store user data, e.g., user preferences and user programs. Thecomputer system 501 in some cases can include one or more additional data storage units that are external to thecomputer system 501, such as located on a remote server that is in communication with thecomputer system 501 through an intranet or the Internet. - The
computer system 501 can communicate with one or more remote computer systems through thenetwork 530. For instance, thecomputer system 501 can communicate with a remote computer system of a user. Examples of remote computer systems include personal computers (e.g., portable PC), slate or tablet PC's (e.g., Apple® iPad, Samsung® Galaxy Tab), telephones, Smart phones (e.g., Apple® iPhone, Android-enabled device, Blackberry®), or personal digital assistants. The user can access thecomputer system 501 via thenetwork 530. - Methods as described herein can be implemented by way of machine (e.g., computer processor) executable code stored on an electronic storage location of the
computer system 501, such as, for example, on thememory 510 orelectronic storage unit 515. The machine executable or machine readable code can be provided in the form of software. During use, the code can be executed by theprocessor 505. In some cases, the code can be retrieved from thestorage unit 515 and stored on thememory 510 for ready access by theprocessor 505. In some situations, theelectronic storage unit 515 can be precluded, and machine-executable instructions are stored onmemory 510. - The code can be pre-compiled and configured for use with a machine having a processer adapted to execute the code, or can be compiled during runtime. The code can be supplied in a programming language that can be selected to enable the code to execute in a pre-compiled or as-compiled fashion.
- Aspects of the systems and methods provided herein, such as the
computer system 501, can be embodied in programming. Various aspects of the technology may be thought of as “products” or “articles of manufacture” typically in the form of machine (or processor) executable code and/or associated data that is carried on or embodied in a type of machine readable medium. Machine-executable code can be stored on an electronic storage unit, such as memory (e.g., read-only memory, random-access memory, flash memory) or a hard disk. “Storage” type media can include any or all of the tangible memory of the computers, processors or the like, or associated modules thereof, such as various semiconductor memories, tape drives, disk drives and the like, which may provide non-transitory storage at any time for the software programming. All or portions of the software may at times be communicated through the Internet or various other telecommunication networks. Such communications, for example, may enable loading of the software from one computer or processor into another, for example, from a management server or host computer into the computer platform of an application server. Thus, another type of media that may bear the software elements includes optical, electrical and electromagnetic waves, such as used across physical interfaces between local devices, through wired and optical landline networks and over various air-links. The physical elements that carry such waves, such as wired or wireless links, optical links or the like, also may be considered as media bearing the software. As used herein, unless restricted to non-transitory, tangible “storage” media, terms such as computer or machine “readable medium” refer to any medium that participates in providing instructions to a processor for execution. - Hence, a machine readable medium, such as computer-executable code, may take many forms, including but not limited to, a tangible storage medium, a carrier wave medium or physical transmission medium. Non-volatile storage media include, for example, optical or magnetic disks, such as any of the storage devices in any computer(s) or the like, such as may be used to implement the databases, etc. shown in the drawings. Volatile storage media include dynamic memory, such as main memory of such a computer platform. Tangible transmission media include coaxial cables; copper wire and fiber optics, including the wires that comprise a bus within a computer system. Carrier-wave transmission media may take the form of electric or electromagnetic signals, or acoustic or light waves such as those generated during radio frequency (RF) and infrared (IR) data communications. Common forms of computer-readable media therefore include for example: a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD or DVD-ROM, any other optical medium, punch cards paper tape, any other physical storage medium with patterns of holes, a RAM, a ROM, a PROM and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave transporting data or instructions, cables or links transporting such a carrier wave, or any other medium from which a computer may read programming code and/or data. Many of these forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to a processor for execution.
- The
computer system 501 can include or be in communication with anelectronic display 535 that comprises a user interface (UI) 540 for providing, for example, input parameters for methods for generating a bait set, for selecting a set of panel blocks, or for improving accuracy of detecting an indel from a plurality of sequence reads derived from cfDNA. Examples of UIs include, without limitation, a graphical user interface (GUI) and web-based user interface. - Methods and systems of the present disclosure can be implemented by way of one or more algorithms. An algorithm can be implemented by way of software upon execution by the
central processing unit 505. The algorithm can, for example, generate a bait set, select a set of panel blocks, or improve accuracy of detecting an indel from a plurality of sequence reads derived from cfDNA molecules. - Analytical sensitivity (as defined by the limit of detection and by positive percent agreement) and precision were assessed throughout the reportable allelic fraction and copy number ranges via multiple serial dilution studies of orthogonally-characterized contrived material and patient samples. Analytical specificity was assessed by calculating the false positive rate in pre-characterized healthy donor sample mixtures serially diluted across the lower reportable range down to allelic fractions below the limit of detection. Positive predictive value (PPV) was estimated as a function of allelic fraction/copy number from pre-characterized clinical patient samples and prevalence-adjusted using a cohort of 2,585 consecutive clinical samples. Orthogonal qualitative and quantitative confirmation was performed using ddPCR.
- Analytical performance is summarized in Table 7 below. Analytical specificity was 100% for single nucleotide variants (SNVs), fusions, and copy number alterations (CNAs) and 96% (24/25) for indels across 25 defined samples. Relative to other methods, this assay demonstrated 20%-50% increases in fusion molecule recovery, depending on the sequence context. Retrospective in silico analysis of 2,585 consecutive clinical samples demonstrated a >15% relative increase in actionable fusion detection, a 6%-15% increase in actionable indel detection (excluding newly reportable indels), and a 3%-6% increase in actionable SNV detection.
-
TABLE 7 Allelic Reportable 95% Limit of Allelic Fraction/ Analytical Fraction/Copy Alterations Range Detection Copy Number Sensitivity number PPV SNVs ≥0.04% 0.25% ≥0.25% >99.9% ≥0.25% 98.7% 0.05-0.25% 63.8% <0.25% 92.3% Indels ≥0.02% 0.2% ≥0.25% >99.9% ≥0.25% 98.4% 0.05-0.25% 67.8% <0.25% 88.5% Fusions ≥0.04% 0.4% ≥0.3% 100% any 100% <0.3% 83.0% CNAs ≥2.12 copies 2.24-2.93 copies 2.3 copies 95.0% any 100% - Table 7: Analytical performance characteristics based on standard cfDNA input (30 ng). Analytical sensitivity/limit of detection estimates are provided for clinically actionable variants and can vary by sequence context and cfDNA input. Positive predictive value is estimated across the entire reportable panel space (PPV was 100% for clinically actionable variants).
- In sum, the assay comprehensively detected all adult solid tumor guideline-recommended somatic genomic variants with high sensitivity, accuracy, and specificity.
- In this experiment, the appropriate probe replication and the saturation point for each panel were determined. Hotspot and backbone panels were designed for both default probe replication and optimized probe replication. The hotspot panel is approximately 12 kb and targets regions of genomic targets that may be indicative of drug response, a disease status (e.g., cancer), and/or a genomic target listed under National Comprehensive Cancer Network (“NCCN”) guidelines. The backbone panel is approximately 140 kb and covers the rest of the panel content. The hotspot and backbone panel may comprise any genetic locations in Table 3. A titration experiment was performed for panel input amount for each of the four panels at 5 ng, 15 ng, and 30 ng of cfDNA as set forth in Table 1.
FIG. 6 shows input amount versus unique molecule count for the generic panel. The unique molecule count saturated at about Vol. 3× for the backbone bait and about Vol. 1.2× for the hotspot bait (data not shown), suggesting that the optimized backbone panel was less variable compared to the default panel. - Based on the saturation point of each panel in Example 2, a concentration of backbone bait and a concentration of hotspot bait were determined. A mixture of backbone bait (e.g., Vol. A) and hotspot bait (e.g., Vol. B) was generated and the molecule count for the hotspot/backbone bait mixture was compared with molecule count for a generic panel. The molecule counts from the hotspot panel were higher than the backbone panel. The difference became more noticeable at higher cfDNA input amount as the backbone bait saturated out much faster, e.g., at lower input amount, as compared to the hotspot bait. A similar trend was seen with the double-stranded count (data not shown). Family size was also higher for the hotspot panel than the backbone panel (data not shown). The difference in family sizes may indicate that the hotspot panel is capturing more than the backbone panel, despite that the effect was masked with molecule counts. For example, with the large family sizes for 5 ng, it is likely that most of the unique molecules were captured, thus there was no obvious difference between the hotspot and backbone panel. With the family size differences, it is likely that more PCR duplicates were being captured by the hotspot panel than the backbone panel.
- In sum, this experiment demonstrates that hotspot regions may be selectively captured with an increased hotspot panel amount.
Claims (20)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US18/503,392 US12100482B2 (en) | 2016-09-30 | 2023-11-07 | Methods for multi-resolution analysis of cell-free nucleic acids |
Applications Claiming Priority (10)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201662402940P | 2016-09-30 | 2016-09-30 | |
US201762468201P | 2017-03-07 | 2017-03-07 | |
US201762489391P | 2017-04-24 | 2017-04-24 | |
PCT/US2017/054607 WO2018064629A1 (en) | 2016-09-30 | 2017-09-29 | Methods for multi-resolution analysis of cell-free nucleic acids |
US201916338445A | 2019-03-29 | 2019-03-29 | |
US17/383,385 US20210358567A1 (en) | 2016-09-30 | 2021-07-22 | Systems and methods for detecting insertions and deletions |
US18/055,298 US11817177B2 (en) | 2016-09-30 | 2022-11-14 | Methods for multi-resolution analysis of cell-free nucleic acids |
US18/155,523 US11817179B2 (en) | 2016-09-30 | 2023-01-17 | Methods for multi-resolution analysis of cell-free nucleic acids |
US18/482,779 US20240233868A9 (en) | 2016-09-30 | 2023-10-06 | Methods for multi-resolution analysis of cell-free nucleic acids |
US18/503,392 US12100482B2 (en) | 2016-09-30 | 2023-11-07 | Methods for multi-resolution analysis of cell-free nucleic acids |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US18/482,779 Continuation US20240233868A9 (en) | 2016-09-30 | 2023-10-06 | Methods for multi-resolution analysis of cell-free nucleic acids |
Publications (2)
Publication Number | Publication Date |
---|---|
US20240087680A1 true US20240087680A1 (en) | 2024-03-14 |
US12100482B2 US12100482B2 (en) | 2024-09-24 |
Family
ID=61760169
Family Applications (8)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/338,445 Abandoned US20200013482A1 (en) | 2016-09-30 | 2017-09-29 | Methods for multi-resolution analysis of cell-free nucleic acids |
US16/719,768 Active US11062791B2 (en) | 2016-09-30 | 2019-12-18 | Methods for multi-resolution analysis of cell-free nucleic acids |
US17/383,385 Pending US20210358567A1 (en) | 2016-09-30 | 2021-07-22 | Systems and methods for detecting insertions and deletions |
US18/055,298 Active US11817177B2 (en) | 2016-09-30 | 2022-11-14 | Methods for multi-resolution analysis of cell-free nucleic acids |
US18/155,523 Active US11817179B2 (en) | 2016-09-30 | 2023-01-17 | Methods for multi-resolution analysis of cell-free nucleic acids |
US18/482,779 Pending US20240233868A9 (en) | 2016-09-30 | 2023-10-06 | Methods for multi-resolution analysis of cell-free nucleic acids |
US18/503,392 Active US12100482B2 (en) | 2016-09-30 | 2023-11-07 | Methods for multi-resolution analysis of cell-free nucleic acids |
US18/506,734 Active US12094573B2 (en) | 2016-09-30 | 2023-11-10 | Methods for multi-resolution analysis of cell-free nucleic acids |
Family Applications Before (6)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/338,445 Abandoned US20200013482A1 (en) | 2016-09-30 | 2017-09-29 | Methods for multi-resolution analysis of cell-free nucleic acids |
US16/719,768 Active US11062791B2 (en) | 2016-09-30 | 2019-12-18 | Methods for multi-resolution analysis of cell-free nucleic acids |
US17/383,385 Pending US20210358567A1 (en) | 2016-09-30 | 2021-07-22 | Systems and methods for detecting insertions and deletions |
US18/055,298 Active US11817177B2 (en) | 2016-09-30 | 2022-11-14 | Methods for multi-resolution analysis of cell-free nucleic acids |
US18/155,523 Active US11817179B2 (en) | 2016-09-30 | 2023-01-17 | Methods for multi-resolution analysis of cell-free nucleic acids |
US18/482,779 Pending US20240233868A9 (en) | 2016-09-30 | 2023-10-06 | Methods for multi-resolution analysis of cell-free nucleic acids |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US18/506,734 Active US12094573B2 (en) | 2016-09-30 | 2023-11-10 | Methods for multi-resolution analysis of cell-free nucleic acids |
Country Status (10)
Country | Link |
---|---|
US (8) | US20200013482A1 (en) |
EP (2) | EP3461274B1 (en) |
JP (5) | JP6560465B1 (en) |
KR (1) | KR102344635B1 (en) |
CN (2) | CN118460676A (en) |
AU (2) | AU2017336153B2 (en) |
CA (2) | CA3027919C (en) |
ES (1) | ES2840003T3 (en) |
SG (1) | SG11201811159SA (en) |
WO (1) | WO2018064629A1 (en) |
Families Citing this family (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
AU2012242847B2 (en) | 2011-04-15 | 2017-01-19 | The Johns Hopkins University | Safe sequencing system |
AU2013338393C1 (en) | 2012-10-29 | 2024-07-25 | The Johns Hopkins University | Papanicolaou test for ovarian and endometrial cancers |
US9932576B2 (en) | 2012-12-10 | 2018-04-03 | Resolution Bioscience, Inc. | Methods for targeted genomic analysis |
US11286531B2 (en) | 2015-08-11 | 2022-03-29 | The Johns Hopkins University | Assaying ovarian cyst fluid |
DK3374525T3 (en) | 2015-11-11 | 2021-04-06 | Resolution Bioscience Inc | HIGH-EFFICIENT CONSTRUCTION OF DNA LIBRARIES |
EP3443066B1 (en) | 2016-04-14 | 2024-10-02 | Guardant Health, Inc. | Methods for early detection of cancer |
US20190287645A1 (en) * | 2016-07-06 | 2019-09-19 | Guardant Health, Inc. | Methods for fragmentome profiling of cell-free nucleic acids |
MX2019002093A (en) | 2016-08-25 | 2019-06-20 | Resolution Bioscience Inc | Methods for the detection of genomic copy changes in dna samples. |
EP3461274B1 (en) | 2016-09-30 | 2020-11-04 | Guardant Health, Inc. | Methods for multi-resolution analysis of cell-free nucleic acids |
WO2019067092A1 (en) | 2017-08-07 | 2019-04-04 | The Johns Hopkins University | Methods and materials for assessing and treating cancer |
CA3097146A1 (en) * | 2018-04-16 | 2019-10-24 | Memorial Sloan Kettering Cancer Center | Systems and methods for detecting cancer via cfdna screening |
IL298458A (en) | 2020-05-22 | 2023-01-01 | Aqtual Inc | Methods for characterizing cell-free nucleic acid fragments |
EP4407042A3 (en) | 2020-07-10 | 2024-09-18 | Guardant Health, Inc. | Methods of detecting genomic rearrangements using cell free nucleic acids |
WO2023282916A1 (en) | 2021-07-09 | 2023-01-12 | Guardant Health, Inc. | Methods of detecting genomic rearrangements using cell free nucleic acids |
JP2023540221A (en) | 2020-08-25 | 2023-09-22 | ガーダント ヘルス, インコーポレイテッド | Methods and systems for predicting variant origin |
EP4407558A1 (en) | 2023-01-26 | 2024-07-31 | Koninklijke Philips N.V. | Obtaining a medical image at a target plane |
CN117935921B (en) * | 2024-03-21 | 2024-06-11 | 北京贝瑞和康生物技术有限公司 | Method, apparatus, medium and program product for determining deletion/repetition type |
Family Cites Families (99)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CH686982A5 (en) | 1993-12-16 | 1996-08-15 | Maurice Stroun | Method for diagnosis of cancers. |
US5648245A (en) | 1995-05-09 | 1997-07-15 | Carnegie Institution Of Washington | Method for constructing an oligonucleotide concatamer library by rolling circle replication |
US6156504A (en) | 1996-03-15 | 2000-12-05 | The Penn State Research Foundation | Detection of extracellular tumor-associated nucleic acid in blood plasma or serum using nucleic acid amplification assays |
WO1997035589A1 (en) | 1996-03-26 | 1997-10-02 | Kopreski Michael S | Method enabling use of extracellular rna extracted from plasma or serum to detect, monitor or evaluate cancer |
US6232066B1 (en) * | 1997-12-19 | 2001-05-15 | Neogen, Inc. | High throughput assay system |
CA2366459A1 (en) | 1999-03-26 | 2000-10-05 | Affymetrix, Inc. | Universal arrays |
CA2366778C (en) | 1999-04-09 | 2008-07-22 | Exact Sciences Corporation | Methods for detecting nucleic acids indicative of cancer |
DE60045059D1 (en) | 1999-04-20 | 2010-11-18 | Nat Inst Of Advanced Ind Scien | Methods and probes for determining the concentration of nucleic acid molecules and methods for analyzing the data obtained |
US6440706B1 (en) | 1999-08-02 | 2002-08-27 | Johns Hopkins University | Digital amplification |
US6849403B1 (en) | 1999-09-08 | 2005-02-01 | Exact Sciences Corporation | Apparatus and method for drug screening |
US6586177B1 (en) | 1999-09-08 | 2003-07-01 | Exact Sciences Corporation | Methods for disease detection |
WO2001042781A2 (en) | 1999-12-07 | 2001-06-14 | Exact Sciences Corporation | Supracolonic aerodigestive neoplasm detection |
US20020072058A1 (en) | 2000-03-24 | 2002-06-13 | Voelker Leroy L. | Method for amplifying quinolone-resistance-determining-regions and identifying polymorphic variants thereof |
EP1158055A1 (en) | 2000-05-26 | 2001-11-28 | Xu Qi University of Teaxs Laboratoire de Leucémie Chen | Method for diagnosing cancers |
CA2426824A1 (en) | 2000-10-24 | 2002-07-25 | The Board Of Trustees Of The Leland Stanford Junior University | Direct multiplex characterization of genomic dna |
DK1342794T3 (en) | 2002-03-05 | 2006-04-24 | Epigenomics Ag | Method and apparatus for determining tissue specificity of free-flowing DNA in body fluids |
US7727720B2 (en) | 2002-05-08 | 2010-06-01 | Ravgen, Inc. | Methods for detection of genetic disorders |
US7635564B2 (en) * | 2002-10-25 | 2009-12-22 | Agilent Technologies, Inc. | Biopolymeric arrays having replicate elements |
US10229244B2 (en) | 2002-11-11 | 2019-03-12 | Affymetrix, Inc. | Methods for identifying DNA copy number changes using hidden markov model based estimations |
WO2005010145A2 (en) | 2003-07-05 | 2005-02-03 | The Johns Hopkins University | Method and compositions for detection and enumeration of genetic variations |
EP1524321B2 (en) | 2003-10-16 | 2014-07-23 | Sequenom, Inc. | Non-invasive detection of fetal genetic traits |
DE10348407A1 (en) | 2003-10-17 | 2005-05-19 | Widschwendter, Martin, Prof. | Prognostic and diagnostic markers for cell proliferative disorders of breast tissues |
CA2544041C (en) * | 2003-10-28 | 2015-12-08 | Bioarray Solutions Ltd. | Optimization of gene expression analysis using immobilized capture probes |
US20100216153A1 (en) | 2004-02-27 | 2010-08-26 | Helicos Biosciences Corporation | Methods for detecting fetal nucleic acids and diagnosing fetal abnormalities |
US7937225B2 (en) | 2004-09-03 | 2011-05-03 | New York University | Systems, methods and software arrangements for detection of genome copy number variation |
EP1647600A3 (en) | 2004-09-17 | 2006-06-28 | Affymetrix, Inc. (A US Entity) | Methods for identifying biological samples by addition of nucleic acid bar-code tags |
US9109256B2 (en) | 2004-10-27 | 2015-08-18 | Esoterix Genetic Laboratories, Llc | Method for monitoring disease progression or recurrence |
US7424371B2 (en) | 2004-12-21 | 2008-09-09 | Helicos Biosciences Corporation | Nucleic acid analysis |
US7393665B2 (en) | 2005-02-10 | 2008-07-01 | Population Genetics Technologies Ltd | Methods and compositions for tagging and identifying polynucleotides |
ITRM20050068A1 (en) | 2005-02-17 | 2006-08-18 | Istituto Naz Per Le Malattie I | METHOD FOR THE DETECTION OF NUCLEIC ACIDS OF BACTERIAL OR PATENT PATOGEN AGENTS IN URINE. |
EP1712639B1 (en) | 2005-04-06 | 2008-08-27 | Maurice Stroun | Method for the diagnosis of cancer by detecting circulating DNA and RNA |
US7666593B2 (en) | 2005-08-26 | 2010-02-23 | Helicos Biosciences Corporation | Single molecule sequencing of captured nucleic acids |
DK1929039T4 (en) | 2005-09-29 | 2014-02-17 | Keygene Nv | High throughput-screening af mutageniserede populationer |
WO2007087312A2 (en) | 2006-01-23 | 2007-08-02 | Population Genetics Technologies Ltd. | Molecular counting |
US8383338B2 (en) | 2006-04-24 | 2013-02-26 | Roche Nimblegen, Inc. | Methods and systems for uniform enrichment of genomic regions |
IL282783B2 (en) | 2006-05-18 | 2023-09-01 | Caris Mpi Inc | System and method for determining individualized medical intervention for a disease state |
US20080090239A1 (en) | 2006-06-14 | 2008-04-17 | Daniel Shoemaker | Rare cell analysis using sample splitting and dna tags |
CA2669728C (en) | 2006-11-15 | 2017-04-11 | Biospherex Llc | Multitag sequencing and ecogenomics analysis |
US20080131887A1 (en) | 2006-11-30 | 2008-06-05 | Stephan Dietrich A | Genetic Analysis Systems and Methods |
US20100196898A1 (en) | 2007-05-24 | 2010-08-05 | The Brigham & Women's Hospital, Inc. | Disease-associated genetic variations and methods for obtaining and using same |
CN101720359A (en) | 2007-06-01 | 2010-06-02 | 454生命科学公司 | System and meth0d for identification of individual samples from a multiplex mixture |
KR101829565B1 (en) | 2007-07-23 | 2018-03-29 | 더 차이니즈 유니버시티 오브 홍콩 | Determining a nucleic acid sequence imbalance |
JP2011508450A (en) | 2007-12-28 | 2011-03-10 | スリーエム イノベイティブ プロパティズ カンパニー | Downconverted light source with uniform wavelength emission |
WO2009099602A1 (en) | 2008-02-04 | 2009-08-13 | Massachusetts Institute Of Technology | Selection of nucleic acids by solution hybridization to oligonucleotide baits |
JP2011511644A (en) | 2008-02-12 | 2011-04-14 | ノバルティス アーゲー | Methods for isolating cell-free apoptotic or fetal nucleic acids |
WO2009120372A2 (en) | 2008-03-28 | 2009-10-01 | Pacific Biosciences Of California, Inc. | Compositions and methods for nucleic acid sequencing |
US20090318305A1 (en) | 2008-06-18 | 2009-12-24 | Xi Erick Lin | Methods for selectively capturing and amplifying exons or targeted genomic regions from biological samples |
WO2010021936A1 (en) | 2008-08-16 | 2010-02-25 | The Board Of Trustees Of The Leland Stanford Junior University | Digital pcr calibration for high throughput sequencing |
US8383345B2 (en) | 2008-09-12 | 2013-02-26 | University Of Washington | Sequence tag directed subassembly of short sequencing reads into long sequencing reads |
US20100323348A1 (en) | 2009-01-31 | 2010-12-23 | The Regents Of The University Of Colorado, A Body Corporate | Methods and Compositions for Using Error-Detecting and/or Error-Correcting Barcodes in Nucleic Acid Amplification Process |
US9085798B2 (en) | 2009-04-30 | 2015-07-21 | Prognosys Biosciences, Inc. | Nucleic acid constructs and methods of use |
US10662474B2 (en) | 2010-01-19 | 2020-05-26 | Verinata Health, Inc. | Identification of polymorphic sequences in mixtures of genomic DNA by whole genome sequencing |
EP2366031B1 (en) | 2010-01-19 | 2015-01-21 | Verinata Health, Inc | Sequencing methods in prenatal diagnoses |
EP2591433A4 (en) | 2010-07-06 | 2017-05-17 | Life Technologies Corporation | Systems and methods to detect copy number variation |
WO2012014877A1 (en) | 2010-07-29 | 2012-02-02 | Toto株式会社 | Photocatalyst coated body and photocatalyst coating liquid |
EP3115468B1 (en) | 2010-09-21 | 2018-07-25 | Agilent Technologies, Inc. | Increasing confidence of allele calls with molecular counting |
US8725422B2 (en) | 2010-10-13 | 2014-05-13 | Complete Genomics, Inc. | Methods for estimating genome-wide copy number variations |
EP2630263B2 (en) | 2010-10-22 | 2021-11-10 | Cold Spring Harbor Laboratory | Varietal counting of nucleic acids for obtaining genomic copy number information |
KR20190100425A (en) * | 2010-12-30 | 2019-08-28 | 파운데이션 메디신 인코포레이티드 | Optimization of multigene analysis of tumor samples |
JP6153874B2 (en) * | 2011-02-09 | 2017-06-28 | ナテラ, インコーポレイテッド | Method for non-invasive prenatal ploidy calls |
WO2012129363A2 (en) | 2011-03-24 | 2012-09-27 | President And Fellows Of Harvard College | Single cell nucleic acid detection and analysis |
AU2012242847B2 (en) | 2011-04-15 | 2017-01-19 | The Johns Hopkins University | Safe sequencing system |
AU2012249759A1 (en) | 2011-04-25 | 2013-11-07 | Bio-Rad Laboratories, Inc. | Methods and compositions for nucleic acid analysis |
CN103890245B (en) | 2011-05-20 | 2020-11-17 | 富鲁达公司 | Nucleic acid encoding reactions |
US9340826B2 (en) | 2011-08-01 | 2016-05-17 | Celemics, Inc. | Method of preparing nucleic acid molecules |
CA2852098C (en) | 2011-10-21 | 2023-05-02 | Chronix Biomedical | Colorectal cancer associated circulating nucleic acid biomarkers |
WO2013060762A1 (en) | 2011-10-25 | 2013-05-02 | Roche Diagnostics Gmbh | Method for diagnosing a disease based on plasma-dna distribution |
PT2814959T (en) | 2012-02-17 | 2018-04-12 | Hutchinson Fred Cancer Res | Compositions and methods for accurately identifying mutations |
US11261494B2 (en) * | 2012-06-21 | 2022-03-01 | The Chinese University Of Hong Kong | Method of measuring a fractional concentration of tumor DNA |
US20160040229A1 (en) | 2013-08-16 | 2016-02-11 | Guardant Health, Inc. | Systems and methods to detect rare mutations and copy number variation |
KR102028375B1 (en) | 2012-09-04 | 2019-10-04 | 가던트 헬쓰, 인크. | Systems and methods to detect rare mutations and copy number variation |
US20140066317A1 (en) | 2012-09-04 | 2014-03-06 | Guardant Health, Inc. | Systems and methods to detect rare mutations and copy number variation |
US20140287408A1 (en) * | 2013-03-13 | 2014-09-25 | Abbott Molecular Inc. | Target sequence enrichment |
GB2528205B (en) | 2013-03-15 | 2020-06-03 | Guardant Health Inc | Systems and methods to detect rare mutations and copy number variation |
ES2946689T3 (en) | 2013-03-15 | 2023-07-24 | Univ Leland Stanford Junior | Identification and use of circulating nucleic acid tumor markers |
EP2999792B1 (en) | 2013-05-23 | 2018-11-14 | The Board of Trustees of The Leland Stanford Junior University | Transposition into native chromatin for personal epigenomics |
EP3378952B1 (en) | 2013-12-28 | 2020-02-05 | Guardant Health, Inc. | Methods and systems for detecting genetic variants |
AU2015249846B2 (en) | 2014-04-21 | 2021-07-22 | Natera, Inc. | Detecting mutations and ploidy in chromosomal segments |
EP3143163B1 (en) | 2014-05-13 | 2020-11-25 | Board of Regents, The University of Texas System | Gene mutations and copy number alterations of egfr, kras and met |
ES2903103T3 (en) | 2014-07-25 | 2022-03-31 | Bgi Genomics Co Ltd | Method for determining the cell-free fraction of fetal nucleic acids in a peripheral blood sample from a pregnant woman and use thereof |
CN117402950A (en) | 2014-07-25 | 2024-01-16 | 华盛顿大学 | Method for determining tissue and/or cell type leading to the production of cell-free DNA and method for identifying diseases or disorders using the same |
US20160053301A1 (en) | 2014-08-22 | 2016-02-25 | Clearfork Bioscience, Inc. | Methods for quantitative genetic analysis of cell free dna |
WO2016040901A1 (en) | 2014-09-12 | 2016-03-17 | The Board Of Trustees Of The Leland Stanford Junior University | Identification and use of circulating nucleic acids |
AU2015339148B2 (en) * | 2014-10-29 | 2022-03-10 | 10X Genomics, Inc. | Methods and compositions for targeted nucleic acid sequencing |
GB2552267B (en) | 2014-12-31 | 2020-06-10 | Guardant Health Inc | Detection and treatment of disease exhibiting disease cell heterogeneity and systems and methods for communicating test results |
EP3256605B1 (en) | 2015-02-10 | 2022-02-09 | The Chinese University Of Hong Kong | Detecting mutations for cancer screening and fetal analysis |
US20160281166A1 (en) * | 2015-03-23 | 2016-09-29 | Parabase Genomics, Inc. | Methods and systems for screening diseases in subjects |
WO2016179049A1 (en) | 2015-05-01 | 2016-11-10 | Guardant Health, Inc | Diagnostic methods |
EP3325667B1 (en) | 2015-07-21 | 2020-11-11 | Guardant Health, Inc. | Locked nucleic acids for capturing fusion genes |
HUE064231T2 (en) | 2015-07-23 | 2024-02-28 | Univ Hong Kong Chinese | Analysis of fragmentation patterns of cell-free dna |
US20170058332A1 (en) | 2015-09-02 | 2017-03-02 | Guardant Health, Inc. | Identification of somatic mutations versus germline variants for cell-free dna variant calling applications |
US11302416B2 (en) | 2015-09-02 | 2022-04-12 | Guardant Health | Machine learning for somatic single nucleotide variant detection in cell-free tumor nucleic acid sequencing applications |
CN108474040B (en) | 2015-10-09 | 2023-05-16 | 夸登特健康公司 | Population-based treatment recommendations using cell-free DNA |
WO2017062970A1 (en) | 2015-10-10 | 2017-04-13 | Guardant Health, Inc. | Methods and applications of gene fusion detection in cell-free dna analysis |
SG11201805119QA (en) | 2015-12-17 | 2018-07-30 | Guardant Health Inc | Methods to determine tumor gene copy number by analysis of cell-free dna |
CN115881230A (en) | 2015-12-17 | 2023-03-31 | 伊路敏纳公司 | Differentiating methylation levels in complex biological samples |
WO2017136603A1 (en) | 2016-02-02 | 2017-08-10 | Guardant Health, Inc. | Cancer evolution detection and diagnostic |
US9850523B1 (en) * | 2016-09-30 | 2017-12-26 | Guardant Health, Inc. | Methods for multi-resolution analysis of cell-free nucleic acids |
EP3461274B1 (en) | 2016-09-30 | 2020-11-04 | Guardant Health, Inc. | Methods for multi-resolution analysis of cell-free nucleic acids |
-
2017
- 2017-09-29 EP EP17857586.6A patent/EP3461274B1/en active Active
- 2017-09-29 CN CN202410611574.6A patent/CN118460676A/en active Pending
- 2017-09-29 CA CA3027919A patent/CA3027919C/en active Active
- 2017-09-29 AU AU2017336153A patent/AU2017336153B2/en active Active
- 2017-09-29 SG SG11201811159SA patent/SG11201811159SA/en unknown
- 2017-09-29 ES ES17857586T patent/ES2840003T3/en active Active
- 2017-09-29 KR KR1020187038033A patent/KR102344635B1/en active IP Right Grant
- 2017-09-29 US US16/338,445 patent/US20200013482A1/en not_active Abandoned
- 2017-09-29 EP EP20203923.6A patent/EP3792922A1/en active Pending
- 2017-09-29 WO PCT/US2017/054607 patent/WO2018064629A1/en unknown
- 2017-09-29 CA CA3126055A patent/CA3126055A1/en active Pending
- 2017-09-29 JP JP2018568202A patent/JP6560465B1/en active Active
- 2017-09-29 CN CN201780049135.9A patent/CN109642250B/en active Active
-
2019
- 2019-07-18 JP JP2019132431A patent/JP6806854B2/en active Active
- 2019-12-18 US US16/719,768 patent/US11062791B2/en active Active
-
2020
- 2020-12-04 JP JP2020201573A patent/JP7022188B2/en active Active
-
2021
- 2021-07-22 US US17/383,385 patent/US20210358567A1/en active Pending
-
2022
- 2022-02-04 JP JP2022016332A patent/JP7385686B2/en active Active
- 2022-11-14 US US18/055,298 patent/US11817177B2/en active Active
-
2023
- 2023-01-17 US US18/155,523 patent/US11817179B2/en active Active
- 2023-06-27 AU AU2023204088A patent/AU2023204088A1/en active Pending
- 2023-10-06 US US18/482,779 patent/US20240233868A9/en active Pending
- 2023-11-07 US US18/503,392 patent/US12100482B2/en active Active
- 2023-11-10 JP JP2023192159A patent/JP2024012567A/en active Pending
- 2023-11-10 US US18/506,734 patent/US12094573B2/en active Active
Non-Patent Citations (2)
Title |
---|
Uchida, Junji, et al. "Diagnostic accuracy of noninvasive genotyping of EGFR in lung cancer patients by deep sequencing of plasma cell-free DNA." Clinical chemistry 61.9 (2015): 1191-1196. (Year: 2015) * |
Zhu, Guanshan, et al. "Highly sensitive droplet digital PCR method for detection of EGFR-activating mutations in plasma cell–free DNA from patients with advanced non–small cell lung cancer." The Journal of Molecular Diagnostics 17.3 (2015): 265-272. (Year: 2015) * |
Also Published As
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US12100482B2 (en) | Methods for multi-resolution analysis of cell-free nucleic acids | |
US9850523B1 (en) | Methods for multi-resolution analysis of cell-free nucleic acids | |
US12054774B2 (en) | Methods and systems for detecting genetic variants |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
AS | Assignment |
Owner name: GUARDANT HEALTH, INC., CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:CHUDOVA, DARYA;ELTOUKHY, HELMY;MORTIMER, STEFANIE ANN WARD;AND OTHERS;SIGNING DATES FROM 20170913 TO 20170922;REEL/FRAME:066455/0379 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |