WO2021168146A1 - Methods and systems for a liquid biopsy assay - Google Patents
Methods and systems for a liquid biopsy assay Download PDFInfo
- Publication number
- WO2021168146A1 WO2021168146A1 PCT/US2021/018622 US2021018622W WO2021168146A1 WO 2021168146 A1 WO2021168146 A1 WO 2021168146A1 US 2021018622 W US2021018622 W US 2021018622W WO 2021168146 A1 WO2021168146 A1 WO 2021168146A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- bin
- sequence
- segment
- level
- measure
- Prior art date
Links
- 238000001574 biopsy Methods 0.000 title claims description 364
- 239000007788 liquid Substances 0.000 title claims description 324
- 238000004166 bioassay Methods 0.000 title claims description 80
- 206010028980 Neoplasm Diseases 0.000 claims abstract description 452
- 230000000392 somatic Effects 0.000 claims abstract description 133
- 239000000523 sample Substances 0.000 claims description 585
- 201000011510 cancer Diseases 0.000 claims description 388
- 230000000875 corresponding Effects 0.000 claims description 364
- 150000007523 nucleic acids Chemical class 0.000 claims description 187
- 238000006243 chemical reaction Methods 0.000 claims description 175
- 108020004707 nucleic acids Proteins 0.000 claims description 161
- 210000001519 tissues Anatomy 0.000 claims description 154
- 229920003013 deoxyribonucleic acid Polymers 0.000 claims description 140
- 239000006185 dispersion Substances 0.000 claims description 126
- 238000000034 method Methods 0.000 claims description 120
- 238000004458 analytical method Methods 0.000 claims description 109
- 241000282414 Homo sapiens Species 0.000 claims description 98
- 238000002560 therapeutic procedure Methods 0.000 claims description 92
- 230000035772 mutation Effects 0.000 claims description 87
- 238000004422 calculation algorithm Methods 0.000 claims description 86
- 210000004602 germ cell Anatomy 0.000 claims description 79
- 241000894007 species Species 0.000 claims description 78
- 210000000349 Chromosomes Anatomy 0.000 claims description 74
- 239000012472 biological sample Substances 0.000 claims description 72
- 210000004369 Blood Anatomy 0.000 claims description 64
- 230000003321 amplification Effects 0.000 claims description 64
- 239000008280 blood Substances 0.000 claims description 64
- 238000001514 detection method Methods 0.000 claims description 62
- 210000004027 cells Anatomy 0.000 claims description 56
- 238000003199 nucleic acid amplification method Methods 0.000 claims description 55
- 238000009826 distribution Methods 0.000 claims description 44
- 206010006187 Breast cancer Diseases 0.000 claims description 41
- 230000035945 sensitivity Effects 0.000 claims description 40
- 102000017256 epidermal growth factor-activated receptor activity proteins Human genes 0.000 claims description 36
- 108040009258 epidermal growth factor-activated receptor activity proteins Proteins 0.000 claims description 36
- 230000011218 segmentation Effects 0.000 claims description 31
- 101700025368 ERBB2 Proteins 0.000 claims description 29
- 230000015654 memory Effects 0.000 claims description 29
- 208000002154 Non-Small-Cell Lung Carcinoma Diseases 0.000 claims description 26
- 206010060862 Prostate cancer Diseases 0.000 claims description 21
- 238000003860 storage Methods 0.000 claims description 18
- 230000001225 therapeutic Effects 0.000 claims description 18
- 102000036638 BRCA1 Human genes 0.000 claims description 16
- 108010042977 BRCA1 Protein Proteins 0.000 claims description 16
- 238000010304 firing Methods 0.000 claims description 16
- 230000001264 neutralization Effects 0.000 claims description 15
- 229920000665 Exon Polymers 0.000 claims description 14
- 108010000750 BRCA2 Protein Proteins 0.000 claims description 13
- 102000002280 BRCA2 Protein Human genes 0.000 claims description 13
- 102100007290 CD274 Human genes 0.000 claims description 13
- 101710012053 CD274 Proteins 0.000 claims description 13
- 102100015262 MYC Human genes 0.000 claims description 13
- 101700075357 MYC Proteins 0.000 claims description 13
- 210000002381 Plasma Anatomy 0.000 claims description 13
- 230000000694 effects Effects 0.000 claims description 13
- 102100016490 CCNE1 Human genes 0.000 claims description 11
- 101700061678 CCNE1 Proteins 0.000 claims description 11
- 230000001537 neural Effects 0.000 claims description 10
- 238000007476 Maximum Likelihood Methods 0.000 claims description 9
- 210000002966 Serum Anatomy 0.000 claims description 8
- 208000005017 Glioblastoma Diseases 0.000 claims description 7
- 206010018338 Glioma Diseases 0.000 claims description 7
- 201000011231 colorectal cancer Diseases 0.000 claims description 7
- 206010073251 Clear cell renal cell carcinoma Diseases 0.000 claims description 6
- 206010014733 Endometrial cancer Diseases 0.000 claims description 6
- 206010025650 Malignant melanoma Diseases 0.000 claims description 6
- 206010033128 Ovarian cancer Diseases 0.000 claims description 6
- 201000005216 brain cancer Diseases 0.000 claims description 6
- 201000001441 melanoma Diseases 0.000 claims description 6
- 206010005003 Bladder cancer Diseases 0.000 claims description 5
- 206010039491 Sarcoma Diseases 0.000 claims description 5
- 201000005112 urinary bladder cancer Diseases 0.000 claims description 5
- 206010004146 Basal cell carcinoma Diseases 0.000 claims description 4
- 206010004593 Bile duct cancer Diseases 0.000 claims description 4
- 206010008342 Cervix carcinoma Diseases 0.000 claims description 4
- 206010017758 Gastric cancer Diseases 0.000 claims description 4
- 206010051066 Gastrointestinal stromal tumour Diseases 0.000 claims description 4
- 108009000344 Head and Neck Squamous Cell Carcinoma Proteins 0.000 claims description 4
- 208000000172 Medulloblastoma Diseases 0.000 claims description 4
- 206010027191 Meningioma Diseases 0.000 claims description 4
- 206010027406 Mesothelioma Diseases 0.000 claims description 4
- 206010029260 Neuroblastoma Diseases 0.000 claims description 4
- 206010030155 Oesophageal carcinoma Diseases 0.000 claims description 4
- 206010031096 Oropharyngeal cancer Diseases 0.000 claims description 4
- 208000008443 Pancreatic Carcinoma Diseases 0.000 claims description 4
- 206010038389 Renal cancer Diseases 0.000 claims description 4
- 208000000587 Small Cell Lung Carcinoma Diseases 0.000 claims description 4
- 206010041067 Small cell lung cancer Diseases 0.000 claims description 4
- 208000000102 Squamous Cell Carcinoma of Head and Neck Diseases 0.000 claims description 4
- 206010057644 Testis cancer Diseases 0.000 claims description 4
- 208000008732 Thymoma Diseases 0.000 claims description 4
- 201000005188 adrenal gland cancer Diseases 0.000 claims description 4
- 201000010881 cervical cancer Diseases 0.000 claims description 4
- 201000010240 chromophobe renal cell carcinoma Diseases 0.000 claims description 4
- 201000011523 endocrine gland cancer Diseases 0.000 claims description 4
- 201000004101 esophageal cancer Diseases 0.000 claims description 4
- 201000011243 gastrointestinal stromal tumor Diseases 0.000 claims description 4
- 201000010536 head and neck cancer Diseases 0.000 claims description 4
- 201000000459 head and neck squamous cell carcinoma Diseases 0.000 claims description 4
- 201000010982 kidney cancer Diseases 0.000 claims description 4
- 201000007270 liver cancer Diseases 0.000 claims description 4
- 201000006958 oropharynx cancer Diseases 0.000 claims description 4
- 201000002528 pancreatic cancer Diseases 0.000 claims description 4
- 239000011886 peripheral blood Substances 0.000 claims description 4
- 201000000849 skin cancer Diseases 0.000 claims description 4
- 201000011549 stomach cancer Diseases 0.000 claims description 4
- 201000003120 testicular cancer Diseases 0.000 claims description 4
- 201000002510 thyroid cancer Diseases 0.000 claims description 4
- 201000005969 uveal melanoma Diseases 0.000 claims description 4
- 101710042656 BQ2027_MB1231C Proteins 0.000 claims description 3
- 206010005949 Bone cancer Diseases 0.000 claims description 3
- KTEIFNKAUNYNJU-GFCCVEGCSA-N Crizotinib Chemical compound O([C@H](C)C=1C(=C(F)C=CC=1Cl)Cl)C(C(=NC=1)N)=CC=1C(=C1)C=NN1C1CCNCC1 KTEIFNKAUNYNJU-GFCCVEGCSA-N 0.000 claims description 3
- 239000002136 L01XE07 - Lapatinib Substances 0.000 claims description 3
- 239000002146 L01XE16 - Crizotinib Substances 0.000 claims description 3
- BCFGMOOMADDAQU-UHFFFAOYSA-N Lapatinib Chemical compound O1C(CNCCS(=O)(=O)C)=CC=C1C1=CC=C(N=CN=C2NC=3C=C(Cl)C(OCC=4C=C(F)C=CC=4)=CC=3)C2=C1 BCFGMOOMADDAQU-UHFFFAOYSA-N 0.000 claims description 3
- 210000002751 Lymph Anatomy 0.000 claims description 3
- 108010010691 Trastuzumab Proteins 0.000 claims description 3
- 201000009047 chordoma Diseases 0.000 claims description 3
- 229960005061 crizotinib Drugs 0.000 claims description 3
- 201000010175 gallbladder cancer Diseases 0.000 claims description 3
- 201000007492 gastroesophageal junction adenocarcinoma Diseases 0.000 claims description 3
- 229960004891 lapatinib Drugs 0.000 claims description 3
- 201000003709 ovarian serous carcinoma Diseases 0.000 claims description 3
- 201000010279 papillary renal cell carcinoma Diseases 0.000 claims description 3
- 201000000582 retinoblastoma Diseases 0.000 claims description 3
- 238000005096 rolling process Methods 0.000 claims description 3
- 238000007671 third-generation sequencing Methods 0.000 claims description 3
- 229960000575 trastuzumab Drugs 0.000 claims description 3
- 201000003701 uterine corpus endometrial carcinoma Diseases 0.000 claims description 3
- 201000010370 uterine corpus serous adenocarcinoma Diseases 0.000 claims description 3
- 208000006265 Renal Cell Carcinoma Diseases 0.000 claims description 2
- 102000027760 ERBB2 Human genes 0.000 claims 3
- 241000854491 Delta Species 0.000 claims 2
- 239000002773 nucleotide Substances 0.000 description 107
- 125000003729 nucleotide group Chemical group 0.000 description 96
- 230000004075 alteration Effects 0.000 description 79
- 230000002068 genetic Effects 0.000 description 74
- 230000002759 chromosomal Effects 0.000 description 69
- 238000010200 validation analysis Methods 0.000 description 65
- 238000001914 filtration Methods 0.000 description 58
- 238000005516 engineering process Methods 0.000 description 50
- 230000001976 improved Effects 0.000 description 48
- 150000001413 amino acids Chemical class 0.000 description 47
- 229920002393 Microsatellite Polymers 0.000 description 43
- 239000000203 mixture Substances 0.000 description 39
- 239000007787 solid Substances 0.000 description 35
- 238000007481 next generation sequencing Methods 0.000 description 32
- 238000000605 extraction Methods 0.000 description 30
- 238000003752 polymerase chain reaction Methods 0.000 description 28
- 229920001850 Nucleic acid sequence Polymers 0.000 description 27
- 102100016662 ERBB2 Human genes 0.000 description 26
- 201000010099 disease Diseases 0.000 description 26
- 230000000869 mutational Effects 0.000 description 25
- 108009000071 Non-small cell lung cancer Proteins 0.000 description 24
- 230000004927 fusion Effects 0.000 description 24
- 238000010606 normalization Methods 0.000 description 24
- 102000004169 proteins and genes Human genes 0.000 description 23
- 108090000623 proteins and genes Proteins 0.000 description 23
- 238000002626 targeted therapy Methods 0.000 description 22
- -1 DNA and/or RNA) Chemical class 0.000 description 21
- 230000001717 pathogenic Effects 0.000 description 21
- 238000003786 synthesis reaction Methods 0.000 description 21
- 102100009279 KRAS Human genes 0.000 description 18
- 101710033922 KRAS Proteins 0.000 description 18
- 102100006473 MAP2K1 Human genes 0.000 description 18
- 102100019471 PIK3CA Human genes 0.000 description 18
- 101710027440 PIK3CA Proteins 0.000 description 18
- 108010068342 MAP Kinase Kinase 1 Proteins 0.000 description 17
- 229920000160 (ribonucleotides)n+m Polymers 0.000 description 16
- 238000011109 contamination Methods 0.000 description 16
- 238000011156 evaluation Methods 0.000 description 16
- 238000007781 pre-processing Methods 0.000 description 16
- 230000004044 response Effects 0.000 description 16
- 102100016102 NTRK1 Human genes 0.000 description 15
- 101700043017 NTRK1 Proteins 0.000 description 15
- 238000004364 calculation method Methods 0.000 description 15
- 238000006467 substitution reaction Methods 0.000 description 14
- 239000000090 biomarker Substances 0.000 description 13
- 238000006481 deamination reaction Methods 0.000 description 13
- 231100000590 oncogenic Toxicity 0.000 description 13
- 230000002246 oncogenic Effects 0.000 description 13
- 244000052769 pathogens Species 0.000 description 13
- 230000002085 persistent Effects 0.000 description 13
- 108020004999 Messenger RNA Proteins 0.000 description 12
- ISAKRJDGNUQOIC-UHFFFAOYSA-N Uracil Natural products O=C1C=CNC(=O)N1 ISAKRJDGNUQOIC-UHFFFAOYSA-N 0.000 description 12
- 238000007796 conventional method Methods 0.000 description 12
- 229920002106 messenger RNA Polymers 0.000 description 12
- 230000011987 methylation Effects 0.000 description 12
- 238000007069 methylation reaction Methods 0.000 description 12
- 210000002220 Organoids Anatomy 0.000 description 11
- 150000001875 compounds Chemical class 0.000 description 11
- 239000003814 drug Substances 0.000 description 11
- 238000002744 homologous recombination Methods 0.000 description 11
- 238000002360 preparation method Methods 0.000 description 11
- 238000003908 quality control method Methods 0.000 description 11
- 238000007482 whole exome sequencing Methods 0.000 description 11
- 102100007495 AR Human genes 0.000 description 10
- 108010080146 androgen receptors Proteins 0.000 description 10
- 230000001419 dependent Effects 0.000 description 10
- 206010061818 Disease progression Diseases 0.000 description 9
- 229920002459 Intron Polymers 0.000 description 9
- 206010058467 Lung neoplasm malignant Diseases 0.000 description 9
- 238000003745 diagnosis Methods 0.000 description 9
- 239000012530 fluid Substances 0.000 description 9
- 230000012010 growth Effects 0.000 description 9
- 238000003780 insertion Methods 0.000 description 9
- 201000005202 lung cancer Diseases 0.000 description 9
- 230000015572 biosynthetic process Effects 0.000 description 8
- 238000005070 sampling Methods 0.000 description 8
- 238000002864 sequence alignment Methods 0.000 description 8
- 210000001124 Body Fluids Anatomy 0.000 description 7
- 229920001405 Coding region Polymers 0.000 description 7
- 206010027476 Metastasis Diseases 0.000 description 7
- 206010061289 Metastatic neoplasm Diseases 0.000 description 7
- 239000003153 chemical reaction reagent Substances 0.000 description 7
- 238000007405 data analysis Methods 0.000 description 7
- 238000003384 imaging method Methods 0.000 description 7
- 230000001965 increased Effects 0.000 description 7
- 239000000463 material Substances 0.000 description 7
- 230000001394 metastastic Effects 0.000 description 7
- 238000011160 research Methods 0.000 description 7
- 230000002194 synthesizing Effects 0.000 description 7
- 231100000277 DNA damage Toxicity 0.000 description 6
- 210000000265 Leukocytes Anatomy 0.000 description 6
- 102100001119 NRAS Human genes 0.000 description 6
- 101710033916 NRAS Proteins 0.000 description 6
- 210000003296 Saliva Anatomy 0.000 description 6
- 230000002159 abnormal effect Effects 0.000 description 6
- 238000004891 communication Methods 0.000 description 6
- 238000011161 development Methods 0.000 description 6
- 230000018109 developmental process Effects 0.000 description 6
- 201000009910 diseases by infectious agent Diseases 0.000 description 6
- 230000014509 gene expression Effects 0.000 description 6
- 230000003394 haemopoietic Effects 0.000 description 6
- 238000010801 machine learning Methods 0.000 description 6
- 230000004048 modification Effects 0.000 description 6
- 238000006011 modification reaction Methods 0.000 description 6
- 230000036961 partial Effects 0.000 description 6
- 230000037361 pathway Effects 0.000 description 6
- 230000001131 transforming Effects 0.000 description 6
- 238000004450 types of analysis Methods 0.000 description 6
- 102100011141 ALK Human genes 0.000 description 5
- 101710033641 ALK Proteins 0.000 description 5
- 108060000721 ATR Proteins 0.000 description 5
- 102100004328 BRAF Human genes 0.000 description 5
- 101700004551 BRAF Proteins 0.000 description 5
- 238000001712 DNA sequencing Methods 0.000 description 5
- 206010061819 Disease recurrence Diseases 0.000 description 5
- 102100016692 ESR1 Human genes 0.000 description 5
- 102200006648 HRAS Q61K Human genes 0.000 description 5
- 102200007373 KRAS Q61H Human genes 0.000 description 5
- 102200006520 KRAS Q61L Human genes 0.000 description 5
- 102200006525 KRAS Q61R Human genes 0.000 description 5
- 238000003657 Likelihood-ratio test Methods 0.000 description 5
- 102100013322 MTOR Human genes 0.000 description 5
- 101700036611 MTOR Proteins 0.000 description 5
- 210000004940 Nucleus Anatomy 0.000 description 5
- 102200085788 PIK3CA H1047L Human genes 0.000 description 5
- 102200085789 PIK3CA H1047R Human genes 0.000 description 5
- 102200085790 PIK3CA H1047Y Human genes 0.000 description 5
- 102100019330 STK11 Human genes 0.000 description 5
- 101700065463 STK11 Proteins 0.000 description 5
- 125000003275 alpha amino acid group Chemical group 0.000 description 5
- 239000011324 bead Substances 0.000 description 5
- 230000001413 cellular Effects 0.000 description 5
- 230000002596 correlated Effects 0.000 description 5
- 230000036541 health Effects 0.000 description 5
- 210000002865 immune cell Anatomy 0.000 description 5
- 238000003364 immunohistochemistry Methods 0.000 description 5
- 230000000670 limiting Effects 0.000 description 5
- 230000033607 mismatch repair Effects 0.000 description 5
- 238000000513 principal component analysis Methods 0.000 description 5
- 102220117341 rs11554290 Human genes 0.000 description 5
- 102220197778 rs121913254 Human genes 0.000 description 5
- 102220225553 rs766964168 Human genes 0.000 description 5
- 238000001356 surgical procedure Methods 0.000 description 5
- OPTASPLRGRRNAP-UHFFFAOYSA-N Cytosine Chemical compound NC=1C=CNC(=O)N=1 OPTASPLRGRRNAP-UHFFFAOYSA-N 0.000 description 4
- 108020004388 MicroRNAs Proteins 0.000 description 4
- 229920000970 Repeated sequence (DNA) Polymers 0.000 description 4
- 210000004243 Sweat Anatomy 0.000 description 4
- 102100019730 TP53 Human genes 0.000 description 4
- 102100008209 TSC1 Human genes 0.000 description 4
- 101700061326 TSC1 Proteins 0.000 description 4
- IQFYYKKMVGJFEH-XLPZGREQSA-N Thymidine Natural products O=C1NC(=O)C(C)=CN1[C@@H]1O[C@H](CO)[C@@H](O)C1 IQFYYKKMVGJFEH-XLPZGREQSA-N 0.000 description 4
- RWQNBRDOKXIBIV-UHFFFAOYSA-N Thymine Chemical compound CC1=CNC(=O)NC1=O RWQNBRDOKXIBIV-UHFFFAOYSA-N 0.000 description 4
- 108010078814 Tumor Suppressor Protein p53 Proteins 0.000 description 4
- 210000002700 Urine Anatomy 0.000 description 4
- 238000010276 construction Methods 0.000 description 4
- 238000005259 measurement Methods 0.000 description 4
- 229920001239 microRNA Polymers 0.000 description 4
- 239000002679 microRNA Substances 0.000 description 4
- 238000011176 pooling Methods 0.000 description 4
- 230000000306 recurrent Effects 0.000 description 4
- VYPSYNLAJGMNEJ-UHFFFAOYSA-N silicium dioxide Chemical compound O=[Si]=O VYPSYNLAJGMNEJ-UHFFFAOYSA-N 0.000 description 4
- 238000007619 statistical method Methods 0.000 description 4
- 239000000126 substance Substances 0.000 description 4
- GUAHPAJOXVYFON-ZETCQYMHSA-N (8S)-8-azaniumyl-7-oxononanoate Chemical compound C[C@H](N)C(=O)CCCCCC(O)=O GUAHPAJOXVYFON-ZETCQYMHSA-N 0.000 description 3
- 102100007788 APC Human genes 0.000 description 3
- 101700010938 APC Proteins 0.000 description 3
- 206010069754 Acquired gene mutation Diseases 0.000 description 3
- 206010000880 Acute myeloid leukaemia Diseases 0.000 description 3
- 210000003567 Ascitic Fluid Anatomy 0.000 description 3
- 208000005623 Carcinogenesis Diseases 0.000 description 3
- 210000001175 Cerebrospinal Fluid Anatomy 0.000 description 3
- 229940104302 Cytosine Drugs 0.000 description 3
- 238000007399 DNA isolation Methods 0.000 description 3
- 230000033616 DNA repair Effects 0.000 description 3
- 229920002024 GDNA Polymers 0.000 description 3
- 102000033185 GNAS Human genes 0.000 description 3
- 101700086896 GNAS Proteins 0.000 description 3
- 101710041546 Galphas Proteins 0.000 description 3
- 101700030371 IDH2 Proteins 0.000 description 3
- 102100002772 IDH2 Human genes 0.000 description 3
- 108020004391 Introns Proteins 0.000 description 3
- 102100019516 JAK2 Human genes 0.000 description 3
- 101700016050 JAK2 Proteins 0.000 description 3
- 210000004910 Pleural fluid Anatomy 0.000 description 3
- 102000012338 Poly(ADP-ribose) Polymerases Human genes 0.000 description 3
- 108010061844 Poly(ADP-ribose) Polymerases Proteins 0.000 description 3
- 210000001138 Tears Anatomy 0.000 description 3
- 231100000005 chromosome aberration Toxicity 0.000 description 3
- 238000002648 combination therapy Methods 0.000 description 3
- 230000000295 complement Effects 0.000 description 3
- 238000010192 crystallographic characterization Methods 0.000 description 3
- 238000011143 downstream manufacturing Methods 0.000 description 3
- 229940079593 drugs Drugs 0.000 description 3
- 230000001973 epigenetic Effects 0.000 description 3
- 230000011132 hemopoiesis Effects 0.000 description 3
- 230000002401 inhibitory effect Effects 0.000 description 3
- 239000011159 matrix material Substances 0.000 description 3
- 210000000056 organs Anatomy 0.000 description 3
- 230000002974 pharmacogenomic Effects 0.000 description 3
- 238000004393 prognosis Methods 0.000 description 3
- 238000011002 quantification Methods 0.000 description 3
- 230000002829 reduced Effects 0.000 description 3
- 230000000391 smoking Effects 0.000 description 3
- 230000001629 suppression Effects 0.000 description 3
- 230000004083 survival Effects 0.000 description 3
- 230000002123 temporal effect Effects 0.000 description 3
- 238000011144 upstream manufacturing Methods 0.000 description 3
- 102100001248 AKT1 Human genes 0.000 description 2
- 101700006234 AKT1 Proteins 0.000 description 2
- 101700006583 AKT2 Proteins 0.000 description 2
- 102100007877 APOE Human genes 0.000 description 2
- 101700025839 APOE Proteins 0.000 description 2
- 102100011069 ARAF Human genes 0.000 description 2
- 101700086422 ARAF Proteins 0.000 description 2
- 102100000648 ATM Human genes 0.000 description 2
- 108060006202 ATM Proteins 0.000 description 2
- 102100002848 ATRX Human genes 0.000 description 2
- 241000251468 Actinopterygii Species 0.000 description 2
- 239000004114 Ammonium polyphosphate Substances 0.000 description 2
- 206010059512 Apoptosis Diseases 0.000 description 2
- 238000010207 Bayesian analysis Methods 0.000 description 2
- 241000283690 Bos taurus Species 0.000 description 2
- 102100019529 CCND1 Human genes 0.000 description 2
- 102100019530 CCND2 Human genes 0.000 description 2
- 101700059002 CCND2 Proteins 0.000 description 2
- 101700016900 CDH1 Proteins 0.000 description 2
- 102100019398 CDK4 Human genes 0.000 description 2
- 101700008359 CDK4 Proteins 0.000 description 2
- 102100006130 CDK6 Human genes 0.000 description 2
- 102000033243 CDKN2A Human genes 0.000 description 2
- 102100002043 CTNNB1 Human genes 0.000 description 2
- 101710005974 CTNNB1 Proteins 0.000 description 2
- 210000003467 Cheek Anatomy 0.000 description 2
- 208000005443 Circulating Neoplastic Cells Diseases 0.000 description 2
- 206010009944 Colon cancer Diseases 0.000 description 2
- 229920000453 Consensus sequence Polymers 0.000 description 2
- 108010058546 Cyclin D1 Proteins 0.000 description 2
- 108010025468 Cyclin-Dependent Kinase 6 Proteins 0.000 description 2
- 108010009392 Cyclin-Dependent Kinase Inhibitor p16 Proteins 0.000 description 2
- 101700011568 DIB1 Proteins 0.000 description 2
- 108009000206 DNA Mismatch Repair Proteins 0.000 description 2
- 108009000097 DNA Replication Proteins 0.000 description 2
- 230000004543 DNA replication Effects 0.000 description 2
- 206010013710 Drug interaction Diseases 0.000 description 2
- 229940121647 EGFR inhibitors Drugs 0.000 description 2
- 102100016041 EZH2 Human genes 0.000 description 2
- 101700041849 EZH2 Proteins 0.000 description 2
- 241000283073 Equus caballus Species 0.000 description 2
- 108010062201 F-Box-WD Repeat-Containing Protein 7 Proteins 0.000 description 2
- 102100020077 FBXW7 Human genes 0.000 description 2
- 102100017996 FGFR1 Human genes 0.000 description 2
- 102100018000 FGFR2 Human genes 0.000 description 2
- 102000027766 FGFR3 Human genes 0.000 description 2
- 102100004573 FLT3 Human genes 0.000 description 2
- 101710009074 FLT3 Proteins 0.000 description 2
- 102100018976 FZR1 Human genes 0.000 description 2
- 240000008168 Ficus benjamina Species 0.000 description 2
- 102100011541 GID4 Human genes 0.000 description 2
- 101700083349 GID4 Proteins 0.000 description 2
- 102100008019 GNA11 Human genes 0.000 description 2
- 101700048596 GNA11 Proteins 0.000 description 2
- 102100014305 GNAQ Human genes 0.000 description 2
- 101700035643 GNAQ Proteins 0.000 description 2
- 102100016995 HNF1A Human genes 0.000 description 2
- 101700018864 HNF1A Proteins 0.000 description 2
- 102100009283 HRAS Human genes 0.000 description 2
- 101710033925 HRAS Proteins 0.000 description 2
- 102200006663 HRAS Q22K Human genes 0.000 description 2
- 102100004121 IDH1 Human genes 0.000 description 2
- 101700024037 IDH1 Proteins 0.000 description 2
- 101700066748 IDH3B Proteins 0.000 description 2
- 102100019518 JAK3 Human genes 0.000 description 2
- 101700007593 JAK3 Proteins 0.000 description 2
- 102100004453 KMT2D Human genes 0.000 description 2
- 101700014096 KMT2D Proteins 0.000 description 2
- 102200006538 KRAS G12C Human genes 0.000 description 2
- 102200006541 KRAS G12S Human genes 0.000 description 2
- 108010068353 MAP Kinase Kinase 2 Proteins 0.000 description 2
- 102100015877 MAP2K2 Human genes 0.000 description 2
- 102100016823 MAPK1 Human genes 0.000 description 2
- 101700083887 MAPK1 Proteins 0.000 description 2
- 102100014416 MLH1 Human genes 0.000 description 2
- 101700072814 MPK12 Proteins 0.000 description 2
- 102100000250 MPL Human genes 0.000 description 2
- 102100018883 MYCL Human genes 0.000 description 2
- 101700048958 MYCL Proteins 0.000 description 2
- 210000000214 Mouth Anatomy 0.000 description 2
- 241000699666 Mus <mouse, genus> Species 0.000 description 2
- 108010026664 MutL Protein Homolog 1 Proteins 0.000 description 2
- 102100017234 NFE2L2 Human genes 0.000 description 2
- 101710031938 NFE2L2 Proteins 0.000 description 2
- 102100012131 NOTCH1 Human genes 0.000 description 2
- 101710036042 NOTCH1 Proteins 0.000 description 2
- 102100020079 NPM1 Human genes 0.000 description 2
- 101710026364 NPM1 Proteins 0.000 description 2
- 102100020121 NSD3 Human genes 0.000 description 2
- 101700012744 NSD3 Proteins 0.000 description 2
- 102100016105 NTRK3 Human genes 0.000 description 2
- 102000007530 Neurofibromin 1 Human genes 0.000 description 2
- 108010085793 Neurofibromin 1 Proteins 0.000 description 2
- 102100019764 PDCD1 Human genes 0.000 description 2
- 102100007289 PDCD1LG2 Human genes 0.000 description 2
- 101710011976 PDCD1LG2 Proteins 0.000 description 2
- 102100004940 PDGFRA Human genes 0.000 description 2
- 101710018349 PDGFRA Proteins 0.000 description 2
- 102100008799 PTEN Human genes 0.000 description 2
- 108010011536 PTEN Phosphohydrolase Proteins 0.000 description 2
- 102100017818 PTPN11 Human genes 0.000 description 2
- 101710018405 PTPN11 Proteins 0.000 description 2
- 210000004912 Pericardial fluid Anatomy 0.000 description 2
- 229920000776 Poly(Adenosine diphosphate-ribose) polymerase Polymers 0.000 description 2
- 108010029485 Protein Isoforms Proteins 0.000 description 2
- 102000001708 Protein Isoforms Human genes 0.000 description 2
- 101710037934 QRSL1 Proteins 0.000 description 2
- 102100006051 RET Human genes 0.000 description 2
- 238000002123 RNA extraction Methods 0.000 description 2
- 102100002050 ROS1 Human genes 0.000 description 2
- 101710027587 ROS1 Proteins 0.000 description 2
- 101700054115 ROS1A Proteins 0.000 description 2
- 102100017680 SMAD4 Human genes 0.000 description 2
- 101700062085 SMAD4 Proteins 0.000 description 2
- 108060007796 SPATA2 Proteins 0.000 description 2
- 241000282898 Sus scrofa Species 0.000 description 2
- 108010081291 Type 1 Fibroblast Growth Factor Receptor Proteins 0.000 description 2
- 108010081268 Type 2 Fibroblast Growth Factor Receptor Proteins 0.000 description 2
- 108010081267 Type 3 Fibroblast Growth Factor Receptor Proteins 0.000 description 2
- 230000001594 aberrant Effects 0.000 description 2
- 239000002250 absorbent Substances 0.000 description 2
- 230000002745 absorbent Effects 0.000 description 2
- 230000002411 adverse Effects 0.000 description 2
- 230000006907 apoptotic process Effects 0.000 description 2
- 101700017456 asd-1 Proteins 0.000 description 2
- LSNNMFCWUKXFEE-UHFFFAOYSA-M bisulfite Chemical compound OS([O-])=O LSNNMFCWUKXFEE-UHFFFAOYSA-M 0.000 description 2
- 230000033077 cellular process Effects 0.000 description 2
- 108091006028 chimera Proteins 0.000 description 2
- 238000007635 classification algorithm Methods 0.000 description 2
- 238000004140 cleaning Methods 0.000 description 2
- 230000003247 decreasing Effects 0.000 description 2
- 108091002536 depatuxizumab mafodotin Proteins 0.000 description 2
- 229950008925 depatuxizumab mafodotin Drugs 0.000 description 2
- 238000002059 diagnostic imaging Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 239000003623 enhancer Substances 0.000 description 2
- 230000002708 enhancing Effects 0.000 description 2
- 230000029578 entry into host Effects 0.000 description 2
- 230000002255 enzymatic Effects 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 230000002550 fecal Effects 0.000 description 2
- 239000007850 fluorescent dye Substances 0.000 description 2
- 238000002509 fluorescent in situ hybridization Methods 0.000 description 2
- WSFSSNUMVMOOMR-UHFFFAOYSA-N formaldehyde Chemical compound O=C WSFSSNUMVMOOMR-UHFFFAOYSA-N 0.000 description 2
- 238000006062 fragmentation reaction Methods 0.000 description 2
- 108020001507 fusion proteins Proteins 0.000 description 2
- 102000037240 fusion proteins Human genes 0.000 description 2
- 210000002980 germ line cell Anatomy 0.000 description 2
- 150000003278 haem Chemical class 0.000 description 2
- 230000002489 hematologic Effects 0.000 description 2
- 201000005787 hematologic cancer Diseases 0.000 description 2
- 229920001519 homopolymer Polymers 0.000 description 2
- 238000009169 immunotherapy Methods 0.000 description 2
- 238000001764 infiltration Methods 0.000 description 2
- 239000003112 inhibitor Substances 0.000 description 2
- 230000000977 initiatory Effects 0.000 description 2
- 238000009114 investigational therapy Methods 0.000 description 2
- 150000002500 ions Chemical class 0.000 description 2
- 238000002955 isolation Methods 0.000 description 2
- 238000002865 local sequence alignment Methods 0.000 description 2
- 238000007477 logistic regression Methods 0.000 description 2
- 201000005244 lung non-small cell carcinoma Diseases 0.000 description 2
- 230000003211 malignant Effects 0.000 description 2
- 230000000873 masking Effects 0.000 description 2
- 230000003287 optical Effects 0.000 description 2
- 230000003647 oxidation Effects 0.000 description 2
- 238000007254 oxidation reaction Methods 0.000 description 2
- 238000003068 pathway analysis Methods 0.000 description 2
- 229920000553 poly(phenylenevinylene) Polymers 0.000 description 2
- 238000000575 proteomic Methods 0.000 description 2
- 238000001959 radiotherapy Methods 0.000 description 2
- 238000007637 random forest analysis Methods 0.000 description 2
- 230000001105 regulatory Effects 0.000 description 2
- 230000003252 repetitive Effects 0.000 description 2
- 239000000377 silicon dioxide Substances 0.000 description 2
- 210000001082 somatic cell Anatomy 0.000 description 2
- 238000010186 staining Methods 0.000 description 2
- 230000003068 static Effects 0.000 description 2
- 238000002700 transcriptomic Methods 0.000 description 2
- 238000000844 transformation Methods 0.000 description 2
- 238000009966 trimming Methods 0.000 description 2
- 108010064892 trkC Receptor Proteins 0.000 description 2
- MHKBMNACOMRIAW-UHFFFAOYSA-N 2,3-dinitrophenol Chemical compound OC1=CC=CC([N+]([O-])=O)=C1[N+]([O-])=O MHKBMNACOMRIAW-UHFFFAOYSA-N 0.000 description 1
- LRSASMSXMSNRBT-UHFFFAOYSA-N 5-Methylcytosine Chemical compound CC1=CNC(=O)N=C1N LRSASMSXMSNRBT-UHFFFAOYSA-N 0.000 description 1
- 101710006658 AAEL003512 Proteins 0.000 description 1
- 102100019002 ABL1 Human genes 0.000 description 1
- 101700000782 ABL1 Proteins 0.000 description 1
- 101710032514 ACTI Proteins 0.000 description 1
- 102100000003 ACVR1B Human genes 0.000 description 1
- 101710026724 ACVR1B Proteins 0.000 description 1
- 101700001100 AHL16 Proteins 0.000 description 1
- PNEYBMLMFCGWSK-UHFFFAOYSA-N AI2O3 Inorganic materials [O-2].[O-2].[O-2].[Al+3].[Al+3] PNEYBMLMFCGWSK-UHFFFAOYSA-N 0.000 description 1
- 102100001250 AKT2 Human genes 0.000 description 1
- 102100006725 AKT3 Human genes 0.000 description 1
- 101700004058 AKT3 Proteins 0.000 description 1
- 102200003102 ALK F1174L Human genes 0.000 description 1
- 102200003022 ALK I1171N Human genes 0.000 description 1
- 102100015488 ALOX12B Human genes 0.000 description 1
- 101710030791 ALOX12B Proteins 0.000 description 1
- 101700015800 AMER1 Proteins 0.000 description 1
- 102100017961 AMER1 Human genes 0.000 description 1
- 102100013036 ARFRP1 Human genes 0.000 description 1
- 101710036280 ARFRP1 Proteins 0.000 description 1
- 102100012672 ARTN Human genes 0.000 description 1
- 101700061329 ARTN Proteins 0.000 description 1
- 102100016224 ASXL1 Human genes 0.000 description 1
- 101700021058 ASXL1 Proteins 0.000 description 1
- 101700033894 ATRX Proteins 0.000 description 1
- 230000035533 AUC Effects 0.000 description 1
- 102100010553 AURKB Human genes 0.000 description 1
- 101700037792 AURKB Proteins 0.000 description 1
- 102100006784 AXIN1 Human genes 0.000 description 1
- 101700062862 AXIN1 Proteins 0.000 description 1
- 102100011565 AXL Human genes 0.000 description 1
- 101710039535 AXL Proteins 0.000 description 1
- 101700008384 AXL1 Proteins 0.000 description 1
- GZOSMCIZMLWJML-VJLLXTKPSA-N Abiraterone Chemical compound C([C@H]1[C@H]2[C@@H]([C@]3(CC[C@H](O)CC3=CC2)C)CC[C@@]11C)C=C1C1=CC=CN=C1 GZOSMCIZMLWJML-VJLLXTKPSA-N 0.000 description 1
- 229960000643 Adenine Drugs 0.000 description 1
- GFFGJBXGBJISGV-UHFFFAOYSA-N Adenine Natural products NC1=NC=NC2=C1N=CN2 GFFGJBXGBJISGV-UHFFFAOYSA-N 0.000 description 1
- 208000010507 Adenocarcinoma of Lung Diseases 0.000 description 1
- 229960001686 Afatinib Drugs 0.000 description 1
- 240000006108 Allium ampeloprasum Species 0.000 description 1
- 235000005254 Allium ampeloprasum Nutrition 0.000 description 1
- 206010001897 Alzheimer's disease Diseases 0.000 description 1
- 241000269328 Amphibia Species 0.000 description 1
- 208000000058 Anaplasia Diseases 0.000 description 1
- 235000002198 Annona diversifolia Nutrition 0.000 description 1
- 244000303258 Annona diversifolia Species 0.000 description 1
- 102000004000 Aurora Kinase A Human genes 0.000 description 1
- 108090000461 Aurora Kinase A Proteins 0.000 description 1
- 241000271566 Aves Species 0.000 description 1
- 208000003950 B-Cell Lymphoma Diseases 0.000 description 1
- 108060005926 BAP1 Proteins 0.000 description 1
- 102100003854 BARD1 Human genes 0.000 description 1
- 101700043219 BARD1 Proteins 0.000 description 1
- 102100015994 BCL11B Human genes 0.000 description 1
- 101710035195 BCL11B Proteins 0.000 description 1
- 102100013894 BCL2 Human genes 0.000 description 1
- 108060000885 BCL2 Proteins 0.000 description 1
- 101710032374 BCL2L1 Proteins 0.000 description 1
- 102100015655 BCL2L1 Human genes 0.000 description 1
- 102100015652 BCL2L2 Human genes 0.000 description 1
- 101710032376 BCL2L2-PABPN1 Proteins 0.000 description 1
- 102100011377 BCL6 Human genes 0.000 description 1
- 101700024247 BCL6 Proteins 0.000 description 1
- 102100004555 BCOR Human genes 0.000 description 1
- 108060000889 BCOR Proteins 0.000 description 1
- 102100015993 BCORL1 Human genes 0.000 description 1
- 101710026316 BCORL1 Proteins 0.000 description 1
- 102100015500 BRD4 Human genes 0.000 description 1
- 101700009767 BRD4 Proteins 0.000 description 1
- 102100011166 BRIP1 Human genes 0.000 description 1
- 101710030368 BRIP1 Proteins 0.000 description 1
- 102100015738 BTG1 Human genes 0.000 description 1
- 101700001484 BTG1 Proteins 0.000 description 1
- 102100015734 BTG2 Human genes 0.000 description 1
- 101700003039 BTG2 Proteins 0.000 description 1
- 102100009312 BTK Human genes 0.000 description 1
- 101700058566 BTK Proteins 0.000 description 1
- 101710019578 BZIP46 Proteins 0.000 description 1
- 210000001185 Bone Marrow Anatomy 0.000 description 1
- 210000004556 Brain Anatomy 0.000 description 1
- 210000000481 Breast Anatomy 0.000 description 1
- 206010055113 Breast cancer metastatic Diseases 0.000 description 1
- 102100004931 CALR Human genes 0.000 description 1
- 101700033040 CALR Proteins 0.000 description 1
- 101710009063 CARD11 Proteins 0.000 description 1
- 102100002586 CARD11 Human genes 0.000 description 1
- 102100008990 CASP8 Human genes 0.000 description 1
- 101700075287 CASP8 Proteins 0.000 description 1
- 102100008372 CBL Human genes 0.000 description 1
- 108050009659 CBL Proteins 0.000 description 1
- 108010014064 CCCTC-Binding Factor Proteins 0.000 description 1
- 102100016486 CCND3 Human genes 0.000 description 1
- 101700079292 CCND3 Proteins 0.000 description 1
- 102100000189 CD22 Human genes 0.000 description 1
- 101700020617 CD22 Proteins 0.000 description 1
- 101700017377 CD70 Proteins 0.000 description 1
- 102100005830 CD70 Human genes 0.000 description 1
- 102100019443 CD79A Human genes 0.000 description 1
- 101700037975 CD79A Proteins 0.000 description 1
- 102100019449 CD79B Human genes 0.000 description 1
- 101700045471 CD79B Proteins 0.000 description 1
- 102100019399 CDC73 Human genes 0.000 description 1
- 108060001261 CDC73 Proteins 0.000 description 1
- 102100016540 CDK12 Human genes 0.000 description 1
- 101700081487 CDK12 Proteins 0.000 description 1
- 102100003970 CDK8 Human genes 0.000 description 1
- 102100002974 CDKN1A Human genes 0.000 description 1
- 102100019348 CEBPA Human genes 0.000 description 1
- 101700058775 CEBPA Proteins 0.000 description 1
- 102100019702 CHEK1 Human genes 0.000 description 1
- 101710015564 CHEK1 Proteins 0.000 description 1
- 102100019698 CHEK2 Human genes 0.000 description 1
- 108060006647 CHEK2 Proteins 0.000 description 1
- 102100006706 CIC Human genes 0.000 description 1
- 101700065513 CIC Proteins 0.000 description 1
- 101710040418 COL9A2 Proteins 0.000 description 1
- 102100004143 COL9A3 Human genes 0.000 description 1
- 101710040374 COL9A3 Proteins 0.000 description 1
- 101700069295 COMP Proteins 0.000 description 1
- 102100003767 CREBBP Human genes 0.000 description 1
- 101710006045 CREBBP Proteins 0.000 description 1
- 101700072217 CRKL Proteins 0.000 description 1
- 102100011432 CRKL Human genes 0.000 description 1
- 102100005175 CSF1R Human genes 0.000 description 1
- 101700063802 CSF1R Proteins 0.000 description 1
- 102100006433 CSF3R Human genes 0.000 description 1
- 101700017008 CSF3R Proteins 0.000 description 1
- 102100011232 CTCF Human genes 0.000 description 1
- 101710005993 CTNNA1 Proteins 0.000 description 1
- 102100001886 CTNNA1 Human genes 0.000 description 1
- 102100014050 CUL3 Human genes 0.000 description 1
- 101700002039 CUL3 Proteins 0.000 description 1
- 102100015948 CUL4A Human genes 0.000 description 1
- 101700019301 CUL4A Proteins 0.000 description 1
- 102100002212 CXCR4 Human genes 0.000 description 1
- 101710003734 CXCR4 Proteins 0.000 description 1
- 102100007078 CYP17A1 Human genes 0.000 description 1
- 101710035271 CYP17A1 Proteins 0.000 description 1
- 241000282836 Camelus dromedarius Species 0.000 description 1
- 241000282472 Canis lupus familiaris Species 0.000 description 1
- 241000283707 Capra Species 0.000 description 1
- 208000008787 Cardiovascular Disease Diseases 0.000 description 1
- 210000002230 Centromere Anatomy 0.000 description 1
- 241000282693 Cercopithecidae Species 0.000 description 1
- 241000283153 Cetacea Species 0.000 description 1
- 241000251730 Chondrichthyes Species 0.000 description 1
- 230000037250 Clearance Effects 0.000 description 1
- 206010065163 Clonal evolution Diseases 0.000 description 1
- 108020004705 Codon Proteins 0.000 description 1
- 108010060313 Core Binding Factor beta Subunit Proteins 0.000 description 1
- 102000008147 Core Binding Factor beta Subunit Human genes 0.000 description 1
- 241001481833 Coryphaena hippurus Species 0.000 description 1
- 108010025415 Cyclin-Dependent Kinase 8 Proteins 0.000 description 1
- 108010009356 Cyclin-Dependent Kinase Inhibitor p15 Proteins 0.000 description 1
- 102000009512 Cyclin-Dependent Kinase Inhibitor p15 Human genes 0.000 description 1
- 108010009367 Cyclin-Dependent Kinase Inhibitor p18 Proteins 0.000 description 1
- 102000009503 Cyclin-Dependent Kinase Inhibitor p18 Human genes 0.000 description 1
- 108010016788 Cyclin-Dependent Kinase Inhibitor p21 Proteins 0.000 description 1
- 108010016777 Cyclin-Dependent Kinase Inhibitor p27 Proteins 0.000 description 1
- 102000000577 Cyclin-Dependent Kinase Inhibitor p27 Human genes 0.000 description 1
- 102100019302 DAXX Human genes 0.000 description 1
- 108060002123 DAXX Proteins 0.000 description 1
- 102100006837 DDR1 Human genes 0.000 description 1
- 101700058746 DDR1 Proteins 0.000 description 1
- 102100001013 DIS3 Human genes 0.000 description 1
- 101700051424 DIS3 Proteins 0.000 description 1
- 108020003215 DNA Probes Proteins 0.000 description 1
- 230000004544 DNA amplification Effects 0.000 description 1
- 230000007067 DNA methylation Effects 0.000 description 1
- 239000003298 DNA probe Substances 0.000 description 1
- 102100006402 DNMT3A Human genes 0.000 description 1
- 101710038368 DNMT3A Proteins 0.000 description 1
- 102100018143 DOT1L Human genes 0.000 description 1
- 101700067544 DOT1L Proteins 0.000 description 1
- 241000238557 Decapoda Species 0.000 description 1
- 206010011953 Decreased activity Diseases 0.000 description 1
- 206010012601 Diabetes mellitus Diseases 0.000 description 1
- 206010059866 Drug resistance Diseases 0.000 description 1
- 206010013786 Dry skin Diseases 0.000 description 1
- 102100018744 EED Human genes 0.000 description 1
- 101700036896 EED Proteins 0.000 description 1
- 102100019209 EMSY Human genes 0.000 description 1
- 101700023519 EMSY Proteins 0.000 description 1
- 229940034984 ENDOCRINE THERAPY ANTINEOPLASTIC AND IMMUNOMODULATING AGENTS Drugs 0.000 description 1
- 102100002185 EP300 Human genes 0.000 description 1
- 101700011490 EP300 Proteins 0.000 description 1
- 102100001810 EPHA3 Human genes 0.000 description 1
- 101700049294 EPHA3 Proteins 0.000 description 1
- 102100009838 EPHB1 Human genes 0.000 description 1
- 101700067933 EPHB1 Proteins 0.000 description 1
- 102100009831 EPHB4 Human genes 0.000 description 1
- 102000027776 ERBB3 Human genes 0.000 description 1
- 101700041204 ERBB3 Proteins 0.000 description 1
- 102100009851 ERBB4 Human genes 0.000 description 1
- 101700023619 ERBB4 Proteins 0.000 description 1
- 102100014008 ERCC4 Human genes 0.000 description 1
- 102100003763 ERG Human genes 0.000 description 1
- 101700055371 ERG Proteins 0.000 description 1
- 108010067770 Endopeptidase K Proteins 0.000 description 1
- 108010055323 EphB4 Receptor Proteins 0.000 description 1
- 229960001433 Erlotinib Drugs 0.000 description 1
- AAKJLRGGTJKAMG-UHFFFAOYSA-N Erlotinib Chemical compound C=12C=C(OCCOC)C(OCCOC)=CC2=NC=NC=1NC1=CC=CC(C#C)=C1 AAKJLRGGTJKAMG-UHFFFAOYSA-N 0.000 description 1
- 102100011167 FANCL Human genes 0.000 description 1
- 101700079540 FAS Proteins 0.000 description 1
- 102100008329 FASN Human genes 0.000 description 1
- 101710008102 FASN Proteins 0.000 description 1
- 102100000261 FGF10 Human genes 0.000 description 1
- 101700052889 FGF10 Proteins 0.000 description 1
- 102100000263 FGF12 Human genes 0.000 description 1
- 101700081916 FGF12 Proteins 0.000 description 1
- 102100015611 FGF14 Human genes 0.000 description 1
- 101700085636 FGF14 Proteins 0.000 description 1
- 102100015614 FGF19 Human genes 0.000 description 1
- 101700047578 FGF19 Proteins 0.000 description 1
- 102100008636 FGF23 Human genes 0.000 description 1
- 101700036284 FGF23 Proteins 0.000 description 1
- 102100008645 FGF3 Human genes 0.000 description 1
- 102100007406 FGF4 Human genes 0.000 description 1
- 101700036125 FGF4 Proteins 0.000 description 1
- 102100007407 FGF6 Human genes 0.000 description 1
- 101700012851 FGF6 Proteins 0.000 description 1
- 102100020189 FGFR4 Human genes 0.000 description 1
- 101700075612 FGFR4 Proteins 0.000 description 1
- 102100015351 FLCN Human genes 0.000 description 1
- 101700018953 FLCN Proteins 0.000 description 1
- 102100006565 FLT1 Human genes 0.000 description 1
- 101710030892 FLT1 Proteins 0.000 description 1
- 102100017921 FOXL2 Human genes 0.000 description 1
- 102100006369 FUBP1 Human genes 0.000 description 1
- 101700076383 FUBP1 Proteins 0.000 description 1
- 108010087740 Fanconi Anemia Complementation Group A Protein Proteins 0.000 description 1
- 102000009095 Fanconi Anemia Complementation Group A Protein Human genes 0.000 description 1
- 102000018825 Fanconi Anemia Complementation Group C Protein Human genes 0.000 description 1
- 108010027673 Fanconi Anemia Complementation Group C Protein Proteins 0.000 description 1
- 102000007122 Fanconi Anemia Complementation Group G Protein Human genes 0.000 description 1
- 108010033305 Fanconi Anemia Complementation Group G Protein Proteins 0.000 description 1
- 108010059417 Fanconi Anemia Complementation Group L Protein Proteins 0.000 description 1
- 102000016627 Fanconi Anemia Complementation Group N Protein Human genes 0.000 description 1
- 108010067741 Fanconi Anemia Complementation Group N Protein Proteins 0.000 description 1
- 201000000106 Fanconi anemia complementation group A Diseases 0.000 description 1
- 201000000129 Fanconi anemia complementation group C Diseases 0.000 description 1
- 201000000127 Fanconi anemia complementation group G Diseases 0.000 description 1
- 201000000141 Fanconi anemia complementation group L Diseases 0.000 description 1
- 241000282326 Felis catus Species 0.000 description 1
- 210000003754 Fetus Anatomy 0.000 description 1
- 108010010285 Forkhead Box Protein L2 Proteins 0.000 description 1
- 102000017691 GABRA6 Human genes 0.000 description 1
- 108060004390 GABRA6 Proteins 0.000 description 1
- 102100012697 GATA4 Human genes 0.000 description 1
- 101700002184 GATA4 Proteins 0.000 description 1
- 102100012696 GATA6 Human genes 0.000 description 1
- 101700008134 GATA6 Proteins 0.000 description 1
- 101700023910 GCAB Proteins 0.000 description 1
- 102100014511 GNA13 Human genes 0.000 description 1
- 101700021123 GNA13 Proteins 0.000 description 1
- 102100006614 GRM3 Human genes 0.000 description 1
- 101700043312 GRM3 Proteins 0.000 description 1
- 102100007245 GSK3B Human genes 0.000 description 1
- 210000001035 Gastrointestinal Tract Anatomy 0.000 description 1
- XGALLCVXEZPNRQ-UHFFFAOYSA-N Gefitinib Chemical compound C=12C=C(OCCCN3CCOCC3)C(OC)=CC2=NC=NC=1NC1=CC=C(F)C(Cl)=C1 XGALLCVXEZPNRQ-UHFFFAOYSA-N 0.000 description 1
- 108010051975 Glycogen Synthase Kinase 3 beta Proteins 0.000 description 1
- 241000282575 Gorilla Species 0.000 description 1
- UYTPUPDQBNUYGX-UHFFFAOYSA-N Guanine Chemical compound O=C1NC(N)=NC2=C1N=CN2 UYTPUPDQBNUYGX-UHFFFAOYSA-N 0.000 description 1
- 102100004959 H3-3A Human genes 0.000 description 1
- 101710007246 H3-3A Proteins 0.000 description 1
- 102100002572 HDAC1 Human genes 0.000 description 1
- 101700036927 HDAC1 Proteins 0.000 description 1
- 102100000579 HGF Human genes 0.000 description 1
- 101710024788 HOMER1 Proteins 0.000 description 1
- 102200006562 HRAS A146T Human genes 0.000 description 1
- 102200006603 HRAS D119N Human genes 0.000 description 1
- 102200006657 HRAS G13C Human genes 0.000 description 1
- 102100004763 HSD3B1 Human genes 0.000 description 1
- 101710007627 HSD3B1 Proteins 0.000 description 1
- WZUVPPKBWHMQCE-VYIIXAMBSA-N Haematoxylin Chemical compound C12=CC(O)=C(O)C=C2C[C@@]2(O)C1C1=CC=C(O)C(O)=C1OC2 WZUVPPKBWHMQCE-VYIIXAMBSA-N 0.000 description 1
- 206010018987 Haemorrhage Diseases 0.000 description 1
- 241000243251 Hydra Species 0.000 description 1
- 206010020488 Hydrocele Diseases 0.000 description 1
- 206010020751 Hypersensitivity Diseases 0.000 description 1
- 206010020772 Hypertension Diseases 0.000 description 1
- 102100008570 ID3 Human genes 0.000 description 1
- 101700069322 ID3 Proteins 0.000 description 1
- 102100014263 IGF1R Human genes 0.000 description 1
- 101700025802 IGF1R Proteins 0.000 description 1
- 102100008723 IKBKE Human genes 0.000 description 1
- 101710002884 IKBKE Proteins 0.000 description 1
- 102100008719 IKZF1 Human genes 0.000 description 1
- 101700005406 IKZF1 Proteins 0.000 description 1
- 108060003940 IL6 Proteins 0.000 description 1
- 102100008239 INPP4B Human genes 0.000 description 1
- 101710031804 INPP4B Proteins 0.000 description 1
- 101710004181 INTS2 Proteins 0.000 description 1
- 102100013278 IRF2 Human genes 0.000 description 1
- 101700001385 IRF2 Proteins 0.000 description 1
- 102100013274 IRF4 Human genes 0.000 description 1
- 101700047660 IRF4 Proteins 0.000 description 1
- 101700043815 IRS2 Proteins 0.000 description 1
- 102100002730 IRS2 Human genes 0.000 description 1
- 102000018358 Immunoglobulins Human genes 0.000 description 1
- 108060003951 Immunoglobulins Proteins 0.000 description 1
- 102100019517 JAK1 Human genes 0.000 description 1
- 101700034277 JAK1 Proteins 0.000 description 1
- 101700085508 KCNH2 Proteins 0.000 description 1
- 101700011826 KCNH6 Proteins 0.000 description 1
- 102100010208 KDM5A Human genes 0.000 description 1
- 101700048012 KDM5A Proteins 0.000 description 1
- 102100010203 KDM5C Human genes 0.000 description 1
- 101700028518 KDM5C Proteins 0.000 description 1
- 102100013867 KDM6A Human genes 0.000 description 1
- 101700004928 KDM6A Proteins 0.000 description 1
- 102100013180 KDR Human genes 0.000 description 1
- 101700033678 KDR Proteins 0.000 description 1
- 101710030888 KDR Proteins 0.000 description 1
- 101700060925 KLHL6 Proteins 0.000 description 1
- 102100012858 KLHL7 Human genes 0.000 description 1
- 101700082930 KLHL7 Proteins 0.000 description 1
- 102100004455 KMT2A Human genes 0.000 description 1
- 101700000155 KMT2A Proteins 0.000 description 1
- 101700013012 KMT2B Proteins 0.000 description 1
- 102200006537 KRAS G12A Human genes 0.000 description 1
- 102200006539 KRAS G12D Human genes 0.000 description 1
- 102200006540 KRAS G12R Human genes 0.000 description 1
- 102200006531 KRAS G12V Human genes 0.000 description 1
- 102200006532 KRAS G13D Human genes 0.000 description 1
- 102200006533 KRAS G13R Human genes 0.000 description 1
- 102000004034 Kelch-like ECH-associated protein 1 Human genes 0.000 description 1
- 108090000484 Kelch-like ECH-associated protein 1 Proteins 0.000 description 1
- 210000003734 Kidney Anatomy 0.000 description 1
- 239000005411 L01XE02 - Gefitinib Substances 0.000 description 1
- 239000005551 L01XE03 - Erlotinib Substances 0.000 description 1
- 101710036514 LONP1 Proteins 0.000 description 1
- 241000270322 Lepidosauria Species 0.000 description 1
- 208000007046 Leukemia, Myeloid, Acute Diseases 0.000 description 1
- 210000004185 Liver Anatomy 0.000 description 1
- 210000004072 Lung Anatomy 0.000 description 1
- 206010025323 Lymphomas Diseases 0.000 description 1
- 102100012659 MAGI1 Human genes 0.000 description 1
- 108010075654 MAP Kinase Kinase Kinase 1 Proteins 0.000 description 1
- 101710027479 MAP2K1 Proteins 0.000 description 1
- 102100015874 MAP2K4 Human genes 0.000 description 1
- 101710027749 MAP2K4 Proteins 0.000 description 1
- 102100016522 MAP3K1 Human genes 0.000 description 1
- 102100016525 MAP3K13 Human genes 0.000 description 1
- 101710007521 MAP3K13 Proteins 0.000 description 1
- 102100016825 MAPK3 Human genes 0.000 description 1
- 101700001448 MAPK3 Proteins 0.000 description 1
- 101700031439 MCL1 Proteins 0.000 description 1
- 102100002383 MCL1 Human genes 0.000 description 1
- 102100019155 MDM2 Human genes 0.000 description 1
- 101700032565 MDM2 Proteins 0.000 description 1
- 108050005300 MDM4 Proteins 0.000 description 1
- 102000017274 MDM4 Human genes 0.000 description 1
- 102100008527 MEF2B Human genes 0.000 description 1
- 101700077280 MEF2B Proteins 0.000 description 1
- 101700028785 MEK1 Proteins 0.000 description 1
- 102100019218 MEN1 Human genes 0.000 description 1
- 101700045140 MEN1 Proteins 0.000 description 1
- 102100007644 MERTK Human genes 0.000 description 1
- 101710026102 MIC-ACT-2 Proteins 0.000 description 1
- 102100014646 MITF Human genes 0.000 description 1
- 101700053443 MKK1 Proteins 0.000 description 1
- 102100014833 MKNK1 Human genes 0.000 description 1
- 101700087081 MKNK1 Proteins 0.000 description 1
- 101700052154 MPK1 Proteins 0.000 description 1
- 108060005135 MPL Proteins 0.000 description 1
- 102000002251 MRE11 Homologue Protein Human genes 0.000 description 1
- 108010000318 MRE11 Homologue Protein Proteins 0.000 description 1
- 101710005594 MRPL36 Proteins 0.000 description 1
- 102100013820 MSH2 Human genes 0.000 description 1
- 101700083509 MSH2 Proteins 0.000 description 1
- 229910015837 MSH2 Inorganic materials 0.000 description 1
- 102100005745 MSH3 Human genes 0.000 description 1
- 108060002317 MSH3 Proteins 0.000 description 1
- 102100002001 MSH6 Human genes 0.000 description 1
- 101700030163 MSH6 Proteins 0.000 description 1
- 101710029065 MST1R Proteins 0.000 description 1
- 102100003099 MST1R Human genes 0.000 description 1
- 108060006326 MTAP Proteins 0.000 description 1
- 102100002025 MTAP Human genes 0.000 description 1
- 102100003827 MUTYH Human genes 0.000 description 1
- 101700053678 MUTYH Proteins 0.000 description 1
- 102100018882 MYCN Human genes 0.000 description 1
- 102100010074 MYD88 Human genes 0.000 description 1
- 101700079836 MYD88 Proteins 0.000 description 1
- 229920002521 Macromolecule Polymers 0.000 description 1
- 241000124008 Mammalia Species 0.000 description 1
- 241000211181 Manta Species 0.000 description 1
- 108010050345 Microphthalmia-Associated Transcription Factor Proteins 0.000 description 1
- 108010074346 Mismatch Repair Endonuclease PMS2 Proteins 0.000 description 1
- 102000008071 Mismatch Repair Endonuclease PMS2 Human genes 0.000 description 1
- MFRNYXJJRJQHNW-NARUGQRUSA-N Monomethyl auristatin F Chemical compound CN[C@@H](C(C)C)C(=O)N[C@@H](C(C)C)C(=O)N(C)C([C@@H](C)CC)[C@H](OC)CC(=O)N1CCC[C@H]1[C@H](OC)[C@@H](C)C(=O)N[C@H](C(O)=O)CC1=CC=CC=C1 MFRNYXJJRJQHNW-NARUGQRUSA-N 0.000 description 1
- 108010010748 N-Myc Proto-Oncogene Protein Proteins 0.000 description 1
- 101700015287 NBN Proteins 0.000 description 1
- 102100010499 NF2 Human genes 0.000 description 1
- 101700071070 NF2 Proteins 0.000 description 1
- 102100015758 NFKBIA Human genes 0.000 description 1
- 101710003044 NFKBIA Proteins 0.000 description 1
- 102100018287 NKX2-1 Human genes 0.000 description 1
- 101710012901 NKX2-1 Proteins 0.000 description 1
- 102100012126 NOTCH2 Human genes 0.000 description 1
- 101710036046 NOTCH2 Proteins 0.000 description 1
- 102100012125 NOTCH3 Human genes 0.000 description 1
- 101710036045 NOTCH3 Proteins 0.000 description 1
- 101710034230 NR2F1 Proteins 0.000 description 1
- 102100017061 NT5C2 Human genes 0.000 description 1
- 101710039719 NT5C2 Proteins 0.000 description 1
- 102100016106 NTRK2 Human genes 0.000 description 1
- 108060005033 NTRK2 Proteins 0.000 description 1
- 210000002445 Nipples Anatomy 0.000 description 1
- 108020004711 Nucleic Acid Probes Proteins 0.000 description 1
- 108091005503 Nucleic proteins Proteins 0.000 description 1
- 229920000272 Oligonucleotide Polymers 0.000 description 1
- 241000283898 Ovis Species 0.000 description 1
- 102100001444 P2RY8 Human genes 0.000 description 1
- 101700043923 P2RY8 Proteins 0.000 description 1
- 102100017733 PARP2 Human genes 0.000 description 1
- 101700053624 PARP2 Proteins 0.000 description 1
- 101700049297 PARP3 Proteins 0.000 description 1
- 102100017728 PARP3 Human genes 0.000 description 1
- 101700057981 PAX5 Proteins 0.000 description 1
- 102100018849 PAX5 Human genes 0.000 description 1
- 102100017777 PBRM1 Human genes 0.000 description 1
- 101710027500 PBRM1 Proteins 0.000 description 1
- 102100004939 PDGFRB Human genes 0.000 description 1
- 108060006638 PDPK1 Proteins 0.000 description 1
- 102100018803 PDPK1 Human genes 0.000 description 1
- 102100013913 PIK3C2B Human genes 0.000 description 1
- 101710009776 PIK3C2B Proteins 0.000 description 1
- 102100013907 PIK3C2G Human genes 0.000 description 1
- 101710009772 PIK3C2G Proteins 0.000 description 1
- 102100019474 PIK3CB Human genes 0.000 description 1
- 101710027441 PIK3CB Proteins 0.000 description 1
- 102100014818 PIK3R1 Human genes 0.000 description 1
- 101710039899 PIK3R1 Proteins 0.000 description 1
- 102100016457 PIM1 Human genes 0.000 description 1
- 101700018532 PIM1 Proteins 0.000 description 1
- 102100012348 POLD1 Human genes 0.000 description 1
- 101710040092 POLD1 Proteins 0.000 description 1
- 102100000077 PPARG Human genes 0.000 description 1
- 101700070851 PPARG Proteins 0.000 description 1
- 102100018719 PPP2R1A Human genes 0.000 description 1
- 101710025266 PPP2R1A Proteins 0.000 description 1
- 102100018014 PPP2R2A Human genes 0.000 description 1
- 101710025230 PPP2R2A Proteins 0.000 description 1
- 102100019670 PRKCI Human genes 0.000 description 1
- 101710038827 PRKCI Proteins 0.000 description 1
- 102100003214 PRKN Human genes 0.000 description 1
- 101700032550 PRKN Proteins 0.000 description 1
- 102100017348 PTPRO Human genes 0.000 description 1
- 101700025670 PTPRO Proteins 0.000 description 1
- 101700071602 PTPRU Proteins 0.000 description 1
- 241000282577 Pan troglodytes Species 0.000 description 1
- 108010065129 Patched-1 Receptor Proteins 0.000 description 1
- 102000012850 Patched-1 Receptor Human genes 0.000 description 1
- ZYFVNVRFVHJEIU-UHFFFAOYSA-N PicoGreen Chemical compound CN(C)CCCN(CCCN(C)C)C1=CC(=CC2=[N+](C3=CC=CC=C3S2)C)C2=CC=CC=C2N1C1=CC=CC=C1 ZYFVNVRFVHJEIU-UHFFFAOYSA-N 0.000 description 1
- 206010035226 Plasma cell myeloma Diseases 0.000 description 1
- 108010051742 Platelet-Derived Growth Factor beta Receptor Proteins 0.000 description 1
- 206010036790 Productive cough Diseases 0.000 description 1
- 102100009495 QKI Human genes 0.000 description 1
- 101700050983 QKI Proteins 0.000 description 1
- 102100002253 RAC1 Human genes 0.000 description 1
- 101700023645 RAC1 Proteins 0.000 description 1
- 101700025113 RAC3 Proteins 0.000 description 1
- 102100001187 RAD21 Human genes 0.000 description 1
- 108060006873 RAD21 Proteins 0.000 description 1
- 102000001195 RAD51 Human genes 0.000 description 1
- 101710003586 RAD51AP2 Proteins 0.000 description 1
- 102100012634 RAD51B Human genes 0.000 description 1
- 101710002981 RAD51B Proteins 0.000 description 1
- 102100012633 RAD51C Human genes 0.000 description 1
- 101710002945 RAD51C Proteins 0.000 description 1
- 102100012914 RAD51D Human genes 0.000 description 1
- 101710002943 RAD51D Proteins 0.000 description 1
- 108050002092 RAD52 Proteins 0.000 description 1
- 102100002821 RAD52 Human genes 0.000 description 1
- 101710017584 RAD54L Proteins 0.000 description 1
- 102100016115 RAF1 Human genes 0.000 description 1
- 101700007719 RAF1 Proteins 0.000 description 1
- 102100000880 RBM10 Human genes 0.000 description 1
- 101700056415 RBM10 Proteins 0.000 description 1
- 101700001630 RET Proteins 0.000 description 1
- 108020005452 RHEB Proteins 0.000 description 1
- 102100005090 RHEB Human genes 0.000 description 1
- 101700020165 RHOA Proteins 0.000 description 1
- 102100004989 RHOA Human genes 0.000 description 1
- 101700068478 RIT1 Proteins 0.000 description 1
- 101710002310 RNASE1 Proteins 0.000 description 1
- 101710007825 RNASE3 Proteins 0.000 description 1
- 102100016246 RNF43 Human genes 0.000 description 1
- 101700048059 RNF43 Proteins 0.000 description 1
- 102100012618 RPTOR Human genes 0.000 description 1
- 108010068097 Rad51 Recombinase Proteins 0.000 description 1
- 108010061204 Rapamycin-Insensitive Companion of mTOR Protein Proteins 0.000 description 1
- 102000012007 Rapamycin-Insensitive Companion of mTOR Protein Human genes 0.000 description 1
- 241000700159 Rattus Species 0.000 description 1
- 108010029031 Regulatory-Associated Protein of mTOR Proteins 0.000 description 1
- 241000282849 Ruminantia Species 0.000 description 1
- BFDMCHRDSYTOLE-UHFFFAOYSA-N SC#N.NC(N)=N.ClC(Cl)Cl.OC1=CC=CC=C1 Chemical compound SC#N.NC(N)=N.ClC(Cl)Cl.OC1=CC=CC=C1 BFDMCHRDSYTOLE-UHFFFAOYSA-N 0.000 description 1
- 101700080317 SCAR2 Proteins 0.000 description 1
- 101700046029 SCN8A Proteins 0.000 description 1
- 101700029504 SDHA Proteins 0.000 description 1
- 102100010184 SDHA Human genes 0.000 description 1
- 102100013990 SDHB Human genes 0.000 description 1
- 101700032216 SDHB Proteins 0.000 description 1
- 102100007252 SDHC Human genes 0.000 description 1
- 101700029781 SDHC Proteins 0.000 description 1
- 102100011961 SDHD Human genes 0.000 description 1
- 101700035481 SDHD Proteins 0.000 description 1
- 102100000940 SETD2 Human genes 0.000 description 1
- 101700071021 SETD2 Proteins 0.000 description 1
- 108060007427 SF3B1 Proteins 0.000 description 1
- 102100014711 SF3B1 Human genes 0.000 description 1
- 102100018035 SGK1 Human genes 0.000 description 1
- 101700056898 SGK1 Proteins 0.000 description 1
- 102100017669 SMAD2 Human genes 0.000 description 1
- 101700012842 SMAD2 Proteins 0.000 description 1
- 102100019447 SMARCA4 Human genes 0.000 description 1
- 101710025703 SMARCA4 Proteins 0.000 description 1
- 102000011740 SMARCB1 Protein Human genes 0.000 description 1
- 108010076630 SMARCB1 Protein Proteins 0.000 description 1
- 101700021542 SMO Proteins 0.000 description 1
- 102100015930 SMOX Human genes 0.000 description 1
- 101700076839 SMOX Proteins 0.000 description 1
- 102100004219 SNCAIP Human genes 0.000 description 1
- 101710019013 SNCAIP Proteins 0.000 description 1
- 102100005349 SOCS1 Human genes 0.000 description 1
- 102100018829 SOX2 Human genes 0.000 description 1
- 101700006931 SOX2 Proteins 0.000 description 1
- 102100015788 SOX9 Human genes 0.000 description 1
- 101700030874 SOX9 Proteins 0.000 description 1
- 102100008368 SPEN Human genes 0.000 description 1
- 101700038841 SPEN Proteins 0.000 description 1
- 102100005038 SPOP Human genes 0.000 description 1
- 101700060127 SPOP Proteins 0.000 description 1
- 102000001332 SRC Human genes 0.000 description 1
- 101710009384 SRC Proteins 0.000 description 1
- 102100015334 STAG2 Human genes 0.000 description 1
- 101700002785 STAG2 Proteins 0.000 description 1
- 102100019667 STAT3 Human genes 0.000 description 1
- 108010017324 STAT3 Transcription Factor Proteins 0.000 description 1
- 102100004542 SUFU Human genes 0.000 description 1
- 108060007940 SUFU Proteins 0.000 description 1
- 102100019630 SYK Human genes 0.000 description 1
- 101700073994 SYK Proteins 0.000 description 1
- 210000003765 Sex Chromosomes Anatomy 0.000 description 1
- 208000007056 Sickle Cell Anemia Diseases 0.000 description 1
- 108020004682 Single-Stranded DNA Proteins 0.000 description 1
- 210000003491 Skin Anatomy 0.000 description 1
- 210000003802 Sputum Anatomy 0.000 description 1
- 108010090804 Streptavidin Proteins 0.000 description 1
- 108010089643 Suppressor of Cytokine Signaling 1 Protein Proteins 0.000 description 1
- 206010042971 T-cell lymphoma Diseases 0.000 description 1
- 101700024656 TBX3 Proteins 0.000 description 1
- 102100009772 TBX3 Human genes 0.000 description 1
- 102100016327 TEK Human genes 0.000 description 1
- 101710037124 TEK Proteins 0.000 description 1
- 101710037949 TENT5C Proteins 0.000 description 1
- 102100005538 TENT5C Human genes 0.000 description 1
- 101700048164 TET2 Proteins 0.000 description 1
- 102100003998 TET2 Human genes 0.000 description 1
- 102100003994 TGFBR2 Human genes 0.000 description 1
- 210000003411 Telomere Anatomy 0.000 description 1
- 210000001550 Testis Anatomy 0.000 description 1
- 229940113082 Thymine Drugs 0.000 description 1
- 210000001685 Thyroid Gland Anatomy 0.000 description 1
- 108010082684 Transforming Growth Factor-beta Type II Receptor Proteins 0.000 description 1
- 206010066901 Treatment failure Diseases 0.000 description 1
- 108020004417 Untranslated RNA Proteins 0.000 description 1
- 210000003932 Urinary Bladder Anatomy 0.000 description 1
- 241001416177 Vicugna pacos Species 0.000 description 1
- 241000700605 Viruses Species 0.000 description 1
- 101700009925 WNK1 Proteins 0.000 description 1
- 102000013814 Wnt Human genes 0.000 description 1
- 108050003627 Wnt Proteins 0.000 description 1
- 108010069188 X-linked Nuclear Protein Proteins 0.000 description 1
- 101710010287 YWHAZ Proteins 0.000 description 1
- 241000068283 Yam virus X Species 0.000 description 1
- 229960000853 abiraterone Drugs 0.000 description 1
- 239000002253 acid Substances 0.000 description 1
- 230000003044 adaptive Effects 0.000 description 1
- 238000007792 addition Methods 0.000 description 1
- OIRDTQYFTABQOQ-KQYNXXCUSA-N adenosine Chemical compound C1=NC=2C(N)=NC=NC=2N1[C@@H]1O[C@H](CO)[C@@H](O)[C@H]1O OIRDTQYFTABQOQ-KQYNXXCUSA-N 0.000 description 1
- 238000009098 adjuvant therapy Methods 0.000 description 1
- 238000005377 adsorption chromatography Methods 0.000 description 1
- 231100000494 adverse effect Toxicity 0.000 description 1
- ULXXDDBFHOBEHA-CWDCEQMOSA-N afatinib Chemical compound N1=CN=C2C=C(O[C@@H]3COCC3)C(NC(=O)/C=C/CN(C)C)=CC2=C1NC1=CC=C(F)C(Cl)=C1 ULXXDDBFHOBEHA-CWDCEQMOSA-N 0.000 description 1
- 231100001075 aneuploidy Toxicity 0.000 description 1
- 238000005571 anion exchange chromatography Methods 0.000 description 1
- 238000011123 anti-EGFR therapy Methods 0.000 description 1
- 230000002424 anti-apoptotic Effects 0.000 description 1
- 239000000427 antigen Substances 0.000 description 1
- 108091007172 antigens Proteins 0.000 description 1
- 102000038129 antigens Human genes 0.000 description 1
- 230000001640 apoptogenic Effects 0.000 description 1
- 230000003851 biochemical process Effects 0.000 description 1
- YBJHBAHKTGYVGT-ZKWXMUAHSA-N biotin Chemical compound N1C(=O)N[C@@H]2[C@H](CCCCC(=O)O)SC[C@@H]21 YBJHBAHKTGYVGT-ZKWXMUAHSA-N 0.000 description 1
- 229960002685 biotin Drugs 0.000 description 1
- 235000020958 biotin Nutrition 0.000 description 1
- 239000011616 biotin Substances 0.000 description 1
- 201000001528 bladder urothelial carcinoma Diseases 0.000 description 1
- 230000000740 bleeding Effects 0.000 description 1
- 231100000319 bleeding Toxicity 0.000 description 1
- 239000010839 body fluid Substances 0.000 description 1
- 108010018804 c-Mer Tyrosine Kinase Proteins 0.000 description 1
- 230000036952 cancer formation Effects 0.000 description 1
- 238000002619 cancer immunotherapy Methods 0.000 description 1
- 231100000504 carcinogenesis Toxicity 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 230000030833 cell death Effects 0.000 description 1
- 238000002512 chemotherapy Methods 0.000 description 1
- 230000035512 clearance Effects 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 230000001010 compromised Effects 0.000 description 1
- 238000005520 cutting process Methods 0.000 description 1
- 201000003883 cystic fibrosis Diseases 0.000 description 1
- 230000009089 cytolysis Effects 0.000 description 1
- 230000002950 deficient Effects 0.000 description 1
- 230000004059 degradation Effects 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 230000002939 deleterious Effects 0.000 description 1
- 230000001809 detectable Effects 0.000 description 1
- 235000005911 diet Nutrition 0.000 description 1
- 230000037213 diet Effects 0.000 description 1
- 230000004069 differentiation Effects 0.000 description 1
- SHIBSTMRCDJXLN-KCZCNTNESA-N digoxigenin Chemical compound C1([C@@H]2[C@@]3([C@@](CC2)(O)[C@H]2[C@@H]([C@@]4(C)CC[C@H](O)C[C@H]4CC2)C[C@H]3O)C)=CC(=O)OC1 SHIBSTMRCDJXLN-KCZCNTNESA-N 0.000 description 1
- 230000035622 drinking Effects 0.000 description 1
- 235000021271 drinking Nutrition 0.000 description 1
- 230000037336 dry skin Effects 0.000 description 1
- 239000000975 dye Substances 0.000 description 1
- 230000004064 dysfunction Effects 0.000 description 1
- 238000003708 edge detection Methods 0.000 description 1
- KCXVZYZYPLLWCC-UHFFFAOYSA-N edta Chemical compound OC(=O)CN(CC(O)=O)CCN(CC(O)=O)CC(O)=O KCXVZYZYPLLWCC-UHFFFAOYSA-N 0.000 description 1
- 230000002500 effect on skin Effects 0.000 description 1
- 230000005670 electromagnetic radiation Effects 0.000 description 1
- 238000009261 endocrine therapy Methods 0.000 description 1
- 230000002357 endometrial Effects 0.000 description 1
- 238000001861 endoscopic biopsy Methods 0.000 description 1
- LFQSCWFLJHTTHZ-UHFFFAOYSA-N ethanol Chemical compound CCO LFQSCWFLJHTTHZ-UHFFFAOYSA-N 0.000 description 1
- 238000007387 excisional biopsy Methods 0.000 description 1
- 230000004634 feeding behavior Effects 0.000 description 1
- 238000011010 flushing procedure Methods 0.000 description 1
- 238000005755 formation reaction Methods 0.000 description 1
- 238000007672 fourth generation sequencing Methods 0.000 description 1
- 229960002584 gefitinib Drugs 0.000 description 1
- 238000001502 gel electrophoresis Methods 0.000 description 1
- 238000007429 general method Methods 0.000 description 1
- 239000003365 glass fiber Substances 0.000 description 1
- 238000002873 global sequence alignment Methods 0.000 description 1
- 125000000267 glycino group Chemical group [H]N([*])C([H])([H])C(=O)O[H] 0.000 description 1
- PCHJSUWPFVWCPO-UHFFFAOYSA-N gold Chemical compound [Au] PCHJSUWPFVWCPO-UHFFFAOYSA-N 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 238000007490 hematoxylin and eosin (H&E) staining Methods 0.000 description 1
- 238000009396 hybridization Methods 0.000 description 1
- 229910052739 hydrogen Inorganic materials 0.000 description 1
- 239000001257 hydrogen Substances 0.000 description 1
- 230000003463 hyperproliferative Effects 0.000 description 1
- 230000001900 immune effect Effects 0.000 description 1
- 230000001771 impaired Effects 0.000 description 1
- 238000000338 in vitro Methods 0.000 description 1
- 238000001727 in vivo Methods 0.000 description 1
- 238000007386 incisional biopsy Methods 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 101700085186 lcb2 Proteins 0.000 description 1
- 230000003902 lesions Effects 0.000 description 1
- 238000009630 liquid culture Methods 0.000 description 1
- 239000006193 liquid solution Substances 0.000 description 1
- 238000011068 load Methods 0.000 description 1
- 230000004777 loss-of-function mutation Effects 0.000 description 1
- 201000005249 lung adenocarcinoma Diseases 0.000 description 1
- 230000002934 lysing Effects 0.000 description 1
- 238000010841 mRNA extraction Methods 0.000 description 1
- 230000036210 malignancy Effects 0.000 description 1
- 210000004962 mammalian cells Anatomy 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000003550 marker Substances 0.000 description 1
- 230000001404 mediated Effects 0.000 description 1
- 230000003340 mental Effects 0.000 description 1
- 238000002705 metabolomic Methods 0.000 description 1
- 230000001431 metabolomic Effects 0.000 description 1
- 201000001997 microphthalmia with limb anomalies Diseases 0.000 description 1
- 108010059074 monomethylauristatin F Proteins 0.000 description 1
- 230000003562 morphometric Effects 0.000 description 1
- 201000009251 multiple myeloma Diseases 0.000 description 1
- 230000001338 necrotic Effects 0.000 description 1
- 230000001613 neoplastic Effects 0.000 description 1
- ZNHPZUKZSNBOSQ-BQYQJAHWSA-N neratinib Chemical compound C=12C=C(NC\C=C\CN(C)C)C(OCC)=CC2=NC=C(C#N)C=1NC(C=C1Cl)=CC=C1OCC1=CC=CC=N1 ZNHPZUKZSNBOSQ-BQYQJAHWSA-N 0.000 description 1
- 229950008835 neratinib Drugs 0.000 description 1
- 238000010899 nucleation Methods 0.000 description 1
- 239000002853 nucleic acid probe Substances 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000002888 pairwise sequence alignment Methods 0.000 description 1
- 239000012188 paraffin wax Substances 0.000 description 1
- 239000002245 particle Substances 0.000 description 1
- 230000001575 pathological Effects 0.000 description 1
- 238000002205 phenol-chloroform extraction Methods 0.000 description 1
- 244000144977 poultry Species 0.000 description 1
- 230000002028 premature Effects 0.000 description 1
- 238000004321 preservation Methods 0.000 description 1
- 239000000092 prognostic biomarker Substances 0.000 description 1
- 230000000750 progressive Effects 0.000 description 1
- 200000000025 progressive disease Diseases 0.000 description 1
- 238000007388 punch biopsy Methods 0.000 description 1
- 230000036647 reaction Effects 0.000 description 1
- 238000005215 recombination Methods 0.000 description 1
- 231100000596 recommended exposure limit Toxicity 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 230000000268 renotropic Effects 0.000 description 1
- 230000001718 repressive Effects 0.000 description 1
- 230000002441 reversible Effects 0.000 description 1
- 102220331758 rs1057519698 Human genes 0.000 description 1
- 102220197960 rs1057519783 Human genes 0.000 description 1
- 102220197961 rs1057519784 Human genes 0.000 description 1
- 102220198074 rs1057519859 Human genes 0.000 description 1
- 102220198249 rs1057519936 Human genes 0.000 description 1
- 102220014333 rs112445441 Human genes 0.000 description 1
- 102220197780 rs121434596 Human genes 0.000 description 1
- 102220197834 rs121913535 Human genes 0.000 description 1
- 102220096619 rs376189676 Human genes 0.000 description 1
- 102220054065 rs727503108 Human genes 0.000 description 1
- 102220010982 rs730880460 Human genes 0.000 description 1
- 238000007480 sanger sequencing Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 238000007841 sequencing by ligation Methods 0.000 description 1
- 238000007389 shave biopsy Methods 0.000 description 1
- 238000010008 shearing Methods 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 231100000486 side effect Toxicity 0.000 description 1
- 230000011664 signaling Effects 0.000 description 1
- 238000004513 sizing Methods 0.000 description 1
- 239000004296 sodium metabisulphite Substances 0.000 description 1
- 239000007790 solid phase Substances 0.000 description 1
- 108010057210 telomerase RNA Proteins 0.000 description 1
- 229920000511 telomere Polymers 0.000 description 1
- 230000035897 transcription Effects 0.000 description 1
- 102000003995 transcription factors Human genes 0.000 description 1
- 108090000464 transcription factors Proteins 0.000 description 1
- 230000002103 transcriptional Effects 0.000 description 1
- 101700026210 trx Proteins 0.000 description 1
- 210000004881 tumor cells Anatomy 0.000 description 1
- 230000004614 tumor growth Effects 0.000 description 1
- 238000009827 uniform distribution Methods 0.000 description 1
- 230000003827 upregulation Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
- 238000011179 visual inspection Methods 0.000 description 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 1
- 108010073629 xeroderma pigmentosum group F protein Proteins 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/10—Ploidy or copy number detection
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/20—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
Abstract
Methods, systems, and software are provided for validating a copy number variation, validating a somatic sequence variant, and/or determining circulating tumor fraction estimates using on-target and off-target sequence reads in a test subject. A copy number status annotation for a genomic segment is validated by applying a first dataset to a plurality of filters comprising a measure of central tendency bin-level sequence ratio filter, a confidence filter, and a measure of central tendency-plus-deviation bin-level sequence ratio filter. A somatic sequence variant is validated by comparing a variant allele fragment count for a candidate somatic sequence variant for a respective locus, against a dynamic variant count threshold for the locus in a respective reference sequence. A circulating tumor fraction is estimated based on a measure of fit between genomic segment-level coverage ratios and integer copy states across a plurality of simulated circulated tumor fractions.
Description
METHODS AND SYSTEMS FOR A LIQUID BIOPSY ASSAY
CROSS REFERENCE TO RELATED APPLICATIONS
[0001] This application claims priority to U.S. Provisional Patent Application No. 63/041,272, filed June 19, 2020, U.S. Provisional Patent Application No. 63/041,293, filed June 19, 2020, U.S. Provisional Patent Application No. 63/041,424, filed June 19, 2020, and U.S. Provisional Patent Application No. 62/978,130, filed February 18, 2020, the contents of which are hereby incorporated by reference, in their entireties, for all purposes.
FIELD OF THE INVENTION
[0002] The present disclosure relates generally to the use of cell-free DNA sequencing data to provide clinical support for personalized treatment of cancer.
BACKGROUND
[0003] Precision oncology is the practice of tailoring cancer therapy to the unique genomic, epigenetic, and/or transcriptomic profile of an individual’s cancer. Personalized cancer treatment builds upon conventional therapeutic regimens used to treat cancer based only on the gross classification of the cancer, e.g., treating all breast cancer patients with a first therapy and all lung cancer patients with a second therapy. This field was home out of many observations that different patients diagnosed with the same type of cancer, e.g., breast cancer, responded very differently to common treatment regimens. Over time, researchers have identified genomic, epigenetic, and transcriptomic markers that improve predictions as to how an individual cancer will respond to a particular treatment modality.
[0004] There is growing evidence that cancer patients who receive therapy guided by their genetics have better outcomes. For example, studies have shown that targeted therapies result in significantly improved progression-free cancer survival. See, e.g., Radovich M. et cil, Oncotarget, 7(35):56491-500 (2016). Similarly, reports from the IMPACT trial — a large (n = 1307) retrospective analysis of consecutive, prospectively molecularly profiled patients with advanced cancer who participated in a large, personalized medicine trial — indicate that patients receiving targeted therapies matched to their tumor biology had a response rate of 16.2%, as opposed to a response rate of 5.2% for patients receiving non-matched therapy. Tsimberidou AM et cil, ASCO 2018, Abstract LBA2553 (2018).
[0005] In fact, therapy targeted to specific genomic alterations is already the standard of care in several tumor types, e.g., as suggested in the National Comprehensive Cancer Network (NCCN) guidelines for melanoma, colorectal cancer, and non-small cell lung cancer. In practice, implementation of these targeted therapies requires determining the status of the diagnostic marker in each eligible cancer patient. While this can be accomplished for the few, well known mutations associated with treatment recommendations in the NCCN guidelines using individual assays or small next generation sequencing (NGS) panels, the growing number of actionable genomic alterations and increasing complexity of diagnostic classifiers necessitates a more comprehensive evaluation of each patient’s cancer genome, epigenome, and/or transcriptome.
[0006] For instance, some evidence suggests that use of combination therapies where each component is matched to an actionable genomic alteration holds the greatest potential for treating individual cancers. To this point, a retroactive study of cancer patients treated with one or more therapeutic regimens revealed that patients who received therapies matched to a higher percentage of their genomic alterations experienced a greater frequency of stable disease (e.g., a longer time to recurrence), longer time to treatment failure, and greater overall survival. Wheeler JJ et al, Cancer Res., 76:3690-701 (2016). Thus, comprehensive evaluation of each cancer patient’s genome, epigenome, and/or transcriptome should maximize the benefits provided by precision oncology, by facilitating more fine-tuned combination therapies, use of novel off-label drug indications, and/or tissue agnostic immunotherapy. See, for example, Schwaederle M. et al, J Clin Oncol., 33(32):3817-25 (2015); Schwaederle M. et al, JAMA Oncol., 2(11): 1452-59 (2016); and Wheler JJ et al, Cancer Res., 76(13):3690-701 (2016). Further, the use of comprehensive next generation sequencing analysis of cancer genomes facilitates better access and a larger patient pool for clinical trial enrollment. Coyne GO et al, Curr. Probl. Cancer, 41(3): 182-93 (2017); and Markman M., Oncology, 31(3): 158, 168.
[0007] The use of large NGS genomic analysis is growing in order to address the need for more comprehensive characterization of an individual’s cancer genome. See, for example, Fernandes GS et al, Clinics, 72(10):588-94. Recent studies indicate that of the patients for which large NGS genomic analysis is performed, 30-40% then receive clinical care based on the assay results, which is limited by at least the identification of actionable genomic alterations, the availability of medication for treatment of identified actionable genomic alterations, and the clinical condition of the subject. See, Ross JS et al, JAMA Oncol.,
l(l):40-49 (2015); Ross JS etal, Arch. Pathol. Lab Med., 139:642-49 (2015); Hirshfield KM et cil, Oncologist, 21(11): 1315-25 (2016); and Groisberg R. etal., Oncotarget, 8:39254-67 (2017).
[0008] However, these large NGS genomic analyses are conventionally performed on solid tumor samples. For instance, each of the studies referenced in the paragraph above performed NGS analysis of FFPE tumor blocks from patients. Solid tissue biopsies remain the gold standard for diagnosis and identification of predictive biomarkers because they represent well-known and validated methodologies that provide a high degree of accuracy. Nevertheless, there are significant limitations to the use of solid tissue material for large NGS genomic analyses of cancers. For example, tumor biopsies are subject to sampling bias caused by spatial and/or temporal genetic heterogeneity, e.g., between two regions of a single tumor and/or between different cancerous tissues (such as between primary and metastatic tumor sites or between two different primary tumor sites). Such intertumor or intratumor heterogeneity can cause sub-clonal or emerging mutations to be overlooked when using localized tissue biopsies, with the potential for sampling bias to be exacerbated over time as sub-clonal populations further evolve and/or shift in predominance.
[0009] Additionally, the acquisition of solid tissue biopsies often requires invasive surgical procedures, e.g., when the primary tumor site is located at an internal organ. These procedures can be expensive, time consuming, and carry a significant risk to the patient, e.g., when the patient’s health is poor and may not be able to tolerate invasive medical procedures and/or the tumor is located in a particularly sensitive or inoperable location, such as in the brain or heart. Further, the amount of tissue, if any, that can be procured depends on multiple factors, including the location of the tumor, the size of the tumor, the fragility of the patient, and the risk of comorbidities related to biopsies, such as bleeding and infections. For instance, recent studies report that tissue samples in a majority of advanced non-small cell lung cancer patients are limited to small biopsies and cannot be obtained at all in up to 31% of patients. Ilie and Hofman, Transl. Lung Cancer Res., 5(4):420-23 (2016). Even when a tissue biopsy is obtained, the sample may be too scant for comprehensive testing.
[0010] Further, the method of tissue collection, preservation (e.g. , formalin fixation), and/or storage of tissue biopsies can result in sample degradation and variable quality DNA. This, in turn, leads to inaccuracies in downstream assays and analysis, including next- generation sequencing (NGS) for the identification of biomarkers. Ilie and Hofman, Transl Lung Cancer Res., 5(4):420-23 (2016).
[0011] In addition, the invasive nature of the biopsy procedure, the time and cost associated with obtaining the sample, and the compromised state of cancer patients receiving therapy render repeat testing of cancerous tissues impracticable, if not impossible. As a result, solid tissue biopsy analysis is not amenable to many monitoring schemes that would benefit cancer patients, such as disease progression analysis, treatment efficacy evaluation, disease recurrence monitoring, and other techniques that require data from several time points.
[0012] Cell-free DNA (cfDNA) has been identified in various bodily fluids, e.g., blood serum, plasma, urine, etc. Chan etal., Ann. Clin. Biochem., 40(Pt 2): 122-30 (2003). This cfDNA originates from necrotic or apoptotic cells of all types, including germline cells, hematopoietic cells, and diseased (e.g., cancerous) cells. Advantageously, genomic alterations in cancerous tissues can be identified from cfDNA isolated from cancer patients. See, e.g., Stroun et ctl, Oncology, 46(5):318-22 (1989); Goessl et ctl, Cancer Res., 60(21):5941-45 (2000); and Frenel etal, Clin. Cancer Res. 21(20):4586-96 (2015). Thus, one approach to overcoming the problems presented by the use of solid tissue biopsies described above is to analyze cell-free nucleic acids (e.g., cfDNA) and/or nucleic acids in circulating tumor cells present in biological fluids, e.g., via a liquid biopsy.
[0013] Specifically, liquid biopsies offer several advantages over conventional solid tissue biopsy analysis. For instance, because bodily fluids can be collected in a minimally invasive or non-invasive fashion, sample collection is simpler, faster, safer, and less expensive than solid tumor biopsies. Such methods require only small amounts of sample (e.g., 10 mL or less of whole blood per biopsy) and reduce the discomfort and risk of complications experienced by patients during conventional tissue biopsies. In fact, liquid biopsy samples can be collected with limited or no assistance from medical professionals and can be performed at almost any location. Further, liquid biopsy samples can be collected from any patient, regardless of the location of their cancer, their overall health, and any previous biopsy collection. This allows for analysis of the cancer genome of patients from which a solid tumor sample cannot be easily and/or safely obtained. In addition, because cell-free DNA in the bodily fluids arise from many different types of tissues in the patient, the genomic alterations present in the pool of cell-free DNA are representative of various different clonal sub-populations of the cancerous tissue of the subject, facilitating a more comprehensive analysis of the cancerous genome of the subject than is possible from one or more sections of a single solid tumor sample.
[0014] Liquid biopsies also enable serial genetic testing prior to cancer detection, during the early stages of cancer progression, throughout the course of treatment, and during remission, e.g., to monitor for disease recurrence. The ability to conduct serial testing via non-invasive liquid biopsies throughout the course of disease could prove beneficial for many patients, e.g., through monitoring patient response to therapies, the emergence of new actionable genomic alterations, and/or drug-resistance alterations. These types of information allow medical professionals to more quickly tailor and update therapeutic regimens, e.g., facilitating more timely intervention in the case of disease progression. See, e.g., Ilie and Hofman, Transl. Lung Cancer Res., 5(4):420-23 (2016).
[0015] Nevertheless, while liquid biopsies are promising tools for improving outcomes using precision oncology, there are significant challenges specific to the use of cell-free DNA for evaluation of a subject’s cancer genome. For instance, there is a highly variable signal -to- noise ratio from one liquid biopsy sample to the next. This occurs because cfDNA originates from a variety of different cells in a subject, both healthy and diseased. Depending on the stage and type of cancer in any particular subject, the fraction of cfDNA fragments originating from cancerous cells (the “tumor fraction” or “ctDNA fraction” of the sample/subject) can range from almost 0% to well over 50%. Other factors, including tumor type and mutation profile, can also impact the amount of DNA released from cancerous tissues. For instance, cfDNA clearance through the liver and kidneys is affected by a variety of factors, including renal dysfunction or other tissue damaging factors (e.g., chemotherapy, surgery, and/or radiotherapy).
[0016] This, in turn, leads to problems detecting and/or validating cancer-specific genomic alterations in a liquid sample. This is particularly true during early stages of the disease — when cancer therapies have much higher success rates — because the tumor fraction in the patient is lowest at this point. Thus, early stage cancer patients can have ctDNA fractions below the limit of detection (LOD) for one or more informative genomic alterations, limiting clinical utility because of the risk of false negatives and/or providing an incomplete picture of the cancer genome of the patient. Further, because cancers, and even individual tumors, can be clonally diverse, actionable genomic alterations that arise in only a subset of clonal populations are diluted below the overall tumor fraction of the sample, further frustrating attempts to tailor combination therapies to the various actionable mutations in the patient’s cancer genome. Consequently, most studies using liquid biopsy samples to date have focused on late stage patients for assay validation and research.
[0017] Another challenge associated with liquid biopsies is the accurate determination of tumor fraction in a sample. This difficulty arises from at least the heterogeneity of cancers and the increased frequency of large chromosomal duplications and deletions found in cancers. As a result, the frequency of genomic alterations from cancerous tissues varies from locus to locus based on at least (i) their prevalence in different sub-clonal populations of the subject’s cancer, and (ii) their location within the genome, relative to large chromosomal copy number variations. The difficulty in accurately determining the tumor fraction of liquid biopsy samples affects accurate measurement of various cancer features shown to have diagnostic value for the analysis of solid tumor biopsies. These include allelic ratios, copy number variations, overall mutational burden, frequency of abnormal methylation patterns, etc., all of which are correlated with the percentage of DNA fragments that arise from cancerous tissue, as opposed to healthy tissue.
[0018] Altogether, these factors result in highly variable concentrations of ctDNA — from patient to patient and possibly from locus to locus — that confound accurate measurement of disease indicators and actionable genomic alterations. Further, the quantity and quality of cfDNA obtained from liquid biopsy samples are highly dependent on the particular methodology for collecting the samples, storing the samples, sequencing the samples, and standardizing the sequencing data.
[0019] While validation studies of existing liquid biopsy assays have shown high sensitivity and specificity, few studies have corroborated results with orthogonal methods, or between particular testing platforms, e.g., different NGS technologies and/or targeted panel sequencing versus whole genome/exome sequence. Reports of liquid biopsy-based studies are limited by comparison to non-comprehensive tissue testing algorithms including Sanger sequencing, small NGS hotspot panels, polymerase chain reaction (PCR), and fluorescent in situ hybridization (FISH), which may not contain all NCCN guideline genes in their reportable range, thus suffering in comparison to a more comprehensive liquid biopsy assay.
[0020] As an example, conventional liquid biopsy assays do not provide accurate classifications of copy number variations (CNVs) for genomic targets (e.g., biomarkers), where CNVs are a form of genomic alteration with known relevance to cancer. Conventional methodologies typically assign a genomic target to an integer copy number and/or one of three copy number states (e.g., amplified, neutral, or deleted) using a copy ratio cutoff above or below which an amplified or deleted status is called, respectively, or in which a neutral status is otherwise called. Such methodologies make these assignments based on the fact that
at a given tumor fraction and a known ploidy, the copy number in a segment is positively correlated with its copy ratio and thus the copy ratio can be mathematically converted to an integer copy number. For example, one conventional method ichorCNA utilizes software that estimates tumor fraction in circulating cfDNA from ultra-low-pass whole genome sequencing, which is then used to determine genomic alterations such as copy number alterations. See, Adalsteinsson et cil, Nat Commun., 8:1324 (2017).
[0021] However, this approach can be problematic due to the current challenges in accurately determining tumor fraction in liquid biopsy samples. For example, estimating the ctDNA fraction of total cell-free DNA in plasma can be difficult due to highly variable tumor fractions that can range from 0 to approximately 90%, and in many cases can be below 1% and/or below the limit of detection. See, Shigematsu and Koyama, Nihon Jinzo Gakkai Shi., 30(9): 1115-22 (1988). Methods based on mean, median, maximum or other point estimates of somatic variant allele fractions (VAFs) require the difficult task of accurate quantification and classification of somatic and germline variants in liquid biopsy samples, which can be further complicated by the absence of a matched normal sample or the presence of artifactual variants and/or clonal heterogeneity. In addition to the reliance on potentially inaccurate tumor fraction estimations, methods that utilize ultra-low-pass whole genome sequencing assays may be inappropriate for analyzing copy number variations from capture-based deep sequencing assays.
[0022] Additional challenges arise in cases where non-focal copy number variations are identified (e.g., where an entire chromosome or a large portion of a chromosome is amplified or deleted). Non-focal copy number variations are often difficult to interpret, as these large- scale copy number changes may represent real copy number variations or may be artifacts resulting from incorrect normalization due to low sample quality, capture failures, or other unknown issues during library preparation or sequencing. Because such large-scale copy number changes are unlikely to be associated with therapeutically actionable genomic alterations, the ability to differentiate between real and artifactual copy number variations is an important and unmet need in precision oncology applications. For example, two conventional methods that are insufficient to distinguish focal copy number variations from non-focal copy number variations include CNVkit and AVENIO. See, for example, Talevich etal, PLoS Comput Biol, 12:1004873 (2016), and Roche, “AVENIO ctDNA Expanded Kit,” (2018), the contents of which are incorporated herein by reference, in their entireties, for all purposes.
[0023] As another example, conventional liquid biopsy assays do not provide a method for accurately detecting variants (e.g., variant alleles) in ctDNA NGS assays. As described above, many patients may not have abundant ctDNA in early stage disease and may shed variants below the limit of detection (LOD) for ctDNA assays, resulting in false negatives. Detecting these variants at low circulating fractions is also technically challenging due to constraints of sequencing by synthesis. Additionally, differentiating between germline and somatic variants in ctDNA is difficult, as is differentiating between mutations derived from clonal hematopoiesis (CH) and the solid tumor being assayed. In such cases, mutations in hematopoietic lineage cells may be mistaken for tumor-derived mutations. Indeed, researchers have identified several genes frequently mutated in CH with potential importance in cancer, such as JAK2, TP53, GNAS, IDH2, and KRAS. Mayrhofer et al, 2018, “Cell-free DNA profiling of metastatic prostate cancer reveals microsatellite instability, structural rearrangements and clonal hematopoiesis,” Genome Med, (10), pg. 85; Hu et al. , 2018, “False-Positive Plasma Genotyping Due to Clonal Hematopoiesis,” Clin Cancer Res, (24), pg. 4437.
[0024] Additionally, conventional conventional liquid biopsy assays do not provide accurate circulating tumor fraction estimates (ctFEs). Accurate ctFEs provide several benefits to liquid biopsy applications, including classification of variants as somatic or germline, detection of clinically relevant copy number variations, and/or use of ctFEs as biomarkers.
[0025] For example, because up to 30% of breast cancer patients and up to 55% of lung cancer patients relapse after initial treatment, as well as a significant portion of patients in other cancer cohorts, the ability to detect metastasis and disease recurrence earlier in these patients could significantly improve patient outcomes. See, Colleoni et al. , 2016, “Annual Hazard Rates of Recurrence for Breast Cancer During 24 Years of Follow-Up: Results From the International Breast Cancer Study Group Trials I to V,” J Clin Oncol, (34), pg. 927; Yates et al. , 2017, “Genomic Evolution of Breast Cancer Metastasis and Relapse,” Cancer Cell, (32), pg. 169; Uramoto et al. , 2014, “Recurrence after surgery in patients with NSCLC,” Transl Lung Cancer Res, (3), pg. 242; Taunk et al. , 2017, “Immunotherapy and radiation therapy for operable early stage and locally advanced non-small cell lung cancer,” Transl Lung Cancer Res, (6), pg. 178. Indeed, recent retrospective and prospective studies have shown ctDNA after completion of treatment or surgery can act as a biomarker for disease recurrence in many cancer types, including breast cancer, lung cancer, melanoma, bladder
cancer, and colon cancer. See, Coombes et al, 2019, “Personalized Detection of Circulating Tumor DNA Antedates Breast Cancer Metastatic Recurrence,” Clin Cancer Res, (25), pg. 4255; Tie et al, 2019, “Circulating Tumor DNA Analyses as Markers of Recurrence Risk and Benefit of Adjuvant Therapy for Stage III Colon Cancer,” JAMA Oncol, print; McEvoy et al, 2019, “Monitoring melanoma recurrence with circulating tumor DNA: a proof of concept from three case studies,” Oncotarget, (10), pg. 113; Christensen et al, 2019, “Early Detection of Metastatic Relapse and Monitoring of Therapeutic Efficacy by Ultra-Deep Sequencing of Plasma Cell-Free DNA in Patients With Urothelial Bladder Carcinoma,” J Clin Oncol, (37), pg. 1547; Isaksson et al, 2019, “Pre-operative plasma cell-free circulating tumor DNA and serum protein tumor markers as predictors of lung adenocarcinoma recurrence,” Acta Oncol, (58), pg. 1079. Higher ctFEs are associated with disease progression at radiographic evaluation and an increased metastatic lesion count.
[0026] Furthermore, ctFEs correlate with important clinical outcomes, and provide a minimally invasive method to monitor patients for response to therapy, disease relapse, and disease progression. However, conventional methodologies used for determining ctFEs in liquid biopsy samples rely on low-pass, whole-genome sequencing, which cannot also be used for variant detection (see, for example, Adalsteinsson et al, “Scalable whole-exome sequencing of cell-free DNA reveals high concordance with metastatic tumors,” (2017) Nature Communications Nov 6;8(1):1324, doi:10.1038/s41467-017-00965-y; and ichorCNA, the Broad Institute, available on the internet at github.com/broadinstitute/ichorCNA). Other traditional approaches use variant allele fractions (VAFs) to estimate tumor fraction, but such approaches are confounded by variant tissue source and capture bias resulting in high levels of noise. Additionally, conventional methodologies for determining tumor purity estimates in solid tumor biopsy samples rely solely on on-target probe regions, which cannot be used in conjunction with targeted gene panels containing small numbers of genes.
[0027] The information disclosed in this Background section is only for enhancement of understanding of the general background of the invention and should not be taken as an acknowledgement or any form of suggestion that this information forms the prior art already known to a person skilled in the art.
SUMMARY
[0028] Given the above background, there is a need in the art for improved methods and systems for supporting clinical decisions in precision oncology using liquid biopsy assays. In
particular, there is a need in the art for improved methods and systems for identifying focal copy number variations in liquid biopsy assays. The present disclosure solves this and other needs in the art by providing improvements in validating copy number variation annotations, thus identifying focal copy number variations in genomic segments obtained from liquid biopsy assays. For example, by applying a plurality of amplification and/or deletion filters to a dataset comprising bin-level copy ratios, segment-level copy ratios, and segment-level confidence intervals for a plurality of bins and segments, respectively, the systems and methods described herein reject or validate a focal copy number status annotation for a at a locus that is potentially actionable using precision oncology.
[0029] For example, in one aspect, the present disclosure provides a method of validating a copy number variation in a test subject, at a computer system having one or more processors, and memory storing one or more programs for execution by the one or more processors. The method comprises obtaining a first dataset that comprises a plurality of bin- level sequence ratios, each respective bin-level sequence ratio in the plurality of bin-level sequence ratios corresponding to a respective bin in a plurality of bins. Each respective bin in the plurality of bins represents a corresponding region of a human reference genome, and each respective bin-level sequence ratio in the plurality of bin-level sequence ratios is determined from a sequencing of a plurality of cell-free nucleic acids in a first liquid biopsy sample of the test subject and one or more reference samples.
[0030] The first dataset also comprises a plurality of segment-level sequence ratios, each respective segment-level sequence ratio in the plurality of segment-level sequence ratios corresponding to a segment in a plurality of segments. Each respective segment in the plurality of segments represents a corresponding region of the human reference genome encompassing a subset of adjacent bins in the plurality of bins, and each respective segment- level sequence ratio in the plurality of segment-level sequence ratios is determined from a measure of central tendency of the plurality of bin-level sequence ratios corresponding to the subset of adjacent bins encompassed by the respective segment.
[0031] The first dataset further comprises a plurality of segment-level measures of dispersion, where each respective segment-level measure of dispersion in the plurality of segment-level measures of dispersion (i) corresponds to a respective segment in the plurality of segments and (ii) is determined using the plurality of bin-level sequence ratios corresponding to the subset of adjacent bins encompassed by the respective segment.
[0032] In this aspect, the method comprises validating a copy number status annotation of a respective segment in the plurality of segments that is annotated with a copy number variation by applying the first dataset to an algorithm having a plurality of filters. A first filter in the plurality of filters is a measure of central tendency bin-level sequence ratio filter that is fired when a measure of central tendency of the plurality of bin-level sequence ratios corresponding to the subset of bins encompassed by the respective segment fails to satisfy one or more bin-level sequence ratio thresholds. A second filter in the plurality of filters is a confidence filter that is fired when the segment-level measure of dispersion corresponding to the respective segment fails to satisfy a confidence threshold. A third filter in the plurality of filters is a measure of central tendency-plus-deviation bin-level sequence ratio filter that is fired when a measure of central tendency of the plurality of bin-level sequence ratios corresponding to the subset of bins encompassed by the respective segment fails to satisfy one or more measure of central tendency-plus-deviation bin-level sequence ratio thresholds.
In the third filter, the one or more measure of central tendency -plus-deviation bin-level copy ratio thresholds are derived from (i) a measure of the bin-level sequence ratios corresponding to the plurality of bins that map to the same chromosome of the human reference genome as the respective segment, and (ii) a measure of dispersion across the bin-level sequence ratios corresponding to the plurality of bins that map to the respective chromosome.
[0033] When a filter in the plurality of filters is fired, the copy number status annotation of the respective segment is rejected; and when no filter in the plurality of filters is fired, the copy number status annotation of the respective segment is validated.
[0034] In another aspect, the present disclosure provides a method for treating a patient with a cancer containing a copy number variation of a target gene. The method comprises determining whether the patient has an aggressive form of cancer associated with a focal copy number variation of the target gene by obtaining a first biological sample of the cancer from the patient and performing copy number variation analysis on the first biological sample to identify the copy number status of the target gene in the cancer.
[0035] The copy number variation analysis generates a first dataset comprising a plurality of bin-level sequence ratios, each respective bin-level sequence ratio in the plurality of bin- level sequence ratios corresponding to a respective bin in a plurality of bins. Each respective bin in the plurality of bins represents a corresponding region of a human reference genome, and each respective bin-level sequence ratio in the plurality of bin-level sequence ratios is
determined from a sequencing of a plurality of nucleic acids in the first biological sample of the cancer from the patient and one or more reference samples.
[0036] The first dataset also comprises a plurality of segment-level sequence ratios, each respective segment-level sequence ratio in the plurality of segment-level sequence ratios corresponding to a segment in a plurality of segments. Each respective segment in the plurality of segments represents a corresponding region of the human reference genome encompassing a subset of adjacent bins in the plurality of bins, and the plurality of segment- level sequence ratios is determined from a measure of central tendency of the plurality of bin- level sequence ratios corresponding to the subset of adjacent bins encompassed by the respective segment.
[0037] The first dataset further comprises a plurality of segment-level measures of dispersion, where each respective segment-level measure of dispersion in the plurality of segment-level measures of dispersion (i) corresponds to a respective segment in the plurality of segments and (ii) is determined using the plurality of bin-level sequence ratios corresponding to the subset of adjacent bins encompassed by the respective segment.
[0038] The method further comprises determining whether the copy number variation of the target gene is a focal copy number variation by applying the first dataset to an algorithm having a plurality of copy number variation filters. When the patient has the aggressive form of cancer associated with focal copy number variation of the target gene, a first therapy for the aggressive form of the cancer to the patient is administered, and when the patient does not have the aggressive form of cancer associated with focal copy number variation of the target gene, a second therapy for a less aggressive form of the cancer to the patient is administered.
[0039] Additionally, there is a need in the art for improved methods and systems for identifying somatic tumor mutations in cell-free DNA, particularly where the sample has low tumor fractions. Advantageously, the present disclosure solves this and other needs in the art by providing improved somatic variant identification methodology that better accounts for locus-specific and/or sample specific considerations to more accurately identify true somatic mutations in a liquid biopsy sample. For example, by using an application of Bayes theorem to account for one or more of (i) the prevalence of variants at a specific locus in a specific cancer type, (ii) the variant allele fraction for the variant being evaluated, (iii) the prevalence of sequencing errors at a particular locus, and (iv) the actual sequencing error rate of a particular reaction, the variant filter methodologies described herein tune the specificity and
sensitivity of variant count thresholds in a locus-specific fashion to achieve higher accuracy of true somatic variant calling in a liquid biopsy assay.
[0040] For example, in one aspect, the present disclosure provides a method of validating a somatic sequence variant in a test subject having a cancer condition. The method is performed at a computer system having one or more processors, and memory storing one or more programs for execution by the one or more processors. The method includes obtaining, from a first sequencing reaction, a corresponding sequence of each cell-free DNA fragment in a first plurality of cell-free DNA fragments in a liquid biopsy sample of the test subject, thus obtaining a first plurality of sequence reads. Each respective sequence read in the first plurality of sequence reads is aligned to a reference sequence for the species of the subject, thus identifying a variant allele fragment count for a candidate variant that maps to a locus in the reference sequence, and a locus fragment count for the locus encompassing the candidate variant.
[0041] The method further includes comparing the variant allele fragment count for the candidate variant against a dynamic variant count threshold for the locus in the reference sequence that the candidate variant maps to. The dynamic variant count threshold is based upon a pre-test odds of a positive variant call for the locus based on the prevalence of variants in a genomic region that includes the locus from a first set of nucleic acids obtained from a cohort of subjects having the cancer condition.
[0042] The method then includes rejecting or validating the variant as a true somatic variant based upon the dynamic variant count threshold. For instance, when the variant allele fragment count for the candidate variant satisfies the dynamic variant count threshold for the locus, the presence of the somatic sequence variant in the test subject is validated. And when the variant allele fragment count for the candidate variant does not satisfy the dynamic variant count threshold for the locus, the presence of the somatic sequence variant in the test subject is rejected.
[0043] Additionally, there is a need in the art for improved methods and systems for determining accurate circulating tumor fraction estimates (ctFEs) in liquid biopsy assays.
The present disclosure solves this and other needs in the art by providing methods and systems for estimating the circulating tumor fraction of a liquid biopsy sample from a targeted-panel sequencing reaction. For example, by fitting segment-level coverage ratios for on-target and off-target sequence reads distributed relatively uniformly along the genome to
integer copy states across a range of simulated tumor fractions (e.g., using maximum likelihood estimation, for example, with an expectation-maximization algorithm), the systems and methods described herein can generate an accurate estimate of the circulating tumor fraction of a liquid biopsy sample. This is achieved, in some embodiments, by identifying the expected coverage ratios, given the fitted integer copy states, that best match the experimental coverage ratios. Such an accurate estimate of the circulating tumor fraction can be used in conjunction with on-target sequencing results to improve variant detection identification, as well as serve as an informative biomarker itself.
[0044] For example, in one aspect, the present disclosure provides a method of estimating a circulating tumor fraction for a test subject from panel-enriched sequencing data for a plurality of sequences, at a computer system having one or more processors, and memory storing one or more programs for execution by the one or more processors.
[0045] The method includes obtaining, from a first panel-enriched sequencing reaction, a first plurality of sequences. The plurality of sequences includes a corresponding sequence for each cell-free DNA fragment in a first plurality of cell-free DNA fragments obtained from a liquid biopsy sample from the test subject, wherein each respective cell-free DNA fragment in the first plurality of cell-free DNA fragments corresponds to a respective probe sequence in a plurality of probe sequences used to enrich cell-free DNA fragments in the liquid biopsy sample in the first panel-enriched sequencing reaction.
[0046] The first plurality of sequences also includes a corresponding sequence for each cell-free DNA fragment in a second plurality of cell-free DNA fragments obtained from the liquid biopsy sample, wherein each respective cell-free DNA fragment in the second plurality of DNA fragments does not correspond to any probe sequence in the plurality of probe sequences.
[0047] The method includes determining a plurality of bin-level coverage ratios from the plurality of sequences, each respective bin-level coverage ratio in the plurality of bin-level coverage ratios corresponding to a respective bin in a plurality of bins. Each respective bin in the plurality of bins represents a corresponding region of a human reference genome. Each respective bin-level sequence ratio in the plurality of bin-level sequence ratios is determined from a comparison of (i) a number of sequence reads in the plurality of sequences that map to the corresponding bin and (ii) a number of sequence reads from one or more reference samples that map to the corresponding bin.
[0048] The method further includes determining a plurality of segment-level coverage ratios by forming a plurality of segments by grouping respective subsets of adjacent bins in the plurality of bins based on a similarity between the respective coverage ratios of the subset of adjacent bins, and determining, for each respective segment in the plurality of segments, a segment-level coverage ratio based on the corresponding bin-level coverage ratios for each bin in the respective segment.
[0049] For each respective simulated circulating tumor fraction in a plurality of simulated circulating tumor fractions, the method includes fitting each respective segment in the plurality of segments to a respective integer copy state in a plurality of integer copy states, by identifying the respective integer copy state in the plurality of integer copy states that best matches the segment-level coverage ratio, thus generating, for each respective simulated circulating tumor fraction in the plurality of simulated tumor fractions, a respective set of integer copy states for the plurality of segments.
[0050] The method further includes determining the circulating tumor fraction for the test subject based on a comparison between the corresponding segment-level coverage ratios and integer copy states across the plurality of simulated circulated tumor fractions. In some embodiments, the comparison includes optimization of an error between corresponding segment-level coverage ratios and integer copy states across the plurality of simulated circulated tumor fractions. In some embodiments, the comparison includes finding two or more local optima for fit (e.g., local minima for an error between corresponding segment- level coverage ratios and integer copy states across the plurality of simulated circulated tumor fractions) and choosing the local optima (e.g., minima) that is most consistent with one or more alternative estimations of the tumor fraction.
[0051] Additional aspects and advantages of the present disclosure will become readily apparent to those skilled in this art from the following detailed description, wherein only illustrative embodiments of the present disclosure are shown and described. As will be realized, the present disclosure is capable of other and different embodiments, and its several details are capable of modifications in various obvious respects, all without departing from the disclosure. Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive.
BRIEF DESCRIPTION OF THE DRAWINGS
[0052] Figures 1A, IB, 1C1, 1D1, 1C2, 1D2, 1E2, 1F2, 1C3, and 1D3 collectively illustrate a block diagram of an example computing device for supporting clinical decisions in precision oncology using liquid biopsy assays ( e.g ., by validating a copy number variation, validating a somatic sequence variant in a test subject having a cancer condition, estimating the circulating tumor fraction of a liquid biopsy sample based on on-target and off-target sequence reads from targeted-panel sequencing data etc.), in which dashed boxes represent optional portions of the method, in accordance with some embodiments of the present disclosure.
[0053] Figure 2A illustrates an example workflow for generating a clinical report based on information generated from analysis of one or more patient specimens, in accordance with some embodiments of the present disclosure.
[0054] Figure 2B illustrates an example of a distributed diagnostic environment for collecting and evaluating patient data for the purpose of precision oncology, in accordance with some embodiments of the present disclosure.
[0055] Figure 3 provides an example flow chart of processes and features for liquid biopsy sample collection and analysis for use in precision oncology, in which dashed boxes represent optional portions of the method, in accordance with some embodiments of the present disclosure.
[0056] Figures 4A, 4B, 4C, 4D, 4E, 4F1, 4F2, 4G1, 4G2, 4G3, and 4F3 collectively illustrate an example bioinformatics pipeline for precision oncology. Figure 4A provides an overview flow chart of processes and features in a bioinformatics pipeline, in accordance with some embodiments of the present disclosure. Figure 4B provides an overview of a bioinformatics pipeline executed with either a liquid biopsy sample alone or a liquid biopsy sample and a matched normal sample. Figure 4C illustrates that paired end reads from tumor and normal isolates are zipped and stored separately under the same order identifier, in accordance with some embodiments of the present disclosure. Figure 4D illustrates quality correction for FASTQ files, in accordance with some embodiments of the present disclosure. Figure 4E illustrates processes for obtaining tumor and normal BAM alignment files, in accordance with some embodiments of the present disclosure. Figure 4F1 provides a flow chart of a method for validating a copy number variation, in which dashed boxes represent optional portions of the method, in accordance with some embodiments of the present
disclosure. Figure 4F2 provides a flow chart of a method for validating a somatic sequence variant in a test subject having a cancer condition, in which dashed boxes represent optional portions of the method, in accordance with some embodiments of the present disclosure. Figures 4G1, 4G2, and 4G3 illustrate a method of variant detection, in which dashed boxes represent optional portions of the method, in accordance with some embodiments of the present disclosure. Figure 4F3 provides an overview of a method for estimating the circulating tumor fraction for a liquid biopsy sample, based on targeted panel sequencing data, in which dashed boxes represent optional portions of the method, in accordance with some embodiments of the present disclosure.
[0057] Figures 5A1, 5B1, 5C1, 5D1, and 5E1 collectively provide a flow chart of processes and features for validating a copy number variation in a test subject, in which dashed boxes represent optional portions of the method, in accordance with some embodiments of the present disclosure.
[0058] Figures 5A2 and 5B2 collectively provide a flow chart of processes and features for validating a somatic sequence variant in a test subject, in which dashed boxes represent optional portions of the method, in accordance with some embodiments of the present disclosure.
[0059] Figures 5A3 and 5B3 collectively provide a flow chart of processes and features for estimating the circulating tumor fraction of a liquid biopsy sample based on on-target and off-target sequence reads from a targeted-panel sequencing data, in which dashed boxes represent optional portions of the method, in accordance with some embodiments of the present disclosure.
[0060] Figures 6A1, 6B1, and 6C1 collectively provide a flow chart of processes and features for treating a patient with a cancer containing a copy number variation of a target gene, in which dashed boxes represent optional portions of the method, in accordance with some embodiments of the present disclosure.
[0061] Figure 6A2 illustrates a flow chart of a method for obtaining a distribution of variant detection sensitivities as a function of circulating variant allele fraction from a cohort of subjects, in accordance with some embodiments of the present disclosure.
[0062] Figures 6A3, 6B3, and 6C3 collectively illustrate a process for fitting segment- level coverage ratios to an integer copy number (6A3 and 6B3) and subsequently determining
the error associated with the fit (6C3) at a particular simulated circulating tumor fraction, in accordance with some embodiments of the present disclosure.
[0063] Figures 7A1 and 7B1 illustrate a non-focal amplified segment and a focal amplified segment comprising the MYC gene, in accordance with some embodiments of the present disclosure.
[0064] Figure 7C1 illustrates a focal deleted segment comprising the BRCA2 gene, in accordance with some embodiments of the present disclosure.
[0065] Figures 7A2 and 7B2 collectively illustrate a method of inferring an effect of a sequence variant as a gain-of-function or a loss-of-function of a gene, in accordance with some embodiments of the present disclosure.
[0066] Figure 7A3 illustrates an overview of an experimental and analytical workflow used for validation of the performance of a method for estimating the circulating tumor fraction of a liquid biopsy sample based on on-target and off-target sequence reads from a targeted-panel sequencing data, in accordance with some embodiments of the present disclosure.
[0067] Figures 8A, 8B, 8C, and 8D collectively illustrate results of an inter-assay comparison between a liquid biopsy assay, a digital droplet polymerase chain reaction (ddPCR), and a solid-tumor biopsy assay, in accordance with various embodiments of the present disclosure.
[0068] Figures 9A, 9B, 9C, 9D, 9E, 9F, 9G, and 9H collectively illustrate results of a comparison between circulating tumor fraction estimate (ctFE) and variant allele fraction (VAF) using an Off-Target Tumor Estimation Routine (OTTER) method, in accordance with various embodiments of the present disclosure.
[0069] Figures 10A and 10B collectively illustrate results of evaluating ctFE and mutational landscape according to cancer type, in accordance with various embodiments of the present disclosure.
[0070] Figures 11A, 11B, and 11C collectively illustrate results of evaluating associations between ctFE and advanced disease states, in accordance with various embodiments of the present disclosure.
[0071] Figures 12A, 12B, and 12C collectively illustrate results of comparing ctFE with recent clinical response outcomes, in accordance with various embodiments of the present disclosure.
[0072] Figure 13 illustrates a first table describing sensitivity for all SNVs, indels, CNVs, and rearrangements targeted in reference samples, in accordance with various embodiments of the present disclosure.
[0073] Figure 14 illustrates a second table describing sensitivity for all SNVs, indels, CNVs, and rearrangements targeted in reference samples, in accordance with various embodiments of the present disclosure.
[0074] Figure 15 illustrates a third table describing comparisons between the presently disclosed liquid biopsy assay and a commercial liquid biopsy kit, in accordance with various embodiments of the present disclosure.
[0075] Figures 16A, 16B, and 16C collectively illustrate a fourth table describing variants detected by a liquid biopsy assay, in accordance with various embodiments of the present disclosure.
[0076] Figure 17 illustrates a fifth table describing dynamic filtering methodology to further reduced discordance, in accordance with various embodiments of the present disclosure.
[0077] Figure 18 illustrates a sixth table describing cancer groups included in clinical profiling analysis, in accordance with various embodiments of the present disclosure.
[0078] Figure 19 illustrates an example plot of the errors between corresponding segment-level coverage ratios and integer copy states determined across a plurality of simulated circulated tumor fractions ranging from about 0 to about 1, in accordance with some embodiments of the disclosure.
[0079] Like reference numerals refer to corresponding parts throughout the several views of the drawings.
DETAILED DESCRIPTION
Introduction
[0080] As described above, conventional liquid biopsy assays do not provide accurate determination of copy number variations (CNYs) for actionable genomic targets, particularly
focal amplifications. For example, some conventional methodologies determine copy number variations by mathematically converting copy ratios (e.g., of experimental samples compared to reference samples) to integer copy numbers based on tumor fraction estimates and known ploidy. These approaches have disadvantages due to the presence of artifactual variants and/or clonal heterogeneity in liquid biopsy samples, leading to unreliable tumor fraction estimates and, subsequently, unreliable copy number annotations. Furthermore, the identification of therapeutically actionable copy number variations is limited when using conventional methods because many large-scale (e.g., non-focal) copy number variations contain artifactual variants due to errors in normalization, poor sample quality, and/or other technical issues.
[0081] Thus, there is a need in the art for improved methods of validating CNV calls in order to distinguish between real and artifactual copy number variations. Specifically, there is a need in the art for a method of detecting focal copy number variations, e.g., in order to identify therapeutically actionable genomic alterations.
[0082] Advantageously, disclosed herein are methods and systems that do provide accurate determination of copy number variations by detecting actionable, focal copy number variations in circulating tumor DNA (ctDNA) with high confidence without the need for tumor fraction estimation. For example, in some embodiments, the methods and systems described herein utilize annotation and filtering that applies a statistical method to bin-level copy ratios, segment-level copy ratios and corresponding segment-level confidence intervals of binned and segmented sequence reads aligned to a reference genome. The statistical method filters out segments with non-focal copy number variations, which are either non- actionable, e.g., in the case of a copy number variation spanning a significant portion of a chromosome, or artifactual, e.g., due to incorrect data normalization.
[0083] As an example, Figure 4F1 illustrates a workflow of a method 400-1 for validating copy number variation, e.g., to identify therapeutically actionable genomic alterations, in accordance with some embodiments of the present disclosure.
[0084] In some embodiments, the methods described herein utilize conventional methodologies to putatively identify copy number variations, which are then validated using the methodologies described herein. For instance, in some embodiments, copy number variations (CNVs) are analyzed using a combination of an open-source tool, e.g., CNVkit, to putatively identify copy number variations, and a script, e.g., a Python script, to validate or
reject the putative copy number variations, using the validation methodologies described herein. In other embodiments, the validation methodologies described herein are used to identify focal copy number variations independently of conventional bioinformatics tools, e.g., CNVkit.
[0085] As described herein, in some embodiments, the methods described herein include one or more data collection steps, in addition to data analysis and downstream steps. For example, as described below, e.g., with reference to Figures 2 and 3, in some embodiments, the methods include collection of a liquid biopsy sample and, optionally, one or more matching biological samples from the subject (e.g., a matched cancerous and/or matched non- cancerous sample from the subject). Likewise, as described below, e.g., with reference to Figures 2 and 3, in some embodiments, the methods include extraction of DNA from the liquid biopsy sample and, optionally, one or more matching biological samples from the subject (e.g., a matched cancerous and/or matched non-cancerous sample from the subject). Similarly, as described below, e.g., with reference to Figures 2 and 3, in some embodiments, the methods include nucleic acid sequencing of DNA from the liquid biopsy sample and, optionally, one or more matching biological samples from the subject (e.g., a matched cancerous and/or matched non-cancerous sample from the subject).
[0086] However, in other embodiments, the methods described herein begin with obtaining nucleic acid sequencing results, e.g., raw or collapsed sequence reads of DNA from a liquid biopsy sample and, optionally, one or more matching biological samples from the subject (e.g., a matched cancerous and/or matched non-cancerous sample from the subject), from which the statistics needed for focal CNV validation (e.g., bin-level sequence ratios, segment-level sequence ratios, and segment-level measures of dispersion) can be determined. For example, in some embodiments, sequencing data 122 for a patient 121 is accessed and/or downloaded over network 105 by system 100.
[0087] Likewise, in some embodiments, the methods described herein begin with obtaining genomic bin values (e.g., bin counts or bin coverages) for a sequencing of a liquid biopsy sample and, optionally, one or more matching biological samples from the subject (e.g., a matched cancerous and/or matched non-cancerous sample from the subject), from which the statistics needed for focal CNV validation (e.g., bin-level sequence ratios, segment- level sequence ratios, and segment-level measures of dispersion) can be determined. For example, in some embodiments, genomic bin values 135-cf-bv for a patient 121 is accessed and/or downloaded over network 105 by system 100.
[0088] Similarly, in some embodiments, the methods described herein begin with obtaining the statistics needed for focal CNV validation (e.g., bin-level sequence ratios, segment-level sequence ratios, and segment-level measures of dispersion) for a sequencing of a liquid biopsy sample and, optionally, one or more matching biological samples from the subject (e.g., a matched cancerous and/or matched non-cancerous sample from the subject), e.g., as an output of a conventional bioinformatics tool (such as CNVkit). For example, in some embodiments, bin-level sequence ratios 135-cf-br, segment-level sequence ratios 135- cf-sr, and segment-level measures of dispersion for a patient 121 is accessed and/or downloaded over network 105 by system 100.
[0089] Referring again to method 400-1 in Figure 4F1, in some embodiments, the method includes obtaining a dataset including cell-free DNA sequencing data (Block 402-1), and determining the statistics needed for focal CNV validation (e.g., bin-level sequence ratios, segment-level sequence ratios, and segment-level measures of dispersion). For instance, in some embodiments, system 100 obtains sequencing data 122 (e.g., sequence reads 123 and/or aligned sequences 124) and applies a copy number segmentation algorithm 153-b (e.g., CNVkit) to the sequencing data.
[0090] For example, in some embodiments, sequence reads 123 obtained from the sequencing dataset 122 are aligned to a reference human construct (Block 404-1), generating a plurality of aligned reads 124 (Block 406-1). Aligned cfDNA sequence reads are then optionally processed (e.g., using normalization, filtering, and/or quality control) (Block 408- iy
[0091] A copy number segmentation algorithm 153-b is then used for genomic region binning, coverage calculation, bias correction, normalization to a reference pool, segmentation, and/or visualization (Block 410-1). For example, in some embodiments, aligned sequence reads are sorted into bins (e.g., on target bins 153-b-l-a and off-target bins 153-b-l-b) of pre-specified bin sizes (e.g., 100-150 base pairs) based on their genomic location using binning subroutine 153-b-l. For example, in some embodiments, binning subroutine 153-b-l reads in mapped sequences 124 and pre-selected bins (e.g., target bins 153-b-l-a and off-target bins 153-b-l-b for target panel sequencing analysis) and assigns respective sequences to the bins based on their mapping within the reference genome. Bin values 135-bv (e.g., liquid biopsy genomic bin values 135-cf-bv) for each of the bins, e.g., bin counts or bin coverages, can be read out from binning subroutine 153-b-l. Bin values
135-bv are optionally pre-processed, e.g., normalized, standardized, corrected, etc., as described in further detail herein.
[0092] Bin values 135-bv are then used to determine bin-level sequence ratios 135-br (e.g., liquid biopsy bin-level sequence ratios 135-cf-br). Briefly, a copy ratio subroutine 153- b-2 reads in bin values 135-bv and reference bin coverages 153-b-2-a determined for one or more reference samples (e.g., a matched non-cancerous sample of the subject or a an average from a plurality of non-cancerous reference samples), and compares bin values for corresponding bins, thereby generating bin-level sequence ratios 135-br.
[0093] These bin-level sequence ratios 135-br are then used to group adjacent bins, having similar sequence ratios, into segments, e.g., using circular binary segmentation. For example, in some embodiments, segmentation subroutine 153-a-3 reads in and applies a segmentation model (e.g., a circular binary segmentation model) to bin-level sequence ratios 135-br, thereby generating a plurality of genomic segments, each corresponding to one or more contiguous bins.
[0094] Segment-level sequence ratios 135-sr (e.g., liquid biopsy segment-level sequence ratios 135-cf-sr) and segment-level measures of dispersion 135-sd (e.g., liquid biopsy segment-level measures of dispersion 135-cf-sd) can be determined using a statistics subroutine 153-a-4, which may be read out from the copy number segmentation algorithm 153-b, as illustrated in Figure 1D1, or may be separately implemented, e.g., by reading-in segment annotations (e.g., including bin assignments to each segment) generated by the segmentation subroutine 153-a-3 and bin-level sequence ratios 135-br from the copy ratio subroutine 153-b-2.
[0095] Optionally, a copy number annotation subroutine 153-a-5 reads in one or both segment-level sequence ratios 135-sr (e.g., liquid biopsy segment-level sequence ratios 135- cf-sr) and segment-level measures of dispersion 135-sd, to provide copy number status annotations (e.g., amplified, neutral, or deleted) 135-cn (e.g., liquid biopsy copy numb annotations 135-cf-cn) for one or more of the identified segments.
[0096] In some embodiments, the process above is also performed for a matched tumor tissue biopsy of the subject, e.g., thereby generating one or more tumor segment copy number annotations 135-t-cn.
[0097] The bin-level copy ratios, segment-level copy ratios and the corresponding segment-level confidence intervals statistics obtained from the copy number segmentation
algorithm 153 (e.g., CNVkit) output are used as inputs for a focal amplification / deletion validation algorithm, to determine whether putative segment amplifications and/or deletions can be validated. The copy number segmentation algorithm 153 applies a plurality of filters to statistics for one or more identified segment (. Block 412-1). In some embodiments, these filters include one or more of:
• a bin-level measure of central tendency sequence ratio filter 153-a-l, e.g., a median bin-level copy ratio filter (. Block 414-1);
• a segment-level measure of dispersion confidence filter 153-a-2, e.g., a segment-level confidence interval filter {Block 416-1);
• a bin-level measure of central tendency plus deviation filter 153-a-3, e.g., a median- plus-median absolute deviation (MAD) bin-level copy ratio filter {Block 418-1); and
• a segment-level sequence ratio filter 153-a-4, e.g., a segment-level copy ratio filter {Block 419-1).
In some embodiments, the plurality of filters includes at least two of the above filters. In some embodiments, the plurality of filters includes at least three of the above filters. In some embodiments, the plurality of filters includes all four of the above filters.
[0098] The copy number status annotation {e.g., amplified, neutral, deleted) for each segment is validated or rejected if it passes or fails the plurality of copy number status annotation validation filters {Block 420-1). Specifically, when a filter in the plurality of filters is fired, the copy number annotation of the segment is rejected, and the copy number variation is determined to be a non-focal copy number variation. When no filter in the plurality of filters is fired, the copy number annotation of the segment is validated, and the copy number variation is determined to be a focal copy number variation {Block 422-1).
[0099] Validated copy number variations {e.g., focal amplifications and/or focal deletions of target genes) can then be used for variant analysis and clinical report generation. For example, focal copy number variations can be matched to the appropriate therapies and/or clinical trials {Block 426-1). A patient report indicating the validated copy number variations and any matched therapies and/or clinical trials can then be generated for use in precision oncology applications {Block 426-1).
[0100] Additional embodiments of the presently disclosed systems and methods are described in further detail below with reference to Figures 2A and 4F1 (see, Example
Workflow for Precision Oncology: Copy Number Variation Analysis) and Example 2 - Identification of Focal Copy Number Variation (see, Examples).
[0101] Copy number variations are considered a biomarker for cancer diagnosis and certain copy number variations are targets of treatment. For example, a subset of copy number variations that can be investigated using the methods disclosed herein include amplifications in MET, EGFR, ERBB2, CD274, CCNE1, and MYC, and deletions in BRCA1 and BRCA2. However, the analysis is not limited to these reportable genes. The method utilizes bin-level copy ratios, in addition to segment-level copy ratios, to validate the copy number variations of target genomic segments, thus allowing a highly sensitive characterization of local (both internal and external) changes in copy number to detect true copy number variations with greater accuracy. The presently disclosed systems and methods enable an automatic and reliable way to detect actionable, focal copy number variations via a liquid biopsy assay that is not achieved by conventional methods and is considerably less invasive than a tissue biopsy. The combination of liquid biopsy and copy number variation detection benefits physicians, clinicians, and medical institutions by providing a powerful tool for diagnosing cancer conditions and administering treatments. Furthermore, the methods disclosed herein can be performed alone or alongside traditional solid tumor biopsy methods as a validation method for detecting copy number variations.
[0102] Specifically, the annotation and filtering algorithm can be used to distinguish between actionable and non-actionable copy number variations of target biomarkers that are informative for precision oncology. For example, as reported in Example 2 (Identification of Focal Copy Number Variation; see Examples, below), when applied to two experimental samples both containing a conventionally obtained amplification status for the MYC gene, the method rejected the amplification in a first sample as anon-focal amplification, and validated the amplification in a second sample as a focal, and likely actionable, amplification.
[0103] The identification of actionable genomic alterations in a patient’s cancer genome is a difficult and computationally demanding problem. For instance, the determination of various prognostic metrics useful for precision oncology, such as variant allelic ratio, copy number variation, tumor mutational burden, microsatellite instability status, etc., requires analysis of hundreds of millions to billions, of sequenced nucleic acid bases. An example of a typical bioinformatics pipeline established for this purpose includes at least five stages of analysis: assessment of the quality of raw next generation sequencing data, generation of collapsed nucleic acid fragment sequences and alignment of such sequences to a reference
genome, detection of structural variants in the aligned sequence data, annotation of identified variants, and visualization of the data. See, Wadapurkar and Vyas, Informatics in Medicine Unlocked, 11:75-82 (2018), the content of which is hereby incorporated by reference, in its entirety, for all purposes. Each one of these procedures is computationally taxing in its own right.
[0104] For instance, the overall temporal and spatial computation complexity of simple global and local pairwise sequence alignment algorithms are quadratic in nature (e.g., second order problems), that increase rapidly as a function of the size of the nucleic acid sequences (n and m) being compared. Specifically, the temporal and spatial complexities of these sequence alignment algorithms can be estimated as O(mn), where O is the upper bound on the asymptotic growth rate of the algorithm, n is the number of bases in the first nucleic acid sequence, and m is the number of bases in the second nucleic acid sequence. See, Baichoo and Ouzounis, BioSystems, 156-157:72-85 (2017), the content of which is hereby incorporated by reference, in its entirety, for all purposes. Given that the human genome contains more than 3 billion bases, these alignment algorithms are extremely computationally taxing, especially when used to analyze next generation sequencing (NGS) data, which can generate more than 3 billion sequence reads per reaction.
[0105] This is particularly true when performed in the context of a liquid biopsy assay, because liquid biopsy samples contain a complex mixture of short DNA fragments originating from many different germline (e.g., healthy) and diseased (e.g., cancerous) tissues. Thus, the cellular origins of the sequence reads are unknown, and the sequence signals originating from cancerous cells, which may constitute multiple sub-clonal populations, must be computationally deconvoluted from signals originating from germline and hematopoietic origins, in order to provide relevant information about the subject’s cancer. Thus, in addition to the computationally taxing processes required to align sequence reads to a human genome, there is a computation problem of determining whether a particular abnormal signal, e.g., one or more sequence reads corresponding to a genomic alteration, (i) is not an artifact, and (ii) originated from a cancerous source in the subject. This is increasingly difficult during the early stages of cancer — when treatment is presumably most effective — when only small amounts of ctDNA are diluted by germline and hematopoietic DNA.
[0106] In addition to the computationally demanding problem of aligning sequencing data to a human reference genome, the method comprises dividing the plurality of aligned
sequence reads into “bins” (e.g., regions of a predefined span of base pairs corresponding to a reference genome), determining the copy ratio of each bin by calculating the differential read depths between experimental and reference samples, and grouping subsets of adjacent bins with shared copy ratios into segments. Grouping bins into segments divides each chromosome into regions of equal copy number that minimizes noise in the data. Such methods essentially perform a change-point or edge detection algorithm, which are either temporally limited or computationally intense. For example, in some embodiments, the segmentation is performed using circular binary segmentation. Circular binary segmentation calculates a statistic for each genomic position, where the statistic comprises a likelihood ratio for the null hypothesis (no change in copy ratio at the respective position) against the alternative (one change in copy ratio at the respective position), and where the null hypothesis is rejected if the statistic is greater than a predefined distribution threshold. Notably, in circular binary segmentation, the chromosome is assumed to be circularized, such that the calculation is performed recursively for each position (e.g., each bin) around the circumference of the circle to identify all change-points across the length of the chromosome. Furthermore, for each position (e.g., bin) under investigation, a reference distribution is generated using a permutation approach, where the copy ratios for the plurality of bins are randomized (typically 10,000 times). For some embodiments that utilize bins of approximately 100-150 bases long spanning a human reference genome of several billion bases, the number of permutations required to perform this recursive method contributes to a computationally intense procedure. See, for example, Olshen et al, Biostatistics 5, 4, 557- 572 (2004), doi:10.1093/biostatistics/kxh008, which is hereby incorporated herein by reference in its entirety.
[0107] Advantageously, the present disclosure provides various systems and methods that improve the computational elucidation of actionable genomic alterations from a liquid biopsy sample of a cancer patient. Specifically, the present disclosure improves a computer- implemented method for identifying focal copy number variations by validating copy number status annotations assigned to genomic segments. As a further example, the application of the plurality of filters to the bin-level copy ratios, segment-level copy ratios, and corresponding segment-level confidence intervals is iterated, on a computer system, over each segment in the plurality of segments, and in some embodiments requires calculations using the copy ratios of each bin in the plurality of bins for each chromosome, for each
segment in the plurality of segments. Taken together, the methods disclosed herein are a computational process designed to solve a computational problem.
[0108] Advantageously, the methods and systems described herein provide an improvement to the abovementioned technical problem (e.g., performing complex computer- implemented methods for analyzing a plurality of sequence reads for detection and validation of copy number variations in human genetic targets). The methods described herein therefore solve a problem in the computing art by improving upon conventional methods for identifying copy number variations for cancer diagnosis and treatment. For example, the application of a plurality of filters to the bin-level copy ratios, segment-level copy ratios, and corresponding segment-level confidence intervals provides a means for detecting true copy number variations for clinically relevant biomarkers and filtering out artifactual variations that are not therapeutically actionable, thus improving the accuracy and precision of genomic alteration detection in precision oncology.
[0109] The methods and systems described herein also improve precision oncology methods for assigning and/or administering treatment because of the improved accuracy of copy number variation detection. The identification of therapeutically actionable, focal copy number variations that can be included in a clinical report for patient and/or clinician review, and/or matched with appropriate therapies and/or clinical trials for treatment and/or monitoring, allows for more accurate assignment of treatments. Furthermore, the removal of non-therapeutically actionable, non-focal copy number variations reduces the risk of patients undergoing unnecessary or potentially harmful regimens due to misdiagnoses.
[0110] As described above, conventional liquid biopsy assays also do not provide accurate determination of variants (e.g., somatic variants), particularly at low circulating variant fractions. This is due, in large part, to the use of static variant count filters that require a common amount of support to call a variant positively as a somatic variant in sequencing data, regardless of the identity of the variant and its position within the genome. That is, conventional methods require that at least X number of unique sequence reads (e.g., 8 sequence reads) provide support for (e.g., encompass) a particular variant in order for that variant to be confirmed as a true somatic variant. While this may be fine for liquid biopsy samples having a high tumor fraction, where more copies of each somatic variant would be expected to be found, it results in a high number of false negatives when samples with lower tumor fractions are analyzed. On the other hand, simply lowering the threshold to allow calling of variants with lower support for a particular variant will increase the number of false
positives, that is the number of untrue positive somatic variant calls, which are actually sequencing errors.
[0111] While there are many methods of performing noise suppression on ultra-high depth sequencing data commonly generated for liquid biopsy assays, there remains the fundamental fidelity boundary of sequencing by synthesis that cannot be overcome. Along with this, there are a variety of complexities and non-linearities within the ability to map reads across complex sets of genomic features and from these data, successfully call a variant. While it is possible to filter very stringently, one of the goals of liquid biopsy assays is to detect alterations at very low circulating fractions. This requires that low levels of support be sufficient to make a positive alteration call given that at 0.1% circulating fraction and an average depth of 5000x, only 5 reads containing alternate alleles will be present. Because of this, it is impossible to have a consistent set of thresholds that will be used to filter variants as any filter will either be too stringent or too permissive depending on the variant context and local sequence specific error generation models.
[0112] Advantageously, the present disclosure provides methods and systems that more accurately call somatic variants by adjusting the variant count threshold in a locus-by-locus fashion, e.g., by lowering the variant count threshold when there is an increased likelihood (orthogonal to the variant count in the sequencing reaction) that a variant at a particular locus is a true somatic variant and/or by raising the variant count threshold when there is an increased likelihood (orthogonal to the variant count in the sequencing reaction) that a variant at a particular locus is a result of a sequencing error, rather than a true somatic variant.
[0113] For example, in some embodiments, the methods and systems described herein employ a generalized application of Bayes’ Theorem through the likelihood ratio test that allows dynamic calibration of filtering threshold for diagnostic assays. These thresholds are based on one or more of a sample-specific error rate, a methodology-specific sequencing error rate (e.g., from a pool of process matched healthy control samples), an estimate of the variant allele fraction for the variant being evaluated, and a historical likelihood that a variant would be present at a particular locus in a particular cancer (e.g., derived from an extensive cohort of human solid tumor tissue samples to inform probability models). This results in high sensitivity and specificity in variant detection, allowing identification of actionable oncologic targets, as well as determination of a precise limit of detection to reduce the occurrence of false negatives.
[0114] For instance, in some embodiments, the dynamic variant filtering methodology described herein uses an application of Bayes theorem to dynamically tune a variant count threshold for calling a somatic variant at a particular genomic region based on the prevalence of similar mutations within that genomic regions in similar cancers. For instance, where there is a high prevalence of a somatic variant in a given gene for a particular cancer, ( e.g . , BRCA1 mutations are common in breast cancers), the dynamic filtering method accounts for this prior (e.g., the prior knowledge that BRCA mutations are commonly found in breast cancers) by setting a lower variant count threshold to call somatic variants in the BRCA1 gene for a breast cancer. That is, the dynamic filtering methodology requires less evidence in order to call a variant in the BRCA1 gene when the subject has breast cancer than when the subject has a different cancer that is not associated with a high prevalence of BRCA1 mutations.
[0115] In some embodiments, the dynamic variant filtering methodology described herein uses an application of Bayes theorem to dynamically tune a variant count threshold for calling a somatic variant based on an estimated variant allele fraction for the variant being evaluated. That is, the dynamic filtering methodology takes into account the fact that in a sample having a lower tumor fraction, and therefore a lower variant allele fraction, a fewer number of sequences encompassing a somatic variant would be expected than in a sample having a higher tumor fraction, and therefore a higher variant allele fraction. Accordingly, the sensitivity and specificity of the dynamic filter are tuned to account for the expectation that a higher percentage of variant sequences with low sequence counts (e.g., lower support) represent true somatic variants in a sample with a low tumor fraction than in a sample with a high tumor fraction, for which a higher percentage of variant sequences with low sequence counts represent sequencing errors.
[0116] In some embodiments, the dynamic variant filtering methodology described herein used an application of Bayes theorem to dynamically tune a variant count threshold for calling a somatic variant at a particular genomic locus based on a historical sequencing error rate for the locus. That is, the dynamic filtering methodology takes into account the fact that at genomic loci that are more prone to sequencing errors, such as loci with short nucleotide repeat sequences (e.g., di-nucleotide or tri-nucleotide repeats), there is a higher likelihood that a particular variant is a product of a sequencing error, rather than a true somatic mutation, than at a locus that is not prone to sequencing errors.
[0117] Similarly, in some embodiments, the dynamic variant filtering methodology described herein used an application of Bayes theorem to dynamically tune a variant count threshold for calling a somatic variant at a particular genomic locus based on a reaction- specific sequencing error rate. That is, the dynamic filtering methodology takes into account the fact that in reactions with higher sequencing rates there is a higher likelihood that a particular variant is a product of a sequencing error, rather than a true somatic mutation.
[0118] The present disclosure provides improved systems and methods for precision oncology based on improved variant calling in liquid biopsy data. The various improvements described herein, e.g., improved variant detection at low circulating fractions, are embodied in an example liquid biopsy workflow described in Examples 2 and 3. These examples describe an example liquid biopsy assay employing a 105-gene hybrid-capture next- generation sequencing (NGS) panel spanning 270 kb of the human genome, configured to detect targets in four variant classes, including single nucleotide variants (SNVs), insertions and/or deletions (indels), copy number variants (CNVs), and gene rearrangements. To establish robust clinical performance, extensive validation studies were conducted that demonstrated high sensitivity and specificity. Accordingly, the example liquid biopsy assay detected actionable variants with high accuracy in comparison to a commercial ctDNA NGS kit, commercial solid tumor biopsy-based assays, such as a solid tumor biopsy NGS tissue assay, and digital droplet PCR (ddPCR). As shown in the results of Figure 17, the methods and systems disclosed herein reduced false positive variant calling by 11.45% compared to conventional variant detection methods.
[0119] As described in detail above, the identification of actionable genomic alterations in a patient’s cancer genome is a difficult and computationally demanding problem.
[0120] Advantageously, the present disclosure provides various systems and methods that improve the computational elucidation of actionable genomic alterations from a liquid biopsy sample of a cancer patient. Specifically, the present disclosure improves a method for identifying variants in ctDNA using a dynamic thresholding approach. As described above, the disclosed methods and systems are necessarily computer-implemented due to their complexity and heavy computational requirements, and thus solve a problem in the computing art.
[0121] Advantageously, the methods and systems described herein provide an improvement to the abovementioned technical problem (e.g., performing complex computer-
implemented methods for identifying variants in ctDNA using a dynamic thresholding approach). The methods described herein therefore solve a problem in the computing art by improving upon conventional methods for identifying variants (e.g., actionable oncologic targets) for cancer diagnosis and treatment. For example, the application of Bayes’ Theorem through the likelihood ratio test provides a means for improving detection of true positive variants and reducing detection of false positive variants for clinically relevant biomarkers, thus improving the accuracy and precision of genomic alteration detection in precision oncology.
[0122] The methods and systems described herein also improve precision oncology methods for assigning and/or administering treatment because of the improved accuracy of variation detection. The identification of therapeutically actionable variants that can be included in a clinical report for patient and/or clinician review, and/or matched with appropriate therapies and/or clinical trials for treatment and/or monitoring, allows for more accurate assignment of treatments. Furthermore, the removal of false positive variant detection reduces the risk of patients undergoing unnecessary or potentially harmful regimens due to misdiagnoses.
[0123] Additionally, as described above, conventional liquid biopsy assays do not provide accurate determination of circulating tumor fraction estimates (ctFEs). For example, while low-pass, whole-genome sequencing can be used to estimate tumor fractions, somatic variant sequences are poorly identified from low-pass, whole genome sequencing data, particularly from samples having low tumor fractions. Accordingly, conventional liquid biopsy assays typically use targeted-panel sequencing in order to achieve higher sequence coverage required to identify somatic variants present at low levels within the sample. However, targeted-panel sequencing data does not span a large enough portion of the genome to accurately estimate tumor fraction. Rather, tumor fraction estimates obtained using variant allele fractions (VAFs) in targeted-panel sequencing data are noisy, due to variant tissue source and capture bias.
[0124] Advantageously, the present disclosure provides methods and systems that do provide accurate determination of circulating tumor fraction estimates by using on-target and off-target sequence reads from targeted-panel sequencing data. For example, in some embodiments, the methods and systems described herein fit experimental coverage ratios for segmented sequence reads across the genome to integer copy numbers across a range of simulated tumor fractions. These fitted copy numbers can then be used to determine the
expected coverage ratio for the segment, at the given simulated tumor fraction. The aggregate difference between the experimental coverage ratios for all segments and the expected coverage ratios based on the fitted copy number at the given simulated tumor fraction is used as a measure of the accuracy of the fit. That is, where the experimental coverage ratios closely match the expected coverage ratios, the simulated tumor fraction is a good estimate of the actual tumor fraction of the sample. Likewise, where the experimental coverage ratios do not closely match the expected coverage ratios, the simulated tumor fraction is a poor estimate of the actual tumor fraction of the sample.
[0125] By using on-target and off-target sequence reads, the systems and methods described herein leverage data collected across a majority of the human genome, which allows for more accurate estimation of circulating tumor fraction than data that is limited to on-target probe regions. Advantageously, this method allows for both accurate tumor fraction estimation and robust variant identification from a single, low-cost sequencing reaction. Previously, in order to generate suitable data for both accurate tumor fraction estimate and robust variant identification two sequencing reactions would need to be performed; a low-pass whole genome sequencing reaction to generate data across the genome for estimating circulating tumor fraction and a targeted-panel sequencing reaction to generate sufficiently deep sequencing data to identify variants.
[0126] Accordingly, the systems and methods described herein can be used in conjunction with variant detection methods that rely on targeted panel sequencing, such as high-depth sequencing reactions. By ensuring uniform distribution of sequence reads across a genome (e.g., by a process of binning sequencing reads and correcting bins for size, GC content, sequencing depth, etc.), the systems and methods described herein ensure that any variation detected in regions of the genome are representative of the reference genome. This approach reduces noise resulting from capture bias, which can result in unreliable circulating tumor fraction estimates.
[0127] By using a maximum likelihood estimation (e.g., an expectation-maximization algorithm) to fit on-target and off-target sequence reads to genomic variations (e.g., integer copy states), the systems and methods described herein further improve the accuracy and reliability of circulating tumor fraction estimates. For example, in some embodiments, the sequencing coverage of on-target and off-target sequence reads are used to determine a test coverage ratio for regions of the genome in a test liquid biopsy sample. The test coverage ratio is compared to a set of expected coverage ratios obtained using assumptions for
expected copy states and expected tumor fractions, which gives a distance (e.g., an error) of the test coverage ratio from the expected copy state. Using this model, by minimizing the distance (e.g., the error) between test parameters and expected parameters, it is possible to estimate the test tumor fraction with high confidence.
[0128] An improved method for obtaining accurate circulating tumor fraction estimates provide several benefits to liquid biopsies. Advantageously, more reliable ctFEs improves the classification accuracy of detected variants as somatic or germline variants (e.g., any variant detected at or below the ctFE can be classified as a somatic variant with high confidence). In addition, accurate ctFEs can greatly improve the sensitivity of detection of clinically relevant copy number variations, including integer copy number calling. Furthermore, in some embodiments, ctFEs are used as biomarkers for tumor burden, metastases, disease progression, or treatment resistance. For example, ctFEs have been shown to correlate with tumor volumes and vary in response to treatment.
[0129] As a result, the methods and systems disclosed herein provide a sensitive, cost- effective, and minimally invasive method to monitor patients for response to therapy, disease burden, relapse, progression, and/or emerging resistance mutations, which can translate into better care for patients. When used as part of the course of care, serial ctFE monitoring can predict objective measures of progression in at-risk individuals. Due to cost and convenience of sampling, the methods and systems disclosed herein can be applied at shorter time intervals than radiographic methods and can allow for more timely intervention in the case of disease progression.
[0130] Additionally, the methods and systems disclosed herein provide benefits to clinicians by generating more accurate variant calls and/or informative ctFE biomarkers that can aid in the prediction of clinical outcomes in patients and/or the selection of appropriate treatment plans.
[0131] Specifically, a validation of the performance of a method for on-target and off- target tumor estimation, in accordance with some embodiments of the present disclosure, revealed a correlation between ctFEs and metastases and disease progression. For example, as reported in Examples 2 and 3, when the method is applied to matched, de-identified clinical data for a cohort of 1,000 patients, high ctFEs were found to (i) correlate well with estimates derived from low-pass, whole genome sequencing, (ii) be a highly specific predictor of metastases, (iii) be positively correlated with reported “progressive disease” and
(iv) be negatively correlated with better clinical outcomes. Figure 7A3 provides an overview of an experimental and analytical workflow used for validation of the off-target tumor estimation routine (OTTER).
[0132] As described in detail above, the identification of actionable genomic alterations in a patient’s cancer genome is a difficult and computationally demanding problem.
[0133] Advantageously, the present disclosure provides various systems and methods that improve the computational elucidation of actionable genomic alterations from a liquid biopsy sample of a cancer patient. Specifically, the present disclosure improves upon the accuracy of circulating tumor fractions estimated from targeted-panel sequencing. Moreover, because the methods described herein eliminate the need to process data from two different sequencing reactions, the disclosure lowers the computational budget for accurately estimating circulating tumor fractions and identifying actionable variants. As described above, the disclosed methods and systems are necessarily computer-implemented due to their complexity and heavy computational requirements, and thus solve a problem in the computing art.
[0134] Advantageously, the methods and systems described herein provide an improvement to the abovementioned technical problem (e.g., performing complex computer- implemented methods for determining accurate circulating tumor fraction estimates). The methods described herein therefore solve a problem in the computing art by improving upon conventional methods for determining tumor fraction estimates for cancer diagnosis, monitoring, and treatment. For example, the application of a maximum likelihood estimation (e.g., an expectation-maximization algorithm) to estimate genomic alterations using on-target and off-target sequence reads in liquid biopsy samples improves upon conventional approaches for precision oncology by providing highly reliable circulating tumor fraction estimates, while allowing concurrent variant detection in targeted panel sequencing of liquid biopsy samples. This in turn lowers the computational budget required for these processes, thereby improving the speed and lowering the power requirements of the computer.
[0135] The methods and systems described herein also improve precision oncology methods for assigning and/or administering treatment because of the improved accuracy of circulating tumor fraction estimations. Accurate ctFEs can be reported as biomarkers and/or used in downstream analysis for identification of therapeutically actionable variants to be included in a clinical report for patient and/or clinician review. Additionally, ctFEs and any
therapeutically actionable variants identified using ctFEs can be matched with appropriate therapies and/or clinical trials, allowing for more accurate assignment of treatments. The improved accuracy of biomarker detection increases the chance of efficacy and reduces the risk of patients undergoing unnecessary or potentially harmful regimens due to misdiagnoses.
Definitions
[0136] As used herein, the term “subject” refers to any living or non-living organism including, but not limited to, a human (e.g., a male human, female human, fetus, pregnant female, child, or the like), a non-human mammal, or a non-human animal. Any human or non-human animal can serve as a subject, including but not limited to mammal, reptile, avian, amphibian, fish, ungulate, ruminant, bovine (e.g., cattle), equine (e.g., horse), caprine and ovine (e.g., sheep, goat), swine (e.g., pig), camelid (e.g., camel, llama, alpaca), monkey, ape (e.g., gorilla, chimpanzee), ursid (e.g., bear), poultry, dog, cat, mouse, rat, fish, dolphin, whale and shark. In some embodiments, a subject is a male or female of any age (e.g., a man, a woman, or a child).
[0137] As used herein, the terms “control,” “control sample,” “reference,” “reference sample,” “normal,” and “normal sample” describe a sample from a non-diseased tissue. In some embodiments, such a sample is from a subject that does not have a particular condition (e.g., cancer). In other embodiments, such a sample is an internal control from a subject, e.g., who may or may not have the particular disease (e.g., cancer), but is from a healthy tissue of the subject. For example, where a liquid or solid tumor sample is obtained from a subject with cancer, an internal control sample may be obtained from a healthy tissue of the subject, e.g., a white blood cell sample from a subject without a blood cancer or a solid germline tissue sample from the subject. Accordingly, a reference sample can be obtained from the subject or from a database, e.g., from a second subject who does not have the particular disease (e.g., cancer).
[0138] As used herein the term “cancer,” “cancerous tissue,” or “tumor” refers to an abnormal mass of tissue in which the growth of the mass surpasses, and is not coordinated with, the growth of normal tissue, including both solid masses (e.g., as in a solid tumor) or fluid masses (e.g. , as in a hematological cancer). A cancer or tumor can be defined as “benign” or “malignant” depending on the following characteristics: degree of cellular differentiation including morphology and functionality, rate of growth, local invasion and metastasis. A “benign” tumor can be well differentiated, have characteristically slower growth than a malignant tumor and remain localized to the site of origin. In addition, in some
cases a benign tumor does not have the capacity to infiltrate, invade or metastasize to distant sites. A “malignant” tumor can be a poorly differentiated (anaplasia), have characteristically rapid growth accompanied by progressive infiltration, invasion, and destruction of the surrounding tissue. Furthermore, a malignant tumor can have the capacity to metastasize to distant sites. Accordingly, a cancer cell is a cell found within the abnormal mass of tissue whose growth is not coordinated with the growth of normal tissue. Accordingly, a “tumor sample” refers to a biological sample obtained or derived from a tumor of a subject, as described herein.
[0139] Non-limiting examples of cancer types include ovarian cancer, cervical cancer, uveal melanoma, colorectal cancer, chromophobe renal cell carcinoma, liver cancer, endocrine tumor, oropharyngeal cancer, retinoblastoma, biliary cancer, adrenal cancer, neural cancer, neuroblastoma, basal cell carcinoma, brain cancer, breast cancer, non-clear cell renal cell carcinoma, glioblastoma, glioma, kidney cancer, gastrointestinal stromal tumor, medulloblastoma, bladder cancer, gastric cancer, bone cancer, non-small cell lung cancer, thymoma, prostate cancer, clear cell renal cell carcinoma, skin cancer, thyroid cancer, sarcoma, testicular cancer, head and neck cancer (e.g., head and neck squamous cell carcinoma), meningioma, peritoneal cancer, endometrial cancer, pancreatic cancer, mesothelioma, esophageal cancer, small cell lung cancer, Her2 negative breast cancer, ovarian serous carcinoma, HR+ breast cancer, uterine serous carcinoma, uterine corpus endometrial carcinoma, gastroesophageal junction adenocarcinoma, gallbladder cancer, chordoma, and papillary renal cell carcinoma.
[0140] As used herein, the terms “cancer state” or “cancer condition” refer to a characteristic of a cancer patient's condition, e.g., a diagnostic status, a type of cancer, a location of cancer, a primary origin of a cancer, a cancer stage, a cancer prognosis, and/or one or more additional characteristics of a cancer (e.g., tumor characteristics such as morphology, heterogeneity, size, etc.). In some embodiments, one or more additional personal characteristics of the subject are used further describe the cancer state or cancer condition of the subject, e.g., age, gender, weight, race, personal habits (e.g., smoking, drinking, diet), other pertinent medical conditions (e.g., high blood pressure, dry skin, other diseases), current medications, allergies, pertinent medical history, current side effects of cancer treatments and other medications, etc.
[0141] As used herein, the term “liquid biopsy” sample refers to a liquid sample obtained from a subject that includes cell-free DNA. Examples of liquid biopsy samples include, but
are not limited to, blood, whole blood, plasma, serum, urine, cerebrospinal fluid, fecal material, saliva, sweat, tears, pleural fluid, pericardial fluid, or peritoneal fluid of the subject. In some embodiments, a liquid biopsy sample is a cell-free sample, e.g., a cell free blood sample. In some embodiments, a liquid biopsy sample is obtained from a subject with cancer. In some embodiments, a liquid biopsy sample is collected from a subject with an unknown cancer status, e.g., for use in determining a cancer status of the subject. Likewise, in some embodiments, a liquid biopsy is collected from a subject with a non-cancerous disorder, e.g., a cardiovascular disease. In some embodiments, a liquid biopsy is collected from a subject with an unknown status for anon-cancerous disorder, e.g., for use in determining a non-cancerous disorder status of the subject.
[0142] As used herein, the term “cell-free DNA” and “cfDNA” interchangeably refer to DNA fragments that circulate in a subject’s body (e.g., bloodstream) and originate from one or more healthy cells and/or from one or more cancer cells. These DNA molecules are found outside cells, in bodily fluids such as blood, whole blood, plasma, serum, urine, cerebrospinal fluid, fecal material, saliva, sweat, sweat, tears, pleural fluid, pericardial fluid, or peritoneal fluid of a subject, and are believed to be fragments of genomic DNA expelled from healthy and/or cancerous cells, e.g., upon apoptosis and lysis of the cellular envelope.
[0143] As used herein, the term “locus” refers to a position (e.g., a site) within a genome, e.g., on a particular chromosome. In some embodiments, a locus refers to a single nucleotide position, on a particular chromosome, within a genome. In some embodiments, a locus refers to a group of nucleotide positions within a genome. In some instances, a locus is defined by a mutation (e.g., substitution, insertion, deletion, inversion, or translocation) of consecutive nucleotides within a cancer genome. In some instances, a locus is defined by a gene, a sub- genic structure (e.g., a regulatory element, exon, intron, or combination thereof), or a predefined span of a chromosome. Because normal mammalian cells have diploid genomes, a normal mammalian genome (e.g., a human genome) will generally have two copies of every locus in the genome, or at least two copies of every locus located on the autosomal chromosomes, e.g., one copy on the maternal autosomal chromosome and one copy on the paternal autosomal chromosome.
[0144] As used herein, the term “allele” refers to a particular sequence of one or more nucleotides at a chromosomal locus. In a haploid organism, the subject has one allele at every chromosomal locus. In a diploid organism, the subject has two alleles at every chromosomal locus.
[0145] As used herein, the term “base pair” or “bp” refers to a unit consisting of two nucleobases bound to each other by hydrogen bonds. Generally, the size of an organism's genome is measured in base pairs because DNA is typically double stranded. However, some viruses have single-stranded DNA or RNA genomes.
[0146] As used herein, the terms “genomic alteration,” “mutation,” and “variant” refer to a detectable change in the genetic material of one or more cells. A genomic alteration, mutation, or variant can refer to various type of changes in the genetic material of a cell, including changes in the primary genome sequence at single or multiple nucleotide positions, e.g., a single nucleotide variant (SNV), a multi-nucleotide variant (MNV), an indel (e.g., an insertion or deletion of nucleotides), a DNA rearrangement (e.g., an inversion or translocation of a portion of a chromosome or chromosomes), a variation in the copy number of a locus (e.g., an exon, gene, or a large span of a chromosome) (CNV), a partial or complete change in the ploidy of the cell, as well as in changes in the epigenetic information of a genome, such as altered DNA methylation patterns. In some embodiments, a mutation is a change in the genetic information of the cell relative to a particular reference genome, or one or more ‘normal’ alleles found in the population of the species of the subject. For instance, mutations can be found in both germline cells (e.g., non-cancerous, ‘normal’ cells) of a subject and in abnormal cells (e.g., pre-cancerous or cancerous cells) of the subject. As such, a mutation in a germline of the subject (e.g., which is found in substantially all ‘normal cells’ in the subject) is identified relative to a reference genome for the species of the subject. However, many loci of a reference genome of a species are associated with several variant alleles that are significantly represented in the population of the subject and are not associated with a diseased state, e.g., such that they would not be considered ‘mutations.’ By contrast, in some embodiments, a mutation in a cancerous cell of a subject can be identified relative to either a reference genome of the subject or to the subject’s own germline genome. In certain instances, identification of both types of variants can be informative. For instance, in some instances, a mutation that is present in both the cancer genome of the subject and the germline of the subject is informative for precision oncology when the mutation is a so-called ‘driver mutation,’ which contributes to the initiation and/or development of a cancer. However, in other instances, a mutation that is present in both the cancer genome of the subject and the germline of the subject is not informative for precision oncology, e.g., when the mutation is a so-called ‘passenger mutation,’ which does not contribute to the initiation and/or development of the cancer. Likewise, in some instances, a mutation that is present in
the cancer genome of the subject but not the germline of the subject is informative for precision oncology, e.g., where the mutation is a driver mutation and/or the mutation facilitates a therapeutic approach, e.g., by differentiating cancer cells from normal cells in a therapeutically actionable way. However, in some instances, a mutation that is present in the cancer genome but not the germline of a subject is not informative for precision oncology, e.g., where the mutation is a passenger mutation and/or where the mutation fails to differentiate the cancer cell from a germline cell in a therapeutically actionable way.
[0147] As used herein, the terms “focal copy number variation,” “focal copy number alteration,” “focal copy number variant,” and the like interchangeably refer to a genomic variation, relative to a reference genome, in the copy number of a small genomic segment. Unless otherwise specified, a small genomic segment is less than 30 Mb. However, in some embodiments, a small genomic segment is less than 25 Mb, less than 20 Mb, less 15 Mb, less than 10 Mb, less than 5 Mb, less than 4 Mb, less than 3 Mb, less than 2 Mb, less than 1 Mb, or smaller. Generally, focal copy number variations range from several hundred bases to tens of Mb. In some embodiments, a focal copy number variation consists of one or a few exons of a gene or several genes. For more information of focal copy number variations see, for example, Nord et ctl, Int. J. Cancer, 126, 1390-1402 (2010), which is hereby incorporated herein by reference in its entirety.
[0148] As used herein, the term “reference allele” refers to the sequence of one or more nucleotides at a chromosomal locus that is either the predominant allele represented at that chromosomal locus within the population of the species (e.g., the “wild-type” sequence), or an allele that is predefined within a reference genome for the species.
[0149] As used herein, the term “variant allele” refers to a sequence of one or more nucleotides at a chromosomal locus that is either not the predominant allele represented at that chromosomal locus within the population of the species (e.g., not the “wild-type” sequence), or not an allele that is predefined within a reference sequence construct (e.g., a reference genome or set of reference genomes) for the species. In some instances, sequence isoforms found within the population of a species that do not affect a change in a protein encoded by the genome, or that result in an amino acid substitution that does not substantially affect the function of an encoded protein, are not variant alleles.
[0150] As used herein, the term “variant allele fraction,” “VAF,” “allelic fraction,” or “AF” refers to the number of times a variant or mutant allele was observed (e.g., a number of
reads supporting a candidate variant allele) divided by the total number of times the position was sequenced (e.g., a total number of reads covering a candidate locus).
[0151] As used herein, the terms “variant fragment count” and “variant allele fragment count” interchangeably refer to a quantification, e.g., a raw or normalized count, of the number of sequences representing unique cell-free DNA fragments encompassing a variant allele in a sequencing reaction. That is, a variant fragment count represents a count of sequence reads representing unique molecules in the liquid biopsy sample, after duplicate sequence reads in the raw sequencing data have been collapsed, e.g., through the use of unique molecular indices (UMI) and bagging, etc. as described herein.
[0152] As used herein, the term “germline variants” refers to genetic variants inherited from maternal and paternal DNA. Germline variants may be determined through a matched tumor-normal calling pipeline.
[0153] As used herein, the term “somatic variants” refers to variants arising as a result of dysregulated cellular processes associated with neoplastic cells, e.g., a mutation. Somatic variants may be detected via subtraction from a matched normal sample.
[0154] As used herein, the term “single nucleotide variant” or “SNV” refers to a substitution of one nucleotide to a different nucleotide at a position (e.g., site) of a nucleotide sequence, e.g., a sequence read from an individual. A substitution from a first nucleobase X to a second nucleobase Y may be denoted as “X>Y.” For example, a cytosine to thymine SNV may be denoted as “C>T.”
[0155] As used herein, the term “insertions and deletions” or “indels” refers to a variant resulting from the gain or loss of DNA base pairs within an analyzed region.
[0156] As used herein, the term “copy number variation” or “CNV” refers to the process by which large structural changes in a genome associated with tumor aneuploidy and other dysregulated repair systems are detected. These processes are used to detect large scale insertions or deletions of entire genomic regions. CNV is defined as structural insertions or deletions greater than a certain base pair (“bp”) in size, such as 500 bp.
[0157] As used herein, the term “gene fusion” refers to the product of large-scale chromosomal aberrations resulting in the creation of a chimeric protein. These expressed products can be non-functional, or they can be highly over or underactive. This can cause deleterious effects in cancer such as hyper-proliferative or anti-apoptotic phenotypes.
[0158] As used herein, the term “loss of heterozygosity” refers to the loss of one copy of a segment (e.g., including part or all of one or more genes) of the genome of a diploid subject (e.g, a human) or loss of one copy of a sequence encoding a functional gene product in the genome of the diploid subject, in a tissue, e.g., a cancerous tissue, of the subject. As used herein, when referring to a metric representing loss of heterozygosity across the entire genome of the subject, loss of heterozygosity is caused by the loss of one copy of various segments in the genome of the subject. Loss of heterozygosity across the entire genome may be estimated without sequencing the entire genome of a subject, and such methods for such estimations based on gene panel targeting-based sequencing methodologies are described in the art. Accordingly, in some embodiments, a metric representing loss of heterozygosity across the entire genome of a tissue of a subject is represented as a single value, e.g., a percentage or fraction of the genome. In some cases, a tumor is composed of various sub- clonal populations, each of which may have a different degree of loss of heterozygosity across their respective genomes. Accordingly, in some embodiments, loss of heterozygosity across the entire genome of a cancerous tissue refers to an average loss of heterozygosity across a heterogeneous tumor population. As used herein, when referring to a metric for loss of heterozygosity in a particular gene, e.g., a DNA repair protein such as a protein involved in the homologous DNA recombination pathway (e.g., BRCA1 or BRCA2), loss of heterozygosity refers to complete or partial loss of one copy of the gene encoding the protein in the genome of the tissue and/or a mutation in one copy of the gene that prevents translation of a full-length gene product, e.g., a frameshift or truncating (creating a premature stop codon in the gene) mutation in the gene of interest. In some cases, a tumor is composed of various sub-clonal populations, each of which may have a different mutational status in a gene of interest. Accordingly, in some embodiments, loss of heterozygosity for a particular gene of interest is represented by an average value for loss of heterozygosity for the gene across all sequenced sub-clonal populations of the cancerous tissue. In other embodiments, loss of heterozygosity for a particular gene of interest is represented by a count of the number of unique incidences of loss of heterozygosity in the gene of interest across all sequenced sub- clonal populations of the cancerous tissue (e.g., the number of unique frame-shift and/or truncating mutations in the gene identified in the sequencing data).
[0159] As used herein, the term “microsatellites” refers to short, repeated sequences of DNA. The smallest nucleotide repeated unit of a microsatellite is referred to as the “repeated unit” or “repeat unit.” In some embodiments, the stability of a microsatellite locus is
evaluated by comparing some metric of the distribution of the number of repeated units at a microsatellite locus to a reference number or distribution.
[0160] As used herein, the term “microsatellite instability” or “MSI” refers to a genetic hypermutability condition associated with various cancers that results from impaired DNA mismatch repair (MMR) in a subject. Among other phenotypes, MSI causes changes in the size of microsatellite loci, e.g., a change in the number of repeated units at microsatellite loci, during DNA replication. Accordingly, the size of microsatellite repeats is varied in MSI cancers as compared to the size of the corresponding microsatellite repeats in the germline of a cancer subject. The term “Microsatellite Instability -High” or “MSI-H” refers to a state of a cancer (e.g., a tumor) that has a significant MMR defect, resulting in microsatellite loci with significantly different lengths than the corresponding microsatellite loci in normal cells of the same individual. The term “Microsatellite Stable” or “MSS” refers to a state of a cancer (e.g., a tumor) without significant MMR defects, such that there is no significant difference between the lengths of the microsatellite loci in cancerous cells and the lengths of the corresponding microsatellite loci in normal (e.g., non-cancerous) cells in the same individual. The term “Microsatellite Equivocal” or “MSE” refers to a state of a cancer (e.g., a tumor) having an intermediate microsatellite length phenotype, that cannot be clearly classified as MSI-H or MSS based on statistical cutoffs used to define those two categories.
[0161] As used herein, the term “gene product” refers to an RNA (e.g. , mRNA or miRNA) or protein molecule transcribed or translated from a particular genomic locus, e.g., a particular gene. The genomic locus can be identified using a gene name, a chromosomal location, or any other genetic mapping metric.
[0162] As used herein, the terms “expression level,” “abundance level,” or simply “abundance” refers to an amount of a gene product, (an RNA species, e.g., mRNA or miRNA, or protein molecule) transcribed or translated by a cell, or an average amount of a gene product transcribed or translated across multiple cells. When referring to mRNA or protein expression, the term generally refers to the amount of any RNA or protein species corresponding to a particular genomic locus, e.g., a particular gene. However, in some embodiments, an expression level can refer to the amount of a particular isoform of an mRNA or protein corresponding to a particular gene that gives rise to multiple mRNA or protein isoforms. The genomic locus can be identified using a gene name, a chromosomal location, or any other genetic mapping metric.
[0163] As used herein, the term “ratio” refers to any comparison of a first metric X, or a first mathematical transformation thereof X' (e.g., measurement of a number of units of a genomic sequence in a first one or more biological samples or a first mathematical transformation thereof) to another metric Y or a second mathematical transformation thereof Y' (e.g., the number of units of a respective genomic sequence in a second one or more biological samples or a second mathematical transformation thereof) expressed as XJY, Y/X, logN(X/Y), logN(Y/X), X'/Y, Y/X', logN(X'/Y), or logN(Y/X'), X/Y', Y'/X, logN(X/Y'), logN(YVX) , X'/Y', Y'/X', logN(X'/Y'), or logN(Y'/X'), where N is any real number greater than 1 and where example mathematical transformations of X and Y include, but are not limited to. raising X or Y to a power Z, multiplying X or Y by a constant Q, where Z and Q are any real numbers, and/or taking an M based logarithm of X and/or Y, where M is a real number greater than 1. In one non-limiting example, X is transformed to X' prior to ratio calculation by raising X by the power of two (X2) and Y is transformed to Y' prior to ratio calculation by raising Y by the power of 3.2 (Y3-2) and the ratio of X and Y is computed as log2(X'/Y').
[0164] As used herein, the term “relative abundance” refers to a ratio of a first amount of a compound measured in a sample, e.g., a gene product (an RNA species, e.g., mRNA or miRNA, or protein molecule) or nucleic acid fragments having a particular characteristic (e.g., aligning to a particular locus or encompassing a particular allele), to a second amount of a compound measured in a second sample. In some embodiments, relative abundance refers to a ratio of an amount of species of a compound to a total amount of the compound in the same sample. For instance, a ratio of the amount of mRNA transcripts encoding a particular gene in a sample (e.g., aligning to a particular region of the exome) to the total amount of mRNA transcripts in the sample. In other embodiments, relative abundance refers to a ratio of an amount of a compound or species of a compound in a first sample to an amount of the compound of the species of the compound in a second sample. For instance, a ratio of a normalized amount of mRNA transcripts encoding a particular gene in a first sample to a normalized amount of mRNA transcripts encoding the particular gene in a second and/or reference sample.
[0165] As used herein, the terms “sequencing,” “sequence determination,” and the like refer to any biochemical processes that may be used to determine the order of biological macromolecules such as nucleic acids or proteins. For example, sequencing data can include
all or a portion of the nucleotide bases in a nucleic acid molecule such as an mRNA transcript or a genomic locus.
[0166] As used herein, the term “genetic sequence” refers to a recordation of a series of nucleotides present in a subject’s RNA or DNA as determined by sequencing of nucleic acids from the subject.
[0167] As used herein, the term “sequence reads” or “reads” refers to nucleotide sequences produced by any nucleic acid sequencing process described herein or known in the art. Reads can be generated from one end of nucleic acid fragments (“single-end reads”) or from both ends of nucleic acid fragments (e.g., paired-end reads, double-end reads). The length of the sequence read is often associated with the particular sequencing technology. High-throughput methods, for example, provide sequence reads that can vary in size from tens to hundreds of base pairs (bp). In some embodiments, the sequence reads are of a mean, median or average length of about 15 bp to 900 bp long (e.g., about 20 bp, about 25 bp, about 30 bp, about 35 bp, about 40 bp, about 45 bp, about 50 bp, about 55 bp, about 60 bp, about 65 bp, about 70 bp, about 75 bp, about 80 bp, about 85 bp, about 90 bp, about 95 bp, about 100 bp, about 110 bp, about 120 bp, about 130, about 140 bp, about 150 bp, about 200 bp, about 250 bp, about 300 bp, about 350 bp, about 400 bp, about 450 bp, or about 500 bp. In some embodiments, the sequence reads are of a mean, median or average length of about 1000 bp, 2000 bp, 5000 bp, 10,000 bp, or 50,000 bp or more. Nanopore® sequencing, for example, can provide sequence reads that can vary in size from tens to hundreds to thousands of base pairs. Illumina® parallel sequencing, for example, can provide sequence reads that do not vary as much, for example, most of the sequence reads can be smaller than 200 bp. A sequence read (or sequencing read) can refer to sequence information corresponding to a nucleic acid molecule (e.g., a string of nucleotides). For example, a sequence read can correspond to a string of nucleotides (e.g., about 20 to about 150) from part of a nucleic acid fragment, can correspond to a string of nucleotides at one or both ends of a nucleic acid fragment, or can correspond to nucleotides of the entire nucleic acid fragment. A sequence read can be obtained in a variety of ways, e.g., using sequencing techniques or using probes, e.g., in hybridization arrays or capture probes, or amplification techniques, such as the polymerase chain reaction (PCR) or linear amplification using a single primer or isothermal amplification.
[0168] As used herein, the term “read segment” refers to any form of nucleotide sequence read including the raw sequence reads obtained directly from a nucleic acid sequencing
technique or from a sequence derived therefrom, e.g., an aligned sequence read, a collapsed sequence read, or a stitched sequence read.
[0169] As used herein, the term “read count” refers to the total number of nucleic acid reads generated, which may or may not be equivalent to the number of nucleic acid molecules generated, during a nucleic acid sequencing reaction.
[0170] As used herein, the term “read-depth,” “sequencing depth,” or “depth” can refer to a total number of unique nucleic acid fragments encompassing a particular locus or region of the genome of a subject that are sequenced in a particular sequencing reaction. Sequencing depth can be expressed as “Yx”, e.g., 50x, lOOx, etc., where “Y” refers to the number of unique nucleic acid fragments encompassing a particular locus that are sequenced in a sequencing reaction. In such a case, Y is necessarily an integer, because it represents the actual sequencing depth for a particular locus. Alternatively, read-depth, sequencing depth, or depth can refer to a measure of central tendency (e.g., a mean or mode) of the number of unique nucleic acid fragments that encompass one of a plurality of loci or regions of the genome of a subject that are sequenced in a particular sequencing reaction. For example, in some embodiments, sequencing depth refers to the average depth of every locus across an arm of a chromosome, a targeted sequencing panel, an exome, or an entire genome. In such case, Y may be expressed as a fraction or a decimal, because it refers to an average coverage across a plurality of loci. When a mean depth is recited, the actual depth for any particular locus may be different than the overall recited depth. Metrics can be determined that provide a range of sequencing depths in which a defined percentage of the total number of loci fall. For instance, a range of sequencing depths within which 90% or 95%, or 99% of the loci fall. As understood by the skilled artisan, different sequencing technologies provide different sequencing depths. For instance, low-pass whole genome sequencing can refer to technologies that provide a sequencing depth of less than 5x, less than 4x, less than 3x, or less than 2x, e.g., from about 0.5x to about 3x.
[0171] As used herein, the term “sequencing breadth” refers to what fraction of a particular reference exome (e.g., human reference exome), a particular reference genome (e.g., human reference genome), or part of the exome or genome has been analyzed. Sequencing breadth can be expressed as a fraction, a decimal, or a percentage, and is generally calculated as (the number of loci analyzed / the total number of loci in a reference exome or reference genome). The denominator of the fraction can be a repeat-masked genome, and thus 100% can correspond to all of the reference genome minus the masked
parts. A repeat-masked exome or genome can refer to an exome or genome in which sequence repeats are masked (e.g., sequence reads align to unmasked portions of the exome or genome). In some embodiments, any part of an exome or genome can be masked and, thus, sequencing breadth can be evaluated for any desired portion of a reference exome or genome. In some embodiments, “broad sequencing” refers to sequencing/analysis of at least 0.1% of an exome or genome.
[0172] As used herein, the terms “sequence ratio” and “coverage ratio” interchangeably refer to any measurement of a number of units of a genomic sequence in a first one or more biological samples (e.g., a test and/or tumor sample) compared to the number of units of the respective genomic sequence in a second one or more biological samples (e.g., a reference and/or control sample). In some embodiments, a sequence ratio is a copy ratio, a log2- transformed copy ratio (e.g., log2 copy ratio), a coverage ratio, a base fraction, an allele fraction (e.g. , a variant allele fraction), and/or a tumor ploidy . In some embodiments sequence ratio is a logN-transformed copy ratio, where N is any real number greater than 1.
[0173] As used herein, the term “sequencing probe” refers to a molecule that binds to a nucleic acid with affinity that is based on the expected nucleotide sequence of the RNA or DNA present at that locus.
[0174] As used herein, the term “targeted panel” or “targeted gene panel” refers to a combination of probes for sequencing (e.g., by next-generation sequencing) nucleic acids present in a biological sample from a subject (e.g., a tumor sample, liquid biopsy sample, germbne tissue sample, white blood cell sample, or tumor or tissue organoid sample), selected to map to one or more loci of interest on one or more chromosomes. An example set of loci/genes useful for precision oncology, e.g., via solid or liquid biopsy assay, that can be analyzed using a targeted panel is described in Table 1. In some embodiments, in addition to loci that are informative for precision oncology, a targeted panel includes one or more probes for sequencing one or more of a loci associated with a different medical condition, a loci used for internal control purposes, or a loci from a pathogenic organism (e.g., an oncogenic pathogen).
[0175] As used herein, the term, “reference exome” refers to any sequenced or otherwise characterized exome, whether partial or complete, of any tissue from any organism or pathogen that may be used to reference identified sequences from a subject. Typically, a reference exome will be derived from a subject of the same species as the subject whose
sequences are being evaluated. Example reference exomes used for human subjects as well as many other organisms are provided in the on-line genome browser hosted by the National Center for Biotechnology Information (“NCBI”). An “exome” refers to the complete transcriptional profile of an organism or pathogen, expressed in nucleic acid sequences. As used herein, a reference sequence or reference exome often is an assembled or partially assembled exomic sequence from an individual or multiple individuals. In some embodiments, a reference exome is an assembled or partially assembled exomic sequence from one or more human individuals. The reference exome can be viewed as a representative example of a species’ set of expressed genes. In some embodiments, a reference exome comprises sequences assigned to chromosomes.
[0176] As used herein, the term “reference genome” refers to any sequenced or otherwise characterized genome, whether partial or complete, of any organism or pathogen that may be used to reference identified sequences from a subject. Typically, a reference genome will be derived from a subject of the same species as the subject whose sequences are being evaluated. Exemplary reference genomes used for human subjects as well as many other organisms are provided in the on-line genome browser hosted by the National Center for Biotechnology Information (“NCBI”) or the University of California, Santa Cruz (UCSC). A “genome” refers to the complete genetic information of an organism or pathogen, expressed in nucleic acid sequences. As used herein, a reference sequence or reference genome often is an assembled or partially assembled genomic sequence from an individual or multiple individuals. In some embodiments, a reference genome is an assembled or partially assembled genomic sequence from one or more human individuals. The reference genome can be viewed as a representative example of a species’ set of genes. In some embodiments, a reference genome comprises sequences assigned to chromosomes. Exemplary human reference genomes include but are not limited to NCBI build 34 (UCSC equivalent: hgl6), NCBI build 35 (UCSC equivalent: hgl7), NCBI build 36.1 (UCSC equivalent: hgl8), GRCh37 (UCSC equivalent: hgl9), and GRCh38 (UCSC equivalent: hg38). For a haploid genome, there can be only one nucleotide at each locus. For a diploid genome, heterozygous loci can be identified; each heterozygous locus can have two alleles, where either allele can allow a match for alignment to the locus.
[0177] As used herein, the term “bioinformatics pipeline” refers to a series of processing stages used to determine characteristics of a subject’s genome or exome based on sequencing data of the subject’s genome or exome. A bioinformatics pipeline may be used to determine
characteristics of a germline genome or exome of a subject and/or a cancer genome or exome of a subject. In some embodiments, the pipeline extracts information related to genomic alterations in the cancer genome of a subject, which is useful for guiding clinical decisions for precision oncology, from sequencing results of a biological sample, e.g., a tumor sample, liquid biopsy sample, reference normal sample, etc., from the subject. Certain processing stages in a bioinformatics may be ‘connected,’ meaning that the results of a first respective processing stage are informative and/or essential for execution of a second, downstream processing stage. For instance, in some embodiments, a bioinformatics pipeline includes a first respective processing stage for identifying genomic alterations that are unique to the cancer genome of a subject and a second respective processing stage that uses the quantity and/or identity of the identified genomic alterations to determine a metric that is informative for precision oncology, e.g., a tumor mutational burden. In some embodiments, the bioinformatics pipeline includes a reporting stage that generates a report of relevant and/or actionable information identified by upstream stages of the pipeline, which may or may not further include recommendations for aiding clinical therapy decisions.
[0178] As used herein, the term “limit of detection” or “LOD” refers to the minimal quantity of a feature that can be identified with a particular level of confidence. Accordingly, level of detection can be used to describe an amount of a substance that must be present in order for a particular assay to reliably detect the substance. A level of detection can also be used to describe a level of support needed for an algorithm to reliably identify a genomic alteration based on sequencing data. For example, a minimal number of unique sequence reads to support identification of a sequence variant such as a SNV.
[0179] As used herein, the term “BAM File” or “Binary file containing Alignment Maps” refers to a file storing sequencing data aligned to a reference sequence (e.g., a reference genome or exome). In some embodiments, a BAM file is a compressed binary version of a SAM (Sequence Alignment Map) file that includes, for each of a plurality of unique sequence reads, an identifier for the sequence read, information about the nucleotide sequence, information about the alignment of the sequence to a reference sequence, and optionally metrics relating to the quality of the sequence read and/or the quality of the sequence alignment. While BAM files generally relate to files having a particular format, for simplicity they are used herein to simply refer to a file, of any format, containing information about a sequence alignment, unless specifically stated otherwise.
[0180] As used herein, the term “measure of central tendency” refers to a central or representative value for a distribution of values. Non-limiting examples of measures of central tendency include an arithmetic mean, weighted mean, midrange, midhinge, trimean, geometric mean, geometric median, Winsorized mean, median, and mode of the distribution of values.
[0181] As used herein, the term “Positive Predictive Value” or “PPV” means the likelihood that a variant is properly called given that a variant has been called by an assay. PPV can be expressed as (number of true positives)/ (number of false positives + number of true positives).
[0182] As used herein, the term “assay” refers to a technique for determining a property of a substance, e.g., a nucleic acid, a protein, a cell, a tissue, or an organ. An assay (e.g., a first assay or a second assay) can comprise a technique for determining the copy number variation of nucleic acids in a sample, the methylation status of nucleic acids in a sample, the fragment size distribution of nucleic acids in a sample, the mutational status of nucleic acids in a sample, or the fragmentation pattern of nucleic acids in a sample. Any assay known to a person having ordinary skill in the art can be used to detect any of the properties of nucleic acids mentioned herein. Properties of a nucleic acids can include a sequence, genomic identity, copy number, methylation state at one or more nucleotide positions, size of the nucleic acid, presence or absence of a mutation in the nucleic acid at one or more nucleotide positions, and pattern of fragmentation of a nucleic acid (e.g., the nucleotide position(s) at which a nucleic acid fragments). An assay or method can have a particular sensitivity and/or specificity, and their relative usefulness as a diagnostic tool can be measured using ROC- AUC statistics.
[0183] As used herein, the term “classification” can refer to any number(s) or other characters(s) that are associated with a particular property of a sample. For example, in some embodiments, the term “classification” can refer to a type of cancer in a subject, a stage of cancer in a subject, a prognosis for a cancer in a subject, a tumor load, a presence of tumor metastasis in a subject, and the like. The classification can be binary (e.g., positive or negative) or have more levels of classification (e.g., a scale from 1 to 10 or 0 to 1). The terms “cutoff’ and “threshold” can refer to predetermined numbers used in an operation. For example, a cutoff size can refer to a size above which fragments are excluded. A threshold value can be a value above or below which a particular classification applies. Either of these terms can be used in either of these contexts.
[0184] As used herein, the term “sensitivity” or “true positive rate” (TPR) refers to the number of true positives divided by the sum of the number of true positives and false negatives. Sensitivity can characterize the ability of an assay or method to correctly identify a proportion of the population that truly has a condition. For example, sensitivity can characterize the ability of a method to correctly identify the number of subjects within a population having cancer. In another example, sensitivity can characterize the ability of a method to correctly identify the one or more markers indicative of cancer.
[0185] As used herein, the term “specificity” or “true negative rate” (TNR) refers to the number of true negatives divided by the sum of the number of true negatives and false positives. Specificity can characterize the ability of an assay or method to correctly identify a proportion of the population that truly does not have a condition. For example, specificity can characterize the ability of a method to correctly identify the number of subjects within a population not having cancer. In another example, specificity characterizes the ability of a method to correctly identify one or more markers indicative of cancer.
[0186] As used herein, an “actionable genomic alteration” or “actionable variant” refers to a genomic alteration (e.g., a SNV, MNV, indel, rearrangement, copy number variation, or ploidy variation), or value of another cancer metric derived from nucleic acid sequencing data (e.g., a tumor mutational burden, MSI status, or tumor fraction), that is known or believed to be associated with a therapeutic course of action that is more likely to produce a positive effect in a cancer patient that has the actionable variant than in a similarly situated cancer patient that does not have the actionable variant. For instance, administration of EGFR inhibitors (e.g., afatinib, erlotinib, gefitinib) is more effective for treating non-small cell lung cancer in patients with an EGFR mutation in exons 19/21 than for treating non-small cell lung cancer in patients that do not have an EGFR mutations in exons 19/21. Accordingly, an EGFR mutation in exon 19/21 is an actionable variant. In some instances, an actionable variant is only associated with an improved treatment outcome in one or a group of specific cancer types. In other instances, an actionable variant is associated with an improved treatment outcome in substantially all cancer types.
[0187] As used herein, a “variant of uncertain significance” or “VUS” refers to a genomic alteration (e.g., a SNV, MNV, indel, rearrangement, copy number variation, or ploidy variation), or value of another cancer metric derived from nucleic acid sequencing data (e.g., a tumor mutational burden, MSI status, or tumor fraction), whose impact on disease development/progression is unknown.
[0188] As used herein, a “benign variant” or “likely benign variant” refers to a genomic alteration (e.g., a SNV, MNV, indel, rearrangement, copy number variation, or ploidy variation), or value of another cancer metric derived from nucleic acid sequencing data (e.g., a tumor mutational burden, MSI status, or tumor fraction), that is known or believed to not contribute to disease development/progression.
[0189] As used herein, a “pathogenic variant” or “likely pathogenic variant” refers to a genomic alteration (e.g., a SNV, MNV, indel, rearrangement, copy number variation, or ploidy variation), or value of another cancer metric derived from nucleic acid sequencing data (e.g., a tumor mutational burden, MSI status, or tumor fraction), that is known or believed to contribute to disease development/progression.
[0190] As used herein, an “effective amount” or “therapeutically effective amount” is an amount sufficient to affect a beneficial or desired clinical result upon treatment. An effective amount can be administered to a subject in one or more doses. In terms of treatment, an effective amount is an amount that is sufficient to palliate, ameliorate, stabilize, reverse or slow the progression of the disease, or otherwise reduce the pathological consequences of the disease. The effective amount is generally determined by the physician on a case-by-case basis and is within the skill of one in the art. Several factors are typically taken into account when determining an appropriate dosage to achieve an effective amount. These factors include age, sex and weight of the subject, the condition being treated, the severity of the condition and the form and effective concentration of the therapeutic agent being administered.
[0191] The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the description of the invention and the appended claims, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term “and/or” as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. Furthermore, to the extent that the terms “including,” “includes,” “having,” “has,” “with,” or
variants thereof are used in either the detailed description and/or the claims, such terms are intended to be inclusive in a manner similar to the term “comprising.”
[0192] As used herein, the term “if’ may be construed to mean “when” or “upon” or “in response to determining” or “in response to detecting,” depending on the context. Similarly, the phrase “if it is determined” or “if [a stated condition or event] is detected” may be construed to mean “upon determining” or “in response to determining” or “upon detecting [the stated condition or event]” or “in response to detecting [the stated condition or event],” depending on the context.
[0193] It will also be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first subject could be termed a second subject, and, similarly, a second subject could be termed a first subject, without departing from the scope of the present disclosure. The first subject and the second subject are both subjects, but they are not the same subject. Furthermore, the terms “subject,” “user,” and “patient” are used interchangeably herein.
[0194] Reference will now be made in detail to embodiments, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, including example systems, methods, techniques, instruction sequences, and computing machine program products that embody illustrative implementations. However, the illustrative discussions below are not intended to be exhaustive or to limit the implementations to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The features described herein are not limited by the illustrated ordering of acts or events, as some acts can occur in different orders and/or concurrently with other acts or events.
[0195] The implementations provided herein are chosen and described in order to best explain the principles and their practical applications, to thereby enable others skilled in the art to best utilize the various embodiments with various modifications as are suited to the particular use contemplated. In some instances, well-known methods, procedures, components, circuits, and networks have not been described in detail so as not to unnecessarily obscure aspects of the embodiments. In other instances, it will be apparent to
one of ordinary skill in the art that the present disclosure may be practiced without one or more of the specific details.
[0196] It will be appreciated that, in the development of any such actual implementation, numerous implementation-specific decisions are made in order to achieve the designer’s specific goals, such as compliance with use case- and business-related constraints, and that these specific goals will vary from one implementation to another and from one designer to another. Moreover, it will be appreciated that though such a design effort might be complex and time-consuming, it will nevertheless be a routine undertaking of engineering for those of ordering skill in the art having the benefit of the present disclosure.
Example System Embodiments
[0197] Now that an overview of some aspects of the present disclosure and some definitions used in the present disclosure have been provided, details of an exemplary system for providing clinical support for personalized cancer therapy using a liquid biopsy assay are now described in conjunction with Figures 1A, IB, 1C1, 1D1, 1C2, 1D2, 1E2, 1F2, 1C3, and 1D3. Figures 1A, IB, 1C1, 1D1, 1C2, 1D2, 1E2, 1F2, 1C3, and lD3collectively illustrate the topology of an example system for providing clinical support for personalized cancer therapy using a liquid biopsy assay, in accordance with some embodiments of the present disclosure. Advantageously, the example system illustrated in Figures 1A, IB, 1C1, 1D1, 1C2, 1D2,
1E2, 1F2, 1C3, and lD3improves upon conventional methods for providing clinical support for personalized cancer therapy by validating copy number variations, thus identifying focal copy number variations for actionable treatment, validating a somatic sequence variant in a test subject having a cancer condition, and/or determining circulating tumor fraction estimates using on-target and off-target sequence reads.
[0198] Figure 1 A is a block diagram illustrating a system in accordance with some implementations. The device 100 in some implementations includes one or more processing units CPU(s) 102 (also referred to as processors), one or more network interfaces 104, a user interface 106, e.g., including a display 108 and/or an input 110 (e.g., a mouse, touchpad, keyboard, etc.), a non-persistent memory 111, a persistent memory 112, and one or more communication buses 114 for interconnecting these components. The one or more communication buses 114 optionally include circuitry (sometimes called a chipset) that interconnects and controls communications between system components. The non-persistent memory 111 typically includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, ROM, EEPROM, flash memory, whereas the persistent memory 112 typically
includes CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, magnetic disk storage devices, optical disk storage devices, flash memory devices, or other non-volatile solid state storage devices. The persistent memory 112 optionally includes one or more storage devices remotely located from the CPU(s) 102. The persistent memory 112, and the non-volatile memory device(s) within the non-persistent memory 112, comprise non- transitory computer readable storage medium. In some implementations, the non-persistent memory 111 or alternatively the non-transitory computer readable storage medium stores the following programs, modules and data structures, or a subset thereof, sometimes in conjunction with the persistent memory 112:
• an operating system 116, which includes procedures for handling various basic system services and for performing hardware dependent tasks;
• a network communication module (or instructions) 118 for connecting the system 100 with other devices and/or a communication network 105;
• a test patient data store 120 for storing one or more collections of features from patients (e.g., subjects);
• a bioinformatics module 140 for processing sequencing data and extracting features from sequencing data, e.g., from liquid biopsy sequencing assays;
• a feature analysis module 160 for evaluating patient features, e.g., genomic alterations, compound genomic features, and clinical features; and
• a reporting module 180 for generating and transmitting reports that provide clinical support for personalized cancer therapy.
[0199] Although Figures 1A, IB, 1C1, 1D1, 1C2, 1D2, 1E2, 1F2, 1C3, and 1D3 depict a “system 100,” the figures are intended more as a functional description of the various features that may be present in computer systems than as a structural schematic of the implementations described herein. In practice, and as recognized by those of ordinary skill in the art, items shown separately could be combined and some items could be separated. Moreover, although Figure 1 depicts certain data and modules in non-persistent memory 111, some or all of these data and modules may be in persistent memory 112. For example, in various implementations, one or more of the above identified elements are stored in one or more of the previously mentioned memory devices and correspond to a set of instructions for performing a function described above. The above identified modules, data, or programs (e.g., sets of instructions) need not be implemented as separate software programs,
procedures, datasets, or modules, and thus various subsets of these modules and data may be combined or otherwise re-arranged in various implementations.
[0200] In some implementations, the non-persistent memory 111 optionally stores a subset of the modules and data structures identified above. Furthermore, in some embodiments, the memory stores additional modules and data structures not described above. In some embodiments, one or more of the above-identified elements is stored in a computer system, other than that of system 100, that is addressable by system 100 so that system 100 may retrieve all or a portion of such data when needed.
[0201] For purposes of illustration in Figure 1A, system 100 is represented as a single computer that includes all of the functionality for providing clinical support for personalized cancer therapy. However, while a single machine is illustrated, the term “system” shall also be taken to include any collection of machines that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein.
[0202] For example, in some embodiments, system 100 includes one or more computers. In some embodiments, the functionality for providing clinical support for personalized cancer therapy is spread across any number of networked computers and/or resides on each of several networked computers and/or is hosted on one or more virtual machines at a remote location accessible across the communications network 105. For example, different portions of the various modules and data stores illustrated in Figures 1A, IB, 1C1, 1D1, 1C2, 1D2, 1E2, 1F2, 1C3, and lD3can be stored and/or executed on the various instances of a processing device and/or processing server/database in the distributed diagnostic environment 210 illustrated in Figure 2B ( e.g ., processing devices 224, 234, 244, and 254, processing server 262, and database 264).
[0203] The system may operate in the capacity of a server or a client machine in client- server network environment, as a peer machine in a peer-to-peer (or distributed) network environment, or as a server or a client machine in a cloud computing infrastructure or environment. The system may be a personal computer (PC), a tablet PC, a set-top box (STB), a Personal Digital Assistant (PDA), a cellular telephone, a web appliance, a server, a network router, a switch or bridge, or any machine capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that machine.
[0204] In another implementation, the system comprises a virtual machine that includes a module for executing instructions for performing any one or more of the methodologies disclosed herein. In computing, a virtual machine (VM) is an emulation of a computer system that is based on computer architectures and provides functionality of a physical computer. Some such implementations may involve specialized hardware, software, or a combination of hardware and software.
[0205] One of skill in the art will appreciate that any of a wide array of different computer topologies are used for the application and all such topologies are within the scope of the present disclosure.
Test Patient Data Store (120)
[0206] Referring to Figure IB, in some embodiments, the system (e.g., system 100) includes a patient data store 120 that stores data for patients 121-1 to 121-M (e.g., cancer patients or patients being tested for cancer) including one or more sequencing data 122, feature data 125, and clinical assessments 139. These data are used and/or generated by the various processes stored in the bioinformatics module 140 and feature analysis module 160 of system 100, to ultimately generate a report providing clinical support for personalized cancer therapy of a patient. While the feature scope of patient data 121 across all patients may be informationally dense, an individual patient’s feature set may be sparsely populated across the entirety of the collective feature scope of all features across all patients. That is to say, the data stored for one patient may include a different set of features that the data stored for another patient. Further, while illustrated as a single data construct in Figure IB, different sets of patient data may be stored in different databases or modules spread across one or more system memories.
[0207] In some embodiments, sequencing data 122 from one or more sequencing reactions 122 -i, including a plurality of sequence reads 123-1 to 123-K, is stored in the test patient data store 120. The data store may include different sets of sequencing data from a single subject, corresponding to different samples from the patient, e.g., a tumor sample, liquid biopsy sample, tumor organoid derived from a patient tumor, and/or a normal sample, and/or to samples acquired at different times, e.g., while monitoring the progression, regression, remission, and/or recurrence of a cancer in a subject. The sequence reads may be in any suitable file format, e.g., BCL, FASTA, FASTQ, etc. In some embodiments, sequencing data 122 is accessed by a sequencing data processing module 141, which
performs various pre-processing, genome alignment, and demultiplexing operations, as described in detail below with reference to bioinformatics module 140. In some embodiments, sequence data that has been aligned to a reference construct, e.g., BAM file 124, is stored in test patient data store 120.
[0208] In some embodiments, the test patient data store 120 includes feature data 125, e.g., that is useful for identifying clinical support for personalized cancer therapy. In some embodiments, the feature data 125 includes personal characteristics 126 of the patient, such as patient name, date of birth, gender, ethnicity, physical address, smoking status, alcohol consumption characteristic, anthropomorphic data, etc.
[0209] In some embodiments, the feature data 125 includes medical history data 127 for the patient, such as cancer diagnosis information (e.g., date of initial diagnosis, date of metastatic diagnosis, cancer staging, tumor characterization, tissue of origin, previous treatments and outcomes, adverse effects of therapy, therapy group history, clinical trial history, previous and current medications, surgical history, etc.), previous or current symptoms, previous or current therapies, previous treatment outcomes, previous disease diagnoses, diabetes status, diagnoses of depression, diagnoses of other physical or mental maladies, and family medical history. In some embodiments, the feature data 125 includes clinical features 128, such as pathology data 128-1, medical imaging data 128-2, and tissue culture and/or tissue organoid culture data 128-3.
[0210] In some embodiments, yet other clinical features, such as previous laboratory testing results, are stored in the test patient data store 120. Medical history data 127 and clinical features may be collected from various sources, including at intake directly from the patient, from an electronic medical record (EMR) or electronic health record (EHR) for the patient, or curated from other sources, such as fields from various testing records (e.g., genetic sequencing reports).
[0211] In some embodiments, the feature data 125 includes genomic features 131 for the patient. Non-limiting examples of genomic features include allelic states 132 (e.g., the identity of alleles at one or more loci, support for wild type or variant alleles at one or more loci, support for SNVs/MNVs at one or more loci, support for indels at one or more loci, and/or support for gene rearrangements at one or more loci), allelic fractions 133 (e.g., ratios of variant to reference alleles (or vice versa), methylation states 134 (e.g., a distribution of methylation patterns at one or more loci and/or support for aberrant methylation patterns at
one or more loci), genomic copy numbers 135 (e.g., a copy number value at one or more loci and/or support for an aberrant (increased or decreased) copy number at one or more loci), tumor mutational burden 136 (e.g., a measure of the number of mutations in the cancer genome of the subject), and microsatellite instability status 137 (e.g, a measure of the repeated unit length at one or more microsatellite loci and/or a classification of the MSI status for the patient’s cancer). In some embodiments, one or more of the genomic features 131 are determined by a nucleic acid bioinformatics pipeline, e.g, as described in detail below with reference to Figure 4 (e.g, Figures 4A-E, 4F1, 4F2, and 4F3). In particular, in some embodiments, the feature data 125 include genomic copy numbers 135 (e.g, 135-1 for Patient 1 121-1) variant allele fractions 133, and/or circulating tumor fraction estimates 131-i, as determined using the improved methods for analyzing copy number variations (CNVs) using the copy number variation analysis module 153, validating somatic sequence variants, and/or determining circulating tumor fraction estimates, and as described in further detail below with reference to Figures 1 and 4 (e.g, Figures 1C1, 1D1, 4F1; Figures 1C2, 1D2, and 4F2; and/or Figures 1C3, 1D3, and 4F3). In some embodiments, one or more of the genomic features 131 are obtained from an external testing source, e.g, not connected to the bioinformatics pipeline as described below.
[0212] For example, referring to Figure 1C1, the one or more genomic features 131 include genomic copy numbers 135 comprising liquid biopsy genomic copy numbers 135-cf and optional tumor biopsy genomic copy numbers 135-t, in accordance with some embodiments of the present disclosure. In some embodiments, the liquid biopsy genomic copy numbers 135-cf are determined by a nucleic acid bioinformatics pipeline (e.g, as described in detail below with reference to Figures 4A-E and 4F1) using a plurality of sequence reads 123 obtained from a sequencing of cell-free nucleic acids from a liquid biopsy sample. In some embodiments, the liquid biopsy genomic copy numbers comprise plurality of copy number annotations (e.g, 135-cf-l, 135-cf-2,... , 135-cf-N), where each copy number annotation corresponds to a genomic target (e.g. , a gene or a region of a genome). In some embodiments, a copy number annotation comprises a qualitative status and/or a quantitative copy number. In some alternative embodiments, the optional tumor biopsy genomic copy numbers 135-t are determined by a nucleic acid bioinformatics pipeline using a plurality of sequence reads 123 obtained from a sequencing of nucleic acids from a tumor (e.g, tissue) biopsy. In some embodiments, the optional tumor biopsy genomic copy numbers comprise a plurality of optional copy number annotations (e.g, 135-1-t-l, 135-l-t-2,... , 135-1-t-O),
where each copy number annotation corresponds to a genomic target (e.g., a gene or a region of a genome).
[0213] Referring again to Figure IB, in some embodiments, the feature data 125 further includes data 138 from other -omics fields of study. Non-limiting examples of -omics fields of study that may yield feature data useful for providing clinical support for personalized cancer therapy include transcriptomics, epigenomics, proteomics, metabolomics, metabonomics, microbiomics, lipidomics, gly comics, cellomics, and organoidomics.
[0214] In some embodiments, yet other features may include features derived from machine learning approaches, e.g., based at least in part on evaluation of any relevant molecular or clinical features, considered alone or in combination, not limited to those listed above. For instance, in some embodiments, one or more latent features learned from evaluation of cancer patient training datasets improve the diagnostic and prognostic power of the various analysis algorithms in the feature analysis module 160.
[0215] The skilled artisan will know of other types of features useful for providing clinical support for personalized cancer therapy. The listing of features above is merely representative and should not be construed to be limiting.
[0216] In some embodiments, a test patient data store 120 includes clinical assessment data 139 for patients, e.g., based on the feature data 125 collected for the subject. In some embodiments, the clinical assessment data 139 includes a catalogue of actionable variants and characteristics 139-1 (e.g., genomic alterations and compound metrics based on genomic features known or believed to be targetable by one or more specific cancer therapies), matched therapies 139-2 (e.g., the therapies known or believed to be particularly beneficial for treatment of subjects having actionable variants), and/or clinical reports 139-3 generated for the subject, e.g., based on identified actionable variants and characteristics 139-1 and/or matched therapies 139-2.
[0217] In some embodiments, clinical assessment data 139 is generated by analysis of feature data 125 using the various algorithms of feature analysis module 160, as described in further detail below. In some embodiments, clinical assessment data 139 is generated, modified, and/or validated by evaluation of feature data 125 by a clinician, e.g., an oncologist. For instance, in some embodiments, a clinician (e.g., at clinical environment 220) uses feature analysis module 160, or accesses test patient data store 120 directly, to evaluate feature data 125 to make recommendations for personalized cancer treatment of a patient.
Similarly, in some embodiments, a clinician (e.g., at clinical environment 220) reviews recommendations determined using feature analysis module 160 and approves, rejects, or modifies the recommendations, e.g., prior to the recommendations being sent to a medical professional treating the cancer patient.
Bioinformatics Module (140)
[0218] Referring again to Figure 1A, the system (e.g., system 100) includes a bioinformatics module 140 that includes a feature extraction module 145 and optional ancillary data processing constructs, such as a sequence data processing module 141 and/or one or more reference sequence constructs 158 (e.g., a reference genome, exome, or targeted- panel construct that includes reference sequences for a plurality of loci targeted by a sequencing panel).
[0219] In some embodiments, bioinformatics module 140 includes a sequence data processing module 141 that includes instructions for processing sequence reads, e.g., raw sequence reads 123 from one or more sequencing reactions 122-i, prior to analysis by the various feature extraction algorithms, as described in detail below. In some embodiments, sequence data processing module 141 includes one or more pre-processing algorithms 142 that prepare the data for analysis. In some embodiments, the pre-processing algorithms 142 include instructions for converting the file format of the sequence reads from the output of the sequencer (e.g., a BCL file format) into a file format compatible with downstream analysis of the sequences (e.g., a FASTQ or FASTA file format). In some embodiments, the pre-processing algorithms 142 include instructions for evaluating the quality of the sequence reads (e.g., by interrogating quality metrics like Phred score, base-calling error probabilities, Quality (Q) scores, and the like) and/or removing sequence reads that do not satisfy a threshold quality (e.g., an inferred base call accuracy of at least 80%, at least 90%, at least 95%, at least 99%, at least 99.5%, at least 99.9%, or higher). In some embodiments, the pre processing algorithms 142 include instructions for filtering the sequence reads for one or more properties, e.g., removing sequences failing to satisfy a lower or upper size threshold or removing duplicate sequence reads.
[0220] In some embodiments, sequence data processing module 141 includes one or more alignment algorithms 143, for aligning pre-processed sequence reads 123 to a reference sequence construct 158, e.g., a reference genome, exome, or targeted-panel construct. Many algorithms for aligning sequencing data to a reference construct are known in the art, for example, BWA, Blat, SHRiMP, LastZ, and MAQ. One example of a sequence read
alignment package is the Burrows-Wheeler Alignment tool (BWA), which uses a Burrows- Wheeler Transform (BWT) to align short sequence reads against a large reference construct, allowing for mismatches and gaps. Li and Durbin, Bioinformatics, 25(14): 1754-60 (2009), the content of which is incorporated herein by reference, in its entirety, for all purposes. Sequence read alignment packages import raw or pre-processed sequence reads 122, e.g., in BCL, FASTA, or FASTQ file formats, and output aligned sequence reads 124, e.g., in SAM or BAM file formats.
[0221] In some embodiments, sequence data processing module 141 includes one or more demultiplexing algorithms 144, for dividing sequence read or sequence alignment files generated from sequencing reactions of pooled nucleic acids into separate sequence read or sequence alignment files, each of which corresponds to a different source of nucleic acids in the nucleic acid sequencing pool. For instance, because of the cost of sequencing reactions, it is common practice to pool nucleic acids from a plurality of samples into a single sequencing reaction. The nucleic acids from each sample are tagged with a sample-specific and/or molecule-specific sequence tag (e.g., a UMI), which is sequenced along with the molecule.
In some embodiments, demultiplexing algorithms 144 sort these sequence tags in the sequence read or sequence alignment files to demultiplex the sequencing data into separate files for each of the samples included in the sequencing reaction.
[0222] Bioinformatics module 140 includes a feature extraction module 145, which includes instructions for identifying diagnostic features, e.g., genomic features 131, from sequencing data 122 of biological samples from a subject, e.g., one or more of a solid tumor sample, a liquid biopsy sample, or a normal tissue (e.g., control) sample. For instance, in some embodiments, a feature extraction algorithm compares the identity of one or more nucleotides at a locus from the sequencing data 122 to the identity of the nucleotides at that locus in a reference sequence construct (e.g., a reference genome, exome, or targeted-panel construct) to determine whether the subject has a variant at that locus. In some embodiments, a feature extraction algorithm evaluates data other than the raw sequence, to identify a genomic alteration in the subject, e.g., an allelic ratio, a relative copy number, a repeat unit distribution, etc.
[0223] For instance, in some embodiments, feature extraction module 145 includes one or more variant identification modules that include instructions for various variant calling processes. In some embodiments, variants in the germline of the subject are identified, e.g., using a germline variant identification module 146. In some embodiments, variants in the
cancer genome, e.g., somatic variants, are identified, e.g., using a somatic variant identification module 150. While separate germline and somatic variant identification modules are illustrated in Figure 1A, in some embodiments they are integrated into a single module. In some embodiments, the variant identification module includes instructions for identifying one or more of nucleotide variants (e.g., single nucleotide variants (SNV) and multi-nucleotide variants (MNV)) using one or more SNV/MNV calling algorithms (e.g., algorithms 147 and/or 151), indels (e.g., insertions or deletions of nucleotides) using one or more indel calling algorithms (e.g., algorithms 148 and/or 152), and genomic rearrangements (e.g., inversions, translocation, and fusions of nucleotide sequences) using one or more genomic rearrangement calling algorithms (e.g., algorithms 149 and/or 153).
[0224] For example, referring to Figures 1C2 and 1D2, in some embodiments, feature extraction module 145 comprises, in the variant identification module 146, a variant thresholding module 146-a, a sequence variant data store 146-r, and a variant validation module 146-o. In some such embodiments, the sequence variant data store 146-r comprises one or more candidate variants for a test subject identified by aligning to a reference sequence a plurality of sequence reads obtained from sequencing a liquid biopsy sample of the test subject, the one or more candidate variants corresponding to a respective one or more loci in the reference sequence. The plurality of sequence reads aligned to the reference sequence is used to identify a variant allele fragment count for each candidate variant. The sequence variant data store 146-r further comprises, in some embodiments, a plurality of variants from a first set of nucleic acids obtained from a cohort of subjects (e.g. , from a tumor tissue biopsy for each subject in a baseline cohort of subjects). The variant thresholding module 146-a performs a function for each candidate variant in the one or more candidate variants where, for each corresponding locus 146-b (e.g., 146-b-l,... , 146-b-P), a dynamic variant count threshold 146-d (e.g., 146-d-l) is obtained based on a pre-test odds of a positive variant call for the locus, based on the prevalence of variants in the genomic region that includes the locus, using the plurality of variants for the baseline cohort. The variant thresholding module 146-a compares the variant allele fragment count 146-c (e.g., 146-c-l) for the candidate variant against the dynamic variant count threshold 146-d for the locus corresponding to the candidate variant. In some embodiments, the variant validation module 146-0 determines whether the candidate variant is validated or rejected as a somatic sequence variant based on the comparison. For example, when the variant allele fragment count for the candidate variant satisfies the dynamic variant count threshold for the locus, the somatic
sequence variant is validated, and when the variant allele fragment count for the candidate variant does not satisfy the dynamic variant count threshold for the locus, the somatic sequence variant is rejected.
[0225] In some embodiments, the dynamic variant count threshold is determined based on a distribution of variant detection sensitivities as a function of circulating variant allele fraction from the cohort of subjects (e.g., the baseline cohort). For example, referring to Figure 1C2, in some such embodiments, the variant thresholding module 146-a takes as input one or more variant allele fractions 133 from the genomic features module 131. In some such embodiments, the variant allele fractions 133 comprises a plurality of variant allele fractions obtained from tumor tissue biopsies 133-t (e.g.. 133-t-l, 133-t-2... , 133-t-O) for the cohort of subjects. In some embodiments, the variant allele fractions comprise a plurality of variant allele fractions obtained from liquid biopsy samples 133-cf (e.g., 133-cf-l, 133-cf-2..., 133- cf-N) for the cohort of subjects. In some embodiments, the circulating variant allele fraction is obtained by comparing the liquid biopsy variant allele fractions 133-cf to the tumor biopsy variant allele fraction 133-t.
[0226] Additional embodiments for using variant allele fractions (e.g., variant allele frequencies) to identify somatic variants are detailed below (see, Example Methods: Variant Identification).
[0227] A SNV/MNV algorithm 147 may identify a substitution of a single nucleotide that occurs at a specific position in the genome. For example, at a specific base position, or locus, in the human genome, the C nucleotide may appear in most individuals, but in a minority of individuals, the position is occupied by an A. This means that there is a SNP at this specific position and the two possible nucleotide variations, C or A, are said to be alleles for this position. SNPs underlie differences in human susceptibility to a wide range of diseases (e.g., sickle-cell anemia, b-thalassemia and cystic fibrosis result from SNPs). The severity of illness and the way the body responds to treatments are also manifestations of genetic variations. For example, a single-base mutation in the APOE (apolipoprotein E) gene is associated with a lower risk for Alzheimer's disease. A single-nucleotide variant (SNV) is a variation in a single nucleotide without any limitations of frequency and may arise in somatic cells. A somatic single-nucleotide variation (e.g., caused by cancer) may also be called a single-nucleotide alteration. An MNP (Multiple-nucleotide polymorphisms) module may identify the substitution of consecutive nucleotides at a specific position in the genome.
[0228] An indel calling algorithm 148 may identify an insertion or deletion of bases in the genome of an organism classified among small genetic variations. While indels usually measure from 1 to 10 000 base pairs in length, a microindel is defined as an indel that results in a net change of 1 to 50 nucleotides. Indels can be contrasted with a SNP or point mutation. An indel inserts and/or deletes nucleotides from a sequence, while a point mutation is a form of substitution that replaces one of the nucleotides without changing the overall number in the DNA. Indels, being insertions and/or deletions, can be used as genetic markers in natural populations, especially in phylogenetic studies. Indel frequency tends to be markedly lower than that of single nucleotide polymorphisms (SNP), except near highly repetitive regions, including homopolymers and microsatellites.
[0229] A genomic rearrangement algorithm 149 may identify hybrid genes formed from two previously separate genes. It can occur as a result of translocation, interstitial deletion, or chromosomal inversion. Gene fusion can play an important role in tumorigenesis. Fusion genes can contribute to tumor formation because fusion genes can produce much more active abnormal protein than non-fusion genes. Often, fusion genes are oncogenes that cause cancer; these include BCR-ABL, TEL- AML 1 (ALL with t(12 ; 21)), AML1-ETO (M2 AML with t(8 ; 21)), and TMPRSS2-ERG with an interstitial deletion on chromosome 21, often occurring in prostate cancer. In the case of TMPRSS2-ERG, by disrupting androgen receptor (AR) signaling and inhibiting AR expression by oncogenic ETS transcription factor, the fusion product regulates prostate cancer. Most fusion genes are found from hematological cancers, sarcomas, and prostate cancer. BCAM-AKT2 is a fusion gene that is specific and unique to high-grade serous ovarian cancer. Oncogenic fusion genes may lead to a gene product with a new or different function from the two fusion partners. Alternatively, a proto oncogene is fused to a strong promoter, and thereby the oncogenic function is set to function by an upregulation caused by the strong promoter of the upstream fusion partner. The latter is common in lymphomas, where oncogenes are juxtaposed to the promoters of the immunoglobulin genes. Oncogenic fusion transcripts may also be caused by trans-splicing or read-through events. Since chromosomal translocations play such a significant role in neoplasia, a specialized database of chromosomal aberrations and gene fusions in cancer has been created. This database is called Mitelman Database of Chromosome Aberrations and Gene Fusions in Cancer.
[0230] In some embodiments, feature extraction module 145 includes instructions for identifying one or more complex genomic alterations (e.g.. features that incorporate more
than a change in the primary sequence of the genome) in the cancer genome of the subject. For instance, in some embodiments, feature extraction module 145 includes modules for identifying one or more of copy number variation (e.g., copy number variation analysis module 153), microsatellite instability status (e.g., microsatellite instability analysis module 154), tumor mutational burden (e.g., tumor mutational burden analysis module 155), tumor ploidy (e.g., tumor ploidy analysis module 156), and homologous recombination pathway deficiencies (e.g., homologous recombination pathway analysis module 157).
[0231] For example, referring to Figure 1D1, the copy number variation analysis module 153 performs a method that validates a copy number annotation of a genomic segment in a test subject, in accordance with some embodiments of the present disclosure. The method comprises obtaining an input data store 153-r (e.g., a dataset), where the input data store includes a bin-level sequence ratio data structure 153-r- 1 containing a plurality of bin-level sequence ratios; a segment-level sequence ratio data structure 153-r-2 containing a plurality of segment-level sequence ratios; and a segment-level dispersion measure data structure 153- r-3 containing a plurality of segment-level measures of dispersion. In some embodiments, the method further comprises passing the data in the input data store 153-r to an amplification/deletion filter construct 153-a, thus applying the dataset to a plurality of filters. The amplification/deletion filter construct 153-a comprises a plurality of filters, including an optional measure of central tendency bin-level sequence ratio filter 153-a-l; an optional segment-level measure of dispersion confidence filter 153-a-2; an optional measure of central tendency-plus-deviation bin-level sequence ratio filter 153-a-3; and/or an optional segment- level sequence ratio filter 153-a-4. In some embodiments, the copy number variation analysis module further provides an output via the validation construct 153-o, where, when a filter in the amplification/deletion filter construct 153-a is fired, the copy number annotation of the genomic segment is rejected, and when no filter in the amplification/deletion filter construct 153-a is fired, the copy number annotation of the genomic segment is validated. In some embodiments, copy number annotations validated using the copy number variation analysis module 153 in the feature extraction module 145 are used to populate the plurality of genomic copy numbers 135 in the one or more genomic features 131 of the test patient data store 120.
[0232] As another example, referring to Figure 1D3, in some embodiments, feature extraction module 145 comprises a tumor fraction estimation module 145-tf. In some embodiments, the tumor fraction estimation module 145-tf comprises a sequence ratio data
structure 145-tf-r including a plurality of sequence ratios (e.g., coverage ratios) obtained from a sequencing of a test liquid biopsy sample of a subject. In some embodiments, the sequence ratio data structure 145-tf-r includes the sequence ratios that are used as input to determine tumor fraction estimates for the test liquid biopsy sample. In some embodiments, the tumor fraction estimation module 145-tf also comprises a tumor purity algorithm construct 145-tf-a that executes, for example, a maximum likelihood estimation (e.g., an expectation- maximization algorithm) to calculate an estimate of the circulating tumor fraction. The tumor purity algorithm construct 145-tf-a comprises an optional input data filtration construct 145- tf-k (e.g., for removing one or more inputs passed from the sequence ratio data structure based on a minimum probe threshold or a position on a sex chromosome) and a plurality of model parameters 145-tf-d (e.g., 145-tf-d-l, 145-tf-d-2,...) used for executing the algorithm. In some embodiments, model parameters include expected sequence ratios for a set of copy states at a given tumor purity; a distance (e.g. , an error) from a test sequence ratio to the closest expected sequence ratio at the given tumor purity; a minimum distance (e.g., a minimum error) from a test sequence ratio to the closest expected sequence ratio at the given tumor purity (e.g., an assigned test copy state selected from a minimal distance expected copy state); and/or a tumor purity score (e.g., a sum of weighted errors).
[0233] In some embodiments, referring to Figure 1C3, the tumor fraction estimation module 145-tf is used to obtain one or more circulating tumor fraction estimates 131-i that are included as feature data 125 in a test patient data store 120. For example, in some embodiments, a plurality of circulating tumor fraction estimates is obtained from a test liquid biopsy sample of a subject 131-i-cf (e.g., 131-i-cf-l, 131 -i-cf-2... , 131-i-cf-N). In some embodiments, the plurality of circulating tumor fraction estimates is obtained from a single patient at different collection times.
[0234] Further details and specific embodiments regarding methods for analysis and validation of copy number variation, validation of a somatic sequence variant, and/or determination of a circulating tumor fraction estimate are provided below with reference to Figures 4, 5, and 6 (e.g., Figures 4F1, 5A1-5E1, and 6A1-6C1; Figures 4F2, 5A2-5B2, and 6A2, and/or Figures 4F3, 5A3-5B3, and 6A3-6C3).
Feature Analysis Module (160)
[0235] Referring again to Figure 1A, the system (e.g., system 100) includes a feature analysis module 160 that includes one or more genomic alteration interpretation algorithms 161, one or more optional clinical data analysis algorithms 165, an optional therapeutic
curation algorithm 165, and an optional recommendation validation module 167. In some embodiments, feature analysis module 160 identifies actionable variants and characteristics 139-1 and corresponding matched therapies 139-2 and/or clinical trials using one or more analysis algorithms (e.g., algorithms 162, 163, 164, and 165) to evaluate feature data 125.
The identified actionable variants and characteristics 139-1 and corresponding matched therapies 139-2, which are optionally stored in test patient data store 120, are then curated by feature analysis module 160 to generate a clinical report 139-3, which is optionally validated by a user, e.g., a clinician, before being transmitted to a medical professional, e.g., an oncologist, treating the patient.
[0236] In some embodiments, the genomic alteration interpretation algorithms 161 include instructions for evaluating the effect that one or more genomic features 131 of the subject, e.g., as identified by feature extraction module 145, have on the characteristics of the patient’s cancer and/or whether one or more targeted cancer therapies may improve the clinical outcome for the patient. For example, in some embodiments, one or more genomic variant analysis algorithms 163 evaluate various genomic features 131 by querying a database, e.g., a look-up-table (“LUT”) of actionable genomic alterations, targeted therapies associated with the actionable genomic alterations, and any other conditions that should be met before administering the targeted therapy to a subject having the actionable genomic alteration. For instance, evidence suggests that depatuxizumab mafodotin (an anti-EGFR mAh conjugated to monomethyl auristatin F) has improved efficacy for the treatment of recurrent glioblastomas having EGFR focal amplifications van den Bent M. et al, Cancer Chemother Pharmacol., 80(6): 1209-17 (2017). Accordingly, the actionable genomic alteration LUT would have an entry for the focal amplification of the EGFR gene indicating that depatuxizumab mafodotin is a targeted therapy for glioblastomas (e.g., recurrent glioblastomas) having a focal gene amplification. In some instances, the LUT may also include counter indications for the associated targeted therapy, e.g., adverse drug interactions or personal characteristics that are counter-indicated for administration of the particular targeted therapy.
[0237] In some embodiments, a genomic alteration interpretation algorithm 161 determines whether a particular genomic feature 131 should be reported to a medical professional treating the cancer patient. In some embodiments, genomic features 131 (e.g., genomic alterations and compound features) are reported when there is clinical evidence that the feature significantly impacts the biology of the cancer, impacts the prognosis for the
cancer, and/or impacts pharmacogenomics, e.g., by indicating or counter-indicating particular therapeutic approaches. For instance, a genomic alteration interpretation algorithm 161 may classify a particular CNV feature 135 as “Reportable,” e.g., meaning that the CNV has been identified as influencing the character of the cancer, the overall disease state, and/or pharmacogenomics, as “Not Reportable,” e.g., meaning that the CNV has not been identified as influencing the character of the cancer, the overall disease state, and/or pharmacogenomics, as “No Evidence,” e.g., meaning that no evidence exists supporting that the CNV is “Reportable” or “Not Reportable,” or as “Conflicting Evidence,” e.g., meaning that evidence exists supporting both that the CNV is “Reportable” and that the CNV is “Not Reportable.”
[0238] In some embodiments, the genomic alteration interpretation algorithms 161 include one or more pathogenic variant analysis algorithms 162, which evaluate various genomic features to identify the presence of an oncogenic pathogen associated with the patient’s cancer and/or targeted therapies associated with an oncogenic pathogen infection in the cancer. For instance, RNA expression patterns of some cancers are associated with the presence of an oncogenic pathogen that is helping to drive the cancer. See, for example, U.S. Patent Application Serial No. 16/802,126, filed February 26, 2020, the content of which is hereby incorporated by reference, in its entirety, for all purposes. In some instances, the recommended therapy for the cancer is different when the cancer is associated with the oncogenic pathogen infection than when it is not. Accordingly, in some embodiments, e.g., where feature data 125 includes RNA abundance data for the cancer of the patient, one or more pathogenic variant analysis algorithms 162 evaluate the RNA abundance data for the patient’s cancer to determine whether a signature exists in the data that indicates the presence of the oncogenic pathogen in the cancer. Similarly, in some embodiments, bioinformatics module 140 includes an algorithm that searches for the presence of pathogenic nucleic acid sequences in sequencing data 122. See, for example, U.S. Provisional Patent Application Serial No. 62/978,067, filed February 18, 2020, the content of which is hereby incorporated by reference, in its entirety, for all purposes. Accordingly, in some embodiments, one or more pathogenic variant analysis algorithms 162 evaluates whether the presence of an oncogenic pathogen in a subject is associated with an actionable therapy for the infection. In some embodiments, system 100 queries a database, e.g., a look-up-table (“LUT”), of actionable oncogenic pathogen infections, targeted therapies associated with the actionable infections, and any other conditions that should be met before administering the targeted
therapy to a subject that is infected with the oncogenic pathogen. In some instances, the LUT may also include counter indications for the associated targeted therapy, e.g., adverse drug interactions or personal characteristics that are counter-indicated for administration of the particular targeted therapy.
[0239] In some embodiments, the genomic alteration interpretation algorithms 161 include one or more multi-feature analysis algorithms 164 that evaluate a plurality of features to classify a cancer with respect to the effects of one or more targeted therapies. For instance, in some embodiments, feature analysis module 160 includes one or more classifiers trained against feature data, one or more clinical therapies, and their associated clinical outcomes for a plurality of training subjects to classify cancers based on their predicted clinical outcomes following one or more therapies.
[0240] In some embodiments, the classifier is implemented as an artificial intelligence engine and may include gradient boosting models, random forest models, neural networks (NN), regression models, Naive Bayes models, and/or machine learning algorithms (MLA). An MLA or a NN may be trained from a training data set that includes one or more features 125, including personal characteristics 126, medical history 127, clinical features 128, genomic features 131, and/or other -omic features 138. MLAs include supervised algorithms (such as algorithms where the features/classifications in the data set are annotated) using linear regression, logistic regression, decision trees, classification and regression trees, naive Bayes, nearest neighbor clustering; unsupervised algorithms (such as algorithms where no features/classification in the data set are annotated) using Apriori, means clustering, principal component analysis, random forest, adaptive boosting; and semi-supervised algorithms (such as algorithms where an incomplete number of features/classifications in the data set are annotated) using generative approach (such as a mixture of Gaussian distributions, mixture of multinomial distributions, hidden Markov models), low density separation, graph-based approaches (such as mincut, harmonic function, manifold regularization), heuristic approaches, or support vector machines.
[0241] NNs include conditional random fields, convolutional neural networks, attention based neural networks, deep learning, long short term memory networks, or other neural models where the training data set includes a plurality of tumor samples, RNA expression data for each sample, and pathology reports covering imaging data for each sample.
[0242] While MLA and neural networks identify distinct approaches to machine learning, the terms may be used interchangeably herein. Thus, a mention of MLA may include a corresponding NN or a mention of NN may include a corresponding MLA unless explicitly stated otherwise. Training may include providing optimized datasets, labeling these traits as they occur in patient records, and training the MLA to predict or classify based on new inputs. Artificial NNs are efficient computing models which have shown their strengths in solving hard problems in artificial intelligence. They have also been shown to be universal approximators, that is, they can represent a wide variety of functions when given appropriate parameters.
[0243] In some embodiments, system 100 includes a classifier training module that includes instructions for training one or more untrained or partially trained classifiers based on feature data from a training dataset. In some embodiments, system 100 also includes a database of training data for use in training the one or more classifiers. In other embodiments, the classifier training module accesses a remote storage device hosting training data. In some embodiments, the training data includes a set of training features, including but not limited to, various types of the feature data 125 illustrated in Figure IB. In some embodiments, the classifier training module uses patient data 121, e.g., when test patient data store 120 also stores a record of treatments administered to the patient and patient outcomes following therapy.
[0244] In some embodiments, feature analysis module 160 includes one or more clinical data analysis algorithms 165, which evaluate clinical features 128 of a cancer to identify targeted therapies which may benefit the subject. For example, in some embodiments, e.g., where feature data 125 includes pathology data 128-1, one or more clinical data analysis algorithms 165 evaluate the data to determine whether an actionable therapy is indicated based on the histopathology of a tumor biopsy from the subject, e.g., which is indicative of a particular cancer type and/or stage of cancer. In some embodiments, system 100 queries a database, e.g., a look-up-table (“LUT”), of actionable clinical features (e.g., pathology features), targeted therapies associated with the actionable features, and any other conditions that should be met before administering the targeted therapy to a subject associated with the actionable clinical features 128 (e.g., pathology features 128-1). In some embodiments, system 100 evaluates the clinical features 128 (e.g., pathology features 128-1) directly to determine whether the patient’s cancer is sensitive to a particular therapeutic agent. Further details on example methods, systems, and algorithms for classifying cancer and identifying
targeted therapies based on clinical data, such as pathology data 128-1, imaging data 138-2, and/or tissue culture/organoid data 128-3 are discussed, for example, in U.S. Patent Application No. 16/830,186, filed on March 25, 2020, U.S. Patent Application No. 16/789,363, filed on Feb. 12, 2020, and U.S. Provisional Application No. 63/007,874, filed on April 9, 2020, the contents of which are hereby incorporated by reference, in their entireties, for all purposes.
[0245] In some embodiments, feature analysis module 160 includes a clinical trials module that evaluates test patient data 121 to determine whether the patient is eligible for inclusion in a clinical trial for a cancer therapy, e.g., a clinical trial that is currently recruiting patients, a clinical trial that has not yet begun recruiting patients, and/or an ongoing clinical trial that may recruit additional patients in the future. In some embodiments, a clinical trial module evaluates test patient data 121 to determine whether the results of a clinical trial are relevant for the patient, e.g., the results of an ongoing clinical trial and/or the results of a completed clinical trial. For instance, in some embodiments, system 100 queries a database, e.g., a look-up-table (“LUT”) of clinical trials, e.g., active and/or completed clinical trials, and compares patient data 121 with inclusion criteria for the clinical trials, stored in the database, to identify clinical trials with inclusion criteria that closely match and/or exactly match the patient’s data 121. In some embodiments, a record of matching clinical trials, e.g., those clinical trials that the patient may be eligible for and/or that may inform personalized treatment decisions for the patient, are stored in clinical assessment database 139.
[0246] In some embodiments, feature analysis module 160 includes a therapeutic curation algorithm 166 that assembles actionable variants and characteristics 139-1, matched therapies 139-2, and/or relevant clinical trials identified for the patient, as described above. In some embodiments, a therapeutic curation algorithm 166 evaluates certain criteria related to which actionable variants and characteristics 139-1, matched therapies 139-2, and/or relevant clinical trials should be reported and/or whether certain matched therapies, considered alone or in combination, may be counter-indicated for the patient, e.g., based on personal characteristics 126 of the patient and/or known drug-drug interactions. In some embodiments, the therapeutic curation algorithm then generates one or more clinical reports 139-3 for the patient. In some embodiments, the therapeutic curation algorithm generates a first clinical report 139-3-1 that is to be reported to a medical professional treating the patient and a second clinical report 139-3-2 that will not be communicated to the medical professional, but may be used to improve various algorithms within the system.
[0247] In some embodiments, feature analysis module 160 includes a recommendation validation module 167 that includes an interface allowing a clinician to review, modify, and approve a clinical report 139-3 prior to the report being sent to a medical professional, e.g., an oncologist, treating the patient.
[0248] In some embodiments, each of the one or more feature collections, sequencing modules, bioinformatics modules (including, e.g., alteration module(s), structural variant calling and data processing modules), classification modules and outcome modules are communicatively coupled to a data bus to transfer data between each module for processing and/or storage. In some alternative embodiments, each of the feature collection, alteration module(s), structural variant and feature store are communicatively coupled to each other for independent communication without sharing the data bus.
[0249] Further details on systems and exemplary embodiments of modules and feature collections are discussed in PCT Application PCT/US 19/69149, titled “A METHOD AND PROCESS FOR PREDICTING AND ANALYZING PATIENT COHORT RESPONSE, PROGRESSION, AND SURVIVAL,” filed December 31, 2019, which is hereby incorporated herein by reference in its entirety.
Example Methods
[0250] Now that details of a system 100 for providing clinical support for personalized cancer therapy, e.g., with improved validation of copy number variation, improved validation of somatic sequence variants, and/or improved determination of circulating tumor fraction estimates have been disclosed, details regarding processes and features of the system, in accordance with various embodiments of the present disclosure, are disclosed below. Specifically, example processes are described below with reference to Figures 2A, 3, 4, 5, 6 and 7 (e.g., Figures 2A, 3, 4A-E; Figures 4F1, 5A1-5E1, 6A1-6C1, and 7A1-7C1; Figures 4F2, 5A2-5B2, 6A2, and 7A2-7B2; and/or Figures 4F3, 5A3-5B3, 6A3-6C3, and 7A3). In some embodiments, such processes and features of the system are carried out by modules 118, 120, 140, 160, and/or 170, as illustrated in Figure 1A. Referring to these methods, the systems described herein (e.g., system 100) include instructions for determining and validating focal copy number variations that are improved compared to conventional methods for copy number analysis, instructions for validating somatic variants that are improved compared to conventional methods for somatic variant detection, and/or instructions for determining accurate circulating tumor fraction estimates that are improved compared to conventional methods for obtaining circulating tumor fraction estimates.
Figure 2B: Distributed Diagnostic and Clinical Environment
[0251] In some aspects, the methods described herein for providing clinical support for personalized cancer therapy are performed across a distributed diagnostic/clinical environment, e.g., as illustrated in Figure 2B. However, in some embodiments, the improved methods described herein for supporting clinical decisions in precision oncology using liquid biopsy assays (e.g., by validating a copy number variation in a test subject, validating a somatic sequence variant in a test subject having a cancer condition, determining accurate circulating tumor fraction estimates, etc.) are performed at a single location, e.g, at a single computing system or environment, although ancillary procedures supporting the methods described herein, and/or procedures that make further use of the results of the methods described herein, may be performed across a distributed diagnostic/clinical environment.
[0252] Figure 2B illustrates an example of a distributed diagnostic/clinical environment 210. In some embodiments, the distributed diagnostic/clinical environment is connected via communication network 105. In some embodiments, one or more biological samples, e.g., one or more liquid biopsy samples, solid tumor biopsy, normal tissue samples, and/or control samples, are collected from a subject in clinical environment 220, e.g., a doctor’s office, hospital, or medical clinic, or at a home health care environment (not depicted). Advantageously, while solid tumor samples should be collected within a clinical setting, liquid biopsy samples can be acquired in a less invasive fashion and are more easily collected outside of a traditional clinical setting. In some embodiments, one or more biological samples, or portions thereof, are processed within the clinical environment 220 where collection occurred, using a processing device 224, e.g., a nucleic acid sequencer for obtaining sequencing data, a microscope for obtaining pathology data, a mass spectrometer for obtaining proteomic data, etc. In some embodiments, one or more biological samples, or portions thereof are sent to one or more external environments, e.g., sequencing lab 230, pathology lab 240, and/or molecular biology lab 250, each of which includes a processing device 234, 244, and 254, respectively, to generate biological data 121 for the subject. Each environment includes a communications device 222, 232, 242, and 252, respectively, for communicating biological data 121 about the subject to a processing server 262 and/or database 264, which may be located in yet another environment, e.g., processing/storage center 260. Thus, in some embodiments, different portions of the systems and methods described herein are fulfilled by different processing devices located in different physical environments.
[0253] Accordingly, in some embodiments, a method for providing clinical support for personalized cancer therapy, e.g., with improved validation of copy number variations, improved validation of somatic sequence variants, and/or improved determination of circulating tumor fraction estimates, is performed across one or more environments, as illustrated in Figure 2B. For instance, in some such embodiments, a liquid biopsy sample is collected at clinical environment 220 or in a home healthcare environment. The sample, or a portion thereof, is sent to sequencing lab 230 where raw sequence reads 123 of nucleic acids in the sample are generated by sequencer 234. The raw sequencing data 123 is communicated, e.g., from communications device 232, to database 264 at processing/storage center 260, where processing server 262 extracts features from the sequence reads by executing one or more of the processes in bioinformatics module 140, thereby generating genomic features 131 for the sample. Processing server 262 may then analyze the identified features by executing one or more of the processes in feature analysis module 160, thereby generating clinical assessment 139, including a clinical report 139-3. A clinician may access clinical report 139-3, e.g., at processing/storage center 260 or through communications network 105, via recommendation validation module 167. After final approval, clinical report 139-3 is transmitted to a medical professional, e.g., an oncologist, at clinical environment 220, who uses the report to support clinical decision making for personalized treatment of the patient’s cancer.
Figure 2A: Example Workflow for Precision Oncology
[0254] Figure 2A is a flowchart of an example workflow 200 for collecting and analyzing data in order to generate a clinical report 139 to support clinical decision making in precision oncology. Advantageously, the methods described herein improve this process, for example, by improving various stages within feature extraction 206, including validating copy number variations, validating somatic sequence variants, and/or determining circulating tumor fraction estimates.
[0255] Briefly, the workflow begins with patient intake and sample collection 201, where one or more liquid biopsy samples, one or more tumor biopsy, and one or more normal and/or control tissue samples are collected from the patient (e.g., at a clinical environment 220 or home healthcare environment, as illustrated in Figure 2B). In some embodiments, personal data 126 corresponding to the patient and a record of the one or more biological samples obtained (e.g., patient identifiers, patient clinical data, sample type, sample identifiers, cancer conditions, etc.) are entered into a data analysis platform, e.g., test patient data store 120.
Accordingly, in some embodiments, the methods disclosed herein include obtaining one or more biological samples from one or more subjects, e.g., cancer patients. In some embodiments, the subject is a human, e.g., a human cancer patient.
[0256] In some embodiments, one or more of the biological samples obtained from the patient are a biological liquid sample, also referred to as a liquid biopsy sample. In some embodiments, one or more of the biological samples obtained from the patient are selected from blood, plasma, serum, urine, vaginal fluid, fluid from a hydrocele (e.g., of the testis), vaginal flushing fluids, pleural fluid, ascitic fluid, cerebrospinal fluid, saliva, sweat, tears, sputum, bronchoalveolar lavage fluid, discharge fluid from the nipple, aspiration fluid from different parts of the body (e.g., thyroid, breast), etc. In some embodiments, the liquid biopsy sample includes blood and/or saliva. In some embodiments, the liquid biopsy sample is peripheral blood. In some embodiments, blood samples are collected from patients in commercial blood collection containers, e.g., using a PAXgene® Blood DNA Tubes. In some embodiments, saliva samples are collected from patients in commercial saliva collection containers, e.g., using an Oragene® DNA Saliva Kit.
[0257] In some embodiments, the liquid biopsy sample has a volume of from about 1 mL to about 50 mL. For example, in some embodiments, the liquid biopsy sample has a volume of about 1 mL, about 2 mL, about 3 mL, about 4 mL, about 5 mL, about 6 mL, about 7 mL, about 8 mL, about 9 mL, about 10 mL, about 11 mL, about 12 mL, about 13 mL, about 14 mL, about 15 mL, about 16 mL, about 17 mL, about 18 mL, about 19 mL, about 20 mL, or greater.
[0258] Liquid biopsy samples include cell free nucleic acids, including cell-free DNA (cfDNA). As described above, cfDNA isolated from cancer patients includes DNA originating from cancerous cells, also referred to as circulating tumor DNA (ctDNA), cfDNA originating from germline (e.g., healthy or non-cancerous) cells, and cfDNA originating from hematopoietic cells (e.g., white blood cells). The relative proportions of cancerous and non- cancerous cfDNA present in a liquid biopsy sample varies depending on the characteristics (e.g., the type, stage, lineage, genomic profile, etc.) of the patient’s cancer. As used herein, the ‘tumor burden’ of the subject refers to the percentage cfDNA that originated from cancerous cells.
[0259] As described herein, cfDNA is a particularly useful source of biological data for various implementations of the methods and systems described herein, because it is readily
obtained from various body fluids. Advantageously, use of bodily fluids facilitates serial monitoring because of the ease of collection, as these fluids are collectable by non-invasive or minimally invasive methodologies. This is in contrast to methods that rely upon solid tissue samples, such as biopsies, which often times require invasive surgical procedures. Further, because bodily fluids, such as blood, circulate throughout the body, the cfDNA population represents a sampling of many different tissue types from many different locations.
[0260] In some embodiments, a liquid biopsy sample is separated into two different samples. For example, in some embodiments, a blood sample is separated into a blood plasma sample, containing cfDNA, and a huffy coat preparation, containing white blood cells.
[0261] In some embodiments, a plurality of liquid biopsy samples is obtained from a respective subject at intervals over a period of time (e.g., using serial testing). For example, in some such embodiments, the time between obtaining liquid biopsy samples from a respective subject is at least 1 day, at least 2 days, at least 1 week, at least 2 weeks, at least 1 month, at least 2 months, at least 3 months, at least 4 months, at least 6 months, or at least 1 year.
[0262] In some embodiments, one or more biological samples collected from the patient is a solid tissue sample, e.g., a solid tumor sample or a solid normal tissue sample. Methods for obtaining solid tissue samples, e.g., of cancerous and/or normal tissue are known in the art and are dependent upon the type of tissue being sampled. For example, bone marrow biopsies and isolation of circulating tumor cells can be used to obtain samples of blood cancers, endoscopic biopsies can be used to obtain samples of cancers of the digestive tract, bladder, and lungs, needle biopsies (e.g., fine-needle aspiration, core needle aspiration, vacuum-assisted biopsy, and image-guided biopsy, can be used to obtain samples of subdermal tumors, skin biopsies, e.g., shave biopsy, punch biopsy, incisional biopsy, and excisional biopsy, can be used to obtain samples of dermal cancers, and surgical biopsies can be used to obtain samples of cancers affecting internal organs of a patient. In some embodiments, a solid tissue sample is a formalin-fixed tissue (FFT). In some embodiments, a solid tissue sample is a macro-dissected formalin fixed paraffin embedded (FFPE) tissue. In some embodiments, a solid tissue sample is a fresh frozen tissue sample.
[0263] In some embodiments, a dedicated normal sample is collected from the patient, for co-processing with a liquid biopsy sample. Generally, the normal sample is of a non- cancerous tissue, and can be collected using any tissue collection means described above. In some embodiments, buccal cells collected from the inside of a patient’s cheeks are used as a normal sample. Buccal cells can be collected by placing an absorbent material, e.g., a swab, in the subject’s mouth and rubbing it against their cheek, e.g., for at least 15 second or for at least 30 seconds. The swab is then removed from the patient’s mouth and inserted into a tube, such that the tip of the tube is submerged into a liquid that serves to extract the buccal cells off of the absorbent material. An example of buccal cell recovery and collection devices is provided in U.S. Patent No. 9,138,205, the content of which is hereby incorporated by reference, in its entirety, for all purposes. In some embodiments, the buccal swab DNA is used as a source of normal DNA in circulating heme malignancies.
[0264] The biological samples collected from the patient are, optionally, sent to various analytical environments (e.g., sequencing lab 230, pathology lab 240, and/or molecular biology lab 250) for processing (e.g., data collection) and/or analysis (e.g., feature extraction). Wet lab processing 204 may include cataloguing samples (e.g., accessioning), examining clinical features of one or more samples (e.g., pathology review), and nucleic acid sequence analysis (e.g., extraction, library prep, capture + hybridize, pooling, and sequencing). In some embodiments, the workflow includes clinical analysis of one or more biological samples collected from the subject, e.g., at a pathology lab 240 and/or a molecular and cellular biology lab 250, to generate clinical features such as pathology features 128-3, imaging data 128-3, and/or tissue culture / organoid data 128-3.
[0265] In some embodiments, the pathology data 128-1 collected during clinical evaluation includes visual features identified by a pathologist’s inspection of a specimen (e.g., a solid tumor biopsy), e.g., of stained H&E or IHC slides. In some embodiments, the sample is a solid tissue biopsy sample. In some embodiments, the tissue biopsy sample is a formalin-fixed tissue (FFT), e.g., a formalin-fixed paraffin-embedded (FFPE) tissue. In some embodiments, the tissue biopsy sample is an FFPE or FFT block. In some embodiments, the tissue biopsy sample is a fresh-frozen tissue biopsy. The tissue biopsy sample can be prepared in thin sections (e.g., by cutting and/or affixing to a slide), to facilitate pathology review (e.g., by staining with immunohistochemistry stain for IHC review and/or with hematoxylin and eosin stain for H&E pathology review). For instance, analysis of slides for H&E staining or IHC staining may reveal features such as tumor infiltration, programmed
death-ligand 1 (PD-L1) status, human leukocyte antigen (HLA) status, or other immunological features.
[0266] In some embodiments, a liquid sample ( e.g . , blood) collected from the patient (e.g., in EDTA-containing collection tubes) is prepared on a slide (e.g., by smearing) for pathology review. In some embodiments, macrodissected FFPE tissue sections, which may be mounted on a histopathology slide, from solid tissue samples (e.g. , tumor or normal tissue) are analyzed by pathologists. In some embodiments, tumor samples are evaluated to determine, e.g., the tumor purity of the sample, the percent tumor cellularity as a ratio of tumor to normal nuclei, etc. For each section, background tissue may be excluded or removed such that the section meets a tumor purity threshold, e.g., where at least 20% of the nuclei in the section are tumor nuclei, or where at least 25%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, or more of the nuclei in the section are tumor nuclei.
[0267] Conversion of solid tumor test to liquid biopsy test. In one embodiment, the solid tissue sample is insufficient for NGS testing (for example, the sample is too small or too degraded, the amount or quality of nucleic acids extracted from the sample does not result in quality NGS results that would result in reliable determination of variants and/or other genetic characteristics of the sample), and the physician or patient may decide to convert the solid tissue test that was ordered to a liquid biopsy test to be performed on a liquid biopsy sample collected from the same patient. The resulting report and/or display of the results on a portal may include an “xF Conversion Badge” to distinguish any order that has been converted from solid tissue test to a liquid biopsy test (compared to, for example, a liquid biopsy test that was not initially ordered as a solid tissue test). This will allow a user to identify which orders have been converted by this process, and distinguish between orders that were intentionally placed for the liquid biopsy panel.
[0268] In some embodiments, pathology data 128-1 is extracted, in addition to or instead of visual inspection, using computational approaches to digital pathology, e.g., providing morphometric features extracted from digital images of stained tissue samples. A review of digital pathology methods is provided in Bera, K. et cil, Nat. Rev. Clin. Oncol., 16:703-15 (2019), the content of which is hereby incorporated by reference, in its entirety, for all purposes. In some embodiments, pathology data 128-1 includes features determined using machine learning algorithms to evaluate pathology data collected as described above.
[0269] Further details on methods, systems, and algorithms for using pathology data to classify cancer and identify targeted therapies are discussed, for example, in are discussed, for example, in U.S. Patent Application No. 16/830,186, filed on March 25, 2020, and U.S. Provisional Application No. 63/007,874, filed on April 9, 2020, the contents of which are hereby incorporated by reference, in their entireties, for all purposes.
[0270] In some embodiments, imaging data 128-2 collected during clinical evaluation includes features identified by review of in-vitro and/or in-vivo imaging results (e.g., of a tumor site), for example a size of a tumor, tumor size differentials over time (such as during treatment or during other periods of change). In some embodiments, imaging data 128-2 includes features determined using machine learning algorithms to evaluate imaging data collected as described above.
[0271] Further details on methods, systems, and algorithms for using medical imaging to classify cancer and identify targeted therapies are discussed, for example, in are discussed, for example, in U.S. Patent Application No. 16/830,186, filed on March 25, 2020, and U.S. Provisional Application No. 63/007,874, filed on April 9, 2020, the contents of which are hereby incorporated by reference, in their entireties, for all purposes.
[0272] In some embodiments, tissue culture / organoid data 128-3 collected during clinical evaluation includes features identified by evaluation of cultured tissue from the subject. For instance, in some embodiments, tissue samples obtained from the patients (e.g., tumor tissue, normal tissue, or both) are cultured (e.g., in liquid culture, solid-phase culture, and/or organoid culture) and various features, such as cell morphology, growth characteristics, genomic alterations, and/or drug sensitivity, are evaluated. In some embodiments, tissue culture / organoid data 128-3 includes features determined using machine learning algorithms to evaluate tissue culture / organoid data collected as described above. Examples of tissue organoid (e.g., personal tumor organoid) culturing and feature extractions thereof are described in U.S. Provisional Application Serial No. 62/924,621, filed on October 22, 2019, and U.S. Patent Application Serial No. 16/693,117, filed on November 22, 2019, the contents of which are hereby incorporated by reference, in their entireties, for all purposes.
[0273] Nucleic acid sequencing of one or more samples collected from the subject is performed, e.g., at sequencing lab 230, during wet lab processing 204. An example workflow for nucleic acid sequencing is illustrated in Figure 3. In some embodiments, the one or more
biological samples obtained at the sequencing lab 230 are accessioned (302), to track the sample and data through the sequencing process.
[0274] Next, nucleic acids, e.g., RNA and/or DNA are extracted (304) from the one or more biological samples. Methods for isolating nucleic acids from biological samples are known in the art, and are dependent upon the type of nucleic acid being isolated (e.g., cfDNA, DNA, and/or RNA) and the type of sample from which the nucleic acids are being isolated (e.g., liquid biopsy samples, white blood cell buffy coat preparations, formalin-fixed paraffin-embedded (FFPE) solid tissue samples, and fresh frozen solid tissue samples). The selection of any particular nucleic acid isolation technique for use in conjunction with the embodiments described herein is well within the skill of the person having ordinary skill in the art, who will consider the sample type, the state of the sample, the type of nucleic acid being sequenced and the sequencing technology being used.
[0275] For instance, many techniques for DNA isolation, e.g., genomic DNA isolation, from a tissue sample are known in the art, such as organic extraction, silica adsorption, and anion exchange chromatography. Likewise, many techniques for RNA isolation, e.g., mRNA isolation, from a tissue sample are known in the art. For example, acid guanidinium thiocyanate-phenol-chloroform extraction (see, for example, Chomczynski and Sacchi, 2006, Nat Protoc, l(2):581-85, which is hereby incorporated by reference herein), and silica bead/glass fiber adsorption (see, for example, Poeckh, T. et al, 2008, Anal Biochem, 373(2):253-62, which is hereby incorporated by reference herein). The selection of any particular DNA or RNA isolation technique for use in conjunction with the embodiments described herein is well within the skill of the person having ordinary skill in the art, who will consider the tissue type, the state of the tissue, e.g., fresh, frozen, formalin-fixed, paraffin- embedded (FFPE), and the type of nucleic acid analysis that is to be performed.
[0276] In some embodiments where the biological sample is a liquid biopsy sample, e.g., a blood or blood plasma sample, cfDNA is isolated from blood samples using commercially available reagents, including proteinase K, to generate a liquid solution of cfDNA.
[0277] In some embodiments, isolated DNA molecules are mechanically sheared to an average length using an ultrasonicator (for example, a Covaris ultrasonicator). In some embodiments, isolated nucleic acid molecules are analyzed to determine their fragment size, e.g., through gel electrophoresis techniques and/or the use of a device such as a LabChip GX Touch. The skilled artisan will know of an appropriate range of fragment sizes, based on the
sequencing technique being employed, as different sequencing techniques have differing fragment size requirements for robust sequencing. In some embodiments, quality control testing is performed on the extracted nucleic acids (e.g., DNA and/or RNA), e.g, to assess the nucleic acid concentration and/or fragment size. For example, sizing of DNA fragments provides valuable information used for downstream processing, such as determining whether DNA fragments require additional shearing prior to sequencing.
[0278] Wet lab processing 204 then includes preparing a nucleic acid library from the isolated nucleic acids (e.g., cfDNA, DNA, and/or RNA). For example, in some embodiments, DNA libraries (e.g., gDNA and/or cfDNA libraries) are prepared from isolated DNA from the one or more biological samples. In some embodiments, the DNA libraries are prepared using a commercial library preparation kit, e.g., the KAPA Hyper Prep Kit, a New England Biolabs (NEB) kit, or a similar kit.
[0279] Conversion of solid tumor test to liquid biopsy test. In one embodiment, the solid tissue sample is insufficient for NGS testing (for example, the sample is too small or too degraded, the amount or quality of nucleic acids extracted from the sample does not result in quality NGS results that would result in reliable determination of variants and/or other genetic characteristics of the sample), and the physician or patient may decide to convert the solid tissue test that was ordered to a liquid biopsy test to be performed on a liquid biopsy sample collected from the same patient. The resulting report and/or display of the results on a portal may include an “xF Conversion Badge” to distinguish any order that has been converted from solid tissue test to a liquid biopsy test (compared to, for example, a liquid biopsy test that was not initially ordered as a solid tissue test). This will allow a user to identify which orders have been converted by this process, and distinguish between orders that were intentionally placed for the liquid biopsy panel.
[0280] In some embodiments, during library preparation, adapters (e.g., UDI adapters, such as Roche SeqCap dual end adapters, or UMI adapters such as full length or stubby Y adapters) are ligated onto the nucleic acid molecules. In some embodiments, the adapters include unique molecular identifiers (UMIs), which are short nucleic acid sequences (e.g., 3- 10 base pairs) that are added to ends of DNA fragments during adapter ligation. In some embodiments, UMIs are degenerate base pairs that serve as a unique tag that can be used to identify sequence reads originating from a specific DNA fragment. In some embodiments, e.g., when multiplex sequencing will be used to sequence DNA from a plurality of samples (e.g., from the same or different subjects) in a single sequencing reaction, a patient-specific
index is also added to the nucleic acid molecules. In some embodiments, the patient specific index is a short nucleic acid sequence (e.g., 3-20 nucleotides) that are added to ends of DNA fragments during library construction, that serve as a unique tag that can be used to identify sequence reads originating from a specific patient sample. Examples of identifier sequences are described, for example, in Kivioja et al, Nat. Methods 9(l):72-74 (2011) and Islam et al, Nat. Methods 11(2): 163-66 (2014), the contents of which are hereby incorporated by reference, in their entireties, for all purposes.
[0281] In some embodiments, an adapter includes a PCR primer landing site, designed for efficient binding of a PCR or second-strand synthesis primer used during the sequencing reaction. In some embodiments, an adapter includes an anchor binding site, to facilitate binding of the DNA molecule to anchor oligonucleotide molecules on a sequencer flow cell, serving as a seed for the sequencing process by providing a starting point for the sequencing reaction. During PCR amplification following adapter ligation, the UMIs, patient indexes, and binding sites are replicated along with the attached DNA fragment. This provides a way to identify sequence reads that came from the same original fragment in downstream analysis.
[0282] In some embodiments, DNA libraries are amplified and purified using commercial reagents, (e.g., Axygen MAG PCR clean up beads). In some such embodiments, the concentration and/or quantity of the DNA molecules are then quantified using a fluorescent dye and a fluorescence microplate reader, standard spectrofluorometer, or filter fluorometer. In some embodiments, library amplification is performed on a device (e.g., an Illumina C- Bot2) and the resulting flow cell containing amplified target-captured DNA libraries is sequenced on a next generation sequencer (e.g., an Illumina HiSeq 4000 or an Alumina NovaSeq 6000) to a unique on-target depth selected by the user. In some embodiments,
DNA library preparation is performed with an automated system, using a liquid handling robot (e.g., a SciClone NGSx).
[0283] In some embodiments, where feature data 125 includes methylation states 132 for one or more genomic locations, nucleic acids isolated from the biological sample (e.g., cfDNA) are treated to convert unmethylated cytosines to uracils, e.g., prior to generating the sequencing library. Accordingly, when the nucleic acids are sequenced, all cytosines called in the sequencing reaction were necessarily methylated, since the unmethylated cytosines were converted to uracils and accordingly would have been called as thymidines, rather than cytosines, in the sequencing reaction. Commercial kits are available for bisulfite-mediated conversion of methylated cytosines to uracils, for instance, the EZ DNA MethylationTM-
Gold, EZ DNA Methylation™-Direct, and EZ DNA Methylation™-Lightning kit (available from Zymo Research Corp (Irvine, CA)). Commercial kits are also available for enzymatic conversion of methylated cytosines to uracils, for example, the APOBEC-Seq kit (available from NEBiolabs, Ipswich, MA).
[0284] In some embodiments, wet lab processing 204 includes pooling (308) DNA molecules from a plurality of libraries, corresponding to different samples from the same and/or different patients, to forming a sequencing pool of DNA libraries. When the pool of DNA libraries is sequenced, the resulting sequence reads correspond to nucleic acids isolated from multiple samples. The sequence reads can be separated into different sequence read files, corresponding to the various samples represented in the sequencing read based on the unique identifiers present in the added nucleic acid fragments. In this fashion, a single sequencing reaction can generate sequence reads from multiple samples. Advantageously, this allows for the processing of more samples per sequencing reaction.
[0285] In some embodiments, wet lab processing 204 includes enriching (310) a sequencing library, or pool of sequencing libraries, for target nucleic acids, e.g., nucleic acids encompassing loci that are informative for precision oncology and/or used as internal controls for the sequencing or bioinformatics processes. In some embodiments, enrichment is achieved by hybridizing target nucleic acids in the sequencing library to probes that hybridize to the target sequences, and then isolating the captured nucleic acids away from off-target nucleic acids that are not bound by the capture probes. In some embodiments, one or more off-target nucleic acids will remain in the final sequencing pool.
[0286] Advantageously, enriching for target sequences prior to sequencing nucleic acids significantly reduces the costs and time associated with sequencing, facilitates multiplex sequencing by allowing multiple samples to be mixed together for a single sequencing reaction, and significantly reduces the computation burden of aligning the resulting sequence reads, as a result of significantly reducing the total amount of nucleic acids analyzed from each sample.
[0287] In some embodiments, the enrichment is performed prior to pooling multiple nucleic acid sequencing libraries. However, in other embodiments, the enrichment is performed after pooling nucleic acid sequencing libraries, which has the advantage of reducing the number of enrichment assays that have to be performed.
[0288] In some embodiments, the enrichment is performed prior to generating a nucleic acid sequencing library. This has the advantage that fewer reagents are needed to perform both the enrichment (because there are fewer target sequences at this point, prior to library amplification) and the library production (because there are fewer nucleic acid molecules to tag and amplify after the enrichment). However, this raises the possibility of pull-down bias and/or that small variations in the enrichment protocol will result in less consistent results.
[0289] In some embodiments, nucleic acid libraries are pooled (two or more DNA libraries may be mixed to create a pool) and treated with reagents to reduce off-target capture, for example Human COT-1 and/or IDT xGen Universal Blockers. Pools may be dried in a vacufuge and resuspended. DNA libraries or pools may be hybridized to a probe set (for example, a probe set specific to a panel that includes loci from at least 100, 600, 1,000, 10,000, etc. of the 19,000 known human genes) and amplified with commercially available reagents (for example, the KAPA HiFi HotStart Ready Mix). For example, in some embodiments, a pool is incubated in an incubator, PCR machine, water bath, or other temperature-modulating device to allow probes to hybridize. Pools may then be mixed with Streptavidin-coated beads or another means for capturing hybridized DNA-probe molecules, such as DNA molecules representing exons of the human genome and/or genes selected for a genetic panel.
[0290] Pools may be amplified and purified more than once using commercially available reagents, for example, the KAPA HiFi Library Amplification kit and Axygen MAG PCR clean up beads, respectively. The pools or DNA libraries may be analyzed to determine the concentration or quantity of DNA molecules, for example by using a fluorescent dye (for example, PicoGreen pool quantification) and a fluorescence microplate reader, standard spectrofluorometer, or filter fluorometer. In one example, the DNA library preparation and/or capture is performed with an automated system, using a liquid handling robot (for example, a SciClone NGSx).
[0291] In some embodiments, e.g. , where a whole genome sequencing method will be used, nucleic acid sequencing libraries are not target-enriched prior to sequencing, in order to obtain sequencing data on substantially all of the competent nucleic acids in the sequencing library. Similarly, in some embodiments, e.g., where a whole genome sequencing method will be used, nucleic acid sequencing libraries are not mixed, because of bandwidth limitations related to obtaining significant sequencing depth across an entire genome. However, in other embodiments, e.g., where a low pass whole genome sequencing (LPWGS)
methodology will be used, nucleic acid sequencing libraries can still be pooled, because very low average sequencing coverage is achieved across a respective genome, e.g., between about 0.5x and about 5x.
[0292] In some embodiments, a plurality of nucleic acid probes (e.g., a probe set) is used to enrich one or more target sequences in a nucleic acid sample (e.g., an isolated nucleic acid sample or a nucleic acid sequencing library), e.g., where one or more target sequences is informative for precision oncology. For instance, in some embodiments, one or more of the target sequences encompasses a locus that is associated with an actionable allele. That is, variations of the target sequence are associated with targeted therapeutic approaches. In some embodiments, one or more of the target sequences and/or a property of one or more of the target sequences is used in a classifier trained to distinguish two or more cancer states.
[0293] In some embodiments, the probe set includes probes targeting one or more gene loci, e.g., exon or intron loci. In some embodiments, the probe set includes probes targeting one or more loci not encoding a protein, e.g., regulatory loci, miRNA loci, and other non coding loci, e.g., that have been found to be associated with cancer. In some embodiments, the plurality of loci includes at least 25, 50, 100, 150, 200, 250, 300, 350, 400, 500, 750,
1000, 2500, 5000, or more human genomic loci.
[0294] In some embodiments, the probe set includes probes targeting one or more of the genes listed in Table 1. In some embodiments, the probe set includes probes targeting at least 5 of the genes listed in Table 1. In some embodiments, the probe set includes probes targeting at least 10 of the genes listed in Table 1. In some embodiments, the probe set includes probes targeting at least 25 of the genes listed in Table 1. In some embodiments, the probe set includes probes targeting at least 50 of the genes listed in Table 1. In some embodiments, the probe set includes probes targeting at least 75 of the genes listed in Table 1. In some embodiments, the probe set includes probes targeting at least 100 of the genes listed in Table 1. In some embodiments, the probe set includes probes targeting all of the genes listed in Table 1.
[0295] Table 1. An example panel of 105 genes that are informative for precision oncology.
[0296] In some embodiments, the probe set includes probes targeting one or more of the genes listed in List 1, provided below. In some embodiments, the probe set includes probes targeting at least 5 of the genes listed in List 1. In some embodiments, the probe set includes probes targeting at least 10 of the genes listed in List 1. In some embodiments, the probe set includes probes targeting at least 25 of the genes listed in List 1. In some embodiments, the probe set includes probes targeting at least 50 of the genes listed in List 1. In some embodiments, the probe set includes probes targeting at least 70 of the genes listed in List 1. In some embodiments, the probe set includes probes targeting all of the genes listed in List 1.
[0297] In some embodiments, the probe set includes probes targeting one or more of the genes listed in List 2, provided below. In some embodiments, the probe set includes probes targeting at least 5 of the genes listed in List 2. In some embodiments, the probe set includes
probes targeting at least 10 of the genes listed in List 2. In some embodiments, the probe set includes probes targeting at least 25 of the genes listed in List 2. In some embodiments, the probe set includes probes targeting at least 50 of the genes listed in List 2. In some embodiments, the probe set includes probes targeting at least 75 of the genes listed in List 2. In some embodiments, the probe set includes probes targeting at least 100 of the genes listed in List 2. In some embodiments, the probe set includes probes targeting all of the genes listed in List 2.
[0298] In some embodiments, panels of genes including one or more genes from the following lists are used for analyzing specimens, sequencing, and/or identification. In some embodiments, panels of genes for analyzing specimens, sequencing, and/or identification include one or more genes from List 1 or List 2. In some embodiments, panels of genes for analyzing specimens, sequencing, and/or identification include one or more genes from:
[0299] List 1 : AKT1 (14q32.33), ALK (2p23.2-23.1), APC (5q22.2), AR (Xql2), ARAF (Xpll.3), ARID 1 A (lp36.11), ATM (llq22.3), BRAF (7q34), BRCA1 (17q21.31), BRCA2 (13ql3.1), CCND1 (llql3.3), CCND2 (12pl3.32), CCNE1 (19ql2), CDH1 (16q22.1),
CDK4 (12ql4.1), CDK6 (7q21.2), CDKN2A (9p21.3), CTNNB1 (3p22.1), DDR2 (lq23.3), EGFR (7pl 1.2), ERBB2 (17ql2), ESR1 (6q25.1-25.2), EZH2 (7q36.1), FBXW7 (4q31.3), FGFR1 (8pl 1.23), FGFR2 (10q26.13), FGFR3 (4pl6.3), GAT A3 (10pl4), GNA11 (19pl3.3), GNAQ (9q21.2), GNAS (20ql3.32), HNF1A (12q24.31), HRAS (llpl5.5), IDH1 (2q34), IDH2 (15q26.1), JAK2 (9p24.1), JAK3 (19pl3.11), KIT (4ql2), KRAS (12pl2.1), MAP2K1 (15q22.31), MAP2K2 (19pl3.3), MAPK1 (22qll.22), MAPK3 (16pll.2), MET (7q31.2), MLH1 (3p22.2), MPL (lp34.2), MTOR (lp36.22), MYC (8q24.21), NF1 (17qll.2), NFE2L2 (2q31.2), NOTCH1 (9q34.3), NPM1 (5q35.1), NRAS (lpl3.2), NTRKl (lq23.1), NTRK3 (15q25.3), PDGFRA (4ql2), PIK3CA (3q26.32), PTEN (10q23.31), PTPN11 (12q24.13), RAF1 (3p25.2), RBI (13ql4.2), RET (lOql 1.21), RHEB (7q36.1), RHOA (3p21.31), RIT1 (lq22), ROS1 (6q22.1), SMAD4 (18q21.2), SMO (7q32.1), STK11 (19pl3.3), TERT (5pl5.33), TP53 (17pl3.1), TSC1 (9q34.13), and VHL (3p25.3).
[0300] List 2: ABL1, ACVR1B, AKT1, AKT2, AKT3, ALK, ALOX12B, AMERl (FAM123B), APC, AR, ARAF, ARFRP1, ARID 1 A, ASXL1, ATM, ATR, ATRX, AURKA, AURKB, AXIN1, AXL, BAP1, BARD1, BCL2, BCL2L1, BCL2L2, BCL6, BCOR, BCORL1, BRAF, BRCA1, BRCA2, BRD4, BRIP1, BTG1, BTG2, BTK, Cllorf30 (EMSY), C17orf39 (GID4), CALR, CARD11, CASP8, CBFB, CBL, CCND1, CCND2, CCND3, CCNE1, CD22, CD274 (PD-L1), CD70, CD79A, CD79B, CDC73, CDH1, CDK12, CDK4,
CDK6, CDK8, CDKN1A, CDKN1B, CDKN2A, CDKN2B, CDKN2C, CEBPA, CHEK1, CHEK2, CIC, CREBBP, CRKL, CSF1R, CSF3R, CTCF, CTNNA1, CTNNB1, CUL3, CUL4A, CXCR4, CYP17A1, DAXX, DDR1, DDR2, DIS3, DNMT3A, DOT1L, EED,
EGFR, EP300, EPHA3, EPHB1, EPHB4, ERBB2, ERBB3, ERBB4, ERCC4, ERG, ERRFIl, ESR1, EZH2, FAM46C, FANCA, FANCC, FANCG, FANCL, FAS, FBXW7, FGF10, FGF12, FGF14, FGF19, FGF23, FGF3, FGF4, FGF6, FGFR1, FGFR2, FGFR3, FGFR4, FH, FLCN, FLT1, FLT3, FOXL2, FUBP1, GABRA6, GAT A3, GATA4, GATA6, GNA11, GNA13, GNAQ, GNAS, GRM3, GSK3B, H3F3A, HDAC1, HGF, HNF1A, HRAS,
HSD3B1, ID3, IDH1, IDH2, IGF1R, IKBKE, IKZF1, INPP4B, IRF2, IRF4, IRS2, JAK1, JAK2, JAK3, JETN, KDM5A, KDM5C, KDM6A, KDR, KEAP1, KEL, KIT, KLHL6, KMT2A, KMT2D (MLL2), KRAS, LTK, LYN, MAF, MAP2K1 (MEK1), MAP2K2 (MEK2), MAP2K4, MAP3K1, MAP3K13, MAPK1, MCL1, MDM2, MDM4, MED 12, MEF2B, MEN1, MERTK, MET, MITF, MKNK1, MLH1, MPL, MRE11A, MSH2, MSH3, MSH6, MST1R, MTAP, MTOR, MUTYH, MYC, MYCL (MYCL1), MYCN, MYD88,
NBN, NF1, NF2, NFE2L2, NFKBIA, NKX2-1, NOTCH1, NOTCH2, NOTCH3, NPM1, NRAS, NSD3 (WHSC1L1), NT5C2, NTRKl, NTRK2, NTRK3, P2RY8, PALB2, PARK2, PARPl, PARP2, PARP3, PAX5, PBRM1, PDCD1 (PD-1), PDCD1LG2 (PD-L2), PDGFRA, PDGFRB, PDK1, PIK3C2B, PIK3C2G, PIK3CA, PIK3CB, PIK3R1, PIM1, PMS2, POLD1, POLE, PPARG, PPP2R1A, PPP2R2A, PRDMl, PRKARIA, PRKCI, PTCH1, PTEN, PTPN11, PTPRO, QKI, RAC1, RAD21, RAD51, RAD51B, RAD51C, RAD51D, RAD52, RAD54L, RAFl, RARA, RBI, RBM10, REL, RET, RICTOR, RNF43, ROS1, RPTOR, SDHA, SDHB, SDHC, SDHD, SETD2, SF3B1, SGK1, SMAD2, SMAD4, SMARCA4, SMARCB1, SMO, SNCAIP, SOCS1, SOX2, SOX9, SPEN, SPOP, SRC, STAG2, STAT3, STK11, SUFU, SYK, TBX3, TEK, TERC, TERT, TET2, ncRNA, Promoter, TGFBR2,
TIP ARP, TNFAIP3, TNFRSF14, TP53, TSC1, TSC2, TYR03, U2AF1, VEGFA, VHL, WHSC1, WT1, XPOl, XRCC2, ZNF217, and ZNF703.
[0301] Generally, probes for enrichment of nucleic acids ( e.g ., cfDNA obtained from a liquid biopsy sample) include DNA, RNA, or a modified nucleic acid structure with a base sequence that is complementary to a locus of interest. For instance, a probe designed to hybridize to a locus in a cfDNA molecule can contain a sequence that is complementary to either strand, because the cfDNA molecules are double stranded. In some embodiments, each probe in the plurality of probes includes a nucleic acid sequence that is identical or complementary to at least 10, at least 11, at least 12, at least 13, at least 14, or at least 15
consecutive bases of a locus of interest. In some embodiments, each probe in the plurality of probes includes a nucleic acid sequence that is identical or complementary to at least 20, 25, 30, 40, 50, 75, 100, 150, 200, or more consecutive bases of a locus of interest.
[0302] Targeted panels provide several benefits for nucleic acid sequencing. For example, in some embodiments, algorithms for discriminating between, e.g., a first and second cancer condition can be trained on smaller, more informative data sets (e.g., fewer genes), which leads to more computationally efficient training of classifiers that discriminate between the first and second cancer states. Such improvements in computational efficiency, owing to the reduced size of the discriminating gene set, can advantageously either be used to speed up classifier training or be used to improve the performance of such classifiers (e.g., through more extensive training of the classifier).
[0303] In some embodiments, the gene panel is a whole-exome panel that analyzes the exomes of a biological sample. In some embodiments, the gene panel is a whole-genome panel that analyzes the genome of a specimen. In some embodiments, the gene panel is optimized for use with liquid biopsy samples (e.g., to provide clinical decision support for solid tumors). See, for example, Table 1 above.
[0304] In some embodiments, the probes include additional nucleic acid sequences that do not share any homology to the locus of interest. For example, in some embodiments, the probes also include nucleic acid sequences containing an identifier sequence, e.g., a unique molecular identifier (UMI), e.g., that is unique to a particular sample or subject. Examples of identifier sequences are described, for example, in Kivioja el al, 2011, Nat. Methods 9(1), pp. 72-74 and Islam et al, 2014, Nat. Methods 11(2), pp. 163-66, which are incorporated by reference herein. Similarly, in some embodiments, the probes also include primer nucleic acid sequences useful for amplifying the nucleic acid molecule of interest, e.g., using PCR.
In some embodiments, the probes also include a capture sequence designed to hybridize to an anti-capture sequence for recovering the nucleic acid molecule of interest from the sample.
[0305] Likewise, in some embodiments, the probes each include a non-nucleic acid affinity moiety covalently attached to nucleic acid molecule that is complementary to the locus of interest, for recovering the nucleic acid molecule of interest. Non-limited examples of non-nucleic acid affinity moieties include biotin, digoxigenin, and dinitrophenol. In some embodiments, the probe is attached to a solid-state surface or particle, e.g., a dipstick or magnetic bead, for recovering the nucleic acid of interest. In some embodiments, the
methods described herein include amplifying the nucleic acids that bound to the probe set prior to further analysis, e.g., sequencing. Methods for amplifying nucleic acids, e.g., by PCR, are well known in the art.
[0306] Sequence reads are then generated (312) from the sequencing library or pool of sequencing libraries. Sequencing data may be acquired by any methodology known in the art. For example, next generation sequencing (NGS) techniques such as sequencing-by synthesis technology (Illumina), pyrosequencing (454 Life Sciences), ion semiconductor technology (Ion Torrent sequencing), single-molecule real-time sequencing (Pacific Biosciences), sequencing by ligation (SOLiD sequencing), nanopore sequencing (Oxford Nanopore Technologies), or paired-end sequencing. In some embodiments, massively parallel sequencing is performed using sequencing-by-synthesis with reversible dye terminators. In some embodiments, sequencing is performed using next generation sequencing technologies, such as short-read technologies. In other embodiments, long-read sequencing or another sequencing method known in the art is used.
[0307] Next-generation sequencing produces millions of short reads (e.g., sequence reads) for each biological sample. Accordingly, in some embodiments, the plurality of sequence reads obtained by next-generation sequencing of cfDNA molecules are DNA sequence reads. In some embodiments, the sequence reads have an average length of at least fifty nucleotides. In other embodiments, the sequence reads have an average length of at least 50, 60, 70, 80, 90, 100, 150, 200, 250, 300, or more nucleotides.
[0308] In some embodiments, sequencing is performed after enriching for nucleic acids (e.g., cfDNA, gDNA, and/or RNA) encompassing a plurality of predetermined target sequences, e.g., human genes and/or non-coding sequences associated with cancer. Advantageously, sequencing a nucleic acid sample that has been enriched for target nucleic acids, rather than all nucleic acids isolated from a biological sample, significantly reduces the average time and cost of the sequencing reaction. Accordingly, in some embodiments, the methods described herein include obtaining a plurality of sequence reads of nucleic acids that have been hybridized to a probe set for hybrid-capture enrichment (e.g., of one or more genes listed in Table 1).
[0309] In some embodiments, panel-targeting sequencing is performed to an average on- target depth of at least 500x, at least 750x, at least lOOOx, at least 2500x, at least 500x, at least 10,000x, or greater depth. In some embodiments, samples are further assessed for
uniformity above a sequencing depth threshold (e.g., 95% of all targeted base pairs at 300x sequencing depth). In some embodiments, the sequencing depth threshold is a minimum depth selected by a user or practitioner.
[0310] In some embodiments, the sequence reads are obtained by a whole genome or whole exome sequencing methodology. In some such embodiments, whole exome capture is performed with an automated system, using a liquid handling robot (for example, a SciClone NGSx). Whole genome sequencing, and to some extent whole exome sequencing, is typically performed at lower sequencing depth than smaller target-panel sequencing reactions, because many more loci are being sequenced. For example, in some embodiments, whole genome or whole exome sequencing is performed to an average sequencing depth of at least 3x, at least 5x, at least lOx, at least 15x, at least 20x, or greater. In some embodiments, low-pass whole genome sequencing (LPWGS) techniques are used for whole genome or whole exome sequencing. LPWGS is typically performed to an average sequencing depth of about 0.25x to about 5x, more typically to an average sequencing depth of about 0.5x to about 3x.
[0311] Because of the differences in the sequencing methodologies, data obtained from targeted-panel sequencing is better suited for certain analyses than data obtained from whole genome/ whole exome sequencing, and vice versa. For instance, because of the higher sequencing depth achieved by targeted-panel sequencing, the resulting sequence data is better suited for the identification of variant alleles present at low allelic fractions in the sample, e.g., less than 20%. By contrast, data generated from whole genome/whole exome sequencing is better suited for the estimation of genome-wide metrics, such as tumor mutational burden, because the entire genome is better represented in the sequencing data. Accordingly, in some embodiments, a nucleic acid sample, e.g., a cfDNA, gDNA, or mRNA sample, is evaluated using both targeted-panel sequencing and whole genome/whole exome sequencing (e.g., LPWGS).
[0312] In some embodiments, the raw sequence reads resulting from the sequencing reaction are output from the sequencer in a native file format, e.g., a BCL file. In some embodiments, the native file is passed directly to a bioinformatics pipeline (e.g., variant analysis 206), components of which are described in detail below. In other embodiments, pre-processing is performed prior to passing the sequences to the bioinformatics platform.
For instance, in some embodiments, the format of the sequence read file is converted from the native file format (e.g., BCL) to a file format compatible with one or more algorithms
used in the bioinformatics pipeline (e.g., FASTQ or FASTA). In some embodiments, the raw sequence reads are filtered to remove sequences that do not meet one or more quality thresholds. In some embodiments, raw sequence reads generated from the same unique nucleic acid molecule in the sequencing read are collapsed into a single sequence read representing the molecule, e.g., using UMIs as described above. In some embodiments, one or more of these pre-processing activities is performed within the bioinformatics pipeline itself.
[0313] In one example, a sequencer may generate a BCL file. A BCL file may include raw image data of a plurality of patient specimens which are sequenced. BCL image data is an image of the flow cell across each cycle during sequencing. A cycle may be implemented by illuminating a patient specimen with a specific wavelength of electromagnetic radiation, generating a plurality of images which may be processed into base calls via BCL to FASTQ processing algorithms which identify which base pairs are present at each cycle. The resulting FASTQ file includes the entirety of reads for each patient specimen paired with a quality metric, e.g., in a range from 0 to 64 where a 64 is the best quality and a 0 is the worst quality. In embodiments where both a liquid biopsy sample and a normal tissue sample are sequenced, sequence reads in the corresponding FASTQ files may be matched, such that a liquid biopsy -normal analysis may be performed.
[0314] FASTQ format is a text-based format for storing both a biological sequence, such as a nucleotide sequence, and its corresponding quality scores. These FASTQ files are analyzed to determine what genetic variants or copy number changes are present in the sample. Each FASTQ file contains reads that may be paired-end or single reads, and may be short-reads or long-reads, where each read represents one detected sequence of nucleotides in a nucleic acid molecule that was isolated from the patient sample or a copy of the nucleic acid molecule, detected by the sequencer. Each read in the FASTQ file is also associated with a quality rating. The quality rating may reflect the likelihood that an error occurred during the sequencing procedure that affected the associated read. In some embodiments, the results of paired-end sequencing of each isolated nucleic acid sample are contained in a split pair of FASTQ files, for efficiency. Thus, in some embodiments, forward (Read 1) and reverse (Read 2) sequences of each isolated nucleic acid sample are stored separately but in the same order and under the same identifier.
[0315] In various embodiments, the bioinformatics pipeline may filter FASTQ data from the corresponding sequence data file for each respective biological sample. Such filtering
may include correcting or masking sequencer errors and removing (trimming) low quality sequences or bases, adapter sequences, contaminations, chimeric reads, overrepresented sequences, biases caused by library preparation, amplification, or capture, and other errors.
[0316] While workflow 200 illustrates obtaining a biological sample, extracting nucleic acids from the biological sample, and sequencing the isolated nucleic acids, in some embodiments, sequencing data used in the improved systems and methods described herein (e.g., which include improved methods for validating copy number variations, improved methods for validating a somatic sequence variant in a test subject having a cancer condition, and/or improved methods for determining accurate circulating tumor fraction estimates) is obtained by receiving previously generated sequence reads, in electronic form.
[0317] Referring again to Figure 2A, nucleic acid sequencing data 122 generated from the one or more patient samples is then evaluated (e.g., via variant analysis 206) in a bioinformatics pipeline, e.g., using bioinformatics module 140 of system 100, to identify genomic alterations and other metrics in the cancer genome of the patient. An example overview for a bioinformatics pipeline is described below with respect to Figure 4 (e.g., Figure 4A-E, 4F1-3, and/or 4G1-3). Advantageously, in some embodiments, the present disclosure improves bioinformatics pipelines, like pipeline 206, by improving methods and systems for the validation of copy number variations, the validation of somatic sequence variants, and/or the determination of circulating tumor fraction estimates.
[0318] Figure 4A illustrates an example bioinformatics pipeline 206 (e.g., as used for feature extraction in the workflows illustrated in Figures 2A and 3) for providing clinical support for precision oncology. As shown in Figure 4A, sequencing data 122 obtained from the wet lab processing 204 (e.g., sequence reads 314) is input into the pipeline.
[0319] In various embodiments, the bioinformatics pipeline includes a circulating tumor DNA (ctDNA) pipeline for analyzing liquid biopsy samples. The pipeline may detect SNVs, INDELs, copy number amplifications/deletions and genomic rearrangements (for example, fusions). The pipeline may employ unique molecular index (UMI)-based consensus base calling as a method of error suppression as well as a Bayesian tri-nucleotide context-based position level error suppression. In various embodiments, it is able to detect variants having a 0.1%, 0.15%, 0.2%, 0.25%, 0.3%, 0.4%, or 0.5% variant allele fraction.
[0320] In some embodiments, the sequencing data is processed (e.g., using sequence data processing module 141) to prepare it for genomic feature identification 385. For instance, in
some embodiments as described above, the sequencing data is present in a native file format provided by the sequencer. Accordingly, in some embodiments, the system (e.g., system 100) applies a pre-processing algorithm 142 to convert the file format (318) to one that is recognized by one or more upstream processing algorithms. For example, BCL file outputs from a sequencer can be converted to a FASTQ file format using the bcl2fastq or bcl2fastq2 conversion software (Illumina®). FASTQ format is a text-based format for storing both a biological sequence, such as nucleotide sequence, and its corresponding quality scores. These FASTQ files are analyzed to determine what genetic variants, copy number changes, etc., are present in the sample.
[0321] In some embodiments, other preprocessing functions are performed, e.g., filtering sequence reads 122 based on a desired quality, e.g., size and/or quality of the base calling. In some embodiments, quality control checks are performed to ensure the data is sufficient for variant calling. For instance, entire reads, individual nucleotides, or multiple nucleotides that are likely to have errors may be discarded based on the quality rating associated with the read in the FASTQ file, the known error rate of the sequencer, and/or a comparison between each nucleotide in the read and one or more nucleotides in other reads that has been aligned to the same location in the reference genome. Filtering may be done in part or in its entirety by various software tools, for example, a software tool such as Skewer. See, Jiang, H. et al, BMC Bioinformatics 15(182): 1-12 (2014). FASTQ files may be analyzed for rapid assessment of quality control and reads, for example, by a sequencing data QC software such as AfterQC, Kraken, RNA-SeQC, FastQC, or another similar software program. For paired end reads, reads may be merged.
[0322] In some embodiments, when both a liquid biopsy sample and a normal tissue sample from the patient are sequenced, two FASTQ output files are generated, one for the liquid biopsy sample and one for the normal tissue sample. A ‘matched’ (e.g., panel-specific) workflow is run to jointly analyze the liquid biopsy -normal matched FASTQ files. When a matched normal sample is not available from the patient, FASTQ files from the liquid biopsy sample are analyzed in the ‘tumor-only’ mode. See, for example, Figure 4B. If two or more patient samples are processed simultaneously on the same sequencer flow cell, e.g., a liquid biopsy sample and a normal tissue sample, a difference in the sequence of the adapters used for each patient sample barcodes nucleic acids extracted from both samples, to associate each read with the correct patient sample and facilitate assignment to the correct FASTQ file.
[0323] For efficiency, in some embodiments, the results of paired-end sequencing of each isolate are contained in a split pair of FASTQ files. Forward (Read 1) and reverse (Read 2) sequences of each tumor and normal isolate are stored separately but in the same order and under the same identifier. See, for example, Figure 4C. In various embodiments, the bioinformatics pipeline may filter FASTQ data from each isolate. Such filtering may include correcting or masking sequencer errors and removing (trimming) low quality sequences or bases, adapter sequences, contaminations, chimeric reads, overrepresented sequences, biases caused by library preparation, amplification, or capture, and other errors. See, for example, Figure 4D.
[0324] Similarly, in some embodiments, sequencing (312) is performed on a pool of nucleic acid sequencing libraries prepared from different biological samples, e.g., from the same or different patients. Accordingly, in some embodiments, the system demultiplexes (320) the data (e.g., using demultiplexing algorithm 144) to separate sequence reads into separate files for each sequencing library included in the sequencing pool, e.g, based on UMI or patient identifier sequences added to the nucleic acid fragments during sequencing library preparation, as described above. In some embodiments, the demultiplexing algorithm is part of the same software package as one or more pre-processing algorithms 142. For instance, the bcl2fastq or bcl2fastq2 conversion software (Illumina®) include instructions for both converting the native file format output from the sequencer and demultiplexing sequence reads 122 output from the reaction.
[0325] The sequence reads are then aligned (322), e.g., using an alignment algorithm 143, to a reference sequence construct 158, e.g, a reference genome, reference exome, or other reference construct prepared for a particular targeted-panel sequencing reaction. For example, in some embodiments, individual sequence reads 123, in electronic form (e.g., in FASTQ files), are aligned against a reference sequence construct for the species of the subject (e.g., a reference human genome) by identifying a sequence in a region of the reference sequence construct that best matches the sequence of nucleotides in the sequence read. In some embodiments, the sequence reads are aligned to a reference exome or reference genome using known methods in the art to determine alignment position information. The alignment position information may indicate a beginning position and an end position of a region in the reference genome that corresponds to a beginning nucleotide base and end nucleotide base of a given sequence read. Alignment position information may also include sequence read length, which can be determined from the beginning position and
end position. A region in the reference genome may be associated with a gene or a segment of a gene. Any of a variety of alignment tools can be used for this task.
[0326] For instance, local sequence alignment algorithms compare subsequences of different lengths in the query sequence (e.g., sequence read) to subsequences in the subject sequence (e.g., reference construct) to create the best alignment for each portion of the query sequence. In contrast, global sequence alignment algorithms align the entirety of the sequences, e.g., end to end. Examples of local sequence alignment algorithms include the Smith- Waterman algorithm (see, for example, Smith and Waterman, J Mol. Biol.,
147(1): 195-97 (1981), which is incorporated herein by reference), Lalign (see, for example, Huang and Miller, Adv. Appl. Math, 12:337-57 (1991), which is incorporated by reference herein), and PattemHunter (see, for example, Ma B. el al, Bioinformatics, 18(3):440-45 (2002), which is incorporated by reference herein).
[0327] In some embodiments, the read mapping process starts by building an index of either the reference genome or the reads, which is then used to retrieve the set of positions in the reference sequence where the reads are more likely to align. Once this subset of possible mapping locations has been identified, alignment is performed in these candidate regions with slower and more sensitive algorithms. See, for example, Hatem etal, 2013, “Benchmarking short sequence mapping tools,” BMC Bioinformatics 14: p. 184; and Flicek and Bimey, 2009, “Sense from sequence reads: methods for alignment and assembly,” Nat Methods 6(Suppl. 11), S6-S12, each of which is hereby incorporated by reference. In some embodiments, the mapping tools methodology makes use of a hash table or a Burrows- Wheeler transform (BWT). See, for example, Li and Homer, 2010, “A survey of sequence alignment algorithms for next-generation sequencing,” Brief Bioinformatics 11, pp. 473-483, which is hereby incorporated by reference.
[0328] Other software programs designed to align reads include, for example, Novoalign (Novocraft, Inc.), Bowtie, Burrows Wheeler Aligner (BWA), and/or programs that use a Smith- Waterman algorithm. Candidate reference genomes include, for example, hgl9, GRCh38, hg38, GRCh37, and/or other reference genomes developed by the Genome Reference Consortium. In some embodiments, the alignment generates a SAM file, which stores the locations of the start and end of each read according to coordinates in the reference genome and the coverage (number of reads) for each nucleotide in the reference genome.
[0329] For example, in some embodiments, each read of a FASTQ file is aligned to a location in the human genome having a sequence that best matches the sequence of nucleotides in the read. There are many software programs designed to align reads, for example, Novoalign (Novocraft, Inc.), Bowtie, Burrows Wheeler Aligner (BWA), programs that use a Smith-Waterman algorithm, etc. Alignment may be directed using a reference genome (for example, hgl9, GRCh38, hg38, GRCh37, other reference genomes developed by the Genome Reference Consortium, etc.) by comparing the nucleotide sequences in each read with portions of the nucleotide sequence in the reference genome to determine the portion of the reference genome sequence that is most likely to correspond to the sequence in the read.
In some embodiments, one or more SAM files are generated for the alignment, which store the locations of the start and end of each read according to coordinates in the reference genome and the coverage (number of reads) for each nucleotide in the reference genome.
The SAM files may be converted to BAM files. In some embodiments, the BAM files are sorted, and duplicate reads are marked for deletion, resulting in de-duplicated BAM files.
[0330] In some embodiments, adapter-trimmed FASTQ files are aligned to the 19th edition of the human reference genome build (HG19) using Burrows-Wheeler Aligner (BWA, Li and Durbin, Bioinformatics, 25(14): 1754-60 (2009). Following alignment, reads are grouped by alignment position and UMI family and collapsed into consensus sequences, for example, using fgbio tools (e.g., available on the internet at fulcrumgenomics.github.io/fgbio/). Bases with insufficient quality or significant disagreement among family members (for example, when it is uncertain whether the base is an adenine, cytosine, guanine, etc.) may be replaced by N's to represent a wildcard nucleotide type. PHRED scores are then scaled based on initial base calling estimates combined across all family members. Following single-strand consensus generation, duplex consensus sequences are generated by comparing the forward and reverse oriented PCR products with mirrored UMI sequences. In various embodiments, a consensus can be generated across read pairs. Otherwise, single-strand consensus calls will be used. Following consensus calling, filtering is performed to remove low-quality consensus fragments. The consensus fragments are then re-aligned to the human reference genome using BWA. A BAM output file is generated after the re-alignment, then sorted by alignment position, and indexed.
[0331] In some embodiments, where both a liquid biopsy sample and a normal tissue sample are analyzed, this process produces a liquid biopsy BAM file (e.g., Liquid BAM 124- 1-i-cf) and a normal BAM file (e.g., Germline BAM 124-1-i-g), as illustrated in Figure 4A.
In various embodiments, BAM files may be analyzed to detect genetic variants and other genetic features, including single nucleotide variants (SNVs), copy number variants (CNVs), gene rearrangements, etc.
[0332] In some embodiments, the sequencing data is normalized, e.g. , to account for pull down, amplification, and/or sequencing bias (e.g., mappability, GC bias etc.). See, for example, Schwartz et cil, PLoS ONE 6(l):el6685 (2011) and Benjamini and Speed, Nucleic Acids Research 40(10):e72 (2012), the contents of which are hereby incorporated by reference, in their entireties, for all purposes.
[0333] In some embodiments, SAM files generated after alignment are converted to BAM files 124. Thus, after preprocessing sequencing data generated for a pooled sequencing reaction, BAM files are generated for each of the sequencing libraries present in the master sequencing pools. For example, as illustrated in Figure 4A, separate BAM files are generated for each of three samples acquired from subject 1 at time i (e.g., tumor BAM 124-1-i-t corresponding to alignments of sequence reads of nucleic acids isolated from a solid tumor sample from subject 1, Liquid BAM 124-1-i-cf corresponding to alignments of sequence reads of nucleic acids isolated from a liquid biopsy sample from subject 1, and Germline BAM 124-1-i-g corresponding to alignments of sequence reads of nucleic acids isolated from a normal tissue sample from subject 1), and one or more samples acquired from one or more additional subjects at time j (e.g., Tumor BAM 124-2-j-t corresponding to alignments of sequence reads of nucleic acids isolated from a solid tumor sample from subject 2). In some embodiments, BAM files are sorted, and duplicate reads are marked for deletion, resulting in de-duplicated BAM files. For example, tools like SamBAMBA mark and filter duplicate alignments in the sorted BAM files.
[0334] Many of the embodiments described below, in conjunction with Figure 4 (e.g., Figure 4A-E, 4F1-3, and/or 4G1-3), relate to analyses performed using sequencing data from cfDNA of a cancer patient, e.g., obtained from a liquid biopsy sample of the patient. Generally, these embodiments are independent and, thus, not reliant upon any particular sequencing data generation methods, e.g., sample preparation, sequencing, and/or data pre processing methodologies. However, in some embodiments, the methods described below include one or more features 204 of generating sequencing data, as illustrated in Figures 2A and 3.
[0335] Alignment files prepared as described above (e.g., BAM files 124) are then passed to a feature extraction module 145, where the sequences are analyzed (324) to identify genomic alterations (e.g., SNVs/MNVs, indels, genomic rearrangements, copy number variations, etc.) and/or determine various characteristics of the patient’s cancer (e.g., MSI status, TMB, tumor ploidy, HRD status, tumor fraction, tumor purity, methylation patterns, etc.). Many software packages for identifying genomic alterations are known in the art, for example, freebayes, PolyBayse, samtools, GATK, pindel, SAMtools, Breakdancer, Cortex, Crest, Deify, Gridss, Hydra, Lumpy, Manta, and Socrates. For a review of many of these variant calling packages see, for example, Cameron, D.L. et cil, Nat. Commun., 10(3240):1- 11 (2019), the content of which is hereby incorporated by reference, in its entirety, for all purposes. Generally, these software packages identify variants in sorted SAM or BAM files 124, relative to one or more reference sequence constructs 158. The software packages then output a file e.g., a raw VCF (variant call format), listing the variants (e.g., genomic features 131) called and identifying their location relevant to the reference sequence construct (e.g., where the sequence of the sample nucleic acids differ from the corresponding sequence in the reference construct). In some embodiments, system 100 digests the contents of the native output file to populate feature data 125 in test patient data store 120. In other embodiments, the native output file serves as the record of these genomic features 131 in test patient data store 120.
[0336] Generally, the systems described herein can employ any combination of available variant calling software packages and internally developed variant identification algorithms. In some embodiments, the output of a particular algorithm of a variant calling software is further evaluated, e.g., to improve variant identification. Accordingly, in some embodiments, system 100 employs an available variant calling software package to perform some of all of the functionality of one or more of the algorithms shown in feature extraction module 145.
[0337] In some embodiments, as illustrated in Figure 1 A, separate algorithms (or the same algorithm implemented using different parameters) are applied to identify variants unique to the cancer genome of the patient and variants existing in the germline of the subject. In other embodiments, variants are identified indiscriminately and later classified as either germline or somatic, e.g., based on sequencing data, population data, or a combination thereof. In some embodiments, variants are classified as germline variants, and/or non- actionable variants, when they are represented in the population above a threshold level, e.g., as determined using a population database such as ExAC or gnomAD. For instance, in some
embodiments, variants that are represented in at least 1% of the alleles in a population are annotated as germline and/or non-actionable. In other embodiments, variants that are represented in at least 2%, at least 3%, at least 4%, at least 5%, at least 7.5%, at least 10%, or more of the alleles in a population are annotated as germline and/or non-actionable. In some embodiments, sequencing data from a matched sample from the patient, e.g. , a normal tissue sample, is used to annotate variants identified in a cancerous sample from the subject. That is, variants that are present in both the cancerous sample and the normal sample represent those variants that were in the germline prior to the patient developing cancer and can be annotated as germline variants.
[0338] In various aspects, the detected genetic variants and genetic features are analyzed as a form of quality control. For example, a pattern of detected genetic variants or features may indicate an issue related to the sample, sequencing procedure, and/or bioinformatics pipeline (e.g., example, contamination of the sample, mislabeling of the sample, a change in reagents, a change in the sequencing procedure and/or bioinformatics pipeline, etc.).
[0339] Figure 4E illustrates an example workflow for genomic feature identification (324). This particular workflow is only an example of one possible collection and arrangement of algorithms for feature extraction from sequencing data 124. Generally, any combination of the modules and algorithms of feature extraction module 145, e.g., illustrated in Figure 1 A, can be used for a bioinformatics pipeline, and particularly for a bioinformatics pipeline for analyzing liquid biopsy samples. For instance, in some embodiments, an architecture useful for the methods and systems described herein includes at least one of the modules or variant calling algorithms shown in feature extraction module 145. In some embodiments, an architecture includes at least 2, 3, 4, 5, 6, 7, 8, 9, 10, or more of the modules or variant calling algorithms shown in feature extraction module 145. Further, in some embodiments, feature extraction modules and/or algorithms not illustrated in Figure 1A find use in the methods and systems described herein.
Variant Identification
[0340] In some embodiments, variant analysis of aligned sequence reads, e.g., in SAM or BAM format, includes identification of single nucleotide variants (SNVs), multiple nucleotide variants (MNVs), indels (e.g., nucleotide additions and deletions), and/or genomic rearrangements (e.g., inversions, translocations, and gene fusions) using variant identification module 146, e.g., which includes a SNV/MNV calling algorithm (e.g., SNV/MNV calling algorithm 147), an indel calling algorithm (e.g., indel calling algorithm 148), and/or one or
more genomic rearrangement calling algorithms (e.g., genomic rearrangement calling algorithm 149). An overview of an example method for variant identification is shown in Figure 4E. Essentially, the module first identifies a difference between the sequence of an aligned sequence read 124 and the reference sequence to which the sequence read is aligned (e.g., an SNV/MNV, an indel, or a genomic rearrangement) and makes a record of the variant, e.g., in a variant call format (VCF) file. For instance, software packages such as freebayes and pindel are used to call variants using sorted BAM files and reference BED files as the input. For a review of variant calling packages see, for example, Cameron, D.L. et al, Nat. Commun., 10(3240): 1-11 (2019). A raw VCF file (variant call format) file is output, showing the locations where the nucleotide base in the sample is not the same as the nucleotide base in that position in the reference sequence construct.
[0341] In some embodiments, as illustrated in Figure 4E, raw VCF data is then normalized, e.g., by parsimony and left alignment. For example, software packages such as vcfbreakmulti and vt are used to normalize multi-nucleotide polymorphic variants in the raw VCF file and a variant normalized VCF file is output. See, for example, E. Garrison, “Vcflib: A C++ library for parsing and manipulating VCF files, GitHub, available on the internet at ai th ub. com/eka/vcfl ib (2012), the content of which is hereby incorporated by reference, in its entirety, for all purposes. In some embodiments, a normalization algorithm is included within the architecture of a broader variant identification software package.
[0342] An algorithm is then used to annotate the variants in the (e.g. , normalized) VCF file, e.g., determines the source of the variation, e.g., whether the variant is from the germline of the subject (e.g., a germline variant), a cancerous tissue (e.g., a somatic variant), a sequencing error, or of an undeterminable source. In some embodiments, an annotation algorithm is included within the architecture of a broader variant identification software package. However, in some embodiments, an external annotation algorithm is applied to (e.g., normalized) VCF data obtained from a conventional variant identification software package. The choice to use a particular annotation algorithm is well within the purview of the skilled artisan, and in some embodiments is based upon the data being annotated.
[0343] For example, in some embodiments, where both a liquid biopsy sample and a normal tissue sample of the patient are analyzed, variants identified in the normal tissue sample inform annotation of the variants in the liquid biopsy sample. In some embodiments, where a particular variant is identified in the normal tissue sample, that variant is annotated as a germline variant in the liquid biopsy sample. Similarly, in some embodiments, where a
particular variant identified in the liquid biopsy sample is not identified in the normal tissue sample, the variant is annotated as a somatic variant when the variant otherwise satisfies any additional criteria placed on somatic variant calling, e.g., a threshold variant allele fraction (VAF) in the sample.
[0344] By contrast, in some embodiments, where only a liquid biopsy sample is being analyzed, the annotation algorithm relies on other characteristics of the variant in order to annotate the origin of the variant. For instance, in some embodiments, the annotation algorithm evaluates the VAF of the variant in the sample, e.g., alone or in combination with additional characteristics of the sample, e.g., tumor fraction. Accordingly, in some embodiments, where the VAF is within a first range encompassing a value that corresponds to a 1 : 1 distribution of variant and reference alleles in the sample, the algorithm annotates the variant as a germline variant, because it is presumably represented in cfDNA originating from both normal and cancer tissues. Similarly, in some embodiments, where the VAF is below a baseline variant threshold, the algorithm annotates the variant as undeterminable, because there is not sufficient evidence to distinguish between the possibility that the variant arose as a result of an amplification or sequencing error and the possibility that the variant originated from a cancerous tissue. Similarly, in some embodiments, where the VAF falls between the first range and the baseline variant threshold, the algorithm annotates the variant as a somatic variant.
[0345] In some embodiments, the baseline variant threshold is a value from 0.01% VAF to 0.5% VAF. In some embodiments, the baseline variant threshold is a value from 0.05% VAF to 0.35% VAF. In some embodiments, the baseline variant threshold is a value from 0.1% VAF to 0.25% VAF. In some embodiments, the baseline variant threshold is about 0.01% VAF, 0.015% VAF, 0.02% VAF, 0.025% VAF, 0.03% VAF, 0.035% VAF, 0.04% VAF, 0.045% VAF, 0.05% VAF, 0.06% VAF, 0.07% VAF, 0.075% VAF, 0.08% VAF, 0.09% VAF, 0.1% VAF, 0.15% VAF, 0.2% VAF, 0.25% VAF, 0.3% VAF, 0.35% VAF,
0.4% VAF, 0.45% VAF, 0.5% VAF, or greater. In some embodiments, the baseline variant threshold is different for variants located in a first region, e.g., a region identified as a mutational hotspot and/or having high genomic complexity, than for variants located in a second region, e.g., a region that is not identified as a mutational hotspot and/or having average genomic complexity. For example, in some embodiments, the baseline variant threshold is a value from 0.01% to 0.25% for variants located in the first region and is a value from 0.1% to 0.5% for variants located in the second region.
[0346] In some embodiments, the first region is a region of interest in the genome that may have been manually selected based on criteria (for example, selection may be based on a known likelihood that a region is associated with variants) and the second region is a region that did not meet the selection criteria. In some embodiments, the baseline variant threshold is a value from 0.01% to 0.5% for variants located in the first region and is a value from 1% to 5% for variants located in the second region. In some embodiments, the first region is a region of interest in the genome that may have been manually selected based on criteria (for example, selection may be based on a known likelihood that a region is associated with variants) and the second region is a region selected based on a second set of criteria.
[0347] In some embodiments, a baseline variant threshold is influenced by the sequencing depth of the reaction, e.g., a locus-specific sequencing depth and/or an average sequencing depth (e.g., across a targeted panel and/or complete reference sequence construct). In some embodiments, the baseline variant threshold is dependent upon the type of variant being detected. For example, in some embodiments, different baseline variant thresholds are set for SNPs/MNVs than for indels and/or genomic rearrangements. For instance, while an apparent SNP may be introduced by amplification and/or sequencing errors, it is much less likely that a genomic rearrangement is introduced this way and, thus, a lower baseline variant threshold may be appropriate for genomic rearrangements than for SNPs/MNVs.
[0348] In some embodiments, one or more additional criteria are required to be satisfied before a variant can be annotated as a somatic variant. For instance, in some embodiments, a threshold number of unique sequence reads encompassing the variant must be present to annotate the variant as somatic. In some embodiments, the threshold number of unique sequence reads is 2, 3, 4, 5, 7, 10, 12, 15, or greater. In some embodiments, the threshold number of unique sequence reads is only applied when certain conditions are met, e.g., when the variant allele is located in a region of a certain genomic complexity. In some embodiments, the certain genomic complexity is a low genomic complexity. In some embodiments, the certain genomic complexity is an average genomic complexity. In some embodiments, the certain genomic complexity is a high genomic complexity.
[0349] In some embodiments, a threshold sequencing coverage, e.g., a locus-specific and/or an average sequencing depth (e.g., across a targeted panel and/or complete reference sequence construct) must be satisfied to annotate the variant as somatic. In some embodiments, the threshold sequencing coverage is 50X, 100X, 150X, 200X, 250X, 300X,
350X, 400X or greater. In some embodiments, the variant is located in a microsatellite instable (MSI) region. In some embodiments, the variant is not located in a microsatellite instable (MSI) region. In some embodiments, the variant has sufficient signal-to-noise ratio.
[0350] In some embodiments, bases contributing to the variant satisfy a threshold mapping quality to annotate the variant as somatic. In some embodiments, alignments contributing to the variant must satisfy a threshold alignment quality to annotate the variant as somatic. In some embodiments, a threshold value is determined for a variant detected in a somatic (cancer) sample by analyzing the threshold metric (for example, the baseline variant threshold is determined by analyzing VAF, or the threshold sequencing coverage is determined by analyzing coverage) associated with that variant in a group of germline (normal) samples that were each processed by the same sample processing and sequencing protocol as the somatic sample (process-matched). This may be used to ensure the variants are not caused by observed artifact generating processes.
[0351] In some embodiments, the threshold value is set above the median base fraction of the threshold metric value associated with the variant in more than a specified percentage of process-matched germline samples, e.g., 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more standard deviations above the median base fraction of the threshold metric value associated with 25%, 30, 40, 50, 60, 70, 75, or more of the processed-matched germline samples. For example, in one embodiment, the threshold value is set to a value 5 standard deviations above the median base fraction of the threshold metric value associated with the variant in more than 50% of the process matched germline samples.
[0352] In some embodiments, variants around homopolymer and multimer regions known to generate artifacts may be specifically filtered to avoid such artifacts. For example, in some embodiments, strand specific filtering is performed in the direction of the read in order to minimize stranded artifacts. Similarly, in some embodiments, variants that do not exceed the stranded minimum deviation for their specific locus within a known artifact generating region may be filtered to avoid artifacts.
[0353] Variants may be filtered using dynamic methods, such as through the application of Bayes’ Theorem through a likelihood ratio test. In some such embodiments, the threshold is dynamically calibrated to account for variants with low support (e.g., due to low tumor fraction, low circulating tumor fraction, and/or low sequencing depths). The dynamic threshold may be based on, for example, factors such as sample specific error rate, the error
rate from a healthy reference pool (e.g., a pool of process matched healthy control samples for validation of variants detected in tumor samples), and information from internal human solid tumors (e.g., for validation of variants detected in liquid biopsy samples). Accordingly, in some embodiments, the dynamic filtering method employs a tri-nucleotide context-based Bayesian model. That is, in some embodiments, the threshold for filtering any particular putative variant is dynamically calibrated using a context-based Bayesian model that considers one or more of a sample-specific sequencing error rate, a process-matched control sequencing error rate, and/or a variant-specific frequency (e.g., determined from similar cancers). In this fashion, a minimum number of alternative alleles required to positively identify a true variant is determined for individual alleles and/or loci.
[0354] In some embodiments, the dynamic threshold is selected from a Bayesian probability model, where the selection is based on one or more error rates and/or information from one or more baseline variant distributions. For example, in some embodiments, the dynamic threshold is selected based on a variant detection specificity that is calculated using a distribution of variant detection sensitivities, where the distribution of variant detection sensitivities is a function of circulating variant allele fraction from a plurality of baseline and/or reference alleles (e.g., from a cohort of subjects). Filtration of variants using a dynamic threshold (e.g., to validate the presence of a somatic variant) is performed by comparing the number of unique sequence reads encompassing the variant (e.g., a variant allele fragment count for the variant) against the dynamic threshold.
[0355] As described herein, in some embodiments, the methods described herein (e.g., methods 400-2, 450, and 500-2 as illustrated in Figures 4 and 5) include one or more data collection steps, in addition to data analysis and downstream steps. For example, as described herein, e.g., with reference to Figures 2 and 3, in some embodiments, the methods include collection of a liquid biopsy sample and, optionally, one or more matching biological samples from the subject (e.g., a matched cancerous and/or matched non-cancerous sample from the subject). Likewise, as described herein, e.g., with reference to Figures 2 and 3, in some embodiments, the methods include extraction of cfDNA from the liquid biopsy sample and, optionally, one or more matching biological samples from the subject (e.g., a matched cancerous and/or matched non-cancerous sample from the subject). Similarly, as described herein, e.g., with reference to Figures 2 and 3, in some embodiments, the methods include nucleic acid sequencing of cfDNA from the liquid biopsy sample and, optionally, one or
more matching biological samples from the subject (e.g., a matched cancerous and/or matched non-cancerous sample from the subject).
[0356] However, in other embodiments, the methods described herein begin with obtaining nucleic acid sequencing results, e.g., raw or collapsed sequence reads of cfDNA from a liquid biopsy sample and, optionally, one or more matching biological samples from the subject (e.g., a matched cancerous and/or matched non-cancerous sample from the subject), from which the statistics needed for somatic variant identification (e.g., variant allele count 133-ac and/or variant allele fraction 133-af) can be determined. For example, in some embodiments, sequencing data 122 for a patient 121 is accessed and/or downloaded over network 105 by system 100.
[0357] Similarly, in some embodiments, the methods described herein begin with obtaining the genomic features needed for somatic variant identification (e.g., variant allele count 133-ac and/or variant allele fraction 133-af) for a sequencing of a liquid biopsy sample and, optionally, one or more matching biological samples from the subject (e.g., a matched cancerous and/or matched non-cancerous sample from the subject). For example, in some embodiments, variant allele counts 133-cf-ac and/or variant allele fractions 133-cf-af for sequencing data 122 of patient 121 is accessed and/or downloaded over network 105 by system 100.
[0358] One goal of the liquid biopsy assays described herein is to detect variant alterations at low circulating fractions, which requires that low levels of support be sufficient to call a variant. Therefore, consistent thresholds to filter variants that do not take into account variant context and local sequence specific error cannot be used.
[0359] In some embodiments, a dynamic variant filtering method is applied which uses an application of Bayes' Theorem through the likelihood ratio test. The dynamic threshold is based on sample specific error rate, the error rate from a healthy reference pool, and from internal human solid tumors. The basic application of the likelihood ratio test is as follows: post-test-odds = pre-test-odds * sensitivity / (1 - specificity)
[0360] Given a fixed value for post-test-adds, the specificity can be solved for. The specificity represents the minimum acceptable quantile of an error distribution (e.g., a BetaBinomial, Beta, and Poisson error distribution). The above equation can be refactored to the one below: specificity = 1 - pre-test-odds * sensitivity / post-test-odds
[0361] Specificity can then be plugged into the quantile error (e.g., BetaBinomial, Beta, or Poisson) function to derive the minimum number of alternative alleles that can be observed at a given depth to validate a candidate somatic variant.
[0362] In some embodiments, the post-test odds are post-test probability / (1 - post-test probability). The post-test probability is the probability of having a positive variant given Bayes Theorem. The post-test-odds is pre-defmed.
[0363] In some embodiments, the pre-test odds are pre-test probability / (1 - pre-test probability). The pre-test probability is the probability of having a positive variant given the patient's cancer-type and the prevalence of variant alterations within a genomic region encompassing a candidate somatic sequence variant in a reference population having the same cancer type.
[0364] In some embodiments, a pre-test-odds multiplier is applied to the pre-test odds for a resistance mutation that would develop and/or become more prominent within a heterogeneous population of cancer cells in response to therapeutic treatment. The multiplier is applied to specific genomic regions (e.g., exon windows) containing the resistance mutation position. In some embodiments, the multiplier is only applied in specified cancer contexts. For example, in some embodiments, a multiplier is applied to a pre-test odds for a genomic region containing a mutation that is resistant to at least one cancer therapy used to treat the type of cancer the subject has. For example, if a given mutation is known to have resistance to a therapy used to treat breast cancer, but not to any of the therapies used to treat brain cancer, a multiplier will be applied to the pre-test odds for the genomic region encompassing the mutation if the subject has breast cancer, but not if the subject has brain cancer.
[0365] In some embodiments, sensitivity is the fraction of variants detected by the liquid biopsy assay at a given variant allele fraction (e.g., 0.1%, 0.25%, 0.5%, etc.).
[0366] Calculating the pre-test probability. In some embodiments, the pre-test probability is calculated using historical data for a set of reference subjects having the same type of cancer, e.g., from sequencing of solid tumor samples. In this fashion, it is possible to accurately assess the prevalence of specific variants within the population of advanced human tumors. In some embodiments, the set of reference subjects is at least 10 reference subjects.
In some embodiments, the set of reference subjects is at least 50 reference subjects. In some embodiments, the set of reference subjects is at least 100 reference subjects. In some
embodiments, the set of reference subjects is at least 500 reference subjects. In some embodiments, the set of reference subjects is at least 1000 reference subjects. In some embodiments, the set of reference subjects is at least 5000 reference subjects. In some embodiments, the set of reference subjects is at least 10000 reference subjects.
[0367] In some embodiments, variant prevalence is calculated by indexing genomic regions (e.g., exons) in the reference sample and counting the number of variants in each genomic region (e.g., exon) for each cancer-type. The number of patients who have at least one variant in the genomic region (e.g., the exon) / the number of patients equals the variant prevalence. The pre-test-odds are calculated from the prevalence by pre-test-odds = prevalence / (1 - prevalence).
[0368] In some embodiments, for a cancer where the number of patients in the reference is too low to calculate prevalence, a default pan cancer cancer-type is used. Where no prevalence can be calculated, the mean variant prevalence across cancer-types is used.
[0369] In some embodiments, pre-test-odds are not calculated each time an input sample is run. Rather, in some embodiments, it is read from a pre-existing file, which will be evaluated and regenerated if deemed necessary.
[0370] Calculating the pre-test-odds multiplier. Resistance mutations have historically low prevalence and variant allele fraction and may incorrectly be filtered by the dynamic variant filtering method due to low pre-test-odds. The resistance mutations develop in response to therapeutic treatment, and detecting resistance mutations early provides insights into the current treatment strategy. Low variant allele frequency, low prevalence resistance mutations in historic solid tumor samples have been identified. The high sensitivity of the liquid biopsy assay described herein permits the early detection of these resistance mutations in circulating DNA. Examples of such resistance mutations include PIK3CA p.E545K in breast cancer, EGFR p.T790M in non-small cell lung cancer, and AR p.H875Y for prostate cancer.
[0371] In some embodiments, to estimate the pre-test-odds-multiplier required to pass resistance mutations down to low variant allele fractions (e.g., 0.1% or 0.25% VAF), the average depth for each variant position is utilized from the reference pool (e.g., the reference pool used to determine the pre-test odds) depth, at a high minimum average depth (e.g., of 2500X). For each resistance mutation, the number of alternate alleles required to achieve a 0.1% or 0.25% VAF were calculated. The total alternate alleles and depth for each resistance
mutation was input to the Dynamic Variant Filtering method, and multipliers were applied until those resistance mutations passed the filtering strategy.
[0372] In some embodiments, the minimum multiplier required to pass resistance mutations is determined when the input sample alternate allele count is greater than the background alternate allele count (as outlined in Calculating Testing Sample Alt Allele Count and Calculating Background Alt Allele Count below). In some embodiments, the multiplier is selected based on the multiplier required to pass the variant at a low variant allele fraction (e.g. , 0.1 % VAF or 0.25% VAF). In some embodiments, a maximum value for the multiplier is applied, in order to prevent excessive artifacts from passing the filter. Large multipliers may permit false positive variants to pass the Dynamic Variant Filtering method, however, large multipliers are necessary to pass resistance mutations that have historically low prevalence. In some embodiments, the maximum multiplier is between 750 and 1500. In some embodiments, the maximum multiplier is between 900 and 1100. In some embodiments, the maximum multiplier is between 1000 and 1050.
[0373] In some embodiments, the usage of the pre-test-odds-multiplier is limited by cancer-type context and genomic region (e.g., exon-window). In some embodiments, therefore, the multipliers will not be applied to all genomic regions (e.g., exon-windows) given a specified cancer-type, nor all cancer-types given a specific genomic region (e.g., exon- window).
[0374] Calculating testing sample variant allele count. In some embodiments, the filtering method (the statistical method used for the Dynamic Variant Filtering method) is selected from a beta-binomial distribution model, a beta distribution model, and a Poisson distribution model. In some embodiments, the model is a beta-binomial model. In some embodiments, when applying a quantile beta-binomial distribution, the sum of the input sample alternate reads is divided by the input sample sequencing depth at each variant position, and then multiplied by the reference pool depth (the sequencing depth at genomic positions for a pool of reference, e.g., healthy normal, controls).
[0375] Calculating background variant allele count. In some embodiments, the background variant allele count calculation takes into account the background error from a pool of reference (e.g., healthy normal subjects), the input sample error, and the prevalence of historical variants in the reference cancer subjects. The quantile beta-binomial model considers (i) reference pool depth (the sequencing depth at genomic positions for a pool of
reference, e.g., healthy normal, controls), background posterior error average from the input sample, and alpha calculated from the pre-test-odds, sensitivity, and the post-test-odds (e.g., where alpha is equal to 1 - specificity = pre-test-odds * sensitivity / post-test-odds. The pre- test-odds calculated for a specific genomic region (e.g., exon window) and cancer-type will yield a unique alpha for each variant, given that the variants do not fall in the same genomic region (e.g., exon window)).
[0376] In some embodiments, the background posterior error incorporates a trinucleotide error average (e.g., a reaction-specific sequencing error rate), the reference pool error (e.g., a locus-specific, process-matched sequencing error rate; e.g., a sum of alternate reads for each position / depth from a pool of healthy normal controls), and a shrinkage weight parameter.
In some embodiments, the trinucleotide error average is an aggregate of the input sample background average, where the input sample background average equals the error counts for each position divided by the position-specific sequencing depth. In some embodiments, the sample background average is then aggregated for each trinucleotide context. The trinucleotide average is used to calculate the shrinkage weight parameter. In some embodiments, the shrinkage weight parameter equals the trinucleotide error average divided by the sum of the trinucleotide error average and the reference pool error. In instances when the shrinkage weight parameter is undefined, it is changed to 1. In some embodiments, the final calculation of the background posterior error is calculated as: background posterior error = shrinkage weight parameter * trinucleotide error average + (1 - shrinkage weight parameter) * healthy subject error.
[0377] In some embodiments, a reference pool error can be used in place of an input sample background average, for calculating the background posterior average error rate.
[0378] In some embodiments, the alpha for the beta-binomial distribution is calculated using the pre-test-odds, sensitivity, and post-test-odds, where: alpha = 1 - specificity = pre-test-odds * sensitivity / post-test-odds
[0379] Accordingly, in some embodiments, the background posterior average, the reference pool depth, and the alpha are used in calculating the input to the quantile beta- binomial function. The alpha is used in calculating the mean value of the beta-binomial distribution, which equals 1 - alpha / 2. The size of the quantile beta-binomial is the matrix of the reference pool depth. The shape 1 parameter for the quantile beta-binomial function is the reference pool depth multiplied by the background posterior average error rate, and the shape
2 parameter of the quantile beta-binomial function is the shape 1 parameter subtracted from reference pool depth.
[0380] The output from the quantile BetaBinomial function is the minimum value a variant needs to be called. Any variant that has a normalized allele count below the quantile(BetaBinomial) output will be filtered due to the high background error observed at that position.
[0381] For example, Figure 4F2 illustrates a flow chart of a method 400-2 for validating a somatic sequence variant in a test subject having a cancer condition, in accordance with some embodiments of the present disclosure.
[0382] In some embodiments, the method includes obtaining (402-2) cell-free DNA sequencing data 122 from a sequencing reaction of a liquid biopsy sample of a test subject 121 (e.g., sequence reads 123-1-1-1,... ,123-1-1-K for sequence run 122-1-1 for aliquid biopsy sample from patient 121-1, as illustrated in Figure IB) As described herein, in some embodiments, the obtaining includes a step of sequencing cell-free nucleic acids from a liquid biopsy sample. Example methods for sequencing cell-free nucleic acids are described herein.
[0383] Sequence reads 123 from the sequencing data 122 are then aligned (404-2) to a human reference sequence (e.g. , a human genome or a portion of a human genome, e.g. , 1 %, 5%, 10%, 15%, 20%, 25%, 30%, 40%, 50%, 60%, 75%, 90%, 95%, 99%, or more of the human genome, or to a map of a human reference genome or a set of human reference genomes, or a portion thereof), thereby generating a plurality of aligned reads 124.
Optionally, the pre-aligned sequence reads 123 and/or aligned sequence reads 124 are pre- processed (408-2) using any of the methods disclosed above (e.g., normalization, bias correction, etc.). In some embodiments, as described herein, device 100 obtains previously aligned sequence reads.
[0384] The aligned sequences reads 124 are then evaluated to identify mismatches with the reference construct (e.g., reference genome or set of reference genomes), thereby identifying one or more candidate somatic sequence variants 132-c at respective genomic loci. The number of aligned sequence reads containing the sequence variant at the locus are determined, thereby defining a variant allele fragment count 132-c-ac (e.g., variant allele fragment count 132-c-l-ac as illustrated in Figure 1C2). In some embodiments, the number of aligned sequence reads containing the locus of the candidate variant allele (regardless of the identity of the allele represented in the sequence read) are also determined, thereby
defining a variant allele locus count 132-c-lc (e.g., variant allele locus count 132-c-l-lc as illustrated in Figure 1C2). Accordingly, in some embodiments, the variant allele fragment count 132-c-ac can be compared to the variant allele locus count 132-c-lc to determine a variant allele fraction 132-c-vf (e.g., variant allele fraction 132-c-l-vf as illustrated in Figure 1C2) for the candidate variant allele. This represents a measure of the portion of sequence reads encompassing the nucleotide(s) that is altered in the candidate variant allele that include the candidate variant. In some embodiments, as described below, this measure can be used to define a sensitivity for the detection of the candidate variant based on a distribution of detection sensitivities corresponding to detection of a variant within a genomic region encompassing the locus in reference samples with defined variant allele fractions.
[0385] Method 400-2 then includes obtaining (412-2) a dynamic variant count threshold 191 for the candidate variant allele. As described herein, in some embodiments, the dynamic variant count threshold is based upon a prevalence of sequence variations in a genomic region encompassing the locus of the candidate variant allele in cancer patients sharing one or more similarities with the test subject. For example, in some embodiments, this prevalence defines a pre-test odds that the test subject has a sequence variant within the genomic region encompassing the locus at which the candidate sequence variant is located. In some embodiments, this pre-test odds is used in an application of Bayes theorem to derive a minimal amount of support required of the sequencing reaction to validate the presence of the candidate sequence variant in a cancerous tissue of the subject at a desired confidence level. Information about Bayes theorem and Bayesian inference can be found, for instance, in Section 8.7 of Stuart, A. and Ord, K. (1994), Kendall's Advanced Theory of Statistics:
Volume I — Distribution Theory, Edward Arnold; and Gelman, A. et cil, (2013), Bayesian Data Analysis, Third Edition, Chapman and Hall/CRC, ISBN 978-1-4398-4095-5, the disclosure of both of which are incorporated herein by reference for their teachings of how to implement Bayes theorem and Bayesian inference.
[0386] In some embodiments, the prevalence of sequence variants in the genomic region encompassing the locus of the candidate variant allele is determined from a population of reference cancer subjects having the same type of cancer. In some embodiments, the population of reference cancer subjects is further defined by a matching personal characteristic, e.g., an age, gender, race, smoking status, or any other personal characteristic. In some embodiments, the population of reference subjects is further defined by a plurality of
matching personal characteristics, e.g., at least 2, 3, 4, 5, 6, 7, 8, 9, 10, or more person characteristics, in addition to cancer type.
[0387] For instance, in some embodiments, the prevalence of sequence variants is determined from variant prevalence training data 192, as illustrated in Figure IF. The variant prevalence training data 192 includes data on the variants found in a cancerous tissue from a plurality of reference subjects 193. For example, training data 192 for reference subject 1 193-1 includes a cancer type 194-1 and a list of somatic sequence variants 195-1, including individual variants 196-1-1 . . . 196-1-S. To determine a prevalence for a particular candidate sequence variant detected for a test subject, a genomic region encompassing the locus of the candidate sequence variant is defined (e.g., the exon of a gene in which a candidate sequence variant is detected). Then, it is determined what portion of reference subjects 193, that have the same cancer as the test subject, have a sequence variant located within the defined genomic region (e.g., the exon of the gene).
[0388] In some embodiments, e.g., when only a limited set of defined candidate variants will be validated, sequence variant prevalence is predetermined and stored in a database, e.g., in non-persistent memory 111, or in an addressable remote server, as a look-up table. In other embodiments, system 100 determines a sequence variant prevalence for a genomic region and matching patient profile upon identification of a candidate sequence variant, e.g., by filtering variant prevalence training data 192 for the relevant genomic region and matching reference subjects.
[0389] Generally, the genomic region encompassing the candidate sequence variant is larger than a single nucleotide. For example, in some embodiments, the genomic region includes at least 10 nucleotides, at least 50 nucleotides, at least 100 nucleotides, at least 250 nucleotides, at least 500 nucleotides, at least 1000 nucleotides, at least 2500 nucleotides, or more nucleotides. In some embodiments, the genomic region is no larger than 10,000 nucleotides, not larger than 7500 nucleotides, no larger than 5000 nucleotides, no larger than 2500 nucleotides, or fewer nucleotides. In some embodiments, the genomic region is from 10 nucleotides to 10,000 nucleotides. In some embodiments, the genomic region is from 25 nucleotides to 5000 nucleotides. In some embodiments, the genomic region is from 50 nucleotides to 2500 nucleotides.
[0390] In some embodiments, when the candidate sequence variant falls within a protein coding sequence, the genomic region is defined as the exon in which the candidate sequence
variant is located. In some embodiments, the genomic region is defined as several adjacent exons, including the exon in which the candidate sequence variant is located. In some embodiments, when the candidate sequence variant falls within a protein coding sequence, the genomic region is defined as all exons of the gene in which the candidate sequence variant is located. In some embodiments, when the candidate sequence variant falls within a protein coding sequence, the genomic region is defined as the entire gene in which the candidate sequence variant is located. Similarly, in some embodiments, when the candidate sequence variant falls within an intronic sequence of a gene, the genomic region is defined as the entire intron in which the candidate sequence variant is located, or several adjacent introns including the intron in which the candidate sequence variant is located.
[0391] In some embodiments, the genomic region encompassing the candidate sequence variant is a fixed window encompassing, e.g., surrounding, the candidate sequence variant. For example, in some embodiments, when the candidate sequence variant falls within anon- coding portion of the genome, the genomic region is defined as a fixed window surrounding the candidate sequence variant. However, in some embodiments, when the sequence variant falls within a non-coding genetic element, e.g., a promoter, enhancer, etc., the genomic region is defined as the entirety of the genetic element.
[0392] In some implementations, the genomic region encompassing the candidate sequence variant is dependent upon the sequence context of the locus. For example, when the candidate sequence variant falls within a coding sequence, the exon or several adjacent exons defines the genomic region, but when the candidate sequence variant falls within a non-coding sequence, the genomic region is defined by a fixed window encompassing the candidate sequence variant.
[0393] In some embodiments, the genomic region encompassing the candidate sequence variant is dependent upon a known or inferred effect of the sequence variant. For instance, as described in more detail below, in some embodiments, when the candidate sequence variant causes, or is inferred to cause, a partial or complete loss of function mutation in a gene, the genomic region is defined by all exons of the gene in which the candidate sequence variant is located. Similarly, as described in more detail below, in some embodiments, when the candidate sequence variant causes, or is inferred to cause, a gain of function mutation in a gene having one or more hotspots for gain of function mutations, the genomic region is defined as those exons of the gene encompassing the one or more hotspots.
[0394] In some embodiments, when the candidate sequence variant falls within a genomic region associated with a known therapeutic resistance gene for the cancer of the subject, the pre-test odds determined based on the historical prevalence data is multiplied by a pre-test-odds multiplier (e.g., as described above).
[0395] In some embodiments, the Bayesian analysis is further informed by defining the specificity of variant detection based on an apparent variant allele fraction in the sample. For example, in some embodiments, the variant allele fraction for the candidate sequence variant is determined by a comparison of the variant allele fragment count 132-c-ac to the variant allele locus count 132-c-lc (e.g., a ratio of the variant allele fragment count to the variant allele locus count), thereby determining a variant allele fraction 132-c-vf. In some embodiments, the variant allele fraction is then compared to a distribution of variant detection specificities established based on a set of training samples (e.g., sensitivity distribution training data) with known variant allele fractions. For example, in some embodiments, nucleic acids from each of a plurality of training samples 181 having a known variant allele fraction 184 for one or more variant alleles 183 is sequenced according to a processed- matched sequencing reaction (e.g., using a substantially identical or identical sequencing reaction), and it is determined whether each sequence variant can be detected, e.g., defining a detection status 185 for each locus/variant 183. Over a large number of training samples, a specificity of detection of variants having different variant allele fractions can be determined. In some embodiments, the specificity is determined on a locus-by -locus basis, such that the specificity of detection is specific for the genomic region or locus encompassing the candidate sequence variant. In some embodiments, the specificity is determined globally, e.g., not on a locus-by-locus basis.
[0396] A correlation can then be established between the measured detection specificity and the variant allele fraction (e.g., variant detection sensitivity distribution 186). In some embodiments, the correlation is a linear or non-linear fit between measured detection specificities and variant allele fractions. In other embodiments, the correlation is determined by binning specificities (e.g., in bins 187) as a function of ranges of variant allele fractions 188, and determining a measure of central tendency (e.g., a mean) for the specificities 189 in the bin. The variant allele fraction 132-c-ac determined for the candidate sequence variant is then compared to the established correlation (e.g., variant detection sensitivity distribution 186) to define the specificity of detection for the candidate sequence variant.
[0397] In some embodiments, the Bayesian analysis is further informed by accounting for the sequencing error rate for the variant allele and, accordingly, the probability that the candidate sequence variant is a product of a sequencing error, rather than a genomic variant. In some embodiments, a reaction-specific error rate (e.g., a trinucleotide sequencing error rate) is determined for the sequencing reaction (e.g., using an internal control spiked into the reaction). In some embodiments, a locus-specific error rate is determined from historical sequencing errors at the genomic region, or specific locus, encompassing the candidate sequence variant. In some embodiments, both a reaction-specific sequencing error rate and a locus-specific error rate are used to define a variant count distribution (e.g., variant count distribution 190), representing the number of variant allele counts (e.g., variant allele fragment count 132-c-ac) necessary to validate the presence of the candidate variant sequence in the cancer of the subject at a defined detection sensitivity. In some embodiments, a beta binomial distribution is established based on the reaction-specific sequencing error rate and the locus-specific error rate.
[0398] Method 400-2 then includes applying (414-2) the dynamic variant count threshold (e.g., locus-specific dynamic variant count threshold 191) to the sequencing data, e.g., by determining whether the variant allele fragment count 132-c-ac for the candidate sequence variant satisfies the threshold, and validating the candidate sequence variant (e.g., creating a record 132-v of the validation) when the threshold is satisfied or rejecting the candidate sequence variant when the threshold is not satisfied. In some embodiments, one or more additional filters, relating to global sequencing metrics and/or locus-specific sequencing metrics (e.g., one or more of variant locus coverage filter(s) 463, variant allele fraction filter(s) 465, variant support mapping filter(s) 467, variant support sequencing quality filter(s) 469, and low complexity region filter(s) 471, as illustrated in Figure 1D2) must be satisfied before validating a candidate sequence variant.
[0399] As described in further detail herein, in some embodiments, one or more validated variant statuses 132-v are used to match (424-2) the subject with a targeted therapy and/or a clinical trial. In some embodiments, as described in further detail herein, one or more validated variant statuses 132-v for one or more actionable variants 139-1-1, one or more matched therapies 139-1-2, and/or one or more matched clinical trials are used to generate (426-2) a patient report 139-1-3. In some embodiments, the patient report is transmitted to a medical professional treating the subject. In some embodiments, the patient is then
administered (428-2) a personalized course of therapy, e.g., based on a matched therapy and/or clinical trial.
[0400] In some embodiments, the methods of validating a candidate somatic sequence variant using a dynamic threshold described herein fall within the context of a larger variant detection method, e.g., as illustrated by method 450 illustrated in Figures 4G1-4G3. For example, in some embodiments, the method includes obtaining (452) cfDNA sequence reads, as described herein, and aligning (454) those reads to a reference construct (e.g., a reference genome or mapped representation of several reference genomes), to generate aligned sequences 124 (e.g., a plurality of unique sequence reads). In some embodiments, putative somatic sequence variants are identified (456), e.g., those sequence variants having a variant allele fraction that is lower than expected for a germline sequence variant (which should be around 50% after accounting for an estimated circulating tumor fraction for the liquid biopsy sample), e.g., less than 30%, less than 20%, less than 10% etc. One or more candidate somatic sequence variants are then validated by applying one or more filters. For instance, as described herein, a dynamic variant count threshold is determined (459) and then used to apply (460) a dynamic probabilistic variant count filter to sequencing data for the candidate somatic sequence variant. In some embodiments, the method also includes applying (462) a variant loci coverage filter. In some embodiments, the method also includes applying (464) a variant allele fraction filter. In some embodiments, the method also includes applying (466) a variant support mapping filter. In some embodiments, the method also includes applying (468) a variant support sequencing quality filter. In some embodiments, the method also includes applying (470) a low complexity region filter. When all selected candidate somatic sequence variants have been validated or rejected according to these filters (472), the process proceeds with a reporting function.
[0401] In some embodiments, method 450 also includes validating (474) the sequencing data globally, using any of the metrics described herein. In some embodiments, the validation includes applying (476) a loci minimal coverage filter. In some embodiments, the validation includes applying (478) a loci central tendency coverage filter. In some embodiments, the validation includes applying (480) a total sequence read filter. In some embodiments, the validation includes applying (481) a sequence read quality filter. In some embodiments, the validation includes applying a sequencing control filter (482). The entire sequencing reaction is then validated or rejected (483) based on whether the sequencing data passes these global filters.
[0402] In some embodiments, method 450 also includes validating (485) one or more germline mutations. In some embodiments, candidate germline sequence variants are identified (484), e.g., those sequence variants having a variant allele fraction that is higher than expected for a somatic sequence variant. In some embodiments, the validation includes applying (486) a germline-specific variant allele fraction filter. In some embodiments, the validation includes applying (487) a variant support mapping filter. In some embodiments, the validation includes applying (488) a variant support sequencing quality filter. When all selected candidate germline sequence variants have been validated or rejected according to these filters (489), the process proceeds with a reporting function.
[0403] As described in further detail herein, in some embodiments, one or more validated variant statuses 132-v are used to match (490) the subject with a targeted therapy and/or a clinical trial. In some embodiments, as described in further detail herein, one or more validated variant statuses 132-v for one or more actionable variants 139