EP4320618A2 - Procédé d'analyse de données de séquence d'adn acellulaire pour examiner la protection du nucléosome et l'accessibilité de la chromatine - Google Patents
Procédé d'analyse de données de séquence d'adn acellulaire pour examiner la protection du nucléosome et l'accessibilité de la chromatineInfo
- Publication number
- EP4320618A2 EP4320618A2 EP22785557.4A EP22785557A EP4320618A2 EP 4320618 A2 EP4320618 A2 EP 4320618A2 EP 22785557 A EP22785557 A EP 22785557A EP 4320618 A2 EP4320618 A2 EP 4320618A2
- Authority
- EP
- European Patent Office
- Prior art keywords
- cancer
- cell
- determining
- fragment
- read data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 381
- 108010047956 Nucleosomes Proteins 0.000 title claims description 128
- 210000001623 nucleosome Anatomy 0.000 title claims description 128
- 210000003483 chromatin Anatomy 0.000 title claims description 99
- 108010077544 Chromatin Proteins 0.000 title claims description 97
- 238000007405 data analysis Methods 0.000 title description 7
- 108091028043 Nucleic acid sequence Proteins 0.000 title description 2
- 206010028980 Neoplasm Diseases 0.000 claims abstract description 644
- 201000011510 cancer Diseases 0.000 claims abstract description 356
- 239000012634 fragment Substances 0.000 claims abstract description 258
- 238000009826 distribution Methods 0.000 claims abstract description 76
- 238000003745 diagnosis Methods 0.000 claims abstract description 17
- 230000002708 enhancing effect Effects 0.000 claims abstract description 8
- 210000004027 cell Anatomy 0.000 claims description 270
- 108091023040 Transcription factor Proteins 0.000 claims description 262
- 102000040945 Transcription factor Human genes 0.000 claims description 261
- 108020004414 DNA Proteins 0.000 claims description 130
- 108090000623 proteins and genes Proteins 0.000 claims description 111
- 108700009124 Transcription Initiation Site Proteins 0.000 claims description 104
- 208000000236 Prostatic Neoplasms Diseases 0.000 claims description 88
- 206010060862 Prostate cancer Diseases 0.000 claims description 86
- 230000014509 gene expression Effects 0.000 claims description 79
- 238000012070 whole genome sequencing analysis Methods 0.000 claims description 79
- 206010041067 Small cell lung cancer Diseases 0.000 claims description 73
- 208000000587 small cell lung carcinoma Diseases 0.000 claims description 71
- 210000002381 plasma Anatomy 0.000 claims description 70
- 230000027455 binding Effects 0.000 claims description 56
- 210000001519 tissue Anatomy 0.000 claims description 52
- 238000011282 treatment Methods 0.000 claims description 49
- 230000000955 neuroendocrine Effects 0.000 claims description 44
- 238000003556 assay Methods 0.000 claims description 38
- 208000020816 lung neoplasm Diseases 0.000 claims description 37
- 206010058467 Lung neoplasm malignant Diseases 0.000 claims description 36
- 201000005202 lung cancer Diseases 0.000 claims description 36
- 208000002154 non-small cell lung carcinoma Diseases 0.000 claims description 36
- 208000029729 tumor suppressor gene on chromosome 11 Diseases 0.000 claims description 36
- 210000004369 blood Anatomy 0.000 claims description 34
- 239000008280 blood Substances 0.000 claims description 34
- 206010055113 Breast cancer metastatic Diseases 0.000 claims description 33
- 230000008859 change Effects 0.000 claims description 32
- 230000004481 post-translational protein modification Effects 0.000 claims description 26
- 230000002103 transcriptional effect Effects 0.000 claims description 24
- 239000003814 drug Substances 0.000 claims description 22
- 101001012157 Homo sapiens Receptor tyrosine-protein kinase erbB-2 Proteins 0.000 claims description 21
- 102100030086 Receptor tyrosine-protein kinase erbB-2 Human genes 0.000 claims description 21
- 230000000717 retained effect Effects 0.000 claims description 18
- 238000007477 logistic regression Methods 0.000 claims description 17
- 208000009956 adenocarcinoma Diseases 0.000 claims description 16
- 239000011159 matrix material Substances 0.000 claims description 14
- 101150079937 NEUROD1 gene Proteins 0.000 claims description 12
- 108700020297 NeuroD Proteins 0.000 claims description 12
- 102100032063 Neurogenic differentiation factor 1 Human genes 0.000 claims description 12
- 101000572976 Homo sapiens POU domain, class 2, transcription factor 3 Proteins 0.000 claims description 11
- 102100026466 POU domain, class 2, transcription factor 3 Human genes 0.000 claims description 11
- 229940079593 drug Drugs 0.000 claims description 11
- 238000001914 filtration Methods 0.000 claims description 11
- 230000036210 malignancy Effects 0.000 claims description 10
- 238000001353 Chip-sequencing Methods 0.000 claims description 7
- 238000009499 grossing Methods 0.000 claims description 7
- 208000010658 metastatic prostate carcinoma Diseases 0.000 claims description 6
- 238000013528 artificial neural network Methods 0.000 claims description 5
- 239000003153 chemical reaction reagent Substances 0.000 claims description 5
- 238000003066 decision tree Methods 0.000 claims description 5
- 230000007717 exclusion Effects 0.000 claims description 5
- 238000011275 oncology therapy Methods 0.000 claims description 5
- 210000002966 serum Anatomy 0.000 claims description 5
- 238000012706 support-vector machine Methods 0.000 claims description 5
- 238000003239 susceptibility assay Methods 0.000 claims description 5
- 210000002230 centromere Anatomy 0.000 claims description 4
- 102000054766 genetic haplotypes Human genes 0.000 claims description 4
- 206010041823 squamous cell carcinoma Diseases 0.000 claims description 4
- 238000009966 trimming Methods 0.000 claims description 3
- 238000002560 therapeutic procedure Methods 0.000 abstract description 18
- 239000012472 biological sample Substances 0.000 abstract description 8
- 238000012544 monitoring process Methods 0.000 abstract description 7
- 238000004458 analytical method Methods 0.000 description 138
- 102100038595 Estrogen receptor Human genes 0.000 description 137
- 108010038795 estrogen receptors Proteins 0.000 description 135
- 239000000523 sample Substances 0.000 description 82
- 238000012163 sequencing technique Methods 0.000 description 62
- 108010080146 androgen receptors Proteins 0.000 description 50
- 102100032187 Androgen receptor Human genes 0.000 description 48
- 238000013459 approach Methods 0.000 description 47
- 238000001514 detection method Methods 0.000 description 43
- 206010061289 metastatic neoplasm Diseases 0.000 description 41
- 208000026310 Breast neoplasm Diseases 0.000 description 39
- 238000012937 correction Methods 0.000 description 39
- 206010006187 Breast cancer Diseases 0.000 description 38
- 230000001394 metastastic effect Effects 0.000 description 32
- 238000001574 biopsy Methods 0.000 description 31
- 241000699666 Mus <mouse, genus> Species 0.000 description 30
- 239000002131 composite material Substances 0.000 description 30
- 238000011160 research Methods 0.000 description 30
- 238000004891 communication Methods 0.000 description 29
- 108010033040 Histones Proteins 0.000 description 27
- 230000000670 limiting effect Effects 0.000 description 27
- 101000901099 Homo sapiens Achaete-scute homolog 1 Proteins 0.000 description 26
- 102100022142 Achaete-scute homolog 1 Human genes 0.000 description 25
- 201000010099 disease Diseases 0.000 description 25
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 25
- 230000000694 effects Effects 0.000 description 24
- 239000013610 patient sample Substances 0.000 description 22
- 102000003998 progesterone receptors Human genes 0.000 description 22
- 108090000468 progesterone receptors Proteins 0.000 description 22
- 238000012360 testing method Methods 0.000 description 21
- 230000004048 modification Effects 0.000 description 20
- 238000012986 modification Methods 0.000 description 20
- 230000022532 regulation of transcription, DNA-dependent Effects 0.000 description 20
- 150000007523 nucleic acids Chemical class 0.000 description 19
- 102000039446 nucleic acids Human genes 0.000 description 18
- 108020004707 nucleic acids Proteins 0.000 description 18
- 238000013518 transcription Methods 0.000 description 18
- 230000035897 transcription Effects 0.000 description 18
- 230000001965 increasing effect Effects 0.000 description 16
- 238000007451 chromatin immunoprecipitation sequencing Methods 0.000 description 14
- 230000006870 function Effects 0.000 description 14
- 230000033228 biological regulation Effects 0.000 description 13
- 230000000875 corresponding effect Effects 0.000 description 13
- 210000003958 hematopoietic stem cell Anatomy 0.000 description 13
- 239000000203 mixture Substances 0.000 description 11
- 230000035772 mutation Effects 0.000 description 11
- 238000003559 RNA-seq method Methods 0.000 description 10
- 230000004075 alteration Effects 0.000 description 10
- 238000010801 machine learning Methods 0.000 description 10
- 230000001105 regulatory effect Effects 0.000 description 10
- 238000012549 training Methods 0.000 description 10
- 230000008901 benefit Effects 0.000 description 9
- 230000003247 decreasing effect Effects 0.000 description 9
- 230000035945 sensitivity Effects 0.000 description 9
- 238000003860 storage Methods 0.000 description 9
- 102100033417 Glucocorticoid receptor Human genes 0.000 description 8
- 206010020751 Hypersensitivity Diseases 0.000 description 8
- 238000012512 characterization method Methods 0.000 description 8
- 238000002474 experimental method Methods 0.000 description 8
- 238000013467 fragmentation Methods 0.000 description 8
- 238000006062 fragmentation reaction Methods 0.000 description 8
- 238000003364 immunohistochemistry Methods 0.000 description 8
- 208000037819 metastatic cancer Diseases 0.000 description 8
- 208000011575 metastatic malignant neoplasm Diseases 0.000 description 8
- 230000008520 organization Effects 0.000 description 8
- 241000699670 Mus sp. Species 0.000 description 7
- 101710163270 Nuclease Proteins 0.000 description 7
- 208000026935 allergic disease Diseases 0.000 description 7
- 230000003321 amplification Effects 0.000 description 7
- 238000002487 chromatin immunoprecipitation Methods 0.000 description 7
- 238000011161 development Methods 0.000 description 7
- 230000018109 developmental process Effects 0.000 description 7
- 230000004069 differentiation Effects 0.000 description 7
- 230000001973 epigenetic effect Effects 0.000 description 7
- 230000009610 hypersensitivity Effects 0.000 description 7
- 238000013507 mapping Methods 0.000 description 7
- 108091008916 nuclear estrogen receptors subtypes Proteins 0.000 description 7
- 238000003199 nucleic acid amplification method Methods 0.000 description 7
- 238000004393 prognosis Methods 0.000 description 7
- 238000007482 whole exome sequencing Methods 0.000 description 7
- 102100029283 Hepatocyte nuclear factor 3-alpha Human genes 0.000 description 6
- 101001062353 Homo sapiens Hepatocyte nuclear factor 3-alpha Proteins 0.000 description 6
- 238000000585 Mann–Whitney U test Methods 0.000 description 6
- 102100038358 Prostate-specific antigen Human genes 0.000 description 6
- 238000001793 Wilcoxon signed-rank test Methods 0.000 description 6
- 230000015556 catabolic process Effects 0.000 description 6
- 238000003776 cleavage reaction Methods 0.000 description 6
- 230000002596 correlated effect Effects 0.000 description 6
- 230000007423 decrease Effects 0.000 description 6
- 238000006731 degradation reaction Methods 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 6
- 230000011987 methylation Effects 0.000 description 6
- 238000007069 methylation reaction Methods 0.000 description 6
- 230000007017 scission Effects 0.000 description 6
- 230000004083 survival effect Effects 0.000 description 6
- 230000001225 therapeutic effect Effects 0.000 description 6
- 210000004881 tumor cell Anatomy 0.000 description 6
- 108091032973 (ribonucleotides)n+m Proteins 0.000 description 5
- 230000007067 DNA methylation Effects 0.000 description 5
- 238000001712 DNA sequencing Methods 0.000 description 5
- 108010008532 Deoxyribonuclease I Proteins 0.000 description 5
- 102000007260 Deoxyribonuclease I Human genes 0.000 description 5
- 108700039691 Genetic Promoter Regions Proteins 0.000 description 5
- 101000958299 Homo sapiens Protein lyl-1 Proteins 0.000 description 5
- 206010027476 Metastases Diseases 0.000 description 5
- 108010072866 Prostate-Specific Antigen Proteins 0.000 description 5
- 102100038231 Protein lyl-1 Human genes 0.000 description 5
- 208000003721 Triple Negative Breast Neoplasms Diseases 0.000 description 5
- 230000004913 activation Effects 0.000 description 5
- 239000000090 biomarker Substances 0.000 description 5
- 230000003394 haemopoietic effect Effects 0.000 description 5
- 229940088597 hormone Drugs 0.000 description 5
- 239000005556 hormone Substances 0.000 description 5
- 108091008039 hormone receptors Proteins 0.000 description 5
- 210000002865 immune cell Anatomy 0.000 description 5
- 238000011528 liquid biopsy Methods 0.000 description 5
- 210000004185 liver Anatomy 0.000 description 5
- 238000007481 next generation sequencing Methods 0.000 description 5
- 238000010606 normalization Methods 0.000 description 5
- 238000012545 processing Methods 0.000 description 5
- 210000002307 prostate Anatomy 0.000 description 5
- 238000011002 quantification Methods 0.000 description 5
- 208000024891 symptom Diseases 0.000 description 5
- 230000009897 systematic effect Effects 0.000 description 5
- 208000022679 triple-negative breast carcinoma Diseases 0.000 description 5
- 108090000079 Glucocorticoid Receptors Proteins 0.000 description 4
- 102100022047 Hepatocyte nuclear factor 4-gamma Human genes 0.000 description 4
- 101000926939 Homo sapiens Glucocorticoid receptor Proteins 0.000 description 4
- 101001045749 Homo sapiens Hepatocyte nuclear factor 4-gamma Proteins 0.000 description 4
- 101000804764 Homo sapiens Lymphotactin Proteins 0.000 description 4
- 101000819111 Homo sapiens Trans-acting T-cell-specific transcription factor GATA-3 Proteins 0.000 description 4
- 102100035304 Lymphotactin Human genes 0.000 description 4
- 241000124008 Mammalia Species 0.000 description 4
- 241001465754 Metazoa Species 0.000 description 4
- NKANXQFJJICGDU-QPLCGJKRSA-N Tamoxifen Chemical compound C=1C=CC=CC=1C(/CC)=C(C=1C=CC(OCCN(C)C)=CC=1)/C1=CC=CC=C1 NKANXQFJJICGDU-QPLCGJKRSA-N 0.000 description 4
- 102100021386 Trans-acting T-cell-specific transcription factor GATA-3 Human genes 0.000 description 4
- 230000010632 Transcription Factor Activity Effects 0.000 description 4
- 230000009471 action Effects 0.000 description 4
- 238000003339 best practice Methods 0.000 description 4
- 230000001413 cellular effect Effects 0.000 description 4
- 238000002790 cross-validation Methods 0.000 description 4
- 238000009261 endocrine therapy Methods 0.000 description 4
- 238000011156 evaluation Methods 0.000 description 4
- 238000010195 expression analysis Methods 0.000 description 4
- 230000002068 genetic effect Effects 0.000 description 4
- 238000012268 genome sequencing Methods 0.000 description 4
- 230000037442 genomic alteration Effects 0.000 description 4
- 238000012165 high-throughput sequencing Methods 0.000 description 4
- 239000000463 material Substances 0.000 description 4
- 230000007246 mechanism Effects 0.000 description 4
- 108020004999 messenger RNA Proteins 0.000 description 4
- 230000009401 metastasis Effects 0.000 description 4
- 210000004910 pleural fluid Anatomy 0.000 description 4
- 239000013641 positive control Substances 0.000 description 4
- 238000002360 preparation method Methods 0.000 description 4
- 230000002829 reductive effect Effects 0.000 description 4
- 238000012552 review Methods 0.000 description 4
- 230000011664 signaling Effects 0.000 description 4
- 230000000392 somatic effect Effects 0.000 description 4
- 238000002626 targeted therapy Methods 0.000 description 4
- RWQNBRDOKXIBIV-UHFFFAOYSA-N thymine Chemical compound CC1=CNC(=O)NC1=O RWQNBRDOKXIBIV-UHFFFAOYSA-N 0.000 description 4
- 238000010200 validation analysis Methods 0.000 description 4
- 240000008168 Ficus benjamina Species 0.000 description 3
- 238000000729 Fisher's exact test Methods 0.000 description 3
- 102100034227 Grainyhead-like protein 2 homolog Human genes 0.000 description 3
- 102100022057 Hepatocyte nuclear factor 1-alpha Human genes 0.000 description 3
- 102100021088 Homeobox protein Hox-B13 Human genes 0.000 description 3
- 101001069929 Homo sapiens Grainyhead-like protein 2 homolog Proteins 0.000 description 3
- 101001045751 Homo sapiens Hepatocyte nuclear factor 1-alpha Proteins 0.000 description 3
- 101001041145 Homo sapiens Homeobox protein Hox-B13 Proteins 0.000 description 3
- 101000768460 Homo sapiens Protein unc-13 homolog A Proteins 0.000 description 3
- 101000829203 Homo sapiens Serine/arginine repetitive matrix protein 4 Proteins 0.000 description 3
- 101000687905 Homo sapiens Transcription factor SOX-2 Proteins 0.000 description 3
- 108010059724 Micrococcal Nuclease Proteins 0.000 description 3
- 102100023663 Serine/arginine repetitive matrix protein 4 Human genes 0.000 description 3
- 210000001744 T-lymphocyte Anatomy 0.000 description 3
- 102100024270 Transcription factor SOX-2 Human genes 0.000 description 3
- 239000003098 androgen Substances 0.000 description 3
- 230000002280 anti-androgenic effect Effects 0.000 description 3
- 239000000051 antiandrogen Substances 0.000 description 3
- 210000000988 bone and bone Anatomy 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 3
- 230000011712 cell development Effects 0.000 description 3
- 238000000205 computational method Methods 0.000 description 3
- 230000036541 health Effects 0.000 description 3
- 238000001727 in vivo Methods 0.000 description 3
- 230000002601 intratumoral effect Effects 0.000 description 3
- 238000010172 mouse model Methods 0.000 description 3
- 210000005259 peripheral blood Anatomy 0.000 description 3
- 239000011886 peripheral blood Substances 0.000 description 3
- 102000005962 receptors Human genes 0.000 description 3
- 108020003175 receptors Proteins 0.000 description 3
- 239000007787 solid Substances 0.000 description 3
- 238000001228 spectrum Methods 0.000 description 3
- 238000010186 staining Methods 0.000 description 3
- 230000000153 supplemental effect Effects 0.000 description 3
- 101150034533 ATIC gene Proteins 0.000 description 2
- 201000009030 Carcinoma Diseases 0.000 description 2
- 102100025064 Cellular tumor antigen p53 Human genes 0.000 description 2
- 206010065163 Clonal evolution Diseases 0.000 description 2
- 238000007399 DNA isolation Methods 0.000 description 2
- 102000052510 DNA-Binding Proteins Human genes 0.000 description 2
- 108700020911 DNA-Binding Proteins Proteins 0.000 description 2
- 230000004568 DNA-binding Effects 0.000 description 2
- KCXVZYZYPLLWCC-UHFFFAOYSA-N EDTA Chemical compound OC(=O)CN(CC(O)=O)CCN(CC(O)=O)CC(O)=O KCXVZYZYPLLWCC-UHFFFAOYSA-N 0.000 description 2
- 102100031785 Endothelial transcription factor GATA-2 Human genes 0.000 description 2
- 206010064571 Gene mutation Diseases 0.000 description 2
- 101001066265 Homo sapiens Endothelial transcription factor GATA-2 Proteins 0.000 description 2
- 101000882584 Homo sapiens Estrogen receptor Proteins 0.000 description 2
- 101000829208 Homo sapiens Serine/arginine repetitive matrix protein 3 Proteins 0.000 description 2
- 102000048850 Neoplasm Genes Human genes 0.000 description 2
- 108700019961 Neoplasm Genes Proteins 0.000 description 2
- 206010061309 Neoplasm progression Diseases 0.000 description 2
- 108700020796 Oncogene Proteins 0.000 description 2
- RJKFOVLPORLFTN-LEKSSAKUSA-N Progesterone Chemical compound C1CC2=CC(=O)CC[C@]2(C)[C@@H]2[C@@H]1[C@@H]1CC[C@H](C(=O)C)[C@@]1(C)CC2 RJKFOVLPORLFTN-LEKSSAKUSA-N 0.000 description 2
- 102100027901 Protein unc-13 homolog A Human genes 0.000 description 2
- 102000015097 RNA Splicing Factors Human genes 0.000 description 2
- 108010039259 RNA Splicing Factors Proteins 0.000 description 2
- 238000012300 Sequence Analysis Methods 0.000 description 2
- 102100023665 Serine/arginine repetitive matrix protein 3 Human genes 0.000 description 2
- 108010078814 Tumor Suppressor Protein p53 Proteins 0.000 description 2
- ISAKRJDGNUQOIC-UHFFFAOYSA-N Uracil Chemical compound O=C1C=CNC(=O)N1 ISAKRJDGNUQOIC-UHFFFAOYSA-N 0.000 description 2
- 238000010317 ablation therapy Methods 0.000 description 2
- 238000000540 analysis of variance Methods 0.000 description 2
- 229940030495 antiandrogen sex hormone and modulator of the genital system Drugs 0.000 description 2
- 210000000481 breast Anatomy 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 230000005773 cancer-related death Effects 0.000 description 2
- 230000030833 cell death Effects 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 150000001875 compounds Chemical class 0.000 description 2
- 238000011109 contamination Methods 0.000 description 2
- OPTASPLRGRRNAP-UHFFFAOYSA-N cytosine Chemical compound NC=1C=CNC(=O)N=1 OPTASPLRGRRNAP-UHFFFAOYSA-N 0.000 description 2
- 230000034994 death Effects 0.000 description 2
- 231100000517 death Toxicity 0.000 description 2
- 238000012350 deep sequencing Methods 0.000 description 2
- 230000007850 degeneration Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000005750 disease progression Effects 0.000 description 2
- 239000006185 dispersion Substances 0.000 description 2
- 229940034984 endocrine therapy antineoplastic and immunomodulating agent Drugs 0.000 description 2
- 239000003623 enhancer Substances 0.000 description 2
- 238000010201 enrichment analysis Methods 0.000 description 2
- 230000002255 enzymatic effect Effects 0.000 description 2
- 230000004049 epigenetic modification Effects 0.000 description 2
- 239000000262 estrogen Substances 0.000 description 2
- 201000007281 estrogen-receptor positive breast cancer Diseases 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000004077 genetic alteration Effects 0.000 description 2
- 231100000118 genetic alteration Toxicity 0.000 description 2
- 238000011331 genomic analysis Methods 0.000 description 2
- 238000003205 genotyping method Methods 0.000 description 2
- UYTPUPDQBNUYGX-UHFFFAOYSA-N guanine Chemical compound O=C1NC(N)=NC2=C1N=CN2 UYTPUPDQBNUYGX-UHFFFAOYSA-N 0.000 description 2
- 230000002962 histologic effect Effects 0.000 description 2
- 239000003112 inhibitor Substances 0.000 description 2
- 101150044508 key gene Proteins 0.000 description 2
- 230000003902 lesion Effects 0.000 description 2
- 231100000518 lethal Toxicity 0.000 description 2
- 230000001665 lethal effect Effects 0.000 description 2
- 238000012317 liver biopsy Methods 0.000 description 2
- 239000003550 marker Substances 0.000 description 2
- 239000000178 monomer Substances 0.000 description 2
- 239000013642 negative control Substances 0.000 description 2
- QJGQUHMNIGDVPM-UHFFFAOYSA-N nitrogen group Chemical group [N] QJGQUHMNIGDVPM-UHFFFAOYSA-N 0.000 description 2
- 239000002773 nucleotide Substances 0.000 description 2
- 125000003729 nucleotide group Chemical group 0.000 description 2
- 230000002265 prevention Effects 0.000 description 2
- 230000002250 progressing effect Effects 0.000 description 2
- 201000005825 prostate adenocarcinoma Diseases 0.000 description 2
- 201000001514 prostate carcinoma Diseases 0.000 description 2
- 238000003908 quality control method Methods 0.000 description 2
- 230000008261 resistance mechanism Effects 0.000 description 2
- 238000013515 script Methods 0.000 description 2
- 208000000649 small cell carcinoma Diseases 0.000 description 2
- 238000001356 surgical procedure Methods 0.000 description 2
- 230000002459 sustained effect Effects 0.000 description 2
- 238000013268 sustained release Methods 0.000 description 2
- 239000012730 sustained-release form Substances 0.000 description 2
- 229960001603 tamoxifen Drugs 0.000 description 2
- 229940113082 thymine Drugs 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 238000013519 translation Methods 0.000 description 2
- 229960000575 trastuzumab Drugs 0.000 description 2
- 238000011269 treatment regimen Methods 0.000 description 2
- 230000005751 tumor progression Effects 0.000 description 2
- 230000003827 upregulation Effects 0.000 description 2
- 238000011144 upstream manufacturing Methods 0.000 description 2
- 238000012800 visualization Methods 0.000 description 2
- GUAHPAJOXVYFON-ZETCQYMHSA-N (8S)-8-amino-7-oxononanoic acid zwitterion Chemical compound C[C@H](N)C(=O)CCCCCC(O)=O GUAHPAJOXVYFON-ZETCQYMHSA-N 0.000 description 1
- LKJPYSCBVHEWIU-KRWDZBQOSA-N (R)-bicalutamide Chemical compound C([C@@](O)(C)C(=O)NC=1C=C(C(C#N)=CC=1)C(F)(F)F)S(=O)(=O)C1=CC=C(F)C=C1 LKJPYSCBVHEWIU-KRWDZBQOSA-N 0.000 description 1
- 206010069754 Acquired gene mutation Diseases 0.000 description 1
- 229930024421 Adenine Natural products 0.000 description 1
- GFFGJBXGBJISGV-UHFFFAOYSA-N Adenine Chemical compound NC1=NC=NC2=C1N=CN2 GFFGJBXGBJISGV-UHFFFAOYSA-N 0.000 description 1
- 208000010507 Adenocarcinoma of Lung Diseases 0.000 description 1
- 206010002091 Anaesthesia Diseases 0.000 description 1
- 208000023275 Autoimmune disease Diseases 0.000 description 1
- 102100022983 B-cell lymphoma/leukemia 11B Human genes 0.000 description 1
- LSNNMFCWUKXFEE-UHFFFAOYSA-M Bisulfite Chemical compound OS([O-])=O LSNNMFCWUKXFEE-UHFFFAOYSA-M 0.000 description 1
- 102100028226 COUP transcription factor 2 Human genes 0.000 description 1
- 108091033409 CRISPR Proteins 0.000 description 1
- 238000010354 CRISPR gene editing Methods 0.000 description 1
- 102000029816 Collagenase Human genes 0.000 description 1
- 108060005980 Collagenase Proteins 0.000 description 1
- 208000035473 Communicable disease Diseases 0.000 description 1
- 108010043471 Core Binding Factor Alpha 2 Subunit Proteins 0.000 description 1
- 108010079362 Core Binding Factor Alpha 3 Subunit Proteins 0.000 description 1
- 102000053602 DNA Human genes 0.000 description 1
- 230000033616 DNA repair Effects 0.000 description 1
- 230000004543 DNA replication Effects 0.000 description 1
- 102100037799 DNA-binding protein Ikaros Human genes 0.000 description 1
- 102000004163 DNA-directed RNA polymerases Human genes 0.000 description 1
- 108090000626 DNA-directed RNA polymerases Proteins 0.000 description 1
- 102000016911 Deoxyribonucleases Human genes 0.000 description 1
- 108010053770 Deoxyribonucleases Proteins 0.000 description 1
- 101100477411 Dictyostelium discoideum set1 gene Proteins 0.000 description 1
- QRLVDLBMBULFAL-UHFFFAOYSA-N Digitonin Natural products CC1CCC2(OC1)OC3C(O)C4C5CCC6CC(OC7OC(CO)C(OC8OC(CO)C(O)C(OC9OCC(O)C(O)C9OC%10OC(CO)C(O)C(OC%11OC(CO)C(O)C(O)C%11O)C%10O)C8O)C(O)C7O)C(O)CC6(C)C5CCC4(C)C3C2C QRLVDLBMBULFAL-UHFFFAOYSA-N 0.000 description 1
- 206010061818 Disease progression Diseases 0.000 description 1
- 101100310856 Drosophila melanogaster spri gene Proteins 0.000 description 1
- 102100039563 ETS translocation variant 1 Human genes 0.000 description 1
- 102100039578 ETS translocation variant 4 Human genes 0.000 description 1
- 102100039577 ETS translocation variant 5 Human genes 0.000 description 1
- 102000004190 Enzymes Human genes 0.000 description 1
- 108090000790 Enzymes Proteins 0.000 description 1
- 108700024394 Exon Proteins 0.000 description 1
- 101001066288 Gallus gallus GATA-binding factor 3 Proteins 0.000 description 1
- 229920002527 Glycogen Polymers 0.000 description 1
- 238000011460 HER2-targeted therapy Methods 0.000 description 1
- 108010081348 HRT1 protein Hairy Proteins 0.000 description 1
- 102100021881 Hairy/enhancer-of-split related with YRPW motif protein 1 Human genes 0.000 description 1
- 102100027893 Homeobox protein Nkx-2.1 Human genes 0.000 description 1
- 102100028092 Homeobox protein Nkx-3.1 Human genes 0.000 description 1
- 101100163881 Homo sapiens ASCL1 gene Proteins 0.000 description 1
- 101000903697 Homo sapiens B-cell lymphoma/leukemia 11B Proteins 0.000 description 1
- 101000860860 Homo sapiens COUP transcription factor 2 Proteins 0.000 description 1
- 101000599038 Homo sapiens DNA-binding protein Ikaros Proteins 0.000 description 1
- 101000863721 Homo sapiens Deoxyribonuclease-1 Proteins 0.000 description 1
- 101000813729 Homo sapiens ETS translocation variant 1 Proteins 0.000 description 1
- 101000813747 Homo sapiens ETS translocation variant 4 Proteins 0.000 description 1
- 101000813745 Homo sapiens ETS translocation variant 5 Proteins 0.000 description 1
- 101000632178 Homo sapiens Homeobox protein Nkx-2.1 Proteins 0.000 description 1
- 101000578249 Homo sapiens Homeobox protein Nkx-3.1 Proteins 0.000 description 1
- 101001033715 Homo sapiens Insulinoma-associated protein 1 Proteins 0.000 description 1
- 101000598002 Homo sapiens Interferon regulatory factor 1 Proteins 0.000 description 1
- 101000628547 Homo sapiens Metalloreductase STEAP1 Proteins 0.000 description 1
- 101000603698 Homo sapiens Neurogenin-2 Proteins 0.000 description 1
- 101000979342 Homo sapiens Nuclear factor NF-kappa-B p105 subunit Proteins 0.000 description 1
- 101000572986 Homo sapiens POU domain, class 3, transcription factor 2 Proteins 0.000 description 1
- 101001091365 Homo sapiens Plasma kallikrein Proteins 0.000 description 1
- 101000605534 Homo sapiens Prostate-specific antigen Proteins 0.000 description 1
- 101000876829 Homo sapiens Protein C-ets-1 Proteins 0.000 description 1
- 101001132658 Homo sapiens Retinoic acid receptor gamma Proteins 0.000 description 1
- 101000716809 Homo sapiens Secretogranin-1 Proteins 0.000 description 1
- 102000008394 Immunoglobulin Fragments Human genes 0.000 description 1
- 108010021625 Immunoglobulin Fragments Proteins 0.000 description 1
- 102100039091 Insulinoma-associated protein 1 Human genes 0.000 description 1
- 102100036981 Interferon regulatory factor 1 Human genes 0.000 description 1
- PIWKPBJCKXDKJR-UHFFFAOYSA-N Isoflurane Chemical compound FC(F)OC(Cl)C(F)(F)F PIWKPBJCKXDKJR-UHFFFAOYSA-N 0.000 description 1
- 238000012313 Kruskal-Wallis test Methods 0.000 description 1
- 206010050017 Lung cancer metastatic Diseases 0.000 description 1
- 102100026712 Metalloreductase STEAP1 Human genes 0.000 description 1
- 241000699660 Mus musculus Species 0.000 description 1
- 101100289867 Mus musculus Lyl1 gene Proteins 0.000 description 1
- 101100348669 Mus musculus Nkx3-1 gene Proteins 0.000 description 1
- 102100038554 Neurogenin-2 Human genes 0.000 description 1
- 102100023050 Nuclear factor NF-kappa-B p105 subunit Human genes 0.000 description 1
- 102000007399 Nuclear hormone receptor Human genes 0.000 description 1
- 108020005497 Nuclear hormone receptor Proteins 0.000 description 1
- 208000035327 Oestrogen receptor positive breast cancer Diseases 0.000 description 1
- 239000012661 PARP inhibitor Substances 0.000 description 1
- 102100026459 POU domain, class 3, transcription factor 2 Human genes 0.000 description 1
- 229930012538 Paclitaxel Natural products 0.000 description 1
- 229940121906 Poly ADP ribose polymerase inhibitor Drugs 0.000 description 1
- 241001237728 Precis Species 0.000 description 1
- 102100035251 Protein C-ets-1 Human genes 0.000 description 1
- 238000002123 RNA extraction Methods 0.000 description 1
- 108700005075 Regulator Genes Proteins 0.000 description 1
- 102100033912 Retinoic acid receptor gamma Human genes 0.000 description 1
- 102100025373 Runt-related transcription factor 1 Human genes 0.000 description 1
- 102100025369 Runt-related transcription factor 3 Human genes 0.000 description 1
- 108010017324 STAT3 Transcription Factor Proteins 0.000 description 1
- 102100020867 Secretogranin-1 Human genes 0.000 description 1
- 102100035348 Serine/threonine-protein phosphatase 2B catalytic subunit alpha isoform Human genes 0.000 description 1
- 102100024040 Signal transducer and activator of transcription 3 Human genes 0.000 description 1
- 206010066901 Treatment failure Diseases 0.000 description 1
- 102100027881 Tumor protein 63 Human genes 0.000 description 1
- 101710140697 Tumor protein 63 Proteins 0.000 description 1
- 108700029634 Y-Linked Genes Proteins 0.000 description 1
- 230000001594 aberrant effect Effects 0.000 description 1
- GZOSMCIZMLWJML-VJLLXTKPSA-N abiraterone Chemical compound C([C@H]1[C@H]2[C@@H]([C@]3(CC[C@H](O)CC3=CC2)C)CC[C@@]11C)C=C1C1=CC=CN=C1 GZOSMCIZMLWJML-VJLLXTKPSA-N 0.000 description 1
- 229960000853 abiraterone Drugs 0.000 description 1
- 230000021736 acetylation Effects 0.000 description 1
- 238000006640 acetylation reaction Methods 0.000 description 1
- 229960000643 adenine Drugs 0.000 description 1
- 239000002671 adjuvant Substances 0.000 description 1
- 238000005576 amination reaction Methods 0.000 description 1
- 230000037005 anaesthesia Effects 0.000 description 1
- 238000012801 analytical assay Methods 0.000 description 1
- 238000009167 androgen deprivation therapy Methods 0.000 description 1
- 102000001307 androgen receptors Human genes 0.000 description 1
- 229940043275 anti-HER2 drug Drugs 0.000 description 1
- 230000006907 apoptotic process Effects 0.000 description 1
- 239000003886 aromatase inhibitor Substances 0.000 description 1
- 229940046844 aromatase inhibitors Drugs 0.000 description 1
- 101150036080 at gene Proteins 0.000 description 1
- 229960003852 atezolizumab Drugs 0.000 description 1
- 239000011324 bead Substances 0.000 description 1
- 229960000997 bicalutamide Drugs 0.000 description 1
- 238000001369 bisulfite sequencing Methods 0.000 description 1
- 210000000601 blood cell Anatomy 0.000 description 1
- RMRJXGBAOAMLHD-IHFGGWKQSA-N buprenorphine Chemical compound C([C@]12[C@H]3OC=4C(O)=CC=C(C2=4)C[C@@H]2[C@]11CC[C@]3([C@H](C1)[C@](C)(O)C(C)(C)C)OC)CN2CC1CC1 RMRJXGBAOAMLHD-IHFGGWKQSA-N 0.000 description 1
- 229960001736 buprenorphine Drugs 0.000 description 1
- 239000012830 cancer therapeutic Substances 0.000 description 1
- 229910052799 carbon Inorganic materials 0.000 description 1
- 190000008236 carboplatin Chemical compound 0.000 description 1
- 229960004562 carboplatin Drugs 0.000 description 1
- 231100000504 carcinogenesis Toxicity 0.000 description 1
- 238000012754 cardiac puncture Methods 0.000 description 1
- 108091092259 cell-free RNA Proteins 0.000 description 1
- 238000005119 centrifugation Methods 0.000 description 1
- 239000003795 chemical substances by application Substances 0.000 description 1
- 238000002512 chemotherapy Methods 0.000 description 1
- 208000011654 childhood malignant neoplasm Diseases 0.000 description 1
- 239000013611 chromosomal DNA Substances 0.000 description 1
- 229960002424 collagenase Drugs 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 238000005056 compaction Methods 0.000 description 1
- 239000000306 component Substances 0.000 description 1
- 230000001010 compromised effect Effects 0.000 description 1
- 238000010205 computational analysis Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000009109 curative therapy Methods 0.000 description 1
- 229940104302 cytosine Drugs 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 230000002950 deficient Effects 0.000 description 1
- 230000003831 deregulation Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- UVYVLBIGDKGWPX-KUAJCENISA-N digitonin Chemical compound O([C@@H]1[C@@H]([C@]2(CC[C@@H]3[C@@]4(C)C[C@@H](O)[C@H](O[C@H]5[C@@H]([C@@H](O)[C@@H](O[C@H]6[C@@H]([C@@H](O[C@H]7[C@@H]([C@@H](O)[C@H](O)CO7)O)[C@H](O)[C@@H](CO)O6)O[C@H]6[C@@H]([C@@H](O[C@H]7[C@@H]([C@@H](O)[C@H](O)[C@@H](CO)O7)O)[C@@H](O)[C@@H](CO)O6)O)[C@@H](CO)O5)O)C[C@@H]4CC[C@H]3[C@@H]2[C@@H]1O)C)[C@@H]1C)[C@]11CC[C@@H](C)CO1 UVYVLBIGDKGWPX-KUAJCENISA-N 0.000 description 1
- UVYVLBIGDKGWPX-UHFFFAOYSA-N digitonine Natural products CC1C(C2(CCC3C4(C)CC(O)C(OC5C(C(O)C(OC6C(C(OC7C(C(O)C(O)CO7)O)C(O)C(CO)O6)OC6C(C(OC7C(C(O)C(O)C(CO)O7)O)C(O)C(CO)O6)O)C(CO)O5)O)CC4CCC3C2C2O)C)C2OC11CCC(C)CO1 UVYVLBIGDKGWPX-UHFFFAOYSA-N 0.000 description 1
- 230000003467 diminishing effect Effects 0.000 description 1
- 108010007093 dispase Proteins 0.000 description 1
- 238000002224 dissection Methods 0.000 description 1
- 230000037437 driver mutation Effects 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 229940121647 egfr inhibitor Drugs 0.000 description 1
- 230000008030 elimination Effects 0.000 description 1
- 238000003379 elimination reaction Methods 0.000 description 1
- 229960004671 enzalutamide Drugs 0.000 description 1
- WXCXUHSOUPDCQV-UHFFFAOYSA-N enzalutamide Chemical compound C1=C(F)C(C(=O)NC)=CC=C1N1C(C)(C)C(=O)N(C=2C=C(C(C#N)=CC=2)C(F)(F)F)C1=S WXCXUHSOUPDCQV-UHFFFAOYSA-N 0.000 description 1
- 229940088598 enzyme Drugs 0.000 description 1
- 102000052116 epidermal growth factor receptor activity proteins Human genes 0.000 description 1
- 108700015053 epidermal growth factor receptor activity proteins Proteins 0.000 description 1
- 230000004076 epigenetic alteration Effects 0.000 description 1
- 238000007419 epigenetic assay Methods 0.000 description 1
- 230000006718 epigenetic regulation Effects 0.000 description 1
- 229940011871 estrogen Drugs 0.000 description 1
- 201000007280 estrogen-receptor negative breast cancer Diseases 0.000 description 1
- 230000002496 gastric effect Effects 0.000 description 1
- 229940096919 glycogen Drugs 0.000 description 1
- 238000003306 harvesting Methods 0.000 description 1
- 230000011132 hemopoiesis Effects 0.000 description 1
- 239000008241 heterogeneous mixture Substances 0.000 description 1
- 238000001794 hormone therapy Methods 0.000 description 1
- 238000009396 hybridization Methods 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 230000005746 immune checkpoint blockade Effects 0.000 description 1
- 238000001114 immunoprecipitation Methods 0.000 description 1
- 238000009169 immunotherapy Methods 0.000 description 1
- 230000008676 import Effects 0.000 description 1
- 238000000126 in silico method Methods 0.000 description 1
- 230000001939 inductive effect Effects 0.000 description 1
- 208000015181 infectious disease Diseases 0.000 description 1
- 230000005764 inhibitory process Effects 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 239000000543 intermediate Substances 0.000 description 1
- 238000011835 investigation Methods 0.000 description 1
- 229960002725 isoflurane Drugs 0.000 description 1
- 238000002955 isolation Methods 0.000 description 1
- 231100000225 lethality Toxicity 0.000 description 1
- 238000012417 linear regression Methods 0.000 description 1
- 230000004807 localization Effects 0.000 description 1
- 201000005249 lung adenocarcinoma Diseases 0.000 description 1
- 210000001165 lymph node Anatomy 0.000 description 1
- 230000003211 malignant effect Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000010197 meta-analysis Methods 0.000 description 1
- 230000005012 migration Effects 0.000 description 1
- 238000013508 migration Methods 0.000 description 1
- 238000007479 molecular analysis Methods 0.000 description 1
- 238000010369 molecular cloning Methods 0.000 description 1
- 230000000877 morphologic effect Effects 0.000 description 1
- 230000000869 mutational effect Effects 0.000 description 1
- YOHYSYJDKVYCJI-UHFFFAOYSA-N n-[3-[[6-[3-(trifluoromethyl)anilino]pyrimidin-4-yl]amino]phenyl]cyclopropanecarboxamide Chemical compound FC(F)(F)C1=CC=CC(NC=2N=CN=C(NC=3C=C(NC(=O)C4CC4)C=CC=3)C=2)=C1 YOHYSYJDKVYCJI-UHFFFAOYSA-N 0.000 description 1
- 230000017074 necrotic cell death Effects 0.000 description 1
- 238000011227 neoadjuvant chemotherapy Methods 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 230000007372 neural signaling Effects 0.000 description 1
- 201000002120 neuroendocrine carcinoma Diseases 0.000 description 1
- 230000004031 neuronal differentiation Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000002018 overexpression Effects 0.000 description 1
- 229960001592 paclitaxel Drugs 0.000 description 1
- 230000001575 pathological effect Effects 0.000 description 1
- 230000007170 pathology Effects 0.000 description 1
- 230000037361 pathway Effects 0.000 description 1
- 229960002621 pembrolizumab Drugs 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 229960002087 pertuzumab Drugs 0.000 description 1
- 125000002467 phosphate group Chemical group [H]OP(=O)(O[H])O[*] 0.000 description 1
- 229920000642 polymer Polymers 0.000 description 1
- 102000054765 polymorphisms of proteins Human genes 0.000 description 1
- 102000040430 polynucleotide Human genes 0.000 description 1
- 108091033319 polynucleotide Proteins 0.000 description 1
- 230000003389 potentiating effect Effects 0.000 description 1
- 208000037920 primary disease Diseases 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 239000000047 product Substances 0.000 description 1
- 239000000186 progesterone Substances 0.000 description 1
- 229960003387 progesterone Drugs 0.000 description 1
- 201000007283 progesterone-receptor positive breast cancer Diseases 0.000 description 1
- -1 promoters Proteins 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 208000023958 prostate neoplasm Diseases 0.000 description 1
- 102000004169 proteins and genes Human genes 0.000 description 1
- 239000002096 quantum dot Substances 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000007634 remodeling Methods 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
- 230000010076 replication Effects 0.000 description 1
- 230000001718 repressive effect Effects 0.000 description 1
- 230000008672 reprogramming Effects 0.000 description 1
- 238000012827 research and development Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000013432 robust analysis Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000002864 sequence alignment Methods 0.000 description 1
- 230000037439 somatic mutation Effects 0.000 description 1
- 230000009870 specific binding Effects 0.000 description 1
- 230000007480 spreading Effects 0.000 description 1
- 238000003892 spreading Methods 0.000 description 1
- 238000012289 standard assay Methods 0.000 description 1
- 238000010561 standard procedure Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000007920 subcutaneous administration Methods 0.000 description 1
- CCEKAJIANROZEO-UHFFFAOYSA-N sulfluramid Chemical group CCNS(=O)(=O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F CCEKAJIANROZEO-UHFFFAOYSA-N 0.000 description 1
- 239000006228 supernatant Substances 0.000 description 1
- 238000001847 surface plasmon resonance imaging Methods 0.000 description 1
- 230000008685 targeting Effects 0.000 description 1
- RCINICONZNJXQF-MZXODVADSA-N taxol Chemical compound O([C@@H]1[C@@]2(C[C@@H](C(C)=C(C2(C)C)[C@H](C([C@]2(C)[C@@H](O)C[C@H]3OC[C@]3([C@H]21)OC(C)=O)=O)OC(=O)C)OC(=O)[C@H](O)[C@@H](NC(=O)C=1C=CC=CC=1)C=1C=CC=CC=1)O)C(=O)C1=CC=CC=C1 RCINICONZNJXQF-MZXODVADSA-N 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 230000005758 transcription activity Effects 0.000 description 1
- 108091008023 transcriptional regulators Proteins 0.000 description 1
- 238000011222 transcriptome analysis Methods 0.000 description 1
- 238000013520 translational research Methods 0.000 description 1
- 229940035893 uracil Drugs 0.000 description 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B25/00—ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
- G16B25/10—Gene or protein expression profiling; Expression-ratio estimation or normalisation
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/20—Supervised data analysis
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H20/00—ICT specially adapted for therapies or health-improving plans, e.g. for handling prescriptions, for steering therapy or for monitoring patient compliance
- G16H20/10—ICT specially adapted for therapies or health-improving plans, e.g. for handling prescriptions, for steering therapy or for monitoring patient compliance relating to drugs or medications, e.g. for ensuring correct administration to patients
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/20—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
Definitions
- Metastatic cancer is a late stage of cancer that often leads to cancer-related deaths.
- treatment options are often based on clinical diagnostics from the primary tumor.
- molecular changes in the tumor such as genetic alterations or phenotype changes, can emerge during metastatic progression or the development of treatment resistance.
- hormone receptor conversions in breast cancer are frequent observed during the development of targeted treatment resistance. Therefore, it is important to classify tumor subtypes and identify patterns of transcriptional regulation that drive tumor phenotype changes during therapy. This type of work has critical implications for studying mechanisms of resistance to therapies and informing clinical treatment decisions in order to provide patients with life-prolonging treatment and care.
- breast cancer is among the most common causes of cancer, accounting for 23% of cancer diagnoses and 14% of cancer-related deaths among women worldwide.
- Targeted therapy is guided by tumor subtype, including the expression of three hormone receptors: ER, PR and HER2.
- ER ER
- PR HER2
- breast cancer tumors will undergo a switch in hormone subtype during tumor recurrence or as a mechanism of resistance to endocrine therapy.
- clinical determination of tumor subtype remains restricted to use of tissue biopsies, which are not routinely collected in late-stage cancers or repeatedly taken during the course of therapy.
- prostate cancer is the second most common cause of cancer mortality among men with an estimated 33,000 deaths in the United States in 2020.
- Castration- resistant prostate cancer describes the stage in which the disease has developed resistance to androgen deprivation therapy and progression to metastatic CRPC (mCRPC), which is an invariably lethal stage with no curative treatment.
- mCRPC is recognized to comprise multiple distinct subtype lineages and molecular subtypes, which are generally classified by specific genomic or epigenetic modifications.
- Prostate cancer can be categorized by phenotypic features that includes spectrum of trans-differentiated disease state including neuroendocrine (NE) carcinomas, low androgen regulated disease state (ARlowPC), double negative prostate cancer (DNPC, AR negative NE negative).
- NE neuroendocrine
- ARlowPC low androgen regulated disease state
- DNPC double negative prostate cancer
- the disclosure provides a computer-implemented method of enhancing sequence read data from cell-free DNA samples for cell type prediction.
- the method comprises: receiving, by a computing system, sequence read data, wherein the sequence read data includes a plurality of fragment reads, wherein each fragment read has a fragment length and a GC content indicating a percentage of bases in the fragment read that are G or C; determining, by the computing system, GC bias values for each fragment read based on the fragment length and the GC content of the fragment read; generating, by the computing system, a genomic coverage distribution that is adjusted for GC bias using the sequence read data and the GC bias values; and predicting, by the computing system, the cell type based on the genomic coverage distribution.
- predicting the cell type based on the genomic coverage distribution includes predicting a cell phenotype. In one embodiment, predicting the cell phenotype includes predicting a tissue type, a cancer type, or a cancer subtype. In one embodiment, predicting the cell phenotype includes predicting expression of one or more genes of interest.
- determining the GC bias value based on the fragment length and the GC content of the fragment read includes: counting a number of observed reads of each combination of fragment length and GC content to determine GC counts for the sequence read data; dividing the GC counts by corresponding GC frequencies in a GC frequency matrix to determine a GC bias for each fragment length; normalizing a mean GC bias for each fragment length to determine rough GC bias values; and smoothing the rough GC bias values to determine the GC bias values.
- the GC frequency matrix stores a frequency for each GC content for each fragment length of a plurality of fragment lengths in mappable regions of a reference genome.
- the plurality of fragment lengths includes each fragment length from a short length threshold to a long length threshold.
- the short length threshold is in a range of 10-20 base pairs
- the long length threshold is in a range of 450-550 base pairs.
- the short length threshold is 15 base pairs
- the long length threshold is 500 base pairs.
- the method further comprises: determining genomic regions of interest for a cell type; and filtering the genomic regions of interest to identify cell-type-informative sites.
- determining the genomic regions of interest includes: determining a mean mappability in a fixed size window around each genomic region of interest; and discarding genomic regions of interest having a mean mappability less than a predetermined threshold.
- filtering the genomic regions of interest to identify cell-type-informative sites includes determining sites that have differential signals between a first cell type and a second cell type.
- generating the genomic coverage distribution includes: determining fragment midpoints in a window around each cell-type-informative site; assigning a weight for each fragment read based on an inverse of the GC bias value for each fragment read; using the weighted fragment reads to determine GC-corrected midpoint coverage profiles; excluding positions that overlap excluded regions; determining a mean profile based on determining an average of GC-corrected midpoint coverage profiles for all sites; smoothing the mean profile to generate a smoothed mean profile; and normalizing the smoothed mean profile by dividing by a mean of surrounding coverage to determine a normalized mean profile.
- the excluded regions include one or more regions that are within an encode unified GROG 8 exclusion list, centromeres, gaps in human genome assembly, fix patches, alternative haplotypes, regions of zero mappability, or have coverage of at least 10 standard deviations above a mean.
- predicting the cell type based on the genomic coverage distribution includes: generating one or more features based on the genomic coverage distribution; providing the one or more features as input to a classifier model; and determining the cell type based on an output of the classifier model.
- the one or more features include a mean of coverage in a first predetermined window around each cell-type-informative site, a mean of coverage in a second predetermined window of a different size than the first predetermined window around each cell-type- informative site, and an amplitude of the genomic coverage distribution around each cell- type-informative site.
- the first predetermined window is larger than the second predetermined window.
- the first predetermined window has a width in a range of 1800-2200 base pairs
- the second predetermined window has a width in a range of 40-80 base pairs.
- the first predetermined window has a width of 2000 base pairs
- the second predetermined window has a width of 60 base pairs.
- the amplitude of the genomic coverage distribution around each cell-type-informative site is determined by: trimming the genomic coverage distribution to a window that contains 10 peaks; performing a fast Fourier transform on the window of the genomic coverage distribution; and determining a magnitude of the 10th frequency.
- the classifier model includes a logistic regression model, an artificial neural network, a decision tree, a support vector machine, or a Bayesian network.
- the disclosure provides a method of determining a chromatin accessibility profile for a cell of interest from a sample comprising cell-free DNA derived from the cell of interest.
- the method comprises: obtaining sequence read data from the cell-free DNA; receiving, by a computing system, sequence read data, wherein the sequence read data includes a plurality of fragment reads, wherein each fragment read has a fragment length and a GC content indicating a percentage of bases in the fragment read that are G or C; determining, by the computing system, GC bias values for each fragment read based on the fragment length and the GC content of the fragment read; generating, by the computing system, a genomic coverage distribution that is adjusted for GC bias using the sequence read data and the GC bias values; and determining the chromatin accessibility profile from the genomic coverage distribution.
- the method further comprises determining a phenotype of the cell of interest based on the chromatin occupancy profile. In one embodiment, determining the cell phenotype comprises determining a tissue type, a cancer type, a cancer subtype, a malignancy aggressiveness phenotype, and/or a drug responsivity phenotype. In one embodiment, the method further comprises performing one or more steps of the computer implemented method described herein.
- the disclosure provides a method for determining a cell type of a cell of interest from a sample comprising cell-free DNA derived from the cell of interest.
- the method comprises: obtaining sequence read data generated from the sample comprising cell-free
- DNA DNA; performing the computer-implemented method described herein; and determining the cell type of the cell of interest based on the prediction provided by the computing system.
- determining the cell type comprises determining a cell phenotype. In one embodiment, determining the cell phenotype comprises determining a tissue type, a cancer type, a cancer subtype, a malignancy aggressiveness phenotype, and/or a drug responsivity phenotype. In one embodiment, determining the cell phenotype includes determining expression of one or more genes of interest.
- the disclosure provides a method of detecting the presence of a cancer cell in a subject, comprising: obtaining sequence read data generated from the sample comprising cell-free DNA obtained from the subject; performing the computer-implemented method described herein; and determining the presence of a cancer cell in the subject based on the prediction provided by the computing system.
- the method is performed a plurality of times over time, wherein the detected cancer cell(s) in the subject at each performance of the method are further characterized to determine a cancer subtype or phenotype of the detected cancer cell(s) based on the prediction provided by the computing system.
- the method is performed a plurality of times over time, and the method further comprises detecting a change in phenotype of the detected cancer cell(s) over time.
- the subject receives a cancer therapy between performances of the method, and the method further comprises determining the responsivity of the cancer cell(s) to the treatment.
- the disclosure provides a method of determining a cancer subtype of a target cancer cell from a sample comprising cell-free DNA derived from the target cancer cell.
- the method comprises: obtaining sequence read data generated from the sample comprising cell-free
- the sample is obtained from a subject with cancer.
- the cancer is characterized as metastatic breast cancer.
- determining the cancer subtype comprises determining whether the cancer is ER+ versus ER-. In one embodiment, determining the cancer subtype comprises determining whether the cancer is PR+ versus PR-. In one embodiment, determining the cancer subtype comprises determining whether the cancer is HER2+ versus HER2-. In one embodiment, determining the cancer subtype comprises determining two or all of: whether the cancer is ER+ versus ER-, whether the cancer is PR+ versus PR-, and whether the cancer is HER2+ versus HER2-.
- cancer is characterized as metastatic prostate cancer.
- determining the cancer subtype comprises determining whether the cancer is AR+ (ARPC) versus AR-. In one embodiment, determining the cancer subtype comprises determining whether the cancer is ARPC versus AR-low. In one embodiment, determining the cancer subtype comprises determining whether the cancer has a neuroendocrine prostate cancer (NEPC) phenotype signature or not. In one embodiment, determining the cancer subtype comprises determining whether the cancer is amphicrine.
- ARPC AR+
- determining the cancer subtype comprises determining whether the cancer is ARPC versus AR-low.
- determining the cancer subtype comprises determining whether the cancer has a neuroendocrine prostate cancer (NEPC) phenotype signature or not. In one embodiment, determining the cancer subtype comprises determining whether the cancer is amphicrine.
- NEPC neuroendocrine prostate cancer
- determining the cancer subtype comprises determining two or all of: whether the cancer is AR+ (ARPC) or AR-, whether the cancer is AR-low or ARPC, whether the cancer has a neuroendocrine prostate cancer (NEPC) phenotype signature or not, whether the cancer is AR-low or NEPC, whether the cancer is amphicrine or ARPC or NEPC.
- ARPC AR+
- NEPC neuroendocrine prostate cancer
- the cancer is characterized as lung cancer.
- determining the cancer subtype comprises determining whether the cancer is small cell lung cancer (SCLC) or non-small cell lung cancer (NSCLC).
- the method further comprises determining whether the NSCLC is adenocarcinoma or squamous cell carcinoma.
- the sequence read data is generated from a panel of genomic targets.
- the panel of genomic targets comprises transcription factor binding sites (TFBSs) of one or more transcription factors associated with SCLC.
- the one or more transcription factors associated with SCLC comprise one or more of ASLC, NEUROD1, POU2F3, REST, and the like, and the method comprises determining the nucleosome occupancy of the TFBSs.
- the TFBSs are identified by ChIP-seq data, or the like, and are retained in the panel if they are proximal to a transcription start site of a gene associated with lung cancer.
- the panel of genomic targets comprise transcription start sites (TSSs) for one or more markers associated with lung cancer, wherein the method comprises determining the nucleosome occupancy of the TSSs.
- the sample is obtained from a subject.
- the method further can further comprise administering an effective treatment to the subject based on the determined cancer subtype.
- the method further comprises performing the method on a plurality of samples obtained from the subject at a plurality of distinct time points after an initial diagnosis of cancer.
- the sequence read data is generated by ultra-low pass whole genome sequencing.
- sequence read data is generated by a chromatin accessibility assay.
- sequence read data is generated in an ATAC-seq method.
- sequence read data is generated in a ChIP-seq method.
- sequence read data is generated in a DNAse sensitivity assay.
- sequence read data is generated in a CUT&RUN assay.
- CUT&RUN assay incorporates an affinity reagent that targets a post-translational modification to one or more of H3K27ac, H3K4mel and H3K27ac.
- the method can further comprises generating the sequence read data.
- the sequence read data comprises sequence read data generated from a panel of genomic targets.
- the panel of genomic targets comprises transcription factor binding sites (TFBSs) of one or more transcription factors associated with a cancer type of interest.
- the method comprises determining the nucleosome occupancy of the TFBSs.
- the TFBSs are identified by ChIP-seq data, or the like, and are retained in the panel if they are proximal to a transcription start site of a gene associated with the cancer type of interest.
- the panel of genomic targets comprise transcription start sites (TSSs) for one or more markers associated with the cancer type of interest, wherein the method comprises determining the nucleosome occupancy of the TSSs.
- the sample can be blood, plasma, or serum, and the like.
- the disclosure provides a computer-implemented method of enhancing sequence read data from cell-free DNA samples for cell type prediction.
- the method comprises: receiving, by a computing system, sequence read data, wherein the sequence read data includes a plurality of fragment reads, and wherein each fragment read has a fragment length; determining, by the computing system, a fragment size variability for at least one gene associated with a cell type; and predicting, by the computing system, the cell type based on the fragment size variability for the at least one gene.
- determining the fragment size variability includes determining a fragment size coefficient of variation.
- predicting the cell type based on the genomic coverage distribution includes predicting a cell phenotype.
- predicting the cell phenotype includes predicting a cancer subtype.
- predicting the cell phenotype includes predicting a cancer subtype of prostate cancer.
- predicting the cancer subtype includes distinguishing between ARPC and NEPC.
- predicting the cell type based on the fragment size variability includes: generating one or more features based on the fragment size variability; providing the one or more features as input to a classifier model; and determining the cell type based on an output of the classifier model.
- generating the one or more features based on the fragment size variability includes generating a log2 fold change value of a fragment size coefficient of variation in a first cell type versus a second cell type.
- the log2 fold change value predicts at least one of gene expression and gene transcriptional activity between the first cell type and the second cell type.
- the first cell type is an ARPC cell and the second cell type is an NEPC cell.
- the classifier model includes a logistic regression model, an artificial neural network, a decision tree, a support vector machine, or a Bayesian network.
- the disclosure provides a method for determining a cell type of a cell of interest from a sample comprising cell-free DNA derived from the cell of interest, comprising: obtaining sequence read data generated from the sample comprising cell-free
- DNA DNA; performing the computer-implemented method described herein (e.g., relating to predicting the cell type based on the fragment size variability); and determining the cell type of the cell of interest based on the prediction provided by the computing system.
- determining the cell type comprises determining a cell phenotype. In one embodiment, determining the cell phenotype comprises determining a cancer subtype. In one embodiment, determining the cancer subtype includes distinguishing between ARPC and NEPC.
- the disclosure provides a method of detecting the presence of a cancer cell in a subject, comprising: obtaining sequence read data generated from a sample comprising cell-free DNA obtained from the subject; performing the computer-implemented method described herein (e.g., relating to predicting the cell type based on the fragment size variability); and determining the presence of a cancer cell in the subject based on the prediction provided by the computing system.
- the method is performed a plurality of times over time, wherein the detected cancer cell(s) in the subject at each performance of the method are further characterized to determine a cancer subtype or phenotype of the detected cancer cell(s) based on the prediction provided by the computing system. In one embodiment, the method is performed a plurality of times over time, and wherein the method further comprises detecting a change in phenotype of the detected cancer cell(s) over time. In one embodiment, the subject receives a cancer therapy between performances of the method, wherein the method further comprises determining the responsivity of the cancer cell(s) to the treatment. In another aspect, the disclosure provides a method of determining a cancer subtype of a target cancer cell from a sample comprising cell-free DNA derived from the target cancer cell, the method comprising: obtaining sequence read data generated from the sample comprising cell-free
- DNA DNA; performing the computer-implemented method described herein (e.g., relating to predicting the cell type based on the fragment size variability); and determining the cell type of the originating cell based on the predicted cancer subtype provided by the computing system.
- the sample is obtained from a subject with cancer.
- the cancer is characterized as metastatic prostate cancer.
- determining the cancer subtype comprises determining whether the cancer is AR+ (ARPC) versus AR-.
- determining the cancer subtype comprises determining whether the cancer is ARPC versus AR-low prostate cancer (ARLPC).
- determining the cancer subtype comprises determining whether the cancer has a neuroendocrine prostate cancer (NEPC) phenotype signature or not.
- the sample is obtained from a subject and the method further comprises administering an effective treatment to the subject based on the determined cancer subtype.
- the method further comprises performing the method on a plurality of samples obtained from the subject at a plurality of distinct time points after an initial diagnosis of cancer.
- the sequence read data is generated by ultra-low pass whole genome sequencing. In one embodiment, the sequence read data is generated by a chromatin accessibility assay. In one embodiment, the sequence read data is generated in an ATAC-seq method. In one embodiment, the sequence read data is generated in a ChlP- seq method. In one embodiment, the sequence read data is generated in a DNAse sensitivity assay. In one embodiment, the sequence read data is generated in a CUT&RUN assay. In one embodiment, the CUT&RUN assay incorporates an affinity reagent that targets a post-translational modification to one or more of H3K27ac, H3K4mel and H3K27ac. In one embodiment, the method further comprises generating the sequence read data.
- the sequence read data is generated from a panel of genomic targets.
- the panel of genomic targets comprises transcription factor binding sites (TFBSs) of one or more transcription factors associated with a cancer type of interest.
- the method comprises determining the nucleosome occupancy of the TFBSs.
- TFBSs are identified by ChIP-seq data, or the like, and are retained in the panel if they are proximal to a transcription start site of a gene associated with the cancer type of interest.
- the panel of genomic targets comprise transcription start sites (TSSs) for one or more markers associated with the cancer type of interest, wherein the method comprises determining the nucleosome occupancy of the TSSs.
- the sample is blood, plasma, or serum.
- FIGURE 1 is a flowchart that illustrates a non-limiting example embodiment of a method of cancer subtype prediction according to various aspects of the present disclosure.
- FIGURE 2 is a flowchart that illustrates a non-limiting example embodiment of a procedure for determining informative sites for tissue, cell-type, cancer-type, or cancer- subtype of interest and filtering to identify cancer subtype-specific informative sites according to various aspects of the present disclosure.
- FIGURE 3 is a flowchart that illustrates a non-limiting example embodiment of a procedure for determining a GC frequency matrix for a genome according to various aspects of the present disclosure.
- FIGURE 4 is a flowchart that illustrates a non-limiting example embodiment of a procedure for using a GC frequency matrix to determine GC bias values for sequence read data according to various aspects of the present disclosure.
- FIGURE 5 is a flowchart illustrating a non-limiting example embodiment of a procedure for using GC bias values to generate a nucleosome profile of sequence read data for subtype- specific informative sites according to various aspects of the present disclosure.
- FIGURE 6 is a block diagram that illustrates aspects of an exemplary computing device appropriate for use as a computing device of the present disclosure.
- FIGURES 7A and 7B illustrate the Griffin framework for cfDNA nucleosome profiling to predict cancer subtypes and tumor phenotype.
- FIGURE 7 A is an illustration of a group of accessible sites (left panel) and inaccessible sites (right panel), such as a TFBS.
- the nucleosomes (in grey) are positioned in an organized manner around the accessible sites (box; left panel), but not around the inaccessible ones (right panel). These nucleosomes protect the DNA from degradation when it is released into peripheral blood.
- the protected fragments from the plasma are sequenced and aligned, leading to a coverage profile which reflects the nucleosome protection in the cells of origin.
- FIGURE 7B is a schematic showing the Griffin workflow for cfDNA nucleosome profiling analysis.
- cfDNA whole genome sequencing (WGS) data with > O.lx coverage is aligned to hg38 genome build.
- Sites of interest are selected from any assay. Paired-end reads aligned to each site are collected, fragment midpoint coverage is counted, and corrected for GC bias to produce a coverage profile.
- Coverage profiles from all sites in a group e.g., open chromatin for tumor subtype
- Composite profiles are normalized using the surrounding region (-5 kb to +5 kb).
- FIGURES 8A to 8G illustrate that Griffin GC bias correction improves detection of tissue specific accessibility from cfDNA.
- FIGURE 8A graphically illustrates the aggregated GC content at 10,000 GRHF2 binding sites and its surrounding 2kb region. Mean GC content (line) and interquartile range (shading) are shown.
- FIGURE 8B graphically illustrates cfDNA GC bias is unique to each sample and each fragment length. GC bias computed for cfDNA from a healthy donor (HD_46; dashed shades) and a metastatic breast cancer (MBC_315; solid shades) sample are shown for various fragment sizes.
- HD_46 healthy donor
- MCC_315 metastatic breast cancer
- FIGURE 8C graphically illustrates composite coverage profile of 10,000 GRHF2 binding sites before and after GC correction, shown for HD_46 (dashed) and MBC_315 (solid).
- the 'central coverage' has a higher value due to effects of GC bias, which can obscure differential signals between samples.
- the central coverage of the MBC sample has lower value, which is consistent with increased GRHL2 activity in breast cancer but not immune cells making up the healthy donor sample.
- FIGURE 8D graphically illustrates composite coverage profiles of 10,000 LYL1 sites before and after GC correction, shown for two MBC samples with deep WGS (9-25x, orange), two healthy donors (17-20x, green), and 191 MBC samples with ULP- WGS (0.1-0.3x, blue).
- cfDNA contains a mixture of tumor and blood cells; therefore, central coverage value is expected to be positively correlated with tumor fraction (lower represents increased accessibility).
- the boxed range represents the median ⁇ IQR
- whiskers represent the range of the non-outlier data (maximum extent is 1.5x the IQR).
- Outliers are plotted in grey p-value was calculated using the Wilcoxon signed-rank test (two-sided).
- FIGURE 8G illustrates boxplots showing the distribution of the mean absolute deviation (of the central coverage across 215 healthy donors [l-2x WGS]) across the 377 TFs, before and after GC correction. Box elements are the same as (8F). p-value was calculated using the Wilcoxon signed-rank test (two-sided).
- FIGURES 9A and 9B illustrate that Griffin enables accurate cancer detection and tissue-of-origin prediction.
- FIGURE 9A illustrates receiver operator characteristic (ROC) curve for logistic regression classification of cancer vs. healthy controls in three datasets, the DELFI dataset (Cristiano, S. et al. Genome-wide cell-free DNA fragmentation in patients with cancer. Nature 570, 385-389 (2019)), LUCAS dataset, and LUCAS validation dataset (Mathios, D. et al. Detection and characterization of lung cancer using cell-free DNA fragmentomes. Nat Commun 12, 5060 (2021)).
- ROC receiver operator characteristic
- FIGURE 9B illustrates boxplots of the AUC values for 1000 bootstrap iterations.
- the boxed range represents the median ⁇ IQR
- whiskers represent the range of the non-outlier data (maximum extent is 1.5x the IQR). Values below the boxplots show the median and 95% confidence interval.
- FIGURES 10A to 10H illustrate that Griffin enables accurate prediction of breast cancer estrogen receptor subtypes from ultra-low pass WGS.
- FIGURE IOC illustrates a comut (Crowdis, J., He, M. X., Reardon, B. & Van Allen, E. M. CoMut: visualizing integrated molecular information with comutation plots. Bioinformatics 36, 4348-4349 (2020)) plot showing information about 101 MBC patients with >0.10 tumor fraction. Top row shows the ER status used for training and assessing the regression model. For most patients, this was the metastatic ER status obtained from IHC, if the metastatic ER status was not available, the primary ER status was used.
- FIGURE 10D is a receiver operator characteristic (ROC) curve for a logistic regression model predicting ER+ and ER- subtype. ROC curve, accuracy and AUC are shown for all patients and for patients grouped by tumor fraction (TFx), 0.05- 0.1 and >0.1. 95% CIs were obtained by bootstrapping.
- ROC receiver operator characteristic
- FIGURE 10E graphically illustrates performance of the model on samples from three validation cohorts. For patients with multiple timepoints, the first sample was used.
- FIGURE 10F graphically illustrates subtype prediction in patients separated by clinical metastatic ER status and clinical primary tumor ER status. P-values were calculated using a Fisher's exact test (two-sided).
- FIGURE 10G illustrates ROC curve for predicting ER loss among patients with primary ER positive tumor. 95% Cl was obtained by bootstrapping.
- FIGURE 10H illustrates the timeline for two patients (MBC 1413 and MBC 1099) with multiple biopsies of different subtypes and multiple cfDNA samples.
- ER+ prediction probability is shown for all cfDNA samples that passed the >0.05 tumor fraction and O.lx coverage thresholds. Decision boundary for ER+ (>0.5) and ER- ( ⁇ 0.5) is indicated with dotted line. Timelines in months from metastatic diagnosis to death are shown for each patient. For patient MBC_1413, a metastatic biopsy (pleural fluid) was taken on the day of metastatic diagnosis and indicated ER- disease. However, approximately 7 months later, another metastatic biopsy (liver) showed weak ER+ staining (5%). A final biopsy (pleural fluid) taken at approximately 12 months and showed ER- staining once again.
- two ER- biopsies were taken at 0 months (bone) and 7 months (liver).
- cfDNA was drawn after this point, however between the two cfDNA draws, another biopsy (liver) indicated the presence of low level ER+ disease.
- FIGURES 11A and 11B illustrate the workflow for characterizing advanced prostate cancer through matched tumor and liquid biopsies from PDX models.
- FIGURE 11 A top panel, illustrates that blood and tissue samples were taken from 26 patient- derived xenograft (PDX) mouse models with tumors originating from metastatic castration-resistant prostate cancer (mCRPC) with AR-positive adenocarcinoma (ARPC), neuroendocrine prostate carcinoma (NEPC) and AR-low non neuroendocrine prostate carcinoma (ARLPC) phenotypes.
- mCRPC metastatic castration-resistant prostate cancer
- ARPC AR-positive adenocarcinoma
- NEPC neuroendocrine prostate carcinoma
- ARLPC AR-low non neuroendocrine prostate carcinoma
- cfDNA was extracted from pooled plasma collected from 7-10 mice and whole genome sequencing (WGS) was performed.
- FIGURE 11 A middle panel, illustrates two distinct ctDNA features that were analyzed at transcription factor binding sites (TFBSs) and open chromatin sites throughout the genome using Griffin (see Example 1 and Doebley et al. (2021). Griffin: Framework for clinical cancer subtyping from nucleosome profiling of cell-free DNA. MedRxiv 2021.08.31.21262867 and Methods).
- FIGURE 11 A bottom right panel, shows phenotype classification using a probabilistic model that accounted for ctDNA tumor content and informed by PDX features was applied to 159 samples in three patient cohorts.
- FIGURE 11B illustrates PDX phenotypes and mouse plasma sequencing. Inclusion status based on final mean depth after mouse read subtraction ( ⁇ 3x coverage were excluded unless AR coordinate amplification signal was reliably detected; lower dotted line). Phenotype status, including 6 NEPC, 18 ARPC (2 excluded), and 2 ARLPC. Average depth of coverage before and after mouse subtraction (mean coverage 20.5x; upper dotted line). Percentage of the cfDNA sample that contains human ctDNA after mouse read subtraction.
- FIGURES 12A to 12G illustrate the analysis of tumor histone modifications and ctDNA reveals nucleosome patterns consistent with transcriptional regulation in CRPC phenotype-specific genes.
- FIGURE 12A illustrates H3K27ac peak signals between ARLPC, ARPC, and NEPC PDX tumor phenotypes at 10,000 AR binding sites (left) and at ASCL1 binding sites (right). Binding sites were selected from the GTRD (Yevshin et al. (2019). GTRD: a database on gene transcription regulation — 2019 update. Nucleic Acids Res 47, D100-D105) (Methods).
- FIGURES 12B and 12C graphically illustrate composite coverage profiles at 1000 AR (12B) and ASCL1 (12C) binding sites in ctDNA analyzed using Griffin. Coverage profile means (lines) and 95% confidence interval with 1000 bootstraps (shading) are shown. The region ⁇ 150 bp is indicated with vertical dotted line and yellow shading.
- FIGURE 12D is a heatmap of log2 fold change in key genes up and down regulated between ARPC and NEPC established through RNA-Seq (left) grouped by the type of histone modification which dictates translation levels: Group 1 shows genes where the predominate PTM mark is attributed to H3K27ac or H3K4mel active marks in the gene promoters or putative distal enhancers, lacking H3K27me3 heterochromatic mark in the gene body; Group 2 features gene body spanning H3K27me3 repression marks. Central columns show differential peak intensity for each of the assayed histone modifications, separated by whether they appear upstream or in the promoter or the body of each gene.
- FIGURE 12E graphically illustrates a comparison of the log2 fold change (ARPC vs. NEPC) of mean mRNA expression vs mean coefficient of variation (CV) in the 47 phenotypic lineage marker genes' promoter regions.
- FIGURE 12F (top) provides illustrations of expected ctDNA coverage profiles for Group 1 genes with and without H3K27ac or H3K4mel modification leading to active and inactive transcription, respectively.
- FIGURE 12F (bottom) ⁇ 1000 bp surrounding the promoter region for AR and ASCL1 in ARPC and NEPC.
- FIGURE 12G is an illustration of expected ctDNA coverage profiles for Group 2 genes with repressed transcription caused by H3K27me3 modifications in the gene body.
- Neuronal gene UNC13A has increased nucleosome phasing in ctDNA of ARPC samples compared to NEPC.
- This list of TFs was initially selected as having differential expression between ARPC and NEPC from LuCaP PDX RNA-Seq analysis.
- Heatmap colors indicate increased accessibility (low values; lighter) and decreased accessibility (higher values; darker) in ctDNA.
- TFs with increased accessibility in NEPC samples (log2-fold-change > 0.05, Mann- Whitney U test p ⁇ 0.05) are indicated with red text; increased accessibility in ARPC (log2-fold-change ⁇ -0.05, p ⁇ 0.05) are indicated with blue text.
- FIGURES 14A to 14G illustrate comprehensive evaluation of ctDNA features throughout the genome for CRPC phenotype classification in PDX models.
- FIGURE 14A illustrates a volcano plot of log2-fold change of ATAC-Seq peak intensity between 5 ARPC and 5 NEPC lines; the dotted line demarcates sites by q-value ⁇ 0.05.
- FIGURES 14B and 14C graphically illustrate composite coverage profiles at open chromatin sites specific to ARPC (14B) and NEPC (14C) PDX tumors analyzed by Griffin. Sites from (14A) were filtered for overlap with known TFBSs in 338 factors from GTRD (Yevshin et al. (2019). Nucleic Acids Res 47, D100-D105).
- FIGURE 14E graphically illustrates performance of classifying ARPC vs NEPC PDX from ctDNA using supervised machine learning (XGBoost) in various region types (all genes, TFBSs, and open regions, Methods). Area under the receiver operating characteristic curve (AUC) with 95% confidence interval (100 repeats of stratified cross validation) is shown for performance of all feature types.
- FIGURE 14F is an example composite coverage profiles at open chromatin sites specific to ARPC (left) and NEPC (right) identified in 14B-14C. Simulated admixtures generated using ARPC mixed with healthy donor (HD) (left) and NEPC mixed with HD (right) are shown for varying tumor fractions.
- HD healthy donor
- HD NEPC mixed with HD
- FIGURE 14G graphically illustrates performance for classification on admixtures samples using the probabilistic mixture model.
- Five ctDNA admixtures were generated for each phenotype from PDX lines, each at various sequencing coverages and tumor fractions. In total, 125 admixtures were evaluated. The mean AUC across the 5 admixtures is shown for each configuration.
- FIGURES 15A to 15C illustrate accurate classification of NEPC phenotypes from plasma in three patient cohorts using a probabilistic model informed by PDX ctDNA features.
- FIGURE 15A graphically illustrates receiver operating characteristic (ROC) curve for 101 mCRPC patients (DFCI cohort I) with ultra- low-pass WGS (ULP-WGS) data. The optimal performance of 90.4% sensitivity (for predicting NEPC) and 97.5 % specificity (for predicting ARPC) corresponding to a prediction score cutoff of 0.3314 is indicated with horizontal and vertical dotted lines, respectively.
- FIGURE 15B illustrates prediction scores for 11 plasma samples from seven patients (DFCI cohort II) with both WGS and ULP-WGS data.
- the 0.3314 score cutoff threshold (dotted line) was used for classifying NEPC and ARPC. Tumor fractions were estimated by ichorCNA from WGS data. Patients were treated for adenocarcinoma (ARPC) or had high PSA values.
- FIGURE 15C illustrates prediction scores for 47 plasma samples with clinical phenotypes comprising 26 ARPC, 5 NEPC, and 16 mixed or ambiguous phenotypes (triangles), including double-negative prostate cancer (DNPC). Scores are shown for WGS and ULP- WGS (0.1X) for the same ctDNA sample.
- the cutoff threshold of 0.3314 (dotted line) was used for classifying NEPC and ARPC. Tumor fractions were estimated by ichorCNA on the WGS data.
- FIGURE 16 is a schematic of an integrated, non-invasive targeted sequencing assay based on cfDNA for detection of genetic mutations and prediction of key tumor epigenetic features in SCLC.
- FIGURES 17A and 17B illustrate the detection of transcription factor (TF) expression in SCLC models using targeted sequencing of cfDNA.
- FIGURE 17A is a schematic of experimental workflow for proof-of-concept negative control ("healthy donor") and positive control ("flank tumors" from SCLC cellular models) samples.
- FIGURE 17B graphically illustrates aggregated coverage across TFBSs in targeted sequencing data for healthy donors (top row) and flank tumors (bottom row).
- the TFBS is expected to be located at position 0 on the x axis. Data are color-coded by expected TF expression. Healthy donor-derived cfDNA is expected to reflect REST expression but not ASCL1, NEUROD1, or POU2F3. In SCLC models, systematic differences in coverage distribution as a function of TF expression are apparent.
- FIGURES 18A to 18C illustrate transcription factor activity inference using TFBS coverage distributions from SCLC patient samples with available matched tumor gene expression data.
- FIGURE 18A graphically illustrates aggregated coverage across TFBSs in targeted sequencing data for healthy donors (top row) and patients with SCLC (bottom row) for whom matched tumor tissue with gene expression data was available. Samples are color-coded by expected TF expression. Systematic differences in coverage distribution as a function of expected TF expression are again apparent.
- FIGURE 18B illustrates gene expression of key genes in selected patient samples displayed as a heatmap. Cells are color coded by Z-score and the inset text is the log2(TPM+l).
- FIGURE 18C illustrates peak to trough amplitude calculated from coverage distributions at TFBS in each patient sample displayed as a heatmap. The amplitude is displayed by color and also as inset text. Trough depth magnitude corresponds to gene expression of the key TFs in these bona fide SCLC patient samples.
- FIGURE 19 is a series of graphs illustrating quantification of transcription factor binding site peak to trough amplitude sample types. Distribution of TFBS peak to trough amplitude calculated from aggregated coverage distributions according to expected ground truth of TF expression.
- ASCL1 site peak to trough amplitude is associated with both SCLC status and ASCL1 positivity, while NEUROD1 and POU2F3 peak to trough amplitude is associated only with TF positivity.
- FIGURES 20A and 20B graphically illustrate gene expression inference using TSS coverage distributions in flank tumor positive control samples.
- FIGURE 20A illustrates TSS coverage distribution from targeted sequencing of cfDNA, grouped by gene expression quintile in SCLC flank tumor models (quintiles 1-5) and blood ("B", dark blue). Shown are 1,912 TSS corresponding to 1,213 genes, which were selected based on low expression in whole blood and correlation between TSS coverage distribution and gene expression. TSS coverage distribution varies systematically according to expression of the corresponding gene.
- FIGURE 20B illustrates receiver operating characteristic curves for prediction of gene expression as above or below a threshold value (shown for thresholds of 0.1, 0.5, 1.0, and 2.0), as inferred from the coverage distribution of the corresponding TSS.
- An estimator of gene expression was calculated from the TSS coverage profile as the magnitude of the difference of the average coverage depth at positions +130 and +145 relative to the TSS minus the average depth at positions -45, -30, and -15 (shown as a dotted line in 20A).
- the AUC of the ROC curve is shown in parentheses for each gene expression cutoff. TSS coverage distributions can be used to predict whether a gene is expressed above or below a certain value with good test characteristics in this preliminary analysis that is restricted to especially variable, and therefore challenging, genes.
- FIGURES 21A to 21C are a series of graphs illustrating use of aggregated coverage profiles across large rationally selected subsets of the TSS panel for prediction of SCLC vs NSCLC status in lung cancer Pdx models and Patient samples.
- an amplitude feature was calculated from each coverage distribution curve as the difference between the coverage at the -45 position and the +120 position relative to the TSS, facilitating comparison within and between samples.
- FIGURES 22A and 22B are a series of graphs illustrating use of aggregated coverage profiles across large rationally selected subsets of the TSS panel for prediction of SCLC vs NSCLC status in lung cancer Pdx models (22A) and Patient samples (22B.
- An SCLC PDX that transdifferentiated from an adenocarcinoma is identified with a thick red line.
- FIGURE 23 is a flowchart that illustrates a non-limiting example embodiment of a method of cell (e.g., cancer, e.g., prostate cancer) subtype prediction according to an aspect of the present disclosure.
- a method of cell e.g., cancer, e.g., prostate cancer
- the present disclosure is based on the inventors' development of a facile and sensitive approach to assess the chromatin architecture from cell-free DNA (cfDNA), and to provide accurate signal to detect and differentiate cell and/or tissue phenotypes based on the determined chromatin architecture.
- cfDNA cell-free DNA
- cfDNA Cell-free DNA
- cfDNA circulating tumor DNA
- ctDNA circulating tumor DNA
- Sequencing analysis of ctDNA to detect genomic alterations have also served to classify some subset of tumors based on genetic differences.
- studying the tumor phenotype from ctDNA remains challenging and is still a nascent area of research.
- cfDNA in the bloodstream, cfDNA is protected from degradation by nucleosomes and other DNA binding proteins, leading to a coverage pattern that reflects the genomic organization in the cells-of-origin.
- the genomic organization includes patterns of chromatin accessibility and transcriptional regulation, which, in turn, drive the differential phenotypes of the cells of origin.
- cfDNA can provide a non-invasive route to identify tumor subtypes through the analysis of tumor phenotypes beyond the traditional analysis of genotype, which involves DNA alterations.
- the inventors have addressed the shortcomings of the art to produce a facile, robust, and sensitive approach to detecting and differentiating cell phenotypes.
- the approach is based in part on a core method, called "Griffin", to examine nucleosome protection and chromatin accessibility by quantifying cfDNA fragments around accessible sites.
- Griffin implements critical approaches to consider fragment length-based GC correction to remove GC biases that obscure signals, which is especially prevalent in ULP-WGS applications (e.g., as low as O.lx coverage of WGS).
- Griffin is flexible to analyze any region throughout the genome that may be informative for differential chromatin accessibility between cell/tissue/cancer phenotype settings. For example, key transcriptional factors distinguishing between tumor subtypes can be predicted using Griffin via the analysis at binding sites of these transcription factors. Furthermore, Griffin can be applied to a variety of input data developed different assay approaches to study chromatin architecture and accessibility, including ATAC-seq, ChIP-seq, transcription factor profiling data, CUT & RUN, and the like. Moreover, in sharp contrast to existing technologies, Griffin can address countless hypotheses by enabling the analysis multiple 'omics', such as the following:
- the Griffin approach is adaptable to existing ctDNA sequencing techniques and, thus, permits scalability, adaptability, and accessibility, even from ULP-WGS data, which is highly susceptible to bias and signal obfuscation.
- Major applications of the approach include tumor (subtype) classification, identification of mixed histologies/phenotypes, detection of potential subtype switches (transdifferentiation) during therapy in "real time”, and prediction of biomarkers (e.g., ARv7 splice variant) that can signal therapy resistance.
- ctDNA circulating tumor DNA
- ARPC androgen receptor active
- NEPC neuroendocrine
- the disclosure provides a computer-implemented method of enhancing sequence read data from cell-free DNA samples for cell type prediction.
- cell type prediction is used in a general sense to refer to predicting the identity of, or a characteristic of, a cell of origin (i.e., a cell contributing DNA in the cfDNA sample).
- the characteristic can be a distinguishable phenotype compared to cells with a same or similar developmental lineage, including developmental lineages with a transformation event (i.e., for cancer cells).
- the characteristic can be a distinguishable developmental lineage compared to a distinct developmental lineage.
- the method encompasses predicting or differentiating among different cell lineages, different tissue types, different tissue subtypes, different cancer types, difference cancer subtypes (i.e., subtypes of the same cancer type), and the like.
- the only requirement is that the cell type, as broadly defined, be distinguishable by a unique nucleosome occupancy and/or chromatin accessibility profile.
- the method comprises: receiving, by a computing system, sequence read data, wherein the sequence read data includes a plurality of fragment reads, wherein each fragment read has a fragment length and a GC content indicating a percentage of bases in the fragment read that are G or C; determining, by the computing system, GC bias values for each fragment read based on the fragment length and the GC content of the fragment read; generating, by the computing system, a genomic coverage distribution that is adjusted for GC bias using the sequence read data and the GC bias values; and predicting, by the computing system, the cell type based on the genomic coverage distribution.
- FIG. 1 is a flowchart that illustrates a non-limiting example embodiment of a method of cell type prediction according to various aspects of the present disclosure.
- the method 100 includes use of the GRIFFIN techniques described elsewhere herein to enable meaningful features to be extracted from short nucleic acid sequences of cancer DNA obtained from sequencing of cell-free DNA fragments in a sample.
- the method 100 may be used for various different types of cell type prediction, including but not limited to tissue type prediction, cell type prediction, cancer type prediction, and cancer subtype prediction.
- the method 100 proceeds to subroutine block 102, where genomic regions of interest are determined and filtered to identify cell-type-informative sites.
- genomic regions of interest are determined and filtered to identify cell-type-informative sites.
- Any suitable technique for determining and filtering cell-type-informative sites may be used, and different techniques will likely be used for different types of cancer, different molecular subtypes of a cancer type, different tissues, different cell types, and different types of assays.
- One non-limiting example embodiment of a suitable procedure for determining and filtering cell-type-informative sites is illustrated in FIG. 2 and described in further detail below.
- a GC frequency matrix is determined for combinations of fragment lengths and GC content.
- fragments having certain amounts of G and C bases (“GC content”) will be overrepresented in the sequence read data. This bias is not constant, as fragments of different sizes will have different GC biases.
- GC content fragments having certain amounts of G and C bases
- This bias is not constant, as fragments of different sizes will have different GC biases.
- FIG. 3 One non-limiting example technique for determining a GC frequency matrix is illustrated in FIG. 3 and described in further detail below.
- subroutine block 102 and subroutine block 104 may be performed on reference genome data before obtaining a sample or sequence data to be analyzed.
- sequence read data is received.
- the sequence read data represents sequence reads generated for a sample obtained from a subject.
- the sequence read data may be obtained from an archive or other previously obtained sample.
- the GC frequency matrix is used to determine GC bias values for the sequence read data. Any suitable technique may be used in subroutine block 108, including but not limited to the non-limiting example illustrated in FIG. 4 and described in further detail below.
- the GC bias values are used to generate a genomic coverage distribution of the sequence read data for the cell-type-informative sites.
- any suitable technique may be used in subroutine block 110, including but not limited to the non-limiting example illustrated in FIG. 5 and described in further detail below.
- features are extracted from the genomic coverage distribution. Any features suitable for use with a classifier model may be extracted, and may depend on the type of classifier model used, the assay that generated the sequence reads, and/or the cell type (e.g., type of cancer, cancer subtypes, tissue, or cell type) to be detected. As one non-limiting example, for estrogen receptor (ER) subtyping in breast cancer, three features may be extracted: mean coverage, central coverage, and amplitude.
- ER estrogen receptor
- Mean coverage may be extracted by determining the mean coverage in a window around an informative site.
- the window around the informative site for determining mean coverage may be any suitable size, including but not limited to a range from 1800-2200 bp (from +/- 900 bp to +/- 1100 bp).
- a suitable size for the window for determining mean coverage is 2000 bp (+/- 1000 bp).
- Central coverage may be extracted by determining the mean coverage in a smaller window around the informative site.
- the window around the informative site for determining central coverage may be any suitable size, including but not limited to a range from 40-80 bp (from +/- 20 bp to +/- 40 bp).
- a suitable size for the window for determining mean coverage is 60 bp (+/- 30 bp).
- Amplitude may be extracted by trimming the genomic coverage distribution to an area that includes a given number of peaks (such as an area of +/- 960 bp that contains 10 peaks), performing a fast Fourier transform, and taking the magnitude of a frequency based on the given number of peaks (e.g., the 10th frequency for the area that contains 10 peaks).
- a given number of peaks such as an area of +/- 960 bp that contains 10 peaks
- the features are provided as input to a classifier model to predict the cell subtype.
- a classifier model may be used.
- the classifier model may be a logistic regression model.
- the method 100 then proceeds to an end block and terminates ⁇
- further action may be taken once the cancer subtype is determined, including but not limited to an appropriate cancer diagnosis, identifying cancer subtype change or switch, recommending a new course of treatment, altering an existing course of treatment, or any other appropriate action.
- cfDNA released by hematopoietic cells which leads to a lower ctDNA fraction (i.e., tumor fraction).
- tumor fraction i.e., tumor fraction
- an unsupervised probabilistic model was developed to estimate the proportion of cell types contributing to an individual plasma sample.
- This model is the explicit modeling of the ctDNA tumor fraction in patients.
- the input into this model includes signals generated from patient-derived xenografts (PDXs).
- PDXs provide a resource that is ideal for studying the properties of ctDNA, developing new analytical tools, and validating both genetic and phenotypic features by comparison to matching tumors.
- the model uses estimates of ctDNA fraction and these input PDX signals, the model applies a statistical mixture model approach to estimate the mixture weight parameter that represents the proportion of cell types.
- the mixture weight parameter may be used as a prediction score to classify cell types, such as ARPC and NEPC, as discussed below in Example 2 and illustrated in FIG. 14-15.
- Other cell types, such as phenotypes and subtypes can also be modeled and predicted using this framework.
- FIG. 2 is a flowchart that illustrates a non-limiting example embodiment of a procedure for determining genomic regions of interest and filtering to identify cell-type- informative sites according to various aspects of the present disclosure.
- the cell types of interest for which the cell-type- informative sites are determined and filtered are different cancer types, different cancer subtypes, different tissue types, or different cell types.
- the procedure 200 advances to block 202, where a list of sites likely to be informative in the cell type of interest is selected.
- Sites may be selected using available data, including but not limited to public research databases and repositories, published scientific and sequencing data. These data may be derived from assays, including but not limited to sequencing techniques for Assay for Transposase-Accessible Chromatin (ATACs-eq), micrococcal nuclease (MNase-seq), DNAse hypersensitivity sites, chromatin immunoprecipitation (ChIP-seq), cleavage under targets & release using nuclease (CUT&RUN).
- ATCs-eq Assay for Transposase-Accessible Chromatin
- MNase-seq micrococcal nuclease
- ChIP-seq chromatin immunoprecipitation
- CUT&RUN nuclease
- Sites from these data that distinguish between cell types are selected using any suitable comparison, including but not limited to statistical hypothesis testing using two-group Mann-Whitney U (also called Wilcoxon rank-sum) tests or Student-t's tests and multi group Kruskal- Wallis test or analysis of variance (ANOVA). Additional filtering may be performed using fold change between groups.
- a mean mappability score (metric representing the uniqueness of the genomic sequence) is determined in a fixed size window around each site likely to be informative, and at optional block 206, sites having a mean mappability score less than a predetermined threshold are discarded.
- Mappability may be determined based on reference data, such as the mappability score track from the UCSC genome browser. In some embodiments, the actions of optional block 204 and optional block 206 may not be performed.
- the remaining sites that are informative for determining cell type are identified.
- Any suitable technique may be used.
- the Cancer Genome Atlas (TCGA) ATAC seq data may be used to identify sites that have differential ATAC signal between ER positive samples and ER negative TCGA samples.
- any suitable technique may be used.
- TCGA Cancer Genome Atlas
- FDR false discovery rate
- ATAC seq read counts around each site may be provided as input to DESeq2 software, which may then identify differential sites and produce an adjusted fold change and FDR corrected p- value for each site.
- the sites may be further refined by examining the fold change and retaining all sites with a log2 fold change greater than 0.5 in the subtype of interest relative to the other subtype.
- ER positive and ER negative sites may be separated into those that are shared with hematopoietic cells and those which are not shared with hematopoietic cells using a separate dataset of hematopoietic ChIP seq peaks to generate a total of four subtype-specific informative site lists.
- FIG. 3 is a flowchart that illustrates a non-limiting example embodiment of a procedure for determining a GC frequency matrix for a genome according to various aspects of the present disclosure.
- the technique described in FIG. 3 is different from previous techniques, such as the approach described in Benjamini & Speed, 2012 and implemented in DeepTools (Ramirez, Diindar, Diehl, Griming, & Manke, 2014), at least because the previous techniques did not compensate for fragments of different lengths, and were never shown to work for cell-free DNA sequencing data.
- a separate GC bias curve is determined for each different fragment length.
- the procedure 300 advances to an end block and terminates.
- a range of fragment lengths between a short length threshold and a long length threshold are analyzed in the procedure 300.
- the short length threshold may be in a range of 10-20 bp
- the long length threshold may be in a range of 450-550 bp.
- the short length threshold may be 15 bp
- the long length threshold may be 500 bp.
- the for-loop may operate on each fragment length between the short length threshold and the long length threshold.
- FIG. 4 is a flowchart that illustrates a non-limiting example embodiment of a procedure for using a GC frequency matrix to determine GC bias values for sequence read data according to various aspects of the present disclosure.
- the number of observed reads of each fragment length and GC content are counted to determine GC counts for the sequence read data.
- the GC counts are divided by the values in the GC frequency matrix to determine GC bias for each fragment length.
- a mean GC bias is normalized for each fragment length to determine rough GC bias values.
- the mean GC bias may be normalized to 1. This results in a rough GC bias value for every possible combination of fragment size and GC content.
- the rough GC bias values are smoothed to determine the GC bias values.
- all GC bias values for similar sized fragments (as a non- limiting example, for 165 bp fragments, fragments of sizes from 155 bp to 175 bp may be considered) may be determined.
- the GC bias values for the similar sized fragments may be sorted by GC content, and kernel smoothing may be performed by taking the median of the nearest neighbors to determine the GC bias values.
- the procedure 400 then advances to an end block and terminates.
- FIG. 5 is a flowchart illustrating a non-limiting example embodiment of a procedure for using GC bias values to generate a genomic coverage distribution of sequence read data for cell-type-specific informative sites according to various aspects of the present disclosure.
- the procedure 500 advances to block 502, where fragment midpoints in a window around each cell-type- specific informative site are determined.
- a weight is assigned to each fragment based on the appropriate GC bias value for the fragment length and GC content (i.e., the GC bias value for the fragment length and GC content determined at subroutine block 108, e.g., by procedure 400). The weight is then based on that appropriate GC bias value.
- the weights are used to determine GC-corrected midpoint profiles.
- positions are excluded that overlap excluded regions.
- the excluded regions may be determined using any suitable technique.
- the excluded regions may be obtained from one or more excluded region lists.
- Excluded region lists may include, but are not limited to, an encode unified GROG 8 exclusion list, centromeres, gaps in the human genome assembly, fix patches, alternative haplotypes, regions of zero mappability, and regions with unusually high coverage (e.g., 10 standard deviations above the mean).
- GC-corrected midpoint profiles for all sites are averaged to determine a mean profile.
- the mean profile is smoothed to generate a smoothed mean profile.
- Any suitable technique for smoothing may be used.
- the mean profile may be smoothed using a Savitzky-Golay filter with a window length of 165 bp and a 3rd order polynomial.
- the smoothed mean profile is normalized by dividing by the mean of the surrounding coverage.
- surrounding coverage in a range of 9,000-11,000 bp (+/- 4,500 bp to +/- 5,500 bp), such as 10,000 bp (+/- 5,000 bp) is considered for normalization. This allows samples with different depths of sequencing coverage to be compared.
- the normalized mean profile may be used as the resulting genomic coverage distribution.
- the procedure 500 then advances to and end block and terminates.
- FIG. 6 is a block diagram that illustrates aspects of an exemplary computing device appropriate for use as a computing device of the present disclosure.
- the techniques described above including but not limited to the techniques described in method 100, may be implemented in full or in part on one or more computing systems that include one or more computing devices such as computing device 600 that are communicatively coupled to each other.
- the exemplary computing device 600 describes various elements that are common to many different types of computing devices, including but not limited to desktop computing devices, laptop computing devices, server computing devices, mobile computing devices, and computing devices that are part of a cloud computing system. While FIG. 6 is described with reference to a computing device that is implemented as a device on a network, the description below is applicable to servers, personal computers, mobile phones, smart phones, tablet computers, embedded computing devices, and other devices that may be used to implement portions of embodiments of the present disclosure. Some embodiments of a computing device may be implemented in or may include an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or other customized device. Moreover, those of ordinary skill in the art and others will recognize that the computing device 600 may be any one of any number of currently available or yet to be developed devices.
- ASIC application-specific integrated circuit
- FPGA field-programmable gate array
- the computing device 600 includes at least one processor 602 and a system memory 610 connected by a communication bus 608.
- the system memory 610 may be volatile or nonvolatile memory, such as read only memory (“ROM”), random access memory (“RAM”), EEPROM, flash memory, or similar memory technology.
- ROM read only memory
- RAM random access memory
- EEPROM electrically erasable programmable read-only memory
- flash memory or similar memory technology.
- system memory 610 typically stores data and/or program modules that are immediately accessible to and/or currently being operated on by the processor 602.
- the processor 602 may serve as a computational center of the computing device 600 by supporting the execution of instructions.
- the computing device 600 may include a network interface 606 comprising one or more components for communicating with other devices over a network. Embodiments of the present disclosure may access basic services that utilize the network interface 606 to perform communications using common network protocols.
- the network interface 606 may also include a wireless network interface configured to communicate via one or more wireless communication protocols, such as Wi-Fi, 2G, 3G, LTE, WiMAX, Bluetooth, Bluetooth low energy, and/or the like.
- the network interface 606 illustrated in FIG. 6 may represent one or more wireless interfaces or physical communication interfaces described and illustrated above with respect to particular components of the computing device 600.
- the computing device 600 also includes a storage medium 604.
- services may be accessed using a computing device that does not include means for persisting data to a local storage medium. Therefore, the storage medium 604 depicted in FIG. 6 is represented with a dashed line to indicate that the storage medium 604 is optional.
- the storage medium 604 may be volatile or nonvolatile, removable or nonremovable, implemented using any technology capable of storing information such as, but not limited to, a hard drive, solid state drive, CD ROM, DVD, or other disk storage, magnetic cassettes, magnetic tape, magnetic disk storage, and/or the like.
- FIG. 6 does not show some of the typical components of many computing devices.
- the computing device 600 may include input devices, such as a keyboard, keypad, mouse, microphone, touch input device, touch screen, tablet, and/or the like. Such input devices may be coupled to the computing device 600 by wired or wireless connections including RF, infrared, serial, parallel, Bluetooth, Bluetooth low energy, USB, or other suitable connections protocols using wireless or physical connections.
- the computing device 600 may also include output devices such as a display, speakers, printer, etc. Since these devices are well known in the art, they are not illustrated or described further herein.
- the computer-implemented method implementing the Griffin workflow is highly adaptable to different types of input data reflective of the chromatin architecture (e.g., nucleosome occupancy and chromatin accessibility).
- the method can be applied to various contexts of analyses depending on the source and character of the originating cells or tissues being analyzed.
- the disclosure provides a method of determining a chromatin accessibility profile for a cell of interest from a sample comprising cell-free DNA derived from the cell of interest.
- This method applies the Griffin data optimization workflow, described in more detail above, to determine a chromatin accessibility profile for a cell of interest.
- the method is flexible and permits input data obtained from a variety of sequencing and capture protocols.
- the method comprises: obtaining sequence read data from the cell-free DNA; receiving, by a computing system, sequence read data, wherein the sequence read data includes a plurality of fragment reads, wherein each fragment read has a fragment length and a GC content indicating a percentage of bases in the fragment read that are G or C; determining, by the computing system, GC bias values for each fragment read based on the fragment length and the GC content of the fragment read; generating, by the computing system, a genomic coverage distribution that is adjusted for GC bias using the sequence read data and the GC bias values; and determining the chromatin accessibility profile from the genomic coverage distribution.
- the method can further comprise determining a phenotype of the cell of interest based on the chromatin occupancy profile.
- determinations of cell phenotype can include determining the tissue type of origin of the cell, determining if the cell is transformed (e.g., is cancerous or malignant), determining the cancer type or cancer subtype, determining a malignancy aggressiveness phenotype, and/ or determining a drug responsivity phenotype.
- the term malignancy aggressiveness phenotype refers to the relative aggressiveness of a transformed (e.g., cancer) cell in terms of rate of reproduction, migration, drug responsivity, and the like.
- the phenotype can be qualitative or can be assessed by various metrics to allow for quantitative comparison.
- drug responsivity phenotype refers to the relative responsivity (i.e., susceptibility or resistance) of a cancer cell to a cancer therapy.
- the metric can be quantitative or qualitative. These determinations can be made using various classifiers, described in more detail above, based on sequence data optimized by the Griffin workflow. Elements of the Griffin workflow and computer implemented method are described in more detail above and incorporated into the present aspect without limitation. Exemplary, nonlimiting implementations of the Griffin workflow and associated classifiers to subtype cancer cells with distinct phenotypes are provided in the Examples.
- the Griffin workflow enhances data from a variety of sequencing and capture platforms to provide profiles of nucleosome accessibility, and these profiles can provide highly accurate insight as to the nature of cells that contribute to the ctDNA present in biological samples.
- These insights enable detecting and characterizing cells that contribute to the ctDNA, including enabling the ability to detect cells of a certain type and/or differentiate cells between various subtypes.
- the disclosure also provides a method for determining or identifying a cell type of a cell of interest from a sample comprising cell-free DNA derived from the cell of interest. The method of this aspect comprises: obtaining sequence read data generated from the sample comprising cell-free
- the determining step can be performed by any of a number of appropriate classifiers based on the data enhanced by the Griffin workflow.
- the determining step can comprise determining a cell phenotype, such as determining tissue type, a cancer type, a cancer subtype, a malignancy aggressiveness phenotype, a drug responsivity phenotype, or expression (or expression level) of a gene of interest.
- the disclosure provides a method for detecting the presence of a cancer cell in a subject.
- the method comprises: obtaining sequence read data generated from the sample comprising cell-free DNA obtained from the subject; performing the computer-implemented method described in more detail above (and which is incorporated into this aspect in all of its embodiments); and determining the presence of a cancer cell in the subject based on the prediction provided by the computing system.
- the method is performed a plurality of times. Accordingly, the method can be a method of monitoring for the presence and/or identity of cancer in the subject.
- the cancer cell(s) detected in the subject at each performance of the method can be further characterized. For example, the cell(s) can be monitored over time using this method to determine a cancer subtype or phenotype of the detected cancer cell(s) based on the prediction provided by the computing system.
- the method further comprises detecting a change in phenotype of the detected cancer cell(s) over time. For example, as described in more detail below certain cancer types can progress from one subtype to another during the course of disease. Cancer cells can evolve and essentially switch between characterized subtypes.
- non-small cell lung cancer can be monitored for transdifferentiation to small cell lung cancer (SCLC).
- SCLC subtypes can be monitored for transdifferentiation to distinct subtypes.
- the method can be performed starting before or during the course of treatment for cancer. Accordingly, the cancer can be monitored for responsivity to the treatment, or for changes in phenotype during the course of treatment. These characteristics can inform any appropriate adjustments to the treatment regimen.
- the method comprises implementing a treatment or treatment change based on the monitored status of the cancer cells as determined by the method.
- the disclosure provides a method of determining a cancer subtype of a target cancer cell from a sample comprising cell-free DNA derived from the target cancer cell.
- the method comprises: obtaining sequence read data generated from the sample comprising cell-free
- DNA DNA; performing the computer-implemented method described in more detail above (and which is incorporated into this aspect in all of its embodiments); and determining the cell type of the target cancer cell based on the predicted cancer subtype provided by the computing system.
- the sample can be a biological sample from the subject, e.g., a subject with cancer or suspected to have cancer. Exemplary biological samples are described in more detail below.
- the method comprises obtaining the biological sample from the subject and/or generating the sequence read data from the sample, according to standard techniques appropriate for the desired sequencing platform and/or targeted capture technology.
- the Griffin platform has been employed to successfully distinguish between important subtypes of cancers for various different, unrelated cancers, indicating the broad applicability to cancer types in general.
- the cancer is characterized as metastatic breast cancer.
- the determining step comprises determining the status of the breast cancer as ER+ versus ER-, which refers to the expression of estrogen receptor (ER) and whether the cancer cells respond to exposure of the estrogen hormone. This status can be a critical to inform the appropriate course of therapy because ER+ breast cancers can be addressed by administration of endocrine therapies.
- the determining step comprises determining the status of the breast cancer as PR + versus PR -, which refers to the expression of progesterone receptor (PR) and whether the cancer cells respond to exposure of the progesterone hormone. Similarly, this status can be a critical to inform the appropriate course of therapy because PR+ breast cancers can also be addressed by administration of appropriate hormonal therapies, such as tamoxifen and aromatase inhibitors.
- the determining step comprises determining the status of the breast cancer as HER2+ versus HER2-, which refers to the expression of human epidermal growth factor receptor 2 (HER2).
- HER2+ breast cancer cells tend to result in poorer prognosis as they grow faster and have a higher likelihood of spreading, e.g., to the lymph nodes.
- This status can be a critical to inform the appropriate course of therapy because PR+ breast cancers can also be addressed by administration of appropriate Her2-targeted therapy, such as trastuzumab or pertuzumab.
- Her2-targeted therapy such as trastuzumab or pertuzumab.
- the disclosure also encompasses embodiments of distinguishing determining the expression status of multiple informative markers.
- the method can comprise determining: whether the cancer is ER+ versus ER-; whether the cancer is PR+ versus PR-; and/or whether the cancer is HER2+ versus HER2-, in any combination.
- the method comprises determining whether the cancer is ER+ versus ER-, whether the cancer is PR+ versus PR-, and whether the cancer is HER2+ versus HER2-.
- Patients with triple-negative breast cancer i.e. ER-, PR-, HER-
- the cancer is characterized as metastatic prostate cancer.
- determining the subtype of the prostate cancer addresses determining whether the cancer expresses various markers characteristic of distinguishable subtypes.
- the step of the cancer subtype comprises determining whether the prostate cancer is AR+ (ARPC) versus AR-, which refers to the status for expression of androgen receptors.
- the step of the cancer subtype comprises determining whether the prostate cancer is AR+ (ARPC) versus AR (low).
- Prostate cancers that are AR+ are often treated with androgen receptor signaling inhibitors (ARSI) that repress the androgen receptor activity in the cells.
- ARSI androgen receptor signaling inhibitors
- the step of the cancer subtype comprises determining whether the prostate cancer has a neuroendocrine prostate cancer (NEPC) phenotype signature or not.
- NEPC cells lack AR activity and possess distinct transcriptional programming regulation profiles from CRPC cells, including different epigenetic modifications, that result in a distinct phenotype that requires alternative therapeutic intervention.
- the step of the cancer subtype comprises determining whether the prostate cancer is amphicrine, which refers to possessing both exocrine and neuroendocrine characteristics in the same cell. As is demonstrated in Example 2 below, the Griffin workflow can be leveraged to accurately distinguish these cell types from input sequence reads generated from ctDNA.
- determining the cancer subtype comprises determining 2, 3, 4, or all of the following: whether the cancer is AR+ (ARPC) or AR-, whether the cancer is AR-low or ARPC, whether the cancer has a neuroendocrine prostate cancer (NEPC) phenotype signature or not, whether the cancer is AR-low or NEPC, whether the cancer is amphicrine or ARPC or NEPC, in any combination.
- ARPC AR+
- NEPC neuroendocrine prostate cancer
- the cancer is characterized as metastatic lung cancer.
- determining the subtype of the lung cancer comprises determining whether the cancer is small cell lung cancer (SCLC) or non-small cell lung cancer (NSCLC). If the lung cancer is NSCLC, in a further embodiment, the method further comprises determining whether the NSCLC is adenocarcinoma or squamous cell carcinoma.
- the input sequence read data can be generated from a variety of platforms and with a variety of techniques, including whole genome analysis.
- the inventors established that whole genome analysis, however, is not required. Instead, the inventors designed and implemented a panel of genomic targets deemed to be relevant to the scientific inquiry (e.g., subtyping lung cancer cells). Accordingly, in some embodiments, the lung cancer is further subtypes using sequence read data generated from a panel of genomic targets.
- the panel of genomic targets comprises transcription factor binding sites (TFBSs) of one or more transcription factors associated with a designated subtype that is the subject of analysis, e.g., SCLC.
- TFBSs transcription factor binding sites
- the one or more associated transcription factors comprise one or more of ASLC, NEUROD1, POU2F3, REST, and the like.
- the method comprises determining the nucleosome occupancy of the TFBSs using any appropriate technique (e.g., CUT & RUN, and the like).
- the TFBSs can be identified by ChIP-seq data, or similar techniques known in the art.
- Candidate TFBSs can be retained in the panel if they are proximal to a transcription start site (TSS) of a gene associated with lung cancer, or the subtype of lung cancer that is of interest in the subtyping.
- TSS transcription start site
- proximal can mean within a proximity that the TFBSs is functionally influential on the start of transcription at the TSS.
- the functional influence or relationship can be established if the TSS is the closest TSS to the TFBS.
- the panel of genomic targets comprise transcription start sites (TSSs) for one or more markers associated with lung cancer (or the specific subtype of lung cancer that is of interest).
- the method comprises determining the nucleosome occupancy of the TSSs through known techniques.
- the biological sample described herein can be any sample obtained from a subject that is likely to have cell free DNA.
- Illustrative, non-limiting examples encompassed by the disclosure include the sample is blood, plasma, or serum, which are particularly useful to assess cfDNA and ctDNA from a subject.
- the methods can further comprise obtaining the biological sample from the subject. Additionally, for a subject that is determined to have cancer or a cancer subtype at any time, the method can further comprise prescribing appropriate treatment or actively treating the subject appropriately based on the determination of the cancer type or subtype according to accepted practice in the medical field for the determined cancer.
- the described method can be performed multiple times to provide multiple assessments. This can be useful to provide methods for monitoring the presence or evolution of cell types or subtypes from a source.
- the methods can be performed from sequence read data obtained from biological samples obtained from a subject before and/or for time points at or after initial diagnosis of cancer.
- the Griffin workflow is flexible and is not limited to a certain set of genomic regions of interest, nor to a specific type of sequence data for generating coverage profiles.
- Exemplary, non-limiting approaches for generating sequence read data include whole genome sequencing (for example depths between 0.05X coverage and 100X coverage) and chromatin accessibility assays.
- the sequence read data is generated by, or regions of interests are identified using, techniques such as ATAC-seq, ChIP-seq, DNAse sensitivity assays, and the like, which are known in the art.
- the sequence data is generated by CUT & RUN. See, e.g., WO 2019/060907, incorporated herein by reference in its entirety.
- the CUT & RUN assay can incorporate use of one or more affinity reagents (e.g., antibodies or antibody fragments) that target post-translational modifications of H3K27ac, H3K4mel and/or H3K27ac.
- the method comprises affirmatively generating the sequence read data, using for example, any of the illustrative approaches described herein or other appropriate approaches known in the art.
- the sequence read data can be produced from a panel of genomic targets. It will be understood that this targeted panel approach is applicable beyond Lung cancer subtyping to other types of cancers.
- the sequence read data can comprise sequence read data generated from a panel of genomic targets.
- the panel of genomic targets can be designed and assembled according to the approach described in Example 3 in the context of lung cancer (see also FIG. 16).
- the panel can comprise TFBSs of one or more transcription factors associated with a cancer type of interest.
- the transcription factors associated with a cancer type of interest can be readily identified from the art.
- the TFBSs relating to the designated transcription factor(s) can be determined by standard assays that establish binding sites in the genome, such as ChIP-seq data, and the like. Furthermore, candidate TFBSs can be further retained based on an assessment of association or proximity with transcription start sites (TSSs) of genes with transcription levels (on, off, high, low, etc.) associated with a relevant cancer or cancer subtype.
- the panel of genomic targets comprise transcription start sites (TSSs) for one or more markers associated with the cancer type of interest.
- TSSs transcription start sites
- the panel can be constructed using the TFBSs and/or TSSs in any combination. Once established, directed sequencing reads are generated from the targets. In some embodiments, the nucleosome occupancy of the TFBSs and/or TSSs is determined.
- the sequence read data is the input into the computer-implemented Griffin method described above to facilitate the appropriate subtyping or other analysis.
- the disclosure provides a computer-implemented method of enhancing sequence read data from cell-free DNA samples for cell type prediction.
- the method comprises: receiving, by a computing system, sequence read data, wherein the sequence read data includes a plurality of fragment reads, and wherein each fragment read has a fragment length; determining, by the computing system, a fragment size variability for at least one gene associated with a cell type; and predicting, by the computing system, the cell type based on the fragment size variability for the at least one gene.
- FIG. 23 is a flowchart that illustrates a non-limiting example embodiment of enhancing sequence read data from cell-free DNA samples for improved cell type prediction according to various aspects of the present disclosure.
- a computing system receives sequence read data, wherein the sequence read data includes a plurality of fragment reads, and wherein each fragment read has a fragment length.
- the computing system determines a fragment size variability for at least one gene associated with a cell type.
- locations of genes whose mRNA expression and transcriptional activity are known to be associated with given cell types such as the 47 genes illustrated in Fig. 12D that are known to be associated with prostate cancer, may be used.
- a coefficient of variation of the fragment size of fragments at locations associated with one or more genes may be determined and used as fragment size variability values.
- the coefficient of variation (CV) has been found to be particularly useful in distinguishing cell types based on fragment size variability when analyzing fragments at genes that are associated with the cell types. In particular, CV has been found to be less affected by the depth of sequencing coverage than other techniques (such as measurements of entropy).
- the computing system predicts the cell type based on the fragment size variability for at least one gene.
- features may be generated based on the fragment size variability, and the features may be provided as input to a classifier model to determine whether the features represent a given cell type.
- a ratio of the fragment size variability in a first cell type versus a second cell type may be used as a feature.
- the classifier model may be used to determine whether the calculated features for a given sample are more like features of a first cell type or a second cell type. Any suitable classifier model, including but not limited to logistic regression models, artificial neural networks, decision trees, support vector machines, and Bayesian networks, may be used.
- Example 2 One non-limiting example embodiment of the use of the method 700 is described in Example 2, where analysis of fragment size variability is used to distinguish prostate cancer cell types of androgen receptor pathway active prostate cancer (ARPC) varieties and neuroendocrine prostate cancer (NEPC) varieties.
- ARPC androgen receptor pathway active prostate cancer
- NEPC neuroendocrine prostate cancer
- subject means a mammal being assessed for treatment and/or being treated.
- the mammal is a human.
- the terms "subject,” “individual,” and “patient” encompass, without limitation, individuals having cancer. While subjects may be human, the term also encompasses other mammals, particularly those mammals useful as laboratory models for human disease, e.g., mouse, rat, dog, non-human primate, and the like.
- treating and grammatical variants thereof may refer to any indicia of success in the treatment or amelioration or prevention of a disease or condition (e.g., a cancer, infectious disease, or autoimmune disease), including any objective or subjective parameter such as abatement; remission; diminishing of symptoms or making the disease condition more tolerable to the patient; slowing in the rate of degeneration or decline; or making the final point of degeneration less debilitating.
- a disease or condition e.g., a cancer, infectious disease, or autoimmune disease
- any objective or subjective parameter such as abatement; remission; diminishing of symptoms or making the disease condition more tolerable to the patient; slowing in the rate of degeneration or decline; or making the final point of degeneration less debilitating.
- the treatment or amelioration of symptoms can be based on objective or subjective parameters; including the results of an e amination by a physician.
- the term “treating” includes the administration of the compounds or agents of the present disclosure to prevent or delay, to alleviate, to improve clinical outcomes, to decrease occurrence of symptoms, to improve quality of life, to lengthen disease-free status, to stabilize, to prolong survival, to arrest or inhibit development of the symptoms or conditions associated with a disease or condition (e.g., a cancer), or any combination thereof.
- a disease or condition e.g., a cancer
- therapeutic effect refers to the reduction, elimination, or prevention of the disease or condition, symptoms of the disease or condition, or side effects of the disease or condition in the subject.
- nucleic acid or “polynucleic acid” refer to a polymer of nucleotide monomer units or “residues", typically DNA or RNA.
- the nucleotide monomer subunits, or residues, of the nucleic acids each contain a nitrogenous base (i.e., nucleobase) a five-carbon sugar, and a phosphate group.
- nucleobase a nitrogenous base
- phosphate group i.e., nucleobase
- the identity of each residue is typically indicated herein with reference to the identity of the nucleobase (or nitrogenous base) structure of each residue.
- Canonical nucleobases include adenine (A), guanine (G), thymine (T), uracil (U) (in RNA instead of thymine (T) residues) and cytosine (C).
- the nucleic acids of the present disclosure can include any modified nucleobase, nucleobase analogs, and/or non-canonical nucleobase, as are well-known in the art.
- Example 1 is set forth so as to provide those of ordinary skill in the art with a complete disclosure and description of how to make and use the present invention, and are not intended to limit the scope of what the inventors regard as their invention nor are they intended to represent that the experiments below are all or the only experiments performed.
- Example 1 is set forth so as to provide those of ordinary skill in the art with a complete disclosure and description of how to make and use the present invention, and are not intended to limit the scope of what the inventors regard as their invention nor are they intended to represent that the experiments below are all or the only experiments performed.
- cfDNA Cell-free DNA
- ctDNA circulating tumor DNA
- genomic alterations from ctDNA have helped to distinguish molecular subsets of tumors.
- these genomic alterations including somatic mutations, may not always fully explain treatment failure or identify therapeutic targets, exemplifying a major limitation of cancer precision medicine.
- Tumor subtypes are often characterized by distinct transcriptional regulation, which can change during treatment resistance, leading to different clinical tumor phenotypes.
- prostate and lung cancers may undergo trans-differentiation from adenocarcinoma to small-cell neuroendocrine phenotypes.
- MBC metastatic breast cancer
- treatment is guided based on clinical subtypes determined by the expression of the estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor receptor 2 (HER2), often in the primary tumor; endocrine therapies are prescribed to patients with ER-positive (ER+) or PR-positive (PR+) carcinomas while patients with HER2 positive tumors are prescribed anti-HER2 drugs.
- ER estrogen receptor
- PR progesterone receptor
- HER2 human epidermal growth factor receptor 2
- TNBC triple negative breast cancer
- ER- ER-negative subtypes
- mixtures of clinical subtypes may also co-exist across or within metastatic lesions in the same patient, presenting major clinical challenges. Therefore, accurate subtype classification and identification of transcriptional patterns underlying emergent clinical phenotype during therapy has critical implications for studying mechanisms of resistance and informing treatmentdecisions.
- nucleosomes are positioned in an organized manner that allows access for DNA binding proteins (FIG. 7A). This nucleosome organization results in a loss of sequencing coverage, reflecting DNA degradation at the unprotected binding site with peaks of coverage at the surrounding protected bcations.
- nucleosome profiling from cfDNA has been demonstrated for cancer detection and tumor tissue-of-origin prediction, including the analysis of shorter cfDNA fragments which tend to be enriched from tumor cells. While tumor subtyping from cfDNA has been explored in prostate cancer by analyzing TFBS locations, it is believed that there have not been demonstrations of subtype classification from cfDNA in other cancers. Specifically, predicting histological subtypes in breast cancer has not been shown from cfDNA. Furthermore, current cfDNA nucleosome profiling approaches have not been optimized for ULP-WGS data. Studying the clinical phenotype of tumors from ctDNA remains challenging due to lack of robust computational methods but has obvious potential clinical benefits for guiding treatment decisions in patients with metastatic cancer.
- Griffin a computational framework called Griffin was developed to classify tumor subtypes from nucleosome profiling of cfDNA.
- Griffin overcomes current analytical challenges to profiles the nucleosome accessibility and transcriptional regulation from the analysis of standard cfDNA genome sequencing, including ULP- WGS (O.lx) coverage.
- Griffin employs a novel GC correction procedure that is specific for DNA fragment sizes and therefore unique for cfDNA sequencing data.
- Griffin was applied to perform cancer detection and tumor tissue-of-origin analysis with high performance. Then, the first application of breast cancer ER subtyping from cfDNA was demonstrated, showing strong classification accuracy and insights into tumor heterogeneity and prognosis, all achieved from analysis of ULP-WGS data.
- Griffin is a generalizable framework that can detect molecular changes in transcriptional regulation and chromatin accessibility from cfDNA and possibly direct personalized treatment to improve patient outcomes.
- Griffin was developed as an analysis framework with a GC correction procedure to accurately profile nucleosome occupancy from cfDNA. Griffin processes fragment coverage to distinguish accessible and inaccessible features of nucleosome protection (FIG. 7A). Griffin is designed to be applied to whole genome sequencing (WGS) data of cfDNA from patients with cancer to quantify nucleosome protection around sites of interest and is optimized to work for ULP-WGS data (FIG. 7B). Sites of interest can be selected from various chromatin-based assays, such as from assay for transposase- accessible chromatin using sequencing (ATAC-seq) and are tailored to address specific problems including cancer detection and tumor sub typing.
- GGS whole genome sequencing
- ATAC-seq assay for transposase- accessible chromatin using sequencing
- the analysis workflow begins with computing the genome-wide fragment-based GC bias for each sample. Then, for the region at each site of interest, the fragment midpoint coverage is computed and reweighted to remove GC biases (Methods). Midpoint coverage rather than full fragment coverage is used because it produces higher amplitude nucleosome protection signals (not shown). Next, a composite coverage profile is computed as the mean of the GC- corrected coverage across the set of sites specific for a tissue type, tumor type, transcription factor (TF), or any phenotypic comparison of interest.
- Methods reweighted to remove GC biases
- a novel aspect of Griffin is the implementation of a fragment-based GC bias correction.
- GC-content is non-uniform, which leads to GC-related coverage biases (FIG 8A)
- FIG. 8A Wang, J. et al. Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors. Genome Res. 22, 1798-1812 (2012)).
- GC bias varies between samples and between different fragment lengths within a sample (Benjamini, Y. & Speed, T. P. Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Research 40, e72-e72 (2012)) (FIG.
- nucleosome accessibility prediction (FIG. 8C).
- Griffin computes the global estimated mean fragment coverage ("expected") using a fragment length position model (Benjamini, Y. & Speed, T. P. Nucleic Acids Research 40, e72-e72 (2012)) (Methods, FIG. 8B). Then, when calculating coverage profiles around sites of interest, each fragment is assigned a weight based on the global expected coverage for its length and GC bias. This correction eliminates unexpected increases (or decreases) in coverage at binding sites, removing technical biases to enhance the tissue-associated accessibility signals when analyzing WGS (9-25x, FIG. 8C) cancer patient cfDNA and ULP-WGS (0.1-0.3x, FIG. 8D).
- the estimated TFBS accessibility was compared with the amount of tumor- derived DNA (i.e. tumor fraction) predicted by ichorCNA for ULP-WGS data from 191 MBC cfDNA samples with > 0.1 tumor fraction (Adalsteinsson, V. A. et al. Scalable whole-exome sequencing of cell-free DNA reveals high concordance with metastatic tumors. Nature Communications 8, (2017)).
- the tumor fraction was expected to be negatively corrected with the central coverage around tumor- specific sites, and positively correlated for blood-specific sites.
- the RMSE decreased (0.062 to 0.046), indicating less inter-sample variation in the data after GC correction (FIG. 8E).
- the central coverage for the 377 TFs was examined in a cohort of 215 healthy donors (Cristiano, S. et al. Genome- wide cell-free DNA fragmentation in patients with cancer. Nature 570, 385-389 (2019)) before and after GC correction.
- the performance was likely reflective of the higher tumor fractions observed in late-stage cancer relative to early-stage cancer.
- DHS DNase I Hypersensitivity Sites
- FIG. 10B Griffin was applied to profile nucleosome accessibility at these four sets of ER subtype- specific accessible chromatin sites, extracting a total of 12 features (FIG. 10B).
- Circulating tumour DNA in metastatic breast cancer to guide clinical trial enrolment and precision oncology A cohort study. PLoS medicine 17.10 (2020): el003363) and using the model trained on the original MBC dataset, we were able to predict ER status with 0.92 accuracy (0.96 AUC) in all samples with >0.05 tumor fraction. Looking only at samples with >0.1 tumor fraction, the accuracy was 0.96 and the AUC was 0.98. This analysis further supports that Griffin can perform accurate ER status prediction in independent datasets.
- Griffin a new framework and analysis tool for studying transcriptional regulation and tumor phenotypes.
- Griffin uses a novel cfDNA fragment length- specific normalization of GC-content biases that obscure chromatin accessibility information. It is demonstrated that Griffin can be used to detect cancer from low pass WGS with high accuracy. Additionally, an approach was developed to perform ER subtyping in breast cancer from ULP-WGS, which is the first time that ER phenotype prediction has been shown from ctDNA.
- Griffin is versatile and can be used for various applications in cancer. This disclosure highlights cancer detection, tissue-of-origin, and tumor subtype use-cases. However, Griffin can also be used for any biological comparison where transcriptional regulation and chromatin accessibility differences can be delineated.
- the applications described here use TFBSs from chromatin immunoprecipitation sequencing (ChIP-seq) and accessible chromatin sites from ATAC-seq.
- ChIP-seq chromatin immunoprecipitation sequencing
- ATAC-seq accessible chromatin sites from ATAC-seq.
- Griffin differs from existing methods due to its ability to analyze custom sites of interest that are specific to any biological context. These sites may be obtained from external sources and different assays, such as ChIP-seq, DNase I hypersensitivity, ATAC-seq or cleavage under targets and release using nuclease (CUT&RUN).
- Griffin is optimized for the analysis of ULP-WGS (O.lx) of cfDNA, while other nucleosome profiling methods have focused on deeper coverage sequencing. Griffin takes advantage of analyzing the breadth of sites as opposed to individual loci, which was inspired by a similar strategy used by Ulz, P. et al. Inference of transcription factor binding from cell-free DNA enables tumor subtype prediction and early detection. Nature Communications 10, 4666 (2019). It is demonstrated that Griffin has better performance for both detecting cancer and predicting ER status from ULP-WGS data when compared to the Ulz method, because of its novel bias correction and versatility to analyze any set of genomic regions. However, Griffin is not limited to low coverage data.
- Increased cfDNA sequencing coverage can allow for analysis of specific gene promoters and cis- regulatory elements and may be able to inform gene expression (Ulz, P. et al. Inferring expressed genes by whole-genome sequencing of plasma DNA. Nature Genetics 48, 1273-1278 (2016)). While recent studies show the promise of cfDNA methylation and cfRNA analysis for tumor phenotype analysis and cancer detection (Beltran, H. et al. Circulating tumor DNA profile recognizes transformation to castration- resistant neuroendocrine prostate cancer. J Clin Invest 130, 1653-1668 (2020); Wu, A. et al. Genome-wide plasma DNA methylation features of metastatic prostate cancer. J Clin Invest 130, 1991-2000 (2020); Shen, S.
- a limitation of the binary ER classification is the decreased accuracy for samples with lower tumor fraction (0.05 to 0.1); however, patients with cfDNA tumor fraction > 10% have poorer prognosis (Stover, D. G. et al. Association of Cell-Free DNA Tumor Fraction and Somatic Copy Number Alterations With Survival in Metastatic Triple-Negative Breast Cancer. JCO 36, 543-553 (2018)) and would benefit more from tumor monitoring. It may be possible to improve performance of ER subtyping for lower tumor fraction samples with additional sequencing depth or joint analysis of multiple cfDNA timepoints from the samepatient.
- the breast cancer subtyping was focused on ER prediction because its status has important utility in predicting likely benefit to endocrine therapy (Group (EBCTCG), E. B. C. T. C. Relevance of breast cancer hormone receptors and other factors to the efficacy of adjuvant tamoxifen: patient-level meta-analysis of randomised trials. The Lancet 378, 771-784 (2011)). While PR expression is also determined in the clinic and ER-/PR+ tumors are considered hormone receptor positive, these are rare, not reproducible or less useful for prognosis (Hefti, M. M. et al. Estrogen receptor negative/progesterone receptor positive breast cancer is not a reproducible subtype. Breast Cancer Research 15, R68 (2013)).
- HER2 overexpression is important relevant for prognosis and determining treatment such as trastuzumab (Slamon, D. J. et al. Human breast cancer: correlation of relapse and survival with amplification of the HER-2/neu oncogene. Science 235, 177-182 (1987)).
- trastuzumab Stemmed, D. J. et al. Human breast cancer: correlation of relapse and survival with amplification of the HER-2/neu oncogene. Science 235, 177-182 (1987)
- an insufficient number of open chromatin sites were identified that were specific for distinguishing HER2 status.
- ERBB2 encodes the HER2 protein
- the Griffin framework is a unique advance on our previous method to analyze genomic alterations and estimate tumor fraction from ULP-WGS of cfDNA (Adalsteinsson, V. A. et al. Scalable whole-exome sequencing of cell-free DNA reveals high concordance with metastatic tumors. Nature Communications 8, (2017)). Together, these methods form a suite of tools to establish a new paradigm to study both tumor genotype and phenotype from ULP-WGS of cfDNA. Griffin has the potential to reveal clinically relevant tumor phenotypes, which will support the study of therapeutic resistance, inform treatment decisions, and accelerate applications in cancer precision medicine.
- GC content influences the efficiency of amplification and sequencing leading to different expected coverages (coverage bias) for fragments with different GC contents and fragment lengths. This is called GC bias and is unique to each sample.
- cover bias This is called GC bias and is unique to each sample.
- mappability score 1
- mappability score 2
- mappability score 1
- mappability score 2
- mappability score 3
- centromeres centromeres
- fix patches fix patches
- alternative haplotypes for hg38 downloaded from UCSC table browser
- the pipeline takes a bam file, bedGraph file of valid (mappable, non-excluded) regions, and genome GC frequencies for those regions. For each given sample, we fetched all reads aligning to the valid regions on autosomes using pysam (github.com/pysam- developers/pysam) (Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078-2079 (2009)).
- the griffin nucleosome profiling pipeline To perform nucleosome profiling around sites of interest. This pipeline takes a bam file and site list, and assorted other parameters described below. For a given bam file and site list, we fetched all reads in a window (-5000 to +5000bp) around each site using pysam (excluding those that failed quality control measures). We then filtered read pairs by fragment length and selected those in a range of fragment lengths (100-200 bp unless otherwise specified). For each read pair, we determined the GC bias for the fragment and assigned a weight of to that fragment and identified the location of the fragment midpoint.
- cfDNA tumor fraction was estimated using ichorCNA (Adalsteinsson, V. A. et al. Scalable whole-exome sequencing of cell-free DNA reveals high concordance with metastatic tumors. Nature Communications 8, (2017) ).
- An hg38 panel of normal (PoN) with a lmb bin size was created using all 215 healthy donors in the dataset.
- ichorCNA was then ran on all cancer and healthy samples to estimate tumor fraction.
- ichorCNA_fracReadsInChrYForMale was set to 0.001. Defaults were used for all other settings.
- the LUCAS cohort included 158 patients who had no history of cancer and no future cancer diagnosis and 129 patients who were diagnosed with lung cancer within days of blood draw (0-44 days).
- the validation cohort included 46 patients with cancer and 385 patients without cancer. All samples were realigned to hg38 as described below in sequence data processing. Tumor fraction was determined using ichorCNA as described above with a panel of normals constructed from 54 separate non-cancer samples from this same study.
- MSC Metastatic breast cancer
- WGS of cfDNA from patients with metastatic breast cancer (MBC) and healthy donors were obtained from an existing dataset (Adalsteinsson, V. A. et al. Nature Communications 8, (2017)). Bam files were downloaded from dbGaP (accession code: phs001417.vl.pl). This data consisted of ⁇ 0.1x ultra-low pass whole genome sequencing (ULP-WGS) from lOObp paired end Illumina sequencing reads.
- ULP-WGS ultra-low pass whole genome sequencing
- ER estrogen receptor
- each sample was labeled as ER+ or ER- using information about the ER status from medical records. If metastatic ER status was known, the sample was labeled according to this status. If metastatic ER status was not known, the sample was labeled according to the primary tumor ER status (20 samples from 11 patients). ER low samples (11 samples from 6 patients) were labeled ER positive for the purpose of the binary classifier. For three patients (MBC_1405, MBC_1406, MBC_1408), we had information about multiple metastatic biopsies with different ER statuses. In these cases, we used the last biopsy taken for the purpose of the binary ER status classifier.
- WGS of cfDNA samples from patients with MBC were obtained from an existing study as described above (Adalsteinsson, V. A. et al. Nature Communications 8, (2017)). Additional information, including primary ER status, metastatic ER status, and survival time, was abstracted from the medical records. Use of this data was approved by an institutional review board (Dana-Farber Cancer Institute IRB protocol identifiers 05-246, 09-204, 12-431 [NCT01738438; Closure effective date 6/30/2014]).
- TFBS Transcription factor binding site selection Transcription factor binding sites (TFBSs) were downloaded from the GTRD database (Yevshin, I., GTRD: A database on gene transcription regulation - 2019 update. Nucleic Acids Research 47, D100-D105 (2019)).
- This database contains a compilation of ChIP seq data from various sources.
- we used the meta clusters data version 19.10, downloaded from gtrd.biouml.org/downloads/19.10/chip- seq/Homo%20sapiens_meta_clusters.interval.gz). This contains meta peaks observed in one or more ChIP seq experiments.
- the GTRD database contains some ChIP seq experiments for targets that are not transcription factors (TFs).
- DNase I hypersensitivity sites for a variety of tissue types were downloaded from zenodo.org/record/3838751/files/DHS_Index_and_Vocabulary _hg38_WM20190703.txt. gz (Meuleman, W. et al. Index and biological spectrum of human DNase I hypersensitive sites. Nature 584, 244-251 (2020)). These sites were split by tissue type for a total of 16 site lists. The 'summit' column was used as the site position. The sites were sorted by the number of samples where that site had been observed ('numsamples') and the top 10,000 most frequently observed sites were selected for each tissue type.
- a differential expression experiment was ran using the 'DESeq' and 'results' functions followed by log fold change shrinkage using the 'lfcShrink' function. Sites with a q-value ⁇ 5*10 4 were selected. Additionally, selected sites were further filtered based on the log2 fold change between ER+ and ER- tumors. Sites with a log2 fold change >0.5 were classified as ER+ specific, while sites with a log2 fold change ⁇ -0.5 were classified as ER- specific. These site lists were further split into sites shared with hematopoietic cells and those not shared with hematopoietic cells. Hematopoietic sites were obtained from a database of single cell ATAC-seq data (Satpathy, A.
- nucleosome profiling with and without GC correction was performed on the top 10,000 sites for each of 377 TFs.
- the MAD of the central coverage values was calculated both before and after GC correction.
- the MAD values before and after GC correction were compared using a Wilcoxon signed-rank test (two-sided).
- the realignment procedure was the same as above but using the hgl9 genome (downloaded from hgdownload.soe.ucsc.edu/goldenPath/hgl9/bigZips/hgl9.fa.gz) and hgl9 known polymorphic sites for base recalibration (downloaded from gsapubftp- anonymous@ftp.broadinstitute.org/bundle/hg37/Mills_and_1000G_gold_standard.indels.
- nucleosome profiling using 100-200bp fragments to the 377 TFs from GTRD and extracted 3 features per profile for a total of 1131 features. We then used PCA to identify the components that explained 80% of the variance as described above. Second, we applied nucleosome profiling using 100-200bp fragments to the 4 ER differential AT AC seq lists and extracted 3 features per profile for a total of 12 features. Lastly, we applied nucleosome profiling using 35-150bp fragments to the 4 ER differential ATAC seq lists and extracted 3 features per list for a total of 12 features.
- Sequencing data used in this study was obtained from dbGaP (accession phs001417.vl.pl) and EGA (dataset ID EGAD00001005339).
- Griffin software and the subtype classifier tool can be obtained from github.com/adoebley/Griffin. Code for analysis and machine learning models can be accessed at github.com/adoebley/Griffin_analyses.
- Example 1 above is a proof-of-concept demonstration that sequence analysis applying an embodiment of the Griffin workflow can enhance sequence signals with sufficient power and specificity to allow determination of breast cancer subtypes from low pass sequencing data.
- This Example expands the application of Griffin workflow to other cancer types and makes use of data from an alternative sequence profiling platform. Specifically, histone modification profiling was performed using the CUT & RUN on different subtypes of prostate cancer cells. As with Example 1, the Griffin workflow provided robust signals to clearly differentiate different subtypes of prostate cancer, demonstrating the power and flexibility of the analytic workflow. Background
- Metastatic castration-resistant prostate cancer describes the stage in which the disease has developed resistance to androgen ablation therapies and is lethal. Androgen receptor signaling inhibitors (ARSI), designed for the treatment of CRPC, repress androgen receptor (AR) activity and improve survival, but these therapies eventually fail. Since the adoption of ARSI as standard-of-care for mCRPC, there has been a prominent increase in the frequency of treatment-resistant tumors with neuroendocrine (NE) differentiation and features of small cell carcinomas. These aggressive tumors may develop through a resistance mechanism of trans-differentiation from AR-positive adenocarcinoma (ARPC) to NE prostate cancer (NEPC) that lack AR activity.
- ARPC AR-positive adenocarcinoma
- NEPC neuroendocrine
- Additional phenotypes can also arise based on expression of AR activity and NE genes, including AR-low prostate cancer (ARLPC) and double- negative prostate cancer (DNPC; AR-null/NE-null).
- ARLPC AR-low prostate cancer
- DNPC double- negative prostate cancer
- Distinguishing prostate cancer subtypes has clinical relevance in view of differential responses to therapeutics, but the need for a biopsy to diagnose tumor histology can be challenging: invasive procedures are expensive and accompanied by morbidity, a subset of tumors are not accessible to biopsy, and bone sites pose particular challenges with respect to sample quality.
- Circulating tumor DNA (ctDNA) released from tumor cells into the blood as cell- free DNA (cfDNA) is a non-invasive "liquid biopsy" solution for accessing tumor molecular information.
- the analysis of ctDNA to detect mutation and copy-number alterations has served to classify genomic subtypes of CRPC tumors.
- the defining losses of TP53 and RBI in NEPC do not always lead to NE trans-differentiation. Rather, ARPC and NEPC tumors are associated with distinct reprogramming of transcriptional regulation.
- Methylation analysis of cfDNA in mCRPC to profile the epigenome shows promise for distinguishing phenotypes, but requires specialized assays such as bisulfite treatment, enzymatic treatment, or immunoprecipitation.
- cfDNA represents DNA protected by nucleosomes when released from dying cells into circulation, leading to DNA fragmentation that is reflective of the non-random enzymatic cleavage by nucleases.
- Emerging approaches to analyze cfDNA fragmentation patterns from plasma for studying cancer can be performed directly from standard whole genome sequencing (WGS).
- cfDNA fragments have the characteristic size of 167 bp, consistent with protection by a single core nucleosome octamer and histone linkers, but the size distribution may vary between healthy individuals and cancer patients.
- TSS transcription start site
- TFBS transcription factor binding site
- nucleosome positioning and spacing are dynamic in active and repressed gene regulation. A detailed understanding of the nucleosome organization and positioning patterns associated with transcriptional regulation has not been fully explored in cfDNA.
- ctDNA analysis A major challenge for ctDNA analysis is the low tumor content (tumor fraction) in patient plasma samples.
- plasma from patient-derived xenograft (PDX) models may contain nearly pure human ctDNA after bioinform atic exclusion of mouse DNA reads. This provides a resource that is ideal for studying the properties of ctDNA, developing new analytical tools, and validating both genetic and phenotypic features by comparison to matching tumors.
- WGS of ctDNA from mouse plasma across 24 CRPC PDX lines with diverse phenotypes was performed deep.
- the models consisted of 18 classified as ARPC, two classified as AR-low and NE- negative prostate cancer (ARLPC), and six classified as NEPC (FIG. 11 A).
- CUT&RUN Cleavage Under Targets and Release using Nuclease
- PTMs H3K27me3 histone post-translational modifications
- nucleosome organization inferred from ctDNA reflects the transcriptional activity state regulated by histone PTMs (Zhou et al. (2011). Charting histone modifications and the functional organization of mammalian genomes. Nat Rev Genet 12, 7-18).
- ctDNA coverage at TFBSs were aggregated into composite profiles representing the inferred activity (Example 1 and Doebley et al., 2021; Ulz et al. (2019). Inference of transcription factor binding from cell-free DNA enables tumor subtype prediction and early detection. Nature Communications 10, 4666). Similarly, features in the composite profiles of subtype- specific open chromatin regions were extracted for analyzing the signatures of chromatin accessibility in ctDNA. Altogether, a multi-omic sequencing dataset was assembled from matching tumor and plasma for a total of 24 PDX lines, making this a unique molecular resource and platform for developing transcriptional regulation signatures of tumor phenotype prediction from ctDNA. Characterizing transcriptional activity of AR and ASCL1 in PDX phenotypes through analysis of tumor histone modifications and ctDNA
- RNA Splicing Factors SRRM3 and SRRM4 Distinguish Molecular Phenotypes of Castration-Resistant Neuroendocrine Prostate Cancer. Cancer Research 81, 4736-4750). The transcriptional activity was further characterized in different tumor phenotypes by studying epigenetic regulation via histone PTMs.
- H3K4mel Broad peak regions for H3K4mel (median of 17,643 regions, range 1,894 - 64,934), H3K27ac (median 7,093, range 1610 - 34,047), and H3K27me3 (median 8,737, range 2,024 - 42,495) were identified in the tumors of the 24 PDX lines and an additional nine LuCaP PDX lines where only tumor was available (total of 25 ARPC, 2 ARLPC, and 6 NEPC) (Methods).
- H3K27ac putative active regulatory regions of enhancers and promoters
- H3K4mel gene repressive heterochromatic mark
- H3K27me3 gene repressive heterochromatic mark
- AR and ASCL1 are two key differentially expressed TFs with known regulatory roles in ARPC and NEPC phenotypes, respectively (Brady et al. (2021). Temporal evolution of cellular heterogeneity during the progression to advanced AR-negative prostate cancer. Nat Commiin 12, 3372; Cejas et al. (2021). Subtype heterogeneity and epigenetic convergence in neuroendocrine prostate cancer. Nat Commun 12, 5775; Rapa et al. (2008). Human ASH1 expression in prostate cancer with neuroendocrine differentiation. Mod Pathol 21, 700-707; Wang et al. (2020). Molecular tracing of prostate cancer lethality. Oncogene 39, 7225-7238).
- the ctDNA composite coverage profiles were analyzed at TFBSs to evaluate the nucleosome accessibility, whereby lower normalized central ( ⁇ 30 bp window) mean coverage across these sites suggests more nucleosome depletion (Methods).
- the composite coverage profile at ASCL1 TFBSs showed the strongest nucleosome depletion for NEPC samples (mean central coverage 0.69) compared to ARLPC (0.86) and ARPC (0.88) (FIG. 12C). These observations were consistent with the differential binding activity by AR and ASCL1 in their respective phenotypes from tumor tissue. Furthermore, the ctDNA coverage patterns of the nucleosome depletion in ctDNA resembled the NDR flanked by nucleosomes with H3K27ac and H3K4mel peak profiles, which was exemplified when analyzing only nucleosome- sized fragments (140 bp - 200 bp) generated by CUT&RUN (FIG. 12A). Together, these results suggest that the nucleosome depletion in ctDNA at AR and ASCL1 binding sites represents active TF binding and regulatory activity in specific prostate PDX tumor phenotypes.
- Nucleosome patterns at gene promoters inferred from ctDNA are consistent with transcriptional activity for phenotype-specific genes
- RNA Splicing Factors SRRM3 and SRRM4 Distinguish Molecular Phenotypes of Castration-Resistant Neuroendocrine Prostate Cancer. Cancer Research 81, 4736- 4750) were selected and confirmed by differential expression analysis from PDX tumor RNA-Seq data (FIG. 12D, Methods).
- increased coverage was observed at the TSS of AR (1.08) in NEPC and ASCL1 (0.42) in ARPC, which supports the nucleosome depletion in the absence of PTMs and inactive transcription.
- TFBSs in PDX ctDNA were identified based on the intersection of 338 TFs analyzed using Griffin and 404 differentially expressed TFs between ARPC and NEPC PDX tumors (Methods). Of these TFs, 38 had significantly different accessibility in ctDNA between ARPC and NEPC phenotypes (two tailed Mann-Whitney U test, Benjamini-Hochberg adjusted p ⁇ 0.05). Through unsupervised hierarchical clustering of composite TFBS central coverage values for the 107 TFs, distinct groups of TFs were observed in PDX ctDNA (FIG. 13).
- FOXA1, and GRHL2 were significantly more accessible in ARPC (and ARLPC) samples compared to NEPC (log2 fold-change ⁇ -0.57, adjusted p ⁇ 1.3 x 10 3 ).
- AR, HOXB13, and NKX3-1 had higher accessibility in ARPC compared to NEPC (log2 fold-change ⁇ - 0.37, adjusted p ⁇ 1.3 x 10 3 ), but with only moderate accessibility in ARLPC, as expected.
- Other TFs including RUNX1, BCL11B, POU3F2, NEUROG2, and SOX2 also had higher activity in NEPC (log2 fold-change > 0.06, adjusted p ⁇ 0.048), although the difference was modest.
- HEY1, IRF1, and IKZF1 had a similar trend consistent with increased accessibility in NEPC samples but were not significantly different from ARPC (adjusted p > 0.10).
- Other notable factors such as MYC and ETS transcription family genes (ETV4, ETV5, ETS1, ETV1) had high accessibility across all phenotypes, while NEUROD1, RUNX3, and TP63 were inaccessible in nearly all samples.
- ETV4 ETS transcription family genes
- NEUROD1, RUNX3, and TP63 were inaccessible in nearly all samples.
- ASCL1, NR3C1, HNF4G, HNF1A, and SOX2 Arora et al. (2013). Glucocorticoid Receptor Confers Resistance to Antiandrogens by Bypassing Androgen Receptor Blockade.
- Phenotype-specific open chromatin regions in PDX tumor tissue are reflected in ctDNA profiles of nucleosome accessibility
- Nucleosome profiling from cfDNA sequencing analysis has shown agreement with overall chromatin accessibility in tumor tissue (Snyder et al. (2016). Cell-free DNA Comprises an in Vivo Nucleosome Footprint that Informs Its Tissues-Of-Origin. Cell 164, 57-68; Sun et al. (2019). Orientation-aware plasma cell-free DNA fragmentation analysis in open chromatin regions informs tissue of origin. Genome Research 29, 418— 427; Ulz et al. (2019). Inference of transcription factor binding from cell-free DNA enables tumor subtype prediction and early detection.
- cfDNA released by hematopoietic cells which leads to a lower ctDNA fraction (i.e., tumor fraction).
- tumor fraction i.e., tumor fraction
- a probabilistic model was developed to estimate the proportion of ARPC and NEPC from an individual plasma sample, accounting for the tumor fraction (Methods).
- a focused was made on the phenotype- specific open chromatin composite site features and the PDX plasma ctDNA signals were used (FIGS. 14B and 14C) to inform the model.
- the model produces a normalized prediction score that represents the estimated signature of ARPC (lower values) and NEPC (higher values).
- the study presented here is believed to be the largest sequencing study to date of human ctDNA from mouse plasma of PDX models.
- the sequencing of mouse plasma provided a unique opportunity to comprehensively interrogate the epigenetic nucleosome patterns in ctDNA from well-characterized tumor models.
- Computational methodologies were developed and applied to construct a multitude of ctDNA features, each of which were associated with the transcriptional regulation in the LuCaP PDX models across CRPC tumor phenotypes.
- a probabilistic model was developed to accurately classify ARPC and NEPC phenotypes from patient plasma in three clinical cohorts.
- PDX mouse plasma overcomes the challenge of low ctDNA content or incomplete knowledge of the tumor when studying patient samples and can expedite development of cfDNA diagnostics, basic cancer research, and clinical translation.
- LuCaP ctDNA sequencing data complements the maturing characterization of CRPC tumor phenotypes from tissue.
- the ctDNA data and the disclosed approaches expand on the potential utility of PDX models for translational research. While these data were focused on ARPC and NEPC phenotypes, this study can serve as a framework for the use of PDX plasma from additional CRPC phenotypes and other cancers models.
- LuCaP PDX ctDNA sequencing data confirmed the activity of key regulators between ARPC and NEPC phenotypes, including a set of 47 established differentially expressed gene markers. While gene expression inference from ctDNA has been shown in proof-of-concept studies (Ulz et al. (2016b). Inferring expressed genes by whole-genome sequencing of plasma DNA. Nature Genetics 48, 1273-1278; Zhu et al. (2021). Tissue-specific cell-free DNA degradation quantifies circulating tumor DNA burden. Nature Communications 12, 2229), the PDX ctDNA allowed for a detailed dissection of nucleosome organization associated with transcriptional activity of individual genes that define the tumor phenotypes.
- ASCL1 Glucocorticoid Receptor Confers Resistance to Antiandrogens by Bypassing Androgen Receptor Blockade. Cell 155, 1309-1322; Chaytor et al., 2019; Shukla et al., 2017).
- ASCL1 is a pioneer TF with roles in neuronal differentiation and was recently described to be active during NE trans -differentiation and in NEPC (Cejas et al., 2021; Rapa et al., 2008). To our knowledge, this study is the first to demonstrate ASCL1 binding site accessibility and provide a detailed characterization of its transcriptional activity in NEPC from plasma ctDNA.
- This model does not require training on patient samples but does require tumor fraction estimates (ichorCNA (Adalsteinsson (2017). Scalable whole-exome sequencing of cell-free DNA reveals high concordance with metastatic tumors. Nature Communications 8) and a prediction score cutoff determined from DFCI cohort I.
- the framework presented here can be extended to model multiple phenotype classes, provided the informative parameters for these additional states can be learned. Insights from additional datasets such as single-cell nucleosome and accessibility profiling (Fang et al. (2021). Comprehensive analysis of single cell ATAC-seq data with SnapATAC. Nat Commun 12, 1337; Wu et al. (2021). Single-cell CUT&Tag analysis of chromatin modifications in differentiation and tumor progression. Nat Biotechnol 39, 819-824) of PDX tumors and clinical samples may improve the resolution for ctDNA analysis.
- Tumor heterogeneity and co existence of different molecular phenotypes are common in mCRPC where treatment- induced phenotypic plasticity may vary within and between tumors in an individual patient. Larger studies with comprehensive assessment of the tumor histologies will be needed for developing future extensions of the model to predict mixed phenotypes from ctDNA.
- LuCaP patient-derived xenograft tumors (established at the University of Washington) were initiated from tumor specimens resected from men with advanced prostate cancer. The establishment and characterization of the PDX models were described previously (Lam et al. (2016). Generation of Prostate Cancer Patient-Derived Xenografts to Investigate Mechanisms of Novel Treatments and Treatment Resistance. In Prostate Cancer: Methods and Protocols, Z. Culig, ed. (New York, NY: Springer), pp. 1- 27). PDXs were propagated in vivo in male NOD scid IL2R-gamma-null (NSG) mice from Jackson Labs (cat#005557).
- mice were caged in a pathogen-free facility and given unlimited access to food and water maintained on a 12-hour light/dark cycle. Surgeries were performed under isoflurane anesthesia, and mice were given supplemental buprenorphine sustained release (SR). PDX lines were evaluated using histopathology by at least two expert pathologists, and histological phenotypic subtype annotations were orthogonally validated based on transcriptome- derived signature marker expression scores to define phenotypes (Beltran et al. (2016).
- UW cohort Blood samples were collected from men with metastatic castration resistant prostate cancer at the University of Washington (collected under University of Washington Human Subjects Division IRB protocol number CC6932 between years 2014-2021). In this study, 61 plasma samples from 30 patients were analyzed. After initial ultra-low pass whole genome sequencing (ULP-WGS) analysis, 47 plasma samples from 30 patients were retained for further high depth of coverage whole genome sequencing (WGS) analysis. All samples were de-identified prior to ctDNA analysis and a double blinded approach was employed for evaluating clinical phenotype predictions. The initial patient selection was done based on clinical disease burden information and the availability of clinically derived phenotypic subtype annotation. Clinical information on these patients is protected due to IRB protocol restrictions.
- DFCI cohort I Plasma was collected from men diagnosed with mCRPC and treated at the Dana-Farber Cancer Institute (DFCI), Brigham and Women's Hospital, or Weill Cornell Medicine (WCM) between April 2003 and August 2021. All patients provided written informed consent for research participation and genomic analysis of their biospecimen and blood. The use of samples was approved by the DFCI IRB (#01- 045 and 09-171) and WCM (1305013903) IRBs. ULP-WGS data at mean coverage 0.5x (range 0.3x - 0.9x) for 101 patients were published previously (Berchuck et al. (2022). Detecting Neuroendocrine Prostate Cancer Through Tissue-Informed Cell-Free DNA Methylation Analysis. Clinical Cancer Research 28, 928-938).
- DFCI cohort II Plasma samples in this cohort were collected from men diagnosed with mCRPC and treated at the Dana-Farber Cancer Institute (DFCI). All patients provided written informed consent for blood collection and the analysis of their clinical and genetic data for research purposes (DFCI Protocol # 01-045 and 11-104). WGS data at mean coverage 27x (range llx - 44x) (Viswanathan et al. (2016). Structural Alterations Driving Castration-Resistant Prostate Cancer Revealed by Linked-Read Genome Sequencing. Cell 174, 433-447.el9), and ULP-WGS data at mean coverage 0.13x (range 0.07x - 0.18x) (Adalsteinsson et al. (2017).
- Healthy donor plasmacfDNA WGS data used in this study were obtained from previously published studies. Two samples (HD45 and HD46) with coverage of 13x and 15x, respectively, were accessed from dbGAP under accession phs001417 (Adalsteinsson et al. (2017). Nature Communications 8; Viswanathan et al. (2016). Cell 174, 433- 447.el9). These donors were consented under DFCI protocol IRB (# 03-022).
- Blood samples were collected from NSG mice bearing subcutaneous PDX tumors at the time of sacrifice.
- the PDX lines were maintained at vivaria in the University of Washington and FHCRC.
- the blood was processed following methods described for human plasma DNA processing for subsequent DNA isolation.
- Blood was collected in purple cap EDTA tubes and processed within 4 hours. All blood samples were double spun using centrifugation at 2500g for 10 minutes followed by a 16000g spin of the plasma fraction for 10 minutes at room temperature.
- 7-10 mouse plasma samples were pooled. Processed plasma samples were preserved in clean, screw- capped cryo-microfuge tubes and stored at -80°C prior to cfDNA isolation.
- the QIAamp Circulating Nucleic Acid Kit was used to isolate cfDNA from PDX mouse-derived plasma using the recommended protocol.
- the pooled plasma samples from 7-10 mice for each PDX line contained ⁇ 2-3 mL total plasma volume for each line.
- the filter retention-basedcfDNA kit method does not implement any fragment size class enrichment.
- Carrier RNA spike-in was excluded from elusion buffer.
- Isolated cfDNA was quantified using the Qubit dsDNA HS assay (Invitrogen) and the cfDNA fragment size profiles were analyzed using Tapestation HS D5000 and HS D1000 assays (Agilent).
- NGS libraries were prepared with 50ng inputcfDNA.
- Illumina NGS sequencing libraries were prepared with the KAPA hyperprep kit, adopting nine cycles of amplification, and purified using lab standardized SPRI beads. KAPA UDI dual indexed library adapters were used. Library concentrations were balanced and pooled for multiplexing and sequenced using the Illumina HiSeq 2500 at the Fred Hutch Genomics Shared Resources (200 cycles) and Illumina NovaSeq platform at the Broad Institute Genomics Platform Walkup-Seq Services using S4 flow cells (300 cycles). To match with Illumina HiSeq 2500 data, truncated 200 cycles FASTQ files were generated (100 bp paired end reads).
- ARPC and ARLPC vs. ARPC The results were then filtered using a list of 1,635 human transcription factors published previously (Lambert et al. (2016). The Human Transcription Factors. Cell 172, 650-665), which resulted in 514 genes with FDR ⁇ 0.05 and fold change > 3. Out of these 514, deregulation of gene expression for 404 transcription factor genes delineated ARPC from NEPC.
- CUT&RUN is an antibody targeted enzyme tethering chromatin profiling assay in which controlled cleavage by micrococcal nuclease releases specific protein-DNA complexes into the supernatant for paired-end DNA sequencing analysis.
- CUT&RUN assays were performed for three histone modifications, H3K27ac, H3K4mel, and H3K27me3, according to published protocols (Skene and Henikoff (2017). An efficient targeted nuclease strategy for high-resolution mapping of DNA binding sites. ELife 6, e21856).
- CUT&RUN were performed on LuCaP PDX tumors using ⁇ 75mg flash- frozen tissue pieces.
- frozen tissues were thoroughly chopped into small pieces and converted into smaller clusters of cells using collagenase and dispase.
- Cell clusters were made permeabilized using digitonin and nutated with target antibody in EDTA antibody buffer.
- Time-sensitive micrococcal nuclease enzyme treatments were performed on ice. Released DNA was precipitated along with glycogen career, and subsequent NGS libraries were prepared using picogram input DNA library preparation protocol.
- Paired-end (50 bp) sequencing was performed and reads were aligned using bowtie2 version 2.4.2 (Langmead et al. (2019). Scaling read aligners to hundreds of threads on general-purpose processors. Bioinformatics 35, 421 — 432 J to the hg38 human reference assembly. Aligned reads were processed as described in the SEACR protocol (github.eom/FredHutch/SEACR#preparing-input-bedgraph-files). Peaks were called using SEACR version 1.3 (Meers et al. (2019). Peak calling by Sparse Enrichment Analysis for CUT&RUN chromatin profiling.
- Genomewide peak heatmap, targeted heatmap, and respective profiles were plotted using deepTools. bigWig formatted files for each phenotype were obtained using the mean function in wiggletools 1.2.8. and deepTools computeMatrix. Phenotype-specific informative region coordinates were obtained from diffBind v3.5.0, and the top 10,000 most significant regions (all with FDR ⁇ 0.05) differentially open between ARPC and NEPC lines were used for downstream feature analyses (see Gene body and promoter region selection for additional subsetting criteria applied on a feature by feature basis). For heatmaps and profiles the plotHeatmap function was used.
- Differential PTM analysis was performed with the Diffbind version 2.16.0 package (Ross-Innes et al. (2012). Differential oestrogen receptor binding is associated with clinical outcome in breast cancer. Nature 481, 389-393) in R-4.0.1 using standard parameters (bioconductor.riken.jp/packages/3.0/bioc/html/DiffBind.html).
- ARPC, NEPC and ARLPC samples were grouped by histopathological and transcriptome signature defined phenotypes described in the "PDX mouse models" section. Samples were loaded with the dba function, reads counted with the dba.count function, and contrast specified as phenotype with dba.contrast and a minimum members of 2.
- Differential peak sites were computed with the dba.analyze function with default settings. Differential peak binding of NEPC and ARLPC was computed against ARPC samples. Unique binding sites in NEPC and ARLPC were catalogued using bedtools v2.29.2 (Quinlan and Hall (2010). BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841-842). Intergroup differentially bound peaks were annotated using ChIPseeker 1.28.3 (Yu et al. (2015). ChIPseeker: an R/Bioconductor package for ChIP peak annotation, comparison and visualization. Bioinformatics 31, 2382-2383) and TxDb.Hsapiens.UCSC.hg38.knownGene 3.2.2 in R 4.1.0.
- ATAC-Seq sequence data for 15 tumor samples from 10 PDX lines were published previously and FASTQ files made available upon request (Cejas et al. (2021). Subtype heterogeneity and epigenetic convergence in neuroendocrine prostate cancer. Nat Commun 12, 5775). These lines included LuCaP PDX lines with ARPC histology (23.1, 77, 78, 81, 96) and NEPC histology (two replicates each of 49, 93, 145.1, 173.1 and one replicate of 145.2). Paired end reads were aligned using bowtie2 2.4.2 (Langmead et al. (2019). Scaling read aligners to hundreds of threads on general-purpose processors.
- RNA-Seq derived phenotypes Phenotype specific binding sites were isolated by first selecting for positive fold change open chromatin enrichment and then using Intervene 0.6.5 (Khan and Mathelier (2017). Intervene: a tool for intersection and visualization of multiple gene or genomic region sets. BMC Bioinformatics 18, 287) where regions were considered overlapping if they shared at least 1 bp.
- Regions with FDR adjusted p-values ⁇ 0.05 were then subset to those overlapping the 338,000 established TFBSs (338 TFs x 1,000 binding sites, see Griffin analysis for site selection) by at least 1 bp using BedTools Intersect. Only regions that overlapped an established TFBS were retained.
- Griffin is a method for profiling nucleosome protection and accessibility on predefined genomic loci (see Example 1 and Doebley et al. (2021). Griffin: Framework for clinical cancer subtyping from nucleosome profiling of cell-free DNA. MedRxiv 2021.08.31.21262867). Griffin filters sites by mappability, estimates and corrects GC bias on a per fragment level, and generates GC-corrected coverage profiles around each site. First, griffin takes a site list and examines the mappability in a window (+/- 5000 bp around each site). Mappability (hg38 Umap multi-read mappability for 50bp reads) was obtained from UCSC genome browser (Karimzadeh et al. (2018).
- GC biases were then smoothed by taking the median of values for fragments with similar lengths and GC contents (k nearest neighbors smoothing) to generate smoothed GC bias values.
- nucleosome profiling was performed in each sample. For each mappable site of interest, fragments aligning to the region ⁇ 5000 bp from the site were fetched from the bam file. Fragments were filtered to remove duplicates and low- quality alignments ( ⁇ 20 mapping quality) and by fragment length. Nucleosome size fragments (140-250 bp) were retained. Fragments were then GC corrected by assigning each fragment a weight of l/GC_bias for that given fragment length and GC content and the fragment midpoint was identified. The number of weighted fragment midpoints in 15bp bins across the site were counted.
- TFBS Transcription factor binding site
- TFs 1,314 transcription factors
- CIS-BP CIS-BP database
- TFs from GTRD that were also in CIS-BP and had a known binding motif were retained.
- Selected TF binding genomic loci were then filtered for mappability as described above (Griffin analysis) and TFs with fewer than 10,000 highly mappable sites on autosomes were excluded, resulting in 338 TFs.
- LuCaP PDX cfDNA In downstream analysis of LuCaP PDX cfDNA, if any lines did not meet specific criteria in a region (including differentially open histone modification regions) that feature/region combination was excluded from analysis, leading to a variable lower number of regions considered based on the feature. These criteria included requiring at least 10 total fragments in a region for all Fragment size analysis (see below) and a non zero number of "short" and "long” fragments for the short-long ratio; short-long ratios less than 0.01 or greater than 10.0 were also excluded as outliers. Any region with no coverage in a line was excluded from all analyses. This resulted in gene lists that differed in numbers between genomic contexts and feature types.
- Fragments were first filtered to remove duplicates and low-quality alignments
- fragment short- long ratio (FSLR) was computed as the ratio of short
- Admixtures for evaluating benchmarking performance were constructed using 5 ARPC (LuCaP 35, 35CR, 58, 92, 136CR) and 5 NEPC (LuCaP 49, 93, 145.2, 173.1, 208.4) lines mixed to 1%, 5%, 10%, 20%, and 30% tumor fraction with a single healthy donor plasma line (NPH004, EGAD00001005343) at ⁇ 25X mean coverage, assuming 100% tumor fraction in post- mouse subtracted PDX sequencing data. After extracting chromosomal DNA with SAMtools (Danecek et al. (2021). Twelve years of SAMtools and BCFtools.
- ichorCNA Alsteinsson et al. (2017). Scalable whole-exome sequencing of cell-free DNA reveals high concordance with metastatic tumors. Nature Communications 8) with binSize 1,000,000 bp and hgl9 reference genome. Default tumor fraction estimates reported by ichorCNA were used. See github.com/GavinHaEab/CRPCSubtypingPaper/tree/main/ichorCNA_configuration for complete configuration settings.
- a probabilistic model was developed to classify the mCRPC phenotype (ARPC or NEPC) in an individual patient plasma ctDNA sample.
- This is a generative mixture model that is unsupervised — it does not train on the patient cohort of interest.
- the model accepts the pre-estimated tumor fraction from ichorCNA for the given patient ctDNA sample, as well as the pre-computed ctDNA features values from the LuCaP PDX ctDNA and healthy donor ctDNA as prior information. For each patient ctDNA sample, it fits the heterogeneous tumor fractions against the pure PDX LuCaP models.
- Q has range [0,1], where higher values indicate an increased proportion of the sample having a NEPC phenotype and was used as the NEPC prediction score metric.
- Code and implementation of the method can be found at github.com/GavinHaLab/CRPCSubtypingPaper/tree/main/GenerativeMixtureModel. Analysis and classification of clinical patient samples
- the model was then validated on two cohorts, beginning with the already published DFCI cohort II (Adalsteinsson et al. (2017). Nature Communications S; Choudhury et al. (2016). Tumor fraction in cell-free DNA as a biomarker in prostate cancer. JCI Insight 3 Viswanathan et al. (2016). Structural Alterations Driving Castration-Resistant Prostate Cancer Revealed by Linked-Read Genome Sequencing. Cell 174, 433-447.el9). The analysis was restricted to eleven samples from six patients with matched ULP-WGS and WGS data with paired-end reads. Tumor fraction estimates from ichorCNA were obtained from the original study ( Adalsteinsson et al. (2017). Nature Communications S). All samples were considered adenocarcinoma (ARPC) based on clinical histories (see Human subjects). The scoring threshold of 0.3314, determined from DFCI cohort I was used for phenotype classification.
- Example 1 applied an embodiment of the Griffin workflow to enhance sequence signals to allow accurate determination of breast cancer subtypes from low pass sequencing data.
- Example 2 applied an embodiment the Griffin workflow approach to differentiate subtypes of other cancers, namely prostate cancer, successfully leveraging data from an alternative sequence profiling platform (e.g., from the CUT & RUN platform for nucleosome accessibility), demonstrating the power and flexibility of the Griffin analytic workflow for different cancers and input data.
- This Example described the development of targeted sequencing panels to use in conjunction with the Griffin workflow to understand transcriptional features of small cell lung cancer, non-small cell lung cancer, and other cancer types from blood ctDNA.
- This Example describes an innovative analytical assay based on analysis of cell free DNA, demonstrating clear translational potential for clinical lung cancer diagnostics.
- Cell-free DNA circulating in the blood of cancer patients has been widely used to assess gene mutations, and through analyses of whole genome DNA has more recently been used to infer activation of certain transcription factors.
- Cancer cells give rise to cell-free DNA via cell death and that cell-free DNA is overwhelmingly nucleosomal, i.e. bound to a histone octamer, which protects the DNA from degradation.
- Histone positioning in the genome is influenced by components of chromatin, including transcription factors and the RNA polymerase complex.
- TFBSs transcription factor binding sites
- TSSs transcription start sites
- One innovation is the identification of highly informative TFBSs and TSSs that can be used to differentiate between NSCLC and SCLC or between subtypes of NSCLC or SCLC, which then facilitate the use of hybridization capture-based DNA sequencing of ctDNA to generate high resolution maps of nucleosome occupancy at TFBSs of key TFs in SCLC (ASCL1, NEUROD1, POU2F3, REST) and TSSs for genes that are markers of key transcriptional features of lung cancer cells.
- these informative sites can also be e amined in low-coverage whole genome sequencing to extract similar transcriptional features.
- Targeted capture panels are routinely applied in the clinic to call mutations from ctDNA in the blood, and application of targeted sequencing to assess transcriptional activity in cancer cells is very feasible as a clinical test.
- the technology is especially relevant and viable in SCLC, which kills -30,000 people in the US each year.
- Tissue sampling in SCLC is typically only performed once during a patient's disease course and is often done by transbronchial fine needle aspiration, which yields a very small amount of tissue. Surgery is very rarely performed.
- SCLC has a high level of ctDNA compared to most other cancer types, reflecting its highly metastatic nature, making this assay both practical for application to SCLC patients and potentially especially valuable.
- SCLC subtypes exist based not on mutations but on activation of key transcription factors and their downstream programs (such as ASLC1, NEUROD1, and POU2F3).
- key transcription factors and their downstream programs such as ASLC1, NEUROD1, and POU2F3
- the disclosed targeted assay is designed to differentiate transcriptional subtypes of SCLC from ctDNA providing powerful clinical applications for use of this assay.
- the panel is designed to call gene mutations in exons from a panel of -600 genes.
- the assay has broad clinical utility for correlative analyses of both mutations and transcriptional activity in clinical samples.
- transdifferentiation of driver mutation positive lung cancer to SCLC is treated differently from disease that is progressing but has not acquired a notable histologic change.
- transdifferentiation is likely significantly underdiagnosed because currently it can only be assessed via biopsy of a progressing lesion, which is often infeasible or undesirable.
- This assay can also be applied to lung adenocarcinoma patients who develop resistance to EGFR inhibitors to determine whether this resistance is associated with activation of SCLC transcriptional profiles.
- the major non-invasive applications include the following:
- FIG. 16 shows the generation of a capture panel.
- the approach included rationally designing a targeted sequencing panel for integrated detection of SCLC genetic mutations, transcription factor (TF) subtype identity, and expression of key gene programs.
- Public mutations databases and functional mutation data were interrogated for coding mutations coding in approximately 600 genes related to SCLC.
- TF subtype identity TFBSs for four key SCLC-related TFs (ASCL1, NEUROD1, POU2F3, and REST) were targeted.
- TSSs corresponding to the vast majority of protein-coding genes in the genome were targeted. To select specific sites, multiple sources of data were integrated as follows.
- ChIP-seq data was used to identify TFBSs, resulting in 4-30k sites per factor. These candidate sites were then annotated with the distance to the nearest gene TSS. Retained sites were sites for which the nearest gene TSS was a gene known to be upregulated in SCLC cells that expressed the factor of interest, as determined by available RNAseq data. This resulted in -400-700 SCLC-focused sites per factor. In the final probe set, a 1 kb window symmetrically encompassing these ⁇ 2k sites (500 bp on each side) was targeted.
- TSS profiling beginning with an established transcript annotation, non-coding transcripts, Y chromosome genes, and TSSs corresponding to multi-exon genes that had lower confidence annotations were removed, resulting in approximately -36k theoretically targeted TSSs. In the probe set, regions 260 bp downstream of the TSS and 100 bp upstream were targeted.
- Use of application-specific orthogonal chromatin profiling data to select sites is a key feature of the approach. However, it will be noted that other types of chromatin profiling data could readily be substituted or added and yield same or similar results, such as ATAC-seq, CUT&RUN/TAG, DNAse-seq, modified histone ChIP-seq, etc.
- a data analysis pipeline was developed to quantify cfDNA fragments protected by nucleosomes in both the TFBS and TSS captured DNA.
- the analysis pipeline Griffin (described in more detail above), includes using fragment length-based GC correction to remove GC biases that obscure signals.
- a fragment size-aware GC-bias correction approach helps to maximize signal-to-noise and optimizes the analysis of captured DNA.
- FIGS. 17A and 17B illustrate the detection of transcription factor (TF) expression in SCLC models using targeted sequencing of cfDNA.
- FIG. 17A is a schematic of experimental workflow for proof-of-concept negative control ("healthy donor") and positive control ("flank tumors" from SCLC cellular models) samples.
- FIG. 17B graphically illustrates aggregated coverage across TFBSs in targeted sequencing data for healthy donors (top row) and flank tumors (bottom row).
- the TFBS is expected to be located at position 0 on the x axis. Data are coded by expected TF expression. Healthy donor-derived cfDNA is expected to reflect REST expression but not ASCL1, NEUROD1, or POU2F3. In SCLC models, systematic differences in coverage distribution as a function of TF expression are apparent.
- FIGS. 18A-18C illustrate transcription factor activity inference using TFBS coverage distributions from SCLC patient samples with available matched tumor gene expression data.
- FIG. 18A graphically illustrates aggregated coverage across TFBSs in targeted sequencing data for healthy donors (top row) and patients with SCLC (bottom row) for whom matched tumor tissue with gene expression data was available. Samples are coded by expected TF expression. Systematic differences in coverage distribution as a function of expected TF expression are again apparent.
- FIG. 18B illustrates gene expression of key genes in selected patient samples displayed as a heatmap. Cells are coded by Z-score and the inset text is the log2(TPM+l).
- FIG. 18A graphically illustrates aggregated coverage across TFBSs in targeted sequencing data for healthy donors (top row) and patients with SCLC (bottom row) for whom matched tumor tissue with gene expression data was available. Samples are coded by expected TF expression. Systematic differences in coverage distribution as a function of expected TF expression are again apparent.
- FIG. 18B illustrates gene
- Trough depth magnitude corresponds to gene expression of the key TFs in these bona fide SCLC patient samples.
- FIG. 19 is a series of graphs illustrating quantification of transcription factor binding site peak to trough amplitude sample types. Distribution of TFBS peak to trough amplitude calculated from aggregated coverage distributions according to expected ground truth of TF expression.
- ASCL1 site peak to trough amplitude is associated with both SCLC status and ASCL1 positivity, while NEUROD1 and POU2F3 peak to trough amplitude is associated only with TF positivity.
- FIGS. 20A and 20B graphically illustrate gene expression inference using TSS coverage distributions in flank tumor positive control samples.
- FIG. 20A illustrates TSS coverage distribution from targeted sequencing of cfDNA, grouped by gene expression quintile in SCLC flank tumor models (quintiles 1-5) and blood ("B", dark blue). Shown are 1,912 TSS corresponding to 1,213 genes, which were selected based on low expression in whole blood and correlation between TSS coverage distribution and gene expression. TSS coverage distribution varies systematically according to expression of the corresponding gene.
- FIG. 20B illustrates receiver operating characteristic curves for prediction of gene expression as above or below a threshold value (shown for thresholds of 0.1, 0.5, 1.0, and 2.0), as inferred from the coverage distribution of the corresponding TSS.
- FIG. 21 is a series of graphs illustrating use of aggregated coverage profiles across large rationally selected subsets of the TSS panel for prediction of SCLC vs NSCLC status in lung cancer Pdx models and Patient samples.
- NSCLC Pdx model As shown overlayed on the NSCLC PDX model, an amplitude feature was calculated from each coverage distribution curve as the difference between the coverage at the -45 position and the +120 position relative to the TSS, facilitating comparison within and between samples.
- FIG. 22 is a series of graphs illustrating use of aggregated coverage profiles across large rationally selected subsets of the TSS panel for prediction of SCLC vs NSCLC status in lung cancer Pdx models and Patient samples.
- An SCLC PDX that transdifferentiated from an adenocarcinoma is identified with a thick red line.
- Griffin uses unique normalization of cfDNA sequence data that is specific for nucleosome profiling and chromatin accessibility analysis. This includes GC-bias correction, repetitive sequence filtering, and local coverage normalization. All of these normalization techniques are not available in existing proof-of-concept methods such as in Ulz P, et al. Inference of transcription factor binding from cell-free DNA enables tumor subtype prediction and early detection. Nat Commun. 2019;10(1):4666. Further, multi-omic feature extraction from Griffin for use in machine learning classifier construction to predict cancer subtype is unique to this approach.
- a targeted sequencing panel is expected to yield higher resolution while retaining practical cost, and is more readily integrable with resequencing of regions of interest for genetic mutation detection (i.e. cancer gene panel sequencing).
- genetic mutation detection i.e. cancer gene panel sequencing.
- From output of Griffin many features can be extracted from each binding site of interest and machine learning classifiers can be used to predict subtypes of lung cancer histological subtypes from the cfDNA Griffin-optimized data.
Landscapes
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Medical Informatics (AREA)
- General Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Public Health (AREA)
- Biophysics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- Genetics & Genomics (AREA)
- Chemical & Material Sciences (AREA)
- Data Mining & Analysis (AREA)
- Epidemiology (AREA)
- Molecular Biology (AREA)
- Analytical Chemistry (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Biomedical Technology (AREA)
- Primary Health Care (AREA)
- Databases & Information Systems (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- Medicinal Chemistry (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioethics (AREA)
- Pathology (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
- Investigating Or Analysing Biological Materials (AREA)
- Apparatus Associated With Microorganisms And Enzymes (AREA)
Abstract
Selon un aspect, la divulgation concerne un procédé mis en œuvre par ordinateur permettant d'améliorer des données de lecture de séquence issues d'échantillons d'ADN acellulaire en vue d'une prédiction de type cellulaire. Le procédé consiste à recevoir des données de lecture de séquence qui comprennent une pluralité de lectures de fragment, chaque lecture de fragment ayant une longueur de fragment et un taux de GC indiquant un pourcentage de bases dans la lecture de fragment qui sont des G ou des C. Des valeurs de biais de GC sont déterminées par un système informatique pour chaque lecture de fragment sur la base de la longueur de fragment et du taux de GC de la lecture de fragment. Une distribution de couverture génomique est générée, laquelle est réglée pour un biais de GC à l'aide des données de lecture de séquence et des valeurs de biais de GC. Sur la base de la distribution de couverture génomique, le type de cellule est prédit. Ce procédé peut être mis à profit pour évaluer des sous-types de cellules et des phénotypes sur la base de l'ADN acellulaire présent dans des échantillons biologiques en vue, par exemple, du diagnostic et de la surveillance d'un cancer ainsi que d'une thérapie de précision contre un cancer.
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US202163172590P | 2021-04-08 | 2021-04-08 | |
US202163276378P | 2021-11-05 | 2021-11-05 | |
PCT/US2022/024082 WO2022217096A2 (fr) | 2021-04-08 | 2022-04-08 | Procédé d'analyse de données de séquence d'adn acellulaire pour examiner la protection du nucléosome et l'accessibilité de la chromatine |
Publications (1)
Publication Number | Publication Date |
---|---|
EP4320618A2 true EP4320618A2 (fr) | 2024-02-14 |
Family
ID=83545807
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP22785557.4A Pending EP4320618A2 (fr) | 2021-04-08 | 2022-04-08 | Procédé d'analyse de données de séquence d'adn acellulaire pour examiner la protection du nucléosome et l'accessibilité de la chromatine |
Country Status (5)
Country | Link |
---|---|
EP (1) | EP4320618A2 (fr) |
JP (1) | JP2024515565A (fr) |
AU (1) | AU2022255198A1 (fr) |
CA (1) | CA3214391A1 (fr) |
WO (1) | WO2022217096A2 (fr) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115376616B (zh) * | 2022-10-24 | 2023-04-28 | 臻和(北京)生物科技有限公司 | 一种基于cfDNA多组学的多分类方法及装置 |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8725422B2 (en) * | 2010-10-13 | 2014-05-13 | Complete Genomics, Inc. | Methods for estimating genome-wide copy number variations |
US10497461B2 (en) * | 2012-06-22 | 2019-12-03 | Sequenom, Inc. | Methods and processes for non-invasive assessment of genetic variations |
CN117402950A (zh) * | 2014-07-25 | 2024-01-16 | 华盛顿大学 | 确定导致无细胞dna的产生的组织和/或细胞类型的方法以及使用其鉴定疾病或紊乱的方法 |
US20190287645A1 (en) * | 2016-07-06 | 2019-09-19 | Guardant Health, Inc. | Methods for fragmentome profiling of cell-free nucleic acids |
WO2018227202A1 (fr) * | 2017-06-09 | 2018-12-13 | Bellwether Bio, Inc. | Détermination du type de cancer chez un sujet par modélisation probabiliste de points d'extrémité de fragment d'acide nucléique circulant |
CN112805563A (zh) * | 2018-05-18 | 2021-05-14 | 约翰·霍普金斯大学 | 用于评估和/或治疗癌症的无细胞dna |
JP2022532897A (ja) * | 2019-05-14 | 2022-07-20 | テンパス ラブズ,インコーポレイテッド | マルチラベルがん分類のためのシステムおよび方法 |
-
2022
- 2022-04-08 EP EP22785557.4A patent/EP4320618A2/fr active Pending
- 2022-04-08 CA CA3214391A patent/CA3214391A1/fr active Pending
- 2022-04-08 AU AU2022255198A patent/AU2022255198A1/en active Pending
- 2022-04-08 JP JP2023561726A patent/JP2024515565A/ja active Pending
- 2022-04-08 WO PCT/US2022/024082 patent/WO2022217096A2/fr active Application Filing
Also Published As
Publication number | Publication date |
---|---|
WO2022217096A3 (fr) | 2022-12-29 |
AU2022255198A1 (en) | 2023-11-23 |
JP2024515565A (ja) | 2024-04-10 |
WO2022217096A2 (fr) | 2022-10-13 |
CA3214391A1 (fr) | 2022-10-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP7455757B2 (ja) | 生体試料の多検体アッセイのための機械学習実装 | |
Doebley et al. | A framework for clinical cancer subtyping from nucleosome profiling of cell-free DNA | |
Schwarz et al. | Spatial and temporal heterogeneity in high-grade serous ovarian cancer: a phylogenetic analysis | |
US11978535B2 (en) | Methods of detecting somatic and germline variants in impure tumors | |
Haferlach et al. | Landscape of genetic lesions in 944 patients with myelodysplastic syndromes | |
Riester et al. | Combination of a novel gene expression signature with a clinical nomogram improves the prediction of survival in high-risk bladder cancer | |
Naumov et al. | Genome-scale analysis of DNA methylation in colorectal cancer using Infinium HumanMethylation450 BeadChips | |
Tran et al. | Cancer genomics: technology, discovery, and translation | |
EP3430170B1 (fr) | Procédés pour la caractérisation de génomes | |
US20210257047A1 (en) | Methods and systems for refining copy number variation in a liquid biopsy assay | |
CN112602156A (zh) | 用于检测残留疾病的系统和方法 | |
CN114026646A (zh) | 用于评估肿瘤分数的系统和方法 | |
De Sarkar et al. | Nucleosome patterns in circulating tumor DNA reveal transcriptional regulation of advanced prostate cancer phenotypes | |
Munoz et al. | Molecular profiling and the reclassification of cancer: divide and conquer | |
US20230175058A1 (en) | Methods and systems for abnormality detection in the patterns of nucleic acids | |
Brannon et al. | Enhanced specificity of high sensitivity somatic variant profiling in cell-free DNA via paired normal sequencing: design, validation, and clinical experience of the MSK-ACCESS liquid biopsy assay | |
US20240279745A1 (en) | Systems and methods for multi-analyte detection of cancer | |
KR20240104202A (ko) | 순환 종양 핵산 분자의 다중모드 분석 | |
Adams et al. | Global mutational profiling of formalin-fixed human colon cancers from a pathology archive | |
Ren et al. | SinoDuplex: an improved duplex sequencing approach to detect low-frequency variants in plasma cfDNA samples | |
US20210358571A1 (en) | Systems and methods for predicting pathogenic status of fusion candidates detected in next generation sequencing data | |
US20180371553A1 (en) | Methods and compositions for the analysis of cancer biomarkers | |
Wang et al. | Terminal modifications independent cell-free RNA sequencing enables sensitive early cancer detection and classification | |
EP4320618A2 (fr) | Procédé d'analyse de données de séquence d'adn acellulaire pour examiner la protection du nucléosome et l'accessibilité de la chromatine | |
US20220301656A1 (en) | Genome sequencing as an alternative to cytogenetic analysis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE |
|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE |
|
17P | Request for examination filed |
Effective date: 20231108 |
|
AK | Designated contracting states |
Kind code of ref document: A2 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
DAV | Request for validation of the european patent (deleted) | ||
DAX | Request for extension of the european patent (deleted) |