WO2024026376A2 - Methods and systems for multiomic analysis - Google Patents
Methods and systems for multiomic analysis Download PDFInfo
- Publication number
- WO2024026376A2 WO2024026376A2 PCT/US2023/071068 US2023071068W WO2024026376A2 WO 2024026376 A2 WO2024026376 A2 WO 2024026376A2 US 2023071068 W US2023071068 W US 2023071068W WO 2024026376 A2 WO2024026376 A2 WO 2024026376A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- instances
- cells
- cell
- nucleotides
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 284
- 238000004458 analytical method Methods 0.000 title claims abstract description 58
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 claims abstract description 57
- 201000010099 disease Diseases 0.000 claims abstract description 55
- 238000002474 experimental method Methods 0.000 claims abstract description 33
- 239000000090 biomarker Substances 0.000 claims abstract description 29
- 230000000694 effects Effects 0.000 claims abstract description 22
- 229960005486 vaccine Drugs 0.000 claims abstract description 13
- 210000004027 cell Anatomy 0.000 claims description 316
- 108090000623 proteins and genes Proteins 0.000 claims description 162
- 125000003729 nucleotide group Chemical group 0.000 claims description 96
- 239000002773 nucleotide Substances 0.000 claims description 89
- 230000008859 change Effects 0.000 claims description 88
- 150000007523 nucleic acids Chemical class 0.000 claims description 88
- 102000039446 nucleic acids Human genes 0.000 claims description 86
- 108020004707 nucleic acids Proteins 0.000 claims description 86
- 102000004169 proteins and genes Human genes 0.000 claims description 70
- 238000003199 nucleic acid amplification method Methods 0.000 claims description 65
- 230000003321 amplification Effects 0.000 claims description 64
- 230000035772 mutation Effects 0.000 claims description 64
- 102000053602 DNA Human genes 0.000 claims description 60
- 108020004414 DNA Proteins 0.000 claims description 60
- 229920002477 rna polymer Polymers 0.000 claims description 52
- 238000006243 chemical reaction Methods 0.000 claims description 51
- 238000005192 partition Methods 0.000 claims description 45
- 230000014509 gene expression Effects 0.000 claims description 41
- 238000005259 measurement Methods 0.000 claims description 37
- 230000011987 methylation Effects 0.000 claims description 37
- 238000007069 methylation reaction Methods 0.000 claims description 37
- 206010028980 Neoplasm Diseases 0.000 claims description 33
- 230000001413 cellular effect Effects 0.000 claims description 33
- 230000004048 modification Effects 0.000 claims description 33
- 238000012986 modification Methods 0.000 claims description 33
- 108010026552 Proteome Proteins 0.000 claims description 29
- 201000011510 cancer Diseases 0.000 claims description 29
- 238000012163 sequencing technique Methods 0.000 claims description 27
- 238000011282 treatment Methods 0.000 claims description 26
- 239000002299 complementary DNA Substances 0.000 claims description 24
- 238000004590 computer program Methods 0.000 claims description 21
- 230000007246 mechanism Effects 0.000 claims description 21
- 108020004999 messenger RNA Proteins 0.000 claims description 21
- 230000008569 process Effects 0.000 claims description 20
- 239000003153 chemical reaction reagent Substances 0.000 claims description 18
- 108010029485 Protein Isoforms Proteins 0.000 claims description 17
- 102000001708 Protein Isoforms Human genes 0.000 claims description 17
- 230000037433 frameshift Effects 0.000 claims description 17
- 239000011324 bead Substances 0.000 claims description 16
- 230000005945 translocation Effects 0.000 claims description 16
- 238000004422 calculation algorithm Methods 0.000 claims description 15
- 230000004927 fusion Effects 0.000 claims description 15
- 230000010076 replication Effects 0.000 claims description 14
- 239000005546 dideoxynucleotide Substances 0.000 claims description 13
- 230000002438 mitochondrial effect Effects 0.000 claims description 13
- 230000001580 bacterial effect Effects 0.000 claims description 11
- 239000000203 mixture Substances 0.000 claims description 11
- 206010006187 Breast cancer Diseases 0.000 claims description 10
- 229910052799 carbon Inorganic materials 0.000 claims description 10
- 239000003623 enhancer Substances 0.000 claims description 10
- 108091070501 miRNA Proteins 0.000 claims description 10
- 239000002679 microRNA Substances 0.000 claims description 10
- 230000001225 therapeutic effect Effects 0.000 claims description 9
- 108020004705 Codon Proteins 0.000 claims description 8
- 230000015572 biosynthetic process Effects 0.000 claims description 8
- 238000012790 confirmation Methods 0.000 claims description 8
- 239000003446 ligand Substances 0.000 claims description 8
- 230000002934 lysing effect Effects 0.000 claims description 8
- 230000002503 metabolic effect Effects 0.000 claims description 8
- 230000026731 phosphorylation Effects 0.000 claims description 8
- 238000006366 phosphorylation reaction Methods 0.000 claims description 8
- 150000003384 small molecules Chemical class 0.000 claims description 8
- 125000006850 spacer group Chemical group 0.000 claims description 8
- 238000003786 synthesis reaction Methods 0.000 claims description 8
- 238000013519 translation Methods 0.000 claims description 8
- 208000026310 Breast neoplasm Diseases 0.000 claims description 7
- 230000009946 DNA mutation Effects 0.000 claims description 7
- 238000003776 cleavage reaction Methods 0.000 claims description 7
- 230000009145 protein modification Effects 0.000 claims description 7
- 238000010839 reverse transcription Methods 0.000 claims description 7
- 230000007017 scission Effects 0.000 claims description 7
- 108091029430 CpG site Proteins 0.000 claims description 6
- 230000004913 activation Effects 0.000 claims description 6
- 239000000470 constituent Substances 0.000 claims description 6
- 108700021021 mRNA Vaccine Proteins 0.000 claims description 6
- 229940126582 mRNA vaccine Drugs 0.000 claims description 6
- 238000012216 screening Methods 0.000 claims description 6
- 108020005196 Mitochondrial DNA Proteins 0.000 claims description 5
- 150000002632 lipids Chemical class 0.000 claims description 5
- 239000000463 material Substances 0.000 claims description 5
- ASJSAQIRZKANQN-CRCLSJGQSA-N 2-deoxy-D-ribose Chemical compound OC[C@@H](O)[C@@H](O)CC=O ASJSAQIRZKANQN-CRCLSJGQSA-N 0.000 claims description 4
- OKTJSMMVPCPJKN-UHFFFAOYSA-N Carbon Chemical compound [C] OKTJSMMVPCPJKN-UHFFFAOYSA-N 0.000 claims description 4
- 210000004443 dendritic cell Anatomy 0.000 claims description 4
- 238000001943 fluorescence-activated cell sorting Methods 0.000 claims description 4
- 125000001153 fluoro group Chemical group F* 0.000 claims description 4
- 230000002427 irreversible effect Effects 0.000 claims description 4
- 239000002207 metabolite Substances 0.000 claims description 4
- 125000002924 primary amino group Chemical group [H]N([H])* 0.000 claims description 4
- 208000006402 Ductal Carcinoma Diseases 0.000 claims description 3
- 230000006037 cell lysis Effects 0.000 claims description 3
- 208000032839 leukemia Diseases 0.000 claims description 3
- 238000002705 metabolomic analysis Methods 0.000 claims description 3
- 230000001431 metabolomic effect Effects 0.000 claims description 3
- 230000008672 reprogramming Effects 0.000 claims description 3
- 230000008685 targeting Effects 0.000 claims description 3
- 210000005260 human cell Anatomy 0.000 claims description 2
- 210000004962 mammalian cell Anatomy 0.000 claims description 2
- 239000003814 drug Substances 0.000 abstract description 19
- 230000002068 genetic effect Effects 0.000 abstract description 17
- 229940079593 drug Drugs 0.000 abstract description 13
- 238000011161 development Methods 0.000 abstract description 9
- 238000007405 data analysis Methods 0.000 abstract description 3
- 238000010205 computational analysis Methods 0.000 abstract 1
- 238000000205 computational method Methods 0.000 abstract 1
- 238000003860 storage Methods 0.000 description 49
- 239000000523 sample Substances 0.000 description 38
- 238000012545 processing Methods 0.000 description 31
- 238000006073 displacement reaction Methods 0.000 description 25
- 108010014303 DNA-directed DNA polymerase Proteins 0.000 description 24
- 102000016928 DNA-directed DNA polymerase Human genes 0.000 description 24
- 238000012800 visualization Methods 0.000 description 19
- 108700028369 Alleles Proteins 0.000 description 18
- 210000000349 chromosome Anatomy 0.000 description 16
- 210000001519 tissue Anatomy 0.000 description 15
- 238000001514 detection method Methods 0.000 description 14
- 238000003559 RNA-seq method Methods 0.000 description 13
- 108091093088 Amplicon Proteins 0.000 description 11
- 108060002716 Exonuclease Proteins 0.000 description 10
- ISAKRJDGNUQOIC-UHFFFAOYSA-N Uracil Chemical compound O=C1C=CNC(=O)N1 ISAKRJDGNUQOIC-UHFFFAOYSA-N 0.000 description 10
- 102000013165 exonuclease Human genes 0.000 description 10
- 244000005700 microbiome Species 0.000 description 10
- 239000011541 reaction mixture Substances 0.000 description 10
- 239000007787 solid Substances 0.000 description 10
- 241000894006 Bacteria Species 0.000 description 9
- JJWKPURADFRFRB-UHFFFAOYSA-N carbonyl sulfide Chemical compound O=C=S JJWKPURADFRFRB-UHFFFAOYSA-N 0.000 description 9
- 230000001915 proofreading effect Effects 0.000 description 9
- 238000003556 assay Methods 0.000 description 8
- 230000036541 health Effects 0.000 description 8
- 238000011160 research Methods 0.000 description 8
- 230000035945 sensitivity Effects 0.000 description 8
- 241000894007 species Species 0.000 description 8
- 230000014616 translation Effects 0.000 description 8
- 101710126859 Single-stranded DNA-binding protein Proteins 0.000 description 7
- 238000012217 deletion Methods 0.000 description 7
- 230000037430 deletion Effects 0.000 description 7
- 230000018109 developmental process Effects 0.000 description 7
- 230000006870 function Effects 0.000 description 7
- 210000004602 germ cell Anatomy 0.000 description 7
- 238000003780 insertion Methods 0.000 description 7
- 230000037431 insertion Effects 0.000 description 7
- 239000011159 matrix material Substances 0.000 description 7
- 230000002441 reversible effect Effects 0.000 description 7
- 239000000126 substance Substances 0.000 description 7
- 102000004190 Enzymes Human genes 0.000 description 6
- 108090000790 Enzymes Proteins 0.000 description 6
- 238000013459 approach Methods 0.000 description 6
- 238000004891 communication Methods 0.000 description 6
- 230000000875 corresponding effect Effects 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 6
- 238000010348 incorporation Methods 0.000 description 6
- 238000007481 next generation sequencing Methods 0.000 description 6
- 102000040430 polynucleotide Human genes 0.000 description 6
- 108091033319 polynucleotide Proteins 0.000 description 6
- 239000002157 polynucleotide Substances 0.000 description 6
- 230000004044 response Effects 0.000 description 6
- 230000000392 somatic effect Effects 0.000 description 6
- RWQNBRDOKXIBIV-UHFFFAOYSA-N thymine Chemical compound CC1=CNC(=O)NC1=O RWQNBRDOKXIBIV-UHFFFAOYSA-N 0.000 description 6
- 230000002103 transcriptional effect Effects 0.000 description 6
- 241000124008 Mammalia Species 0.000 description 5
- OPTASPLRGRRNAP-UHFFFAOYSA-N cytosine Chemical compound NC=1C=CNC(=O)N=1 OPTASPLRGRRNAP-UHFFFAOYSA-N 0.000 description 5
- 208000028715 ductal breast carcinoma in situ Diseases 0.000 description 5
- 239000012634 fragment Substances 0.000 description 5
- 239000004973 liquid crystal related substance Substances 0.000 description 5
- 230000003287 optical effect Effects 0.000 description 5
- 230000001124 posttranscriptional effect Effects 0.000 description 5
- 229940035893 uracil Drugs 0.000 description 5
- 229920001621 AMOLED Polymers 0.000 description 4
- 206010069754 Acquired gene mutation Diseases 0.000 description 4
- 208000031261 Acute myeloid leukaemia Diseases 0.000 description 4
- LSNNMFCWUKXFEE-UHFFFAOYSA-M Bisulfite Chemical compound OS([O-])=O LSNNMFCWUKXFEE-UHFFFAOYSA-M 0.000 description 4
- 108010017826 DNA Polymerase I Proteins 0.000 description 4
- 102000004594 DNA Polymerase I Human genes 0.000 description 4
- 101000932478 Homo sapiens Receptor-type tyrosine-protein kinase FLT3 Proteins 0.000 description 4
- 208000037396 Intraductal Noninfiltrating Carcinoma Diseases 0.000 description 4
- 108060004795 Methyltransferase Proteins 0.000 description 4
- 102100020718 Receptor-type tyrosine-protein kinase FLT3 Human genes 0.000 description 4
- 230000004075 alteration Effects 0.000 description 4
- 230000008901 benefit Effects 0.000 description 4
- 230000002596 correlated effect Effects 0.000 description 4
- 238000013500 data storage Methods 0.000 description 4
- 102000054767 gene variant Human genes 0.000 description 4
- UYTPUPDQBNUYGX-UHFFFAOYSA-N guanine Chemical compound O=C1NC(N)=NC2=C1N=CN2 UYTPUPDQBNUYGX-UHFFFAOYSA-N 0.000 description 4
- 230000033001 locomotion Effects 0.000 description 4
- 230000002093 peripheral effect Effects 0.000 description 4
- CVWXJKQAOSCOAB-UHFFFAOYSA-N quizartinib Chemical compound O1C(C(C)(C)C)=CC(NC(=O)NC=2C=CC(=CC=2)C=2N=C3N(C4=CC=C(OCCN5CCOCC5)C=C4S3)C=2)=N1 CVWXJKQAOSCOAB-UHFFFAOYSA-N 0.000 description 4
- 229950001626 quizartinib Drugs 0.000 description 4
- 230000011664 signaling Effects 0.000 description 4
- 230000037439 somatic mutation Effects 0.000 description 4
- 238000010200 validation analysis Methods 0.000 description 4
- 230000000007 visual effect Effects 0.000 description 4
- 206010059866 Drug resistance Diseases 0.000 description 3
- 101000605639 Homo sapiens Phosphatidylinositol 4,5-bisphosphate 3-kinase catalytic subunit alpha isoform Proteins 0.000 description 3
- 102100034343 Integrase Human genes 0.000 description 3
- 101150039798 MYC gene Proteins 0.000 description 3
- 108010052285 Membrane Proteins Proteins 0.000 description 3
- 102000018697 Membrane Proteins Human genes 0.000 description 3
- 241001465754 Metazoa Species 0.000 description 3
- 108091034117 Oligonucleotide Proteins 0.000 description 3
- 102100038332 Phosphatidylinositol 4,5-bisphosphate 3-kinase catalytic subunit alpha isoform Human genes 0.000 description 3
- 108010092799 RNA-directed DNA polymerase Proteins 0.000 description 3
- 241000251539 Vertebrata <Metazoa> Species 0.000 description 3
- 239000012472 biological sample Substances 0.000 description 3
- 230000003247 decreasing effect Effects 0.000 description 3
- MTHSVFCYNBDYFN-UHFFFAOYSA-N diethylene glycol Chemical compound OCCOCCO MTHSVFCYNBDYFN-UHFFFAOYSA-N 0.000 description 3
- 238000009510 drug design Methods 0.000 description 3
- 238000001914 filtration Methods 0.000 description 3
- 230000010354 integration Effects 0.000 description 3
- 230000003993 interaction Effects 0.000 description 3
- 238000013507 mapping Methods 0.000 description 3
- 244000005706 microflora Species 0.000 description 3
- 230000001717 pathogenic effect Effects 0.000 description 3
- 230000037361 pathway Effects 0.000 description 3
- 238000000513 principal component analysis Methods 0.000 description 3
- 230000004952 protein activity Effects 0.000 description 3
- 238000002331 protein detection Methods 0.000 description 3
- 230000001105 regulatory effect Effects 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 229940113082 thymine Drugs 0.000 description 3
- YBJHBAHKTGYVGT-ZKWXMUAHSA-N (+)-Biotin Chemical compound N1C(=O)N[C@@H]2[C@H](CCCCC(=O)O)SC[C@@H]21 YBJHBAHKTGYVGT-ZKWXMUAHSA-N 0.000 description 2
- -1 (e.g. Proteins 0.000 description 2
- RYVNIFSIEDRLSJ-UHFFFAOYSA-N 5-(hydroxymethyl)cytosine Chemical compound NC=1NC(=O)N=CC=1CO RYVNIFSIEDRLSJ-UHFFFAOYSA-N 0.000 description 2
- 229930024421 Adenine Natural products 0.000 description 2
- GFFGJBXGBJISGV-UHFFFAOYSA-N Adenine Chemical compound NC1=NC=NC2=C1N=CN2 GFFGJBXGBJISGV-UHFFFAOYSA-N 0.000 description 2
- 241000203069 Archaea Species 0.000 description 2
- 108010077544 Chromatin Proteins 0.000 description 2
- 238000001712 DNA sequencing Methods 0.000 description 2
- 102000052510 DNA-Binding Proteins Human genes 0.000 description 2
- 241000196324 Embryophyta Species 0.000 description 2
- 102000004533 Endonucleases Human genes 0.000 description 2
- 108010042407 Endonucleases Proteins 0.000 description 2
- 102000018651 Epithelial Cell Adhesion Molecule Human genes 0.000 description 2
- 108010066687 Epithelial Cell Adhesion Molecule Proteins 0.000 description 2
- 241000233866 Fungi Species 0.000 description 2
- 102100031487 Growth arrest-specific protein 6 Human genes 0.000 description 2
- 108010033040 Histones Proteins 0.000 description 2
- 241000282412 Homo Species 0.000 description 2
- 101100335080 Homo sapiens FLT3 gene Proteins 0.000 description 2
- 101000923005 Homo sapiens Growth arrest-specific protein 6 Proteins 0.000 description 2
- TWRXJAOTZQYOKJ-UHFFFAOYSA-L Magnesium chloride Chemical compound [Mg+2].[Cl-].[Cl-] TWRXJAOTZQYOKJ-UHFFFAOYSA-L 0.000 description 2
- 241000535824 Mastacembelocleidus bam Species 0.000 description 2
- ROAIXOJGRFKICW-UHFFFAOYSA-N Methenamine hippurate Chemical compound C1N(C2)CN3CN1CN2C3.OC(=O)CNC(=O)C1=CC=CC=C1 ROAIXOJGRFKICW-UHFFFAOYSA-N 0.000 description 2
- 108010010677 Phosphodiesterase I Proteins 0.000 description 2
- 101710193739 Protein RecA Proteins 0.000 description 2
- 102000018780 Replication Protein A Human genes 0.000 description 2
- 108010027643 Replication Protein A Proteins 0.000 description 2
- 108010001244 Tli polymerase Proteins 0.000 description 2
- 108020004566 Transfer RNA Proteins 0.000 description 2
- DRTQHJPVMGBUCF-XVFCMESISA-N Uridine Chemical compound O[C@@H]1[C@H](O)[C@@H](CO)O[C@H]1N1C(=O)NC(=O)C=C1 DRTQHJPVMGBUCF-XVFCMESISA-N 0.000 description 2
- 241000700605 Viruses Species 0.000 description 2
- 229960000643 adenine Drugs 0.000 description 2
- 238000000540 analysis of variance Methods 0.000 description 2
- 239000000427 antigen Substances 0.000 description 2
- 108091007433 antigens Proteins 0.000 description 2
- 102000036639 antigens Human genes 0.000 description 2
- 230000027455 binding Effects 0.000 description 2
- 210000003483 chromatin Anatomy 0.000 description 2
- 230000000295 complement effect Effects 0.000 description 2
- 238000004883 computer application Methods 0.000 description 2
- 229940104302 cytosine Drugs 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000003745 diagnosis Methods 0.000 description 2
- 208000035475 disorder Diseases 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 239000003596 drug target Substances 0.000 description 2
- 238000006911 enzymatic reaction Methods 0.000 description 2
- 230000037406 food intake Effects 0.000 description 2
- 238000013467 fragmentation Methods 0.000 description 2
- 238000006062 fragmentation reaction Methods 0.000 description 2
- 238000012252 genetic analysis Methods 0.000 description 2
- 102000054766 genetic haplotypes Human genes 0.000 description 2
- 238000003205 genotyping method Methods 0.000 description 2
- KWIUHFFTVRNATP-UHFFFAOYSA-N glycine betaine Chemical compound C[N+](C)(C)CC([O-])=O KWIUHFFTVRNATP-UHFFFAOYSA-N 0.000 description 2
- 238000000126 in silico method Methods 0.000 description 2
- 238000009533 lab test Methods 0.000 description 2
- 239000003550 marker Substances 0.000 description 2
- 238000002483 medication Methods 0.000 description 2
- 108091027963 non-coding RNA Proteins 0.000 description 2
- 102000042567 non-coding RNA Human genes 0.000 description 2
- 239000013612 plasmid Substances 0.000 description 2
- 229920001223 polyethylene glycol Polymers 0.000 description 2
- 238000004393 prognosis Methods 0.000 description 2
- APTZNLHMIGJTEW-UHFFFAOYSA-N pyraflufen-ethyl Chemical compound C1=C(Cl)C(OCC(=O)OCC)=CC(C=2C(=C(OC(F)F)N(C)N=2)Cl)=C1F APTZNLHMIGJTEW-UHFFFAOYSA-N 0.000 description 2
- 108020004418 ribosomal RNA Proteins 0.000 description 2
- 239000010979 ruby Substances 0.000 description 2
- 229910001750 ruby Inorganic materials 0.000 description 2
- 238000000926 separation method Methods 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 239000010409 thin film Substances 0.000 description 2
- 238000011222 transcriptome analysis Methods 0.000 description 2
- 210000003171 tumor-infiltrating lymphocyte Anatomy 0.000 description 2
- 230000003827 upregulation Effects 0.000 description 2
- 238000002255 vaccination Methods 0.000 description 2
- 238000012070 whole genome sequencing analysis Methods 0.000 description 2
- WKKCYLSCLQVWFD-UHFFFAOYSA-N 1,2-dihydropyrimidin-4-amine Chemical compound N=C1NCNC=C1 WKKCYLSCLQVWFD-UHFFFAOYSA-N 0.000 description 1
- ABEXEQSGABRUHS-UHFFFAOYSA-N 16-methylheptadecyl 16-methylheptadecanoate Chemical compound CC(C)CCCCCCCCCCCCCCCOC(=O)CCCCCCCCCCCCCCC(C)C ABEXEQSGABRUHS-UHFFFAOYSA-N 0.000 description 1
- MXHRCPNRJAMMIM-SHYZEUOFSA-N 2'-deoxyuridine Chemical compound C1[C@H](O)[C@@H](CO)O[C@H]1N1C(=O)NC(=O)C=C1 MXHRCPNRJAMMIM-SHYZEUOFSA-N 0.000 description 1
- LRSASMSXMSNRBT-UHFFFAOYSA-N 5-methylcytosine Chemical compound CC1=CNC(=O)N=C1N LRSASMSXMSNRBT-UHFFFAOYSA-N 0.000 description 1
- 108700015125 Adenovirus DBP Proteins 0.000 description 1
- 101150062763 BMRF1 gene Proteins 0.000 description 1
- 241000322342 Bacillus phage M2 Species 0.000 description 1
- 241000701844 Bacillus virus phi29 Species 0.000 description 1
- 238000012169 CITE-Seq Methods 0.000 description 1
- 241000253373 Caldanaerobacter subterraneus subsp. tengcongensis Species 0.000 description 1
- 108091006146 Channels Proteins 0.000 description 1
- VYZAMTAEIAYCRO-UHFFFAOYSA-N Chromium Chemical compound [Cr] VYZAMTAEIAYCRO-UHFFFAOYSA-N 0.000 description 1
- 108091029523 CpG island Proteins 0.000 description 1
- 101150026402 DBP gene Proteins 0.000 description 1
- 108020001738 DNA Glycosylase Proteins 0.000 description 1
- 102000028381 DNA glycosylase Human genes 0.000 description 1
- 238000007399 DNA isolation Methods 0.000 description 1
- 101710134178 DNA polymerase processivity factor BMRF1 Proteins 0.000 description 1
- 108700020911 DNA-Binding Proteins Proteins 0.000 description 1
- 101710116602 DNA-Binding protein G5P Proteins 0.000 description 1
- 108010043461 Deep Vent DNA polymerase Proteins 0.000 description 1
- 101100300807 Drosophila melanogaster spn-A gene Proteins 0.000 description 1
- 101800001466 Envelope glycoprotein E1 Proteins 0.000 description 1
- 241000701533 Escherichia virus T4 Species 0.000 description 1
- LYCAIKOWRPUZTN-UHFFFAOYSA-N Ethylene glycol Chemical compound OCCO LYCAIKOWRPUZTN-UHFFFAOYSA-N 0.000 description 1
- 241000206602 Eukaryota Species 0.000 description 1
- 108700024394 Exon Proteins 0.000 description 1
- 108010010803 Gelatin Proteins 0.000 description 1
- 240000000594 Heliconia bihai Species 0.000 description 1
- 208000009889 Herpes Simplex Diseases 0.000 description 1
- 241000027036 Hippa Species 0.000 description 1
- 101000841267 Homo sapiens Long chain 3-hydroxyacyl-CoA dehydrogenase Proteins 0.000 description 1
- 101000804764 Homo sapiens Lymphotactin Proteins 0.000 description 1
- 101000688343 Homo sapiens Protein phosphatase 1 regulatory subunit 14B Proteins 0.000 description 1
- 208000026350 Inborn Genetic disease Diseases 0.000 description 1
- 108010002350 Interleukin-2 Proteins 0.000 description 1
- 108090000862 Ion Channels Proteins 0.000 description 1
- 102000004310 Ion Channels Human genes 0.000 description 1
- 241000764238 Isis Species 0.000 description 1
- 241001386813 Kraken Species 0.000 description 1
- 241000713666 Lentivirus Species 0.000 description 1
- 238000003657 Likelihood-ratio test Methods 0.000 description 1
- 102100029107 Long chain 3-hydroxyacyl-CoA dehydrogenase Human genes 0.000 description 1
- 102100035304 Lymphotactin Human genes 0.000 description 1
- 102100038895 Myc proto-oncogene protein Human genes 0.000 description 1
- OKIZCWYLBDKLSU-UHFFFAOYSA-M N,N,N-Trimethylmethanaminium chloride Chemical compound [Cl-].C[N+](C)(C)C OKIZCWYLBDKLSU-UHFFFAOYSA-M 0.000 description 1
- 108020004485 Nonsense Codon Proteins 0.000 description 1
- 108091028043 Nucleic acid sequence Proteins 0.000 description 1
- 108700020796 Oncogene Proteins 0.000 description 1
- 239000008118 PEG 6000 Substances 0.000 description 1
- 229910019142 PO4 Inorganic materials 0.000 description 1
- 241001495084 Phylo Species 0.000 description 1
- 229920001030 Polyethylene Glycol 4000 Polymers 0.000 description 1
- 229920002584 Polyethylene Glycol 6000 Polymers 0.000 description 1
- 229920002594 Polyethylene Glycol 8000 Polymers 0.000 description 1
- 229920001213 Polysorbate 20 Polymers 0.000 description 1
- 102100024146 Protein phosphatase 1 regulatory subunit 14B Human genes 0.000 description 1
- 230000008305 RNA mechanism Effects 0.000 description 1
- 108091005682 Receptor kinases Proteins 0.000 description 1
- 102000018120 Recombinases Human genes 0.000 description 1
- 108010091086 Recombinases Proteins 0.000 description 1
- 101710162453 Replication factor A Proteins 0.000 description 1
- 101710176758 Replication protein A 70 kDa DNA-binding subunit Proteins 0.000 description 1
- 101710176276 SSB protein Proteins 0.000 description 1
- 241000011473 Salmonella virus HK620 Species 0.000 description 1
- 241000270295 Serpentes Species 0.000 description 1
- 101710082933 Single-strand DNA-binding protein Proteins 0.000 description 1
- 102100036011 T-cell surface glycoprotein CD4 Human genes 0.000 description 1
- 101150104425 T4 gene Proteins 0.000 description 1
- 101800001690 Transmembrane protein gp41 Proteins 0.000 description 1
- 102000008579 Transposases Human genes 0.000 description 1
- 108010020764 Transposases Proteins 0.000 description 1
- 108700025716 Tumor Suppressor Genes Proteins 0.000 description 1
- 102000044209 Tumor Suppressor Genes Human genes 0.000 description 1
- 108010067390 Viral Proteins Proteins 0.000 description 1
- 238000001772 Wald test Methods 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 230000002411 adverse Effects 0.000 description 1
- 150000001345 alkine derivatives Chemical class 0.000 description 1
- 230000000845 anti-microbial effect Effects 0.000 description 1
- 239000004599 antimicrobial Substances 0.000 description 1
- 125000004429 atom Chemical group 0.000 description 1
- 150000001540 azides Chemical class 0.000 description 1
- 108010058966 bacteriophage T7 induced DNA polymerase Proteins 0.000 description 1
- 230000037429 base substitution Effects 0.000 description 1
- DRTQHJPVMGBUCF-PSQAKQOGSA-N beta-L-uridine Natural products O[C@H]1[C@@H](O)[C@H](CO)O[C@@H]1N1C(=O)NC(=O)C=C1 DRTQHJPVMGBUCF-PSQAKQOGSA-N 0.000 description 1
- 229960003237 betaine Drugs 0.000 description 1
- 238000010364 biochemical engineering Methods 0.000 description 1
- 230000007321 biological mechanism Effects 0.000 description 1
- 229960002685 biotin Drugs 0.000 description 1
- 235000020958 biotin Nutrition 0.000 description 1
- 239000011616 biotin Substances 0.000 description 1
- 230000000903 blocking effect Effects 0.000 description 1
- 210000001124 body fluid Anatomy 0.000 description 1
- 244000309466 calf Species 0.000 description 1
- 239000003560 cancer drug Substances 0.000 description 1
- 150000001720 carbohydrates Chemical class 0.000 description 1
- 230000024245 cell differentiation Effects 0.000 description 1
- 230000010261 cell growth Effects 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 125000003636 chemical group Chemical group 0.000 description 1
- 238000007385 chemical modification Methods 0.000 description 1
- 208000011654 childhood malignant neoplasm Diseases 0.000 description 1
- 239000013611 chromosomal DNA Substances 0.000 description 1
- 230000002759 chromosomal effect Effects 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 238000003759 clinical diagnosis Methods 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 150000001875 compounds Chemical class 0.000 description 1
- 239000013068 control sample Substances 0.000 description 1
- 230000009089 cytolysis Effects 0.000 description 1
- 210000000172 cytosol Anatomy 0.000 description 1
- 230000002950 deficient Effects 0.000 description 1
- MXHRCPNRJAMMIM-UHFFFAOYSA-N desoxyuridine Natural products C1C(O)C(CO)OC1N1C(=O)NC(=O)C=C1 MXHRCPNRJAMMIM-UHFFFAOYSA-N 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 231100000673 dose–response relationship Toxicity 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000005284 excitation Effects 0.000 description 1
- 231100000221 frame shift mutation induction Toxicity 0.000 description 1
- 229920000159 gelatin Polymers 0.000 description 1
- 239000008273 gelatin Substances 0.000 description 1
- 235000019322 gelatine Nutrition 0.000 description 1
- 235000011852 gelatine desserts Nutrition 0.000 description 1
- 239000010437 gem Substances 0.000 description 1
- 229910001751 gemstone Inorganic materials 0.000 description 1
- 208000016361 genetic disease Diseases 0.000 description 1
- 238000013412 genome amplification Methods 0.000 description 1
- 238000012268 genome sequencing Methods 0.000 description 1
- 230000037442 genomic alteration Effects 0.000 description 1
- 238000011331 genomic analysis Methods 0.000 description 1
- 230000008297 genomic mechanism Effects 0.000 description 1
- 108091005608 glycosylated proteins Proteins 0.000 description 1
- 102000035122 glycosylated proteins Human genes 0.000 description 1
- 230000012010 growth Effects 0.000 description 1
- 238000007490 hematoxylin and eosin (H&E) staining Methods 0.000 description 1
- 125000002887 hydroxy group Chemical group [H]O* 0.000 description 1
- 238000005417 image-selected in vivo spectroscopy Methods 0.000 description 1
- 210000002865 immune cell Anatomy 0.000 description 1
- 210000000987 immune system Anatomy 0.000 description 1
- 230000008676 import Effects 0.000 description 1
- 238000007901 in situ hybridization Methods 0.000 description 1
- 208000015181 infectious disease Diseases 0.000 description 1
- 239000003112 inhibitor Substances 0.000 description 1
- 238000012739 integrated shape imaging system Methods 0.000 description 1
- 206010073095 invasive ductal breast carcinoma Diseases 0.000 description 1
- 201000010985 invasive ductal carcinoma Diseases 0.000 description 1
- 238000011835 investigation Methods 0.000 description 1
- 150000002500 ions Chemical class 0.000 description 1
- 230000003902 lesion Effects 0.000 description 1
- 230000000670 limiting effect Effects 0.000 description 1
- 230000004807 localization Effects 0.000 description 1
- 238000007477 logistic regression Methods 0.000 description 1
- 229910001629 magnesium chloride Inorganic materials 0.000 description 1
- 159000000003 magnesium salts Chemical class 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 230000008774 maternal effect Effects 0.000 description 1
- 238000001840 matrix-assisted laser desorption--ionisation time-of-flight mass spectrometry Methods 0.000 description 1
- 125000002496 methyl group Chemical group [H]C([H])([H])* 0.000 description 1
- 238000007855 methylation-specific PCR Methods 0.000 description 1
- YACKEPLHDIMKIO-UHFFFAOYSA-N methylphosphonic acid Chemical group CP(O)(O)=O YACKEPLHDIMKIO-UHFFFAOYSA-N 0.000 description 1
- 238000000386 microscopy Methods 0.000 description 1
- QJGQUHMNIGDVPM-UHFFFAOYSA-N nitrogen group Chemical group [N] QJGQUHMNIGDVPM-UHFFFAOYSA-N 0.000 description 1
- 230000037434 nonsense mutation Effects 0.000 description 1
- 238000004806 packaging method and process Methods 0.000 description 1
- 239000012188 paraffin wax Substances 0.000 description 1
- 230000008775 paternal effect Effects 0.000 description 1
- 230000007170 pathology Effects 0.000 description 1
- 230000002688 persistence Effects 0.000 description 1
- NBIIXXVUZAFLBC-UHFFFAOYSA-K phosphate Chemical group [O-]P([O-])([O-])=O NBIIXXVUZAFLBC-UHFFFAOYSA-K 0.000 description 1
- 239000010452 phosphate Substances 0.000 description 1
- 125000002467 phosphate group Chemical group [H]OP(=O)(O[H])O[*] 0.000 description 1
- 229920002523 polyethylene Glycol 1000 Polymers 0.000 description 1
- 235000010486 polyoxyethylene sorbitan monolaurate Nutrition 0.000 description 1
- 239000000256 polyoxyethylene sorbitan monolaurate Substances 0.000 description 1
- 230000004481 post-translational protein modification Effects 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 230000002265 prevention Effects 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 238000000746 purification Methods 0.000 description 1
- 238000012175 pyrosequencing Methods 0.000 description 1
- 238000013442 quality metrics Methods 0.000 description 1
- 230000002285 radioactive effect Effects 0.000 description 1
- 230000008707 rearrangement Effects 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 230000007115 recruitment Effects 0.000 description 1
- 230000002829 reductive effect Effects 0.000 description 1
- 238000003757 reverse transcription PCR Methods 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 239000003161 ribonuclease inhibitor Substances 0.000 description 1
- 102200085639 rs104886003 Human genes 0.000 description 1
- 102200085789 rs121913279 Human genes 0.000 description 1
- 102220197892 rs121913284 Human genes 0.000 description 1
- 150000003839 salts Chemical class 0.000 description 1
- 238000010008 shearing Methods 0.000 description 1
- 230000037432 silent mutation Effects 0.000 description 1
- 108700014590 single-stranded DNA binding proteins Proteins 0.000 description 1
- 239000010454 slate Substances 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 238000010186 staining Methods 0.000 description 1
- 238000013517 stratification Methods 0.000 description 1
- 239000004094 surface-active agent Substances 0.000 description 1
- 208000024891 symptom Diseases 0.000 description 1
- WZWYJBNHTWCXIM-UHFFFAOYSA-N tenoxicam Chemical compound O=C1C=2SC=CC=2S(=O)(=O)N(C)C1=C(O)NC1=CC=CC=N1 WZWYJBNHTWCXIM-UHFFFAOYSA-N 0.000 description 1
- 229960002871 tenoxicam Drugs 0.000 description 1
- 238000002560 therapeutic procedure Methods 0.000 description 1
- RYYWUUFWQRZTIU-UHFFFAOYSA-K thiophosphate Chemical compound [O-]P([O-])([O-])=S RYYWUUFWQRZTIU-UHFFFAOYSA-K 0.000 description 1
- 210000001541 thymus gland Anatomy 0.000 description 1
- 231100000027 toxicology Toxicity 0.000 description 1
- 238000013518 transcription Methods 0.000 description 1
- 230000035897 transcription Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000010474 transient expression Effects 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 239000001226 triphosphate Substances 0.000 description 1
- 235000011178 triphosphate Nutrition 0.000 description 1
- 125000002264 triphosphate group Chemical class [H]OP(=O)(O[H])OP(=O)(O[H])OP(=O)(O[H])O* 0.000 description 1
- 238000011144 upstream manufacturing Methods 0.000 description 1
- DRTQHJPVMGBUCF-UHFFFAOYSA-N uracil arabinoside Natural products OC1C(O)C(CO)OC1N1C(=O)NC(=O)C=C1 DRTQHJPVMGBUCF-UHFFFAOYSA-N 0.000 description 1
- 229940045145 uridine Drugs 0.000 description 1
- 238000007482 whole exome sequencing Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B5/00—ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6844—Nucleic acid amplification reactions
- C12Q1/686—Polymerase chain reaction [PCR]
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
Definitions
- the present disclosure is generally related to the fields of genomics, transcriptomics, and bioinformatics and high-throughput single cell analysis.
- High-throughput single cell analysis can provide extensive and valuable information about a subject (e.g., a human patient) or a population which can be used to make informed decisions regarding health-related matters.
- Such methods and systems may have vast applications in diagnostics, prognostics, personalized and precision medicine, drug design, discovery', and development.
- a method of single cell analysis comprising: (a) providing or obtaining a plurality of cells; (b) performing one or more experiments on single cells of the plurality' of cells to generate at least a first data set and a second data set from the plurality' of cells, wherein the first data set is a genomic data set and the second data set is a transcriptomic data set and/or a proteomic data set; (c) identifying a correlation between the first data set and the second data set for at least a portion of the plurality of cells; and (d) using the correlation obtained in (c), identifying a disease biomarker, designing a therapeutic, or designing a vaccine for a disease.
- performing the one or more experiments comprises performing primary template directed amplification (PTA).
- the one or more experiments or screens comprise a genomics experiment, a transcriptomic experiment, a proteomics experiment, or any combination thereof.
- the one or more experiments comprise high-throughput single cell analysis, wherein single cells of the plurality of cells are screened in high-throughput.
- the one or more experiments are performed using a miniaturized high-throughput single cell screening system.
- the method comprises compartmentalizing the plurality of cells into a plurality of partitions, a partition of the plurality of partitions comprises a single cell of the plurality of cells.
- the plurality of partitions comprises a plurality of wells, a plurality of droplets, or both.
- the wells are miniaturized wells.
- the miniaturized high-throughput single cell screening system comprises a microfluidic device, a miniaturized array, or both.
- the one or more experiments comprise performing one or more reactions.
- a partition of the plurality of partitions comprises a single cell therein, and the one or more experiments or screens comprise performing one or more reactions on the single cell in the partition.
- the one or more reactions comprise cell lysis.
- the one or more reactions comprise an amplification reaction.
- the amplification reaction comprises primary template directed amplification (PTA).
- the one or more reactions comprise lysmg the single cell, extracting the molecular information of the single cell, thereby releasing a cellular nucleic acids, proteins, lipids, and metabolites from the single cell in the partition, and performing an amplification reaction on the cellular nucleic acid molecule.
- performing the one or more reactions comprises using one or more reagents.
- the one or more reagent(s) comprise one or more of at least one amplification primer, at least one nucleic acid polymerase, and a mixture of nucleotides, wherein the mixture of nucleotides comprises at least one terminator nucleotide which terminates nucleic acid replication by the polymerase.
- the terminator nucleotide is an irreversible terminator.
- the terminator nucleotide is selected from the group consisting of nucleotides with modification to the alpha group, C3 spacer nucleotides, locked nucleic acids (LNA), inverted nucleic acids, 2' fluoro nucleotides, 3' phosphorylated nucleotides, 2'-O-Methyl modified nucleotides, and trans nucleic acids.
- the nucleotides with modification to the alpha group are alpha-thio dideoxynucleotides.
- the terminator nucleotide comprises modifications of the r group of the 3’ carbon of the deoxyribose.
- the terminator nucleotide is selected from the group consisting of dideoxynucleotides, inverted dideoxynucleotides, 3' biotinylated nucleotides, 3' amino nucleotides, 3’-phosphorylated nucleotides, 3'-O-methyl nucleotides, 3' carbon spacer nucleotides including 3' C3 spacer nucleotides, 3' C18 nucleotides, 3' Hexanediol spacer nucleotides, acyclonucleotides, and combinations thereof.
- a partition of the plurality of partitions comprises at least a single cell and a bead.
- the bead delivers a reagent for performing a reaction on the single cell in the partition.
- the reagent is bound to the bead via a cleavable linker and is configured to be released from the bead via cleavage of the cleavable linker.
- the reagent comprises a barcode configured to identify the cell or a constituent of the cell.
- the bead can envelop the entire cell to enable chemical reactions at a miniaturized scale.
- the constituent of the cell comprises genomic material of the cell, ribonucleic acid (RNA), deoxyribonucleic acid (DNA), or any combination thereof.
- the method comprises lysing the cell in the partition, releasing a cellular nucleic acid molecule of the cell in the partition, releasing the barcode from the bead via cleavage of the cleavable linker, and hybridizing the cellular nucleic acid molecule to the barcode.
- the one or more reactions comprise lysing the single cell, thereby releasing cellular nucleic acid molecules in the partition, performing one or more amplification reactions on the cellular nucleic acid molecules thereby generating amplified cellular nucleic acid molecules, and wherein the method further comprises extracting the amplified cellular nucleic acid molecules from the partition, and sequencing the amplified cellular nucleic acid molecules.
- generating the first data set comprises performing primary template directed amplification (PTA) and generating the second data set comprises performing a reverse transcription reaction.
- performing the reverse transcription reaction comprises generating a cDNA library.
- generating the first data set comprises determining a methylation site in a cellular nucleic acid molecule using PTA, thereby generating a methylation library.
- the method further comprises comparing the methylation library to a reference library for a single cell of the plurality of cells, wherein the methylation library and the reference library are generated from the same cell.
- identifying the correlation comprises calculating or assigning a penetrance score to the correlation of these molecular data (biomarkers), wherein the penetrance score quantifies the correlation.
- the penetrance score guides identifying the disease biomarker, identifying collection of biomarkers which may comprise one or more of the multiomic modalities, designing the therapeutic, designing the vaccine for the disease, or any combination thereof.
- a high penetrance score indicates a strong correlation between the first data set and the second data set.
- the high penetrance score indicates that the expression of a gene identified in the first data set leads to a transcriptomic event, a proteomic event or both, and wherein the gene is identified as a disease biomarker.
- a low penetrance score indicates a weak correlation between the first data set and the second data set, and that the expression of a gene identified in the first data set does not lead to a transcriptomic event, a proteomic event, or either, and wherein the gene is not identified as a disease biomarker.
- identifying the correlation is performed with the aid of a computer system comprising a computer program.
- the computer program compnses one or more bioinformatics algorithms or workflows.
- the first data set and the second data set are combined or integrated into a database with or without links to related datasets independently generated across the research community.
- a system for determining a penetrance score comprising: a computing system comprising at least one processor and instructions executable by the at least one processor to provide an application configured to perform operations comprising: receiving multiomics data from one or more sources and at least one biological state; and applying an algorithm configured to process the data and generate a penetrance score.
- the computing system comprises a cloud computing platform.
- the multiomics data comprises data obtained from analysis of one or more of genomic DNA, transcript RNA, proteins, lipids, or metabolites.
- the correlation is quantified by a penetrance score.
- the penetrance score is at least 0.5. In some embodiments, the penetrance score is at least 0.9.
- a method of developing a treatment for a disease comprising: (a) generating multiomics data from one or more single cells, wherein generating comprises performing Primary Template Directed Amplification (PTA), and wherein the multiomics data comprises two or more of genome data, transcriptome data, and proteomics data; (b) correlating one or more mutations in genome data with corresponding mutations in one or both of (i) an mRNA of the transcriptome data and (ii) a protein of the proteome data; and (c) generating a treatment targeting one or both of the mRNA and the protein, thereby developing the treatment for the disease.
- the disease comprises or is cancer.
- the treatment comprises an mRNA vaccine. In some embodiments, the treatment comprises reprogramming a dendritic cell to target one or both of the mRNA or protein.
- the mutation in genome data comprises a DNA mutation. In some embodiments, the DNA mutation is selected from the group consisting of SNV*X, CNV*X, translocation, IND EL, frameshift, stop codon, mitochondrial, promoter/enhancer, TCR/BCR, and other change.
- the mRNA comprises a transcript change. In some embodiments, the transcript change is selected from the group consisting of expression, splice variant, fusion, IncRNA, miRNA, TCR/BCR, promoter, truncated gene, mitochondrial, or mutation.
- the protein comprises a protein change.
- the protein change is selected from the group consisting of over/under expressed, truncated, surface bound, frameshift, misfolded, metabolic, ligand independence, confirmation, activity change, or fused.
- the multiomics data comprises one or more measurements. In some embodiments, one or more of the measurements is a silent change. In some embodiments, the multiomics data comprises data from one or more of a genome, a transcriptome, a proteome, a metabolome, a lipidome, or an epigenome. In some embodiments, the multiomics data comprises data from a genome. In some embodiments, the one or more measurements are selected from the group consisting of: copy number variation, translocation, and mutation burden.
- the disease comprises cancer.
- cancer comprises breast cancer.
- breast cancer comprises ductal carcinoma.
- the cancer comprises leukemia.
- the single cells e g., single cancer cells
- the multiomics data comprises data from a methylome.
- the one or more measurements are selected from the group consisting of: methylation at CpG sites, gene activation, and gene repression.
- the multiomics data comprises data from a transcriptome.
- the one or more measurements are selected from the group consisting of: expressed genes, gene fusions, and splice variants.
- the multiomics data comprises data from a proteome.
- the one or more measurements are selected from the group consisting of: translation level, phosphorylation state, and protein modification.
- the one or more sources comprise an individual organism.
- the one or more sources comprise cells.
- the cells are mammalian cells, human cells, bacterial cells, cancer cells, an immortalized cell line, a primary patient cell line, or any combination thereof.
- the cells are obtained from a tissue.
- the cells are obtained from a tissue cross-section.
- the biological state comprises a disease state.
- the disease state comprises cancer.
- the algorithm further generates a mechanism based on the data.
- the mechanism is generated by detecting one or more changes in one or measurements.
- the change comprises a genome DNA change.
- the genome DNA change is selected from the group consisting of SNV*X, CNV*X, translocation, INDEL, frameshift, stop codon, mitochondrial, promoter/enhancer, TCR/BCR, and other change.
- the change comprises a transcript change.
- the transcript change is selected from the group consisting of expression, splice variant, fusion, IncRNA, miRNA, TCR/BCR, promoter, truncated gene, mitochondrial, or mutation.
- the change comprises a genome change.
- the protein change is selected from the group consisting of over/under expressed, truncated, surface bound, frameshift, misfolded, metabolic, ligand independence, confirmation, activity change, or fused.
- the mechanism is determined to be one or more of a genomic, transcriptomic, proteomic, lipidomic, or metabolomic mechanism.
- a method for validating a disease target for a disease comprising (a) selecting cells from a tissue; (b) banking the cells; (c) performing one or more multiomic methods on the cells to generate multiomics data; and (d) applying a computer algorithm to process the multiomics data and generate a disease target.
- selecting the cells comprises FACS sorting, microfluidics, spatial cell selection, or ultra-high throughput cell sorting.
- the number of cells is at least about 200, 300, 400, 500, 1000, 2000, 3000, 4000, 5000, 6000, 10,000 or greater.
- the disease is cancer.
- the multiomics methods comprise PTA.
- the multiomics data comprises data from one or more of a genome, epigenome, transcriptome, proteome, lipidome, or metabolome.
- the method further comprises a treatment based on the disease target.
- the treatment comprises an mRNA vaccine or small molecule.
- the method or system is capable of detecting a number of RNA variant per cell of at least 750, 1000, 1500, 2000, 2500 or higher. In some embodiments, the method or system is capable of detecting a number of genes per cell of from about 1000 to about 8000. In some embodiments, the method or system is capable of detecting a number of RNA variant per cell of at least 750, 1000, 1500, 2000, 2500 or higher and a number of genes per cell of from about 1000 to about 8000.
- the methods comprise full length synthesis of RNA transcriptsin the cell wherein a plurality of amplification products achieved from performing the method are substantially unbiased over a range of 5 ’-3’ gene body percentiles.
- the methods and systems of the present disclosure are capable of amplifying and detecting transcripts of at least 1 kb, 1.5 kb, 2kb, 2.5 kb, 3 kb, 3.5 kb, 4 kb, 4.5 kb, or longer.
- these transcripts may consist of coding information from one or more genes and represent aberrations of splicing which can affect, but not limited to, transcript isoforms or gene fusion events.
- FIG. 1 depicts a workflow comprising providing a sample, cell selection, and multiomic analysis (including genome, methylome, transcriptome, and proteome);
- FIG. 2 depicts various multiomic modalities which contribute to penetrance score
- FIG. 3 depicts a workflow schematic of measuring penetrance score using multiomic analysis
- FIG. 4 depicts a list of various biological inquiries useful for multiomics measurements
- FIG. 5 depicts another workflow schematic for the types of changes in multiomics measurements which in some instances is used for determining a mechanism
- FIG. 6 depicts a workflow schematic for spatially selecting cells from a frozen specimen, banking the cells, performing multiomic chemistry processes, providing multiomic data/measurements to a computational engine process, and validating targets;
- FIG. 7 depicts a schematic of factors which in some instances dictate cell fate
- FIG. 8 schematically illustrating the various components and applications of the methods and systems of the present disclosure
- FIG. 9 depicts a workflow schematic for mammalian and bacterial multiomics analysis using the methods and systems of the present disclosure
- FIG. 10 schematically illustrates a workflow involving the computational components and systems of the present disclosure
- FIG. 11 depicts an example of a web/mobile application provision system; in this case, a system providing browser-based and/or native mobile user interfaces;
- FIG. 12 depicts an example of a cloud-based web/mobile application provision system; in this case, a system comprising an elastically load balanced, auto-scaling web server and application server resources as well synchronously replicated databases;
- FIG. 13 depicts change in cellular growth rates of MOLM-13 cell lines in the presence of the cancer drug quizartinib where resistant clones are fostering over genetically native cells;
- FIG. 14A depicts a genomic view of allele variation FLT3 gene in resistant and parental strains
- FIG. 14B depicts CNV genomic data of resistant and parental strains
- FIG. 14C depicts karyoty pes of resistant and parental strains
- FIG. 14D depicts a principal component analysis of the transcriptomics data of parental and resistant cells
- FIG. 14E depicts a clustered heat map of transcriptomic data
- FIG. 14F depicts a mechanism for transcriptional bypass of FLT3 signaling in resistant cells
- FIGS. 14G-14H depict alternative exon utilization in transcriptional data
- FIG. 15A depicts a PCA of SNV data, showing discrimination between groups based on genomic variation
- FIG. 15B depicts clustered SNV data, showing groups of genomic positions with similar zygosity across biological groups
- FIG. 16A depicts SNV -gene expression interactions, highlighting specific mutations within genes associated to expression changes significant across biology groups;
- FIG. 16B depicts the location of a SNV in the MYC gene;
- FIG. 16C depicts a plot of MYC gene expression and SNV genotype for the parental and resistant cells showing similar grouping of resistant cells with the signature;
- FIG. 17 depicts H&E and a-ER staining of the primary cancer cells prior to sequencing;
- FIG. 18A depicts heterogeneity in CNV in primary breast cancer cells;
- FIG. 18B depicts known CNV in DCIS
- FIG. 19 depicts SNV PIK3CA mutations detected in single cells derived from 3 separate patients
- FIG. 20 depicts SNV and CNV detected in single cells of a DCIS patient
- FIG. 21 depicts correlations between genomic and transcnptomic data
- FIGs. 22A-22C show experimental data generated using the methods and systems of the present disclosure (ResolveOME) and its comparison to droplet RNA sequencing demonstrating superior RNA performance with respect to enhanced gene body coverage, increased representation across transcript sizes, and robust variant calling;
- FIG. 23A shows significant isoforms across parental or resistant clones of the MOLM- 13(transcript ‘A’ and ‘B’) from the same genes;
- FIG. 23B shows transcripts that are significantly associated to changes in copy number ploidy across the genomes of MOLM-13 cells
- FIG 23C shows genomic variants of MOLM-13, in regulatory regions of the genome (depicted by color) that are also significantly associated to transcript changes across resistant cells.
- genomics, transcriptomics, proteomics, and methylomics are unmet need for comprehensive and effective approaches to generate one or more datasets including genomics, transcriptomics, proteomics, and methylomics, and identifying correlations therebetween, such as to diagnose patients, identify biomarkers, design therapeutics or vaccines, prescribe medications, and/or implement individualized/personalized medicine approaches.
- a comprehensive approach comprising elements of high-throughput single cell analysis, genomics, transcriptomics, proteomics, bioinformatics, software engineering, and data analysis for generating and analyzing data sets that have vast applications for identifying disease biomarkers, diagnosing patients, and designing drugs or vaccines.
- systems and methods for processing and visualization of biological data e.g., biomarkers. Further provided herein are systems and methods described herein result in generating a penetrance score. Further provided herein are systems and methods to interrogate disease mechanisms. Further provided herein are systems and methods for validating therapeutic targets using penetrance data and mechanism.
- template target nucleic acids
- determining means determining if an element is present or not (for example, detection). These terms can include quantitative, qualitative or quantitative and qualitative determinations. Assessing can be relative or absolute. “Detecting the presence of’ can include determining the amount of something present in addition to determining whether it is present or absent depending on the context.
- the term “gene” can refer to a linear sequence of nucleotides along a segment of DNA that provides the coded instructions for synthesis of RNA, which, when translated into protein, leads to the expression of hereditary character.
- nucleic acid molecule can mean DNA, RNA, singlestranded, double-stranded or triple stranded and any chemical modifications thereof. Virtually any modification of the nucleic acid is contemplated.
- a “nucleic acid molecule” can be of almost any length, from 10, 20, 30, 40, 50, 60, 75, 100, 125, 150, 175, 200, 225, 250, 275, 300, 400, 500, 600, 700, 800, 900, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, 5000, 6000, 7000, 8000, 9000, 10,000, 15,000, 20,000, 30,000, 40,000, 50,000, 75,000, 100,000, 150,000, 200,000, 500,000, 1,000,000, 1,500,000, 2,000,000, 5,000,000 or even more bases in length, including increments therein, up to a full-length chromosomal DNA molecule.
- the nucleic acid isolated from a sample is typically RNA.
- a single-stranded nucleic acid molecule is “complementary” to another single-stranded nucleic acid molecule, in certain embodiments of the subject matter descnbed herein, when it can base-pair (hybridize) with all or a portion of the other nucleic acid molecule to form a double helix (double-stranded nucleic acid molecule), based on the ability of guanine (G) to base pair with cytosine (C) and adenine (A) to base pair with thymine (T) or uridine (U).
- G guanine
- C cytosine
- A adenine
- T thymine
- U uridine
- the nucleotide sequence 5'-TATAC-3' is complementary to the nucleotide sequence 5'-GTATA-3'.
- mutation can refer to a change in the genome with respect to the standard wild-type sequence. Mutations can be deletions, insertions, or rearrangements of nucleic acid sequences at a position in the genome, or they can be single base changes at a position in the genome, referred to as “point mutations.” Mutations can be inherited, or they can occur in one or more cells during the lifespan of an individual. In some instances, mutation and variant are used synonymously.
- kit or “research kit” can refer to a collection of products that are used to perform a biological research reaction, procedure, or synthesis, such as, for example, a detection, assay, separation, purification, etc., which are typically shipped together, usually within a common packaging, to an end user.
- Described herein is a cloud-based solution for the storage, query, and analysis of longitudinal data comprising a multiplicity of whole genomes, a large number of public and proprietary annotation sources as well as associated high quality phenotypic data, including microbiome metagenomes and metabolomics profiles.
- the data analyzed by the platforms, systems, media, and methods described herein comprises more than 1,000, more than 5,000, more than 10,000, more than 20,000, more than 50,000, more than 100,000, more than 500,000, or more than 1,000,000 whole genomes.
- the data analyzed by the platforms, systems, media, and methods described herein comprises genomic data.
- the genomic data is produced, by way of example, at a next generation sequencing (NGS) lab.
- NGS next generation sequencing
- an AWS analysis pipeline based on Illumina’s HiSeq X and the ISIS Analysis Software are utilized to produce the genomic data.
- Sequencing reads are mapped to the hg38 human reference sequence and variant callers are used to call single nucleotide variants (SNVs) and insertions and deletions (indels).
- SNVs single nucleotide variants
- indels insertions and deletions
- the genomic data comprises a multiplicity of unique SNVs.
- the genomic data comprises over 1 million, over 10 million, over 50 million, over 100 million, over 500 million, or over 1 billion unique SNVs.
- the data analyzed by the platforms, systems, media, and methods described herein comprises metadata.
- the whole genomes are associated with high quality phenotypic information.
- a proprietary phenotype ingestion process enables the cleaning and standardization of phenotype data across disparate data sources.
- the ingestion process includes: data integrity checks; standardization of units; standardization of terms; ontology/vocabulary mapping; and maintenance of the proprietary data dictionary.
- the phenotype data comprises more than 1000, more than 5000, more than 10,000, more than 100,000, more than 1,000,000, or more than 10,000,000 phenotype data fields with, more than 1 million, more than 5 million, more than 10 million, more than 50 million, more than 100 million, more than 500 million, or more than 1 billion data points.
- Phenotypic data in some instances comprises cellular phenotype data. In some instances, cellular phenotypic data obtained from microscopy.
- cell phenotypic data comprises one or more observable phenotypic traits such as cell shape or morphology , size, texture, internal structure, patterns of distribution of one or more specific proteins, glycosylated proteins, nucleic acid molecules, lipid molecules, glycosylated lipid molecules, carbohydrate molecules, metabolites, and ions.
- phenotypic data describes populations of cells described herein.
- phenotypic data describes phenotypic traits of an organism such as a human.
- a phenotypic data comprises a clinical designation or category, for example, a clinical diagnosis, a clinical parameter name, a clinical parameter value, a laboratory test name or a laboratory test value.
- a phenotype is associated with an observable disease characteristic.
- the data analyzed by the platforms, systems, media, and methods described herein comprises annotation data.
- Annotation data is also cleaned and standardized through an automated end-to-end solution, which allows: idempotence, immutability, persistence; high quality data; consistency between data sources; and scalability and flexibility.
- Samples described herein may represent biologic information obtained from individuals or populations of individuals (e.g., genomic information).
- samples comprise single cells.
- samples comprise 1, 2, 5, 10, 20, 25, 50, 75, 100, 200, 500, or more than 1000 cells from the same or different individual.
- samples comprise 1000, 2000, 5000, 10,000 20,000, 50,000, 75,000, or at least 100,000 cells from the same or different individual.
- Samples may be obtained from any species, including but not limited to viruses, bacteria, plants, fungi, protozoa, archaea, or animals.
- samples are obtained from vertebrates.
- samples are obtained from mammals.
- samples are obtained from humans.
- Samples in some instances are obtained from any bodily fluid or tissue.
- samples are obtained from diseased tissue such as a tumor.
- a method of single cell analysis comprising: (a) providing or obtaining a plurality of cells; (b) performing one or more experiments on single cells of the plurality' of cells to generate at least a first data set and a second data set from the plurality' of cells, wherein the first data set is a genomic data set and the second data set is a transcriptomic data set and/or a proteomic data set and/or a methylomic data set; (c) identifying a correlation between the first data set and the second data set for at least a portion of the plurality' of cells; and (d) using the correlation obtained in (c), identifying a disease biomarker, designing a therapeutic, or designing a vaccine for a disease.
- performing the one or more experiments comprises performing primary template directed amplification (PTA).
- the one or more experiments or screens comprise a genomics experiment, a transcriptomic experiment, a proteomics experiment, a methylomics experiment or any combination thereof.
- the one or more experiments comprise high-throughput single cell analysis, wherein single cells of the plurality of cells are screened in high-throughput.
- the one or more experiments are performed using a miniaturized high-throughput single cell screening system.
- the method comprises compartmentalizing the plurality of cells into a plurality of partitions, a partition of the plurality of partitions comprises a single cell of the plurality of cells.
- the plurality of partitions comprises a plurality of wells, a plurality of droplets, or both.
- the wells are miniaturized wells.
- the miniaturized high-throughput single cell screening system comprises a microfluidic device, a miniaturized array, or both.
- the one or more experiments comprise performing one or more reactions.
- a partition of the plurality of partitions comprises a single cell therein, and the one or more experiments or screens comprise performing one or more reactions on the single cell in the partition.
- the one or more reactions comprise cell lysis.
- the one or more reactions comprise an amplification reaction.
- the amplification reaction comprises primary template directed amplification (PTA).
- the one or more reactions comprise lysing the single cell, extracting the genomic material of the single cell, thereby releasing a cellular nucleic acid molecule from the single cell in the partition, and performing an amplification reaction on the cellular nucleic acid molecule.
- performing the one or more reactions comprises using one or more reagents.
- the one or more reagent(s) comprise one or more of at least one amplification primer, at least one nucleic acid polymerase, and a mixture of nucleotides, wherein the mixture of nucleotides comprises at least one terminator nucleotide which terminates nucleic acid replication by the polymerase.
- the terminator nucleotide is an irreversible terminator.
- the terminator nucleotide is selected from the group consisting of nucleotides with modification to the alpha group, C3 spacer nucleotides, locked nucleic acids (LNA), inverted nucleic acids, 2' fluoro nucleotides, 3' phosphorylated nucleotides, 2'-O-Methyl modified nucleotides, and trans nucleic acids.
- the nucleotides with modification to the alpha group are alpha-thio dideoxynucleotides.
- the terminator nucleotide comprises modifications of the r group of the 3’ carbon of the deoxyribose.
- the terminator nucleotide is selected from the group consisting of dideoxynucleotides, inverted dideoxynucleotides, 3' biotinylated nucleotides, 3' amino nucleotides, 3’-phosphorylated nucleotides, 3'-O-methyl nucleotides, 3' carbon spacer nucleotides including 3' C3 spacer nucleotides, 3' C18 nucleotides, 3' Hexanediol spacer nucleotides, acyclonucleotides, and combinations thereof.
- a partition of the plurality of partitions comprises at least a single cell and a bead.
- the bead delivers a reagent for performing a reaction on the single cell in the partition.
- the reagent is bound to the bead via a cleavable linker and is configured to be released from the bead via cleavage of the cleavable linker.
- the reagent comprises a barcode configured to identify the cell or a constituent of the cell.
- the constituent of the cell comprises genomic material of the cell, ribonucleic acid (RNA), deoxyribonucleic acid (DNA), or any combination thereof.
- the method comprises lysing the cell in the partition, releasing a cellular nucleic acid molecule of the cell in the partition, releasing the barcode from the bead via cleavage of the cleavable linker, and hybridizing the cellular nucleic acid molecule to the barcode.
- the one or more reactions comprise lysing the single cell, thereby releasing cellular nucleic acid molecules in the partition, performing one or more amplification reactions on the cellular nucleic acid molecules thereby generating amplified cellular nucleic acid molecules, and wherein the method further comprises extracting the amplified cellular nucleic acid molecules from the partition, and sequencing the amplified cellular nucleic acid molecules.
- generating the first data set comprises performing primary template directed amplification (PTA) and generating the second data set comprises performing a reverse transcription reaction.
- performing the reverse transcription reaction comprises generating a cDNA library.
- generating the first data set comprises determining a methylation site in a cellular nucleic acid molecule using PTA, thereby generating a methylation library.
- the method further comprises comparing the methylation library to a reference library for a single cell of the plurality of cells, wherein the methylation library and the reference library are generated from the same cell.
- identifying the correlation comprises calculating or assigning a penetrance score to the correlation, wherein the penetrance score quantifies the correlation.
- the penetrance score guides identifying the disease biomarker, designing the therapeutic, designing the vaccine for the disease, or any combination thereof.
- a high penetrance score indicates a strong correlation between the first data set and the second data set.
- the high penetrance score indicates that the expression of a gene identified in the first data set leads to a transcriptomic event, a proteomic event or both, and wherein the gene is identified as a disease biomarker.
- a low penetrance score indicates a weak correlation between the first data set and the second data set, and that the expression of a gene identified in the first data set does not lead to a transcriptomic event, a proteomic event, or either, and wherein the gene is not identified as a disease biomarker.
- identifying the correlation is performed with the aid of a computer system comprising a computer program.
- the computer program comprises a bioinformatics algorithm.
- the first data set and the second data set are combined or integrated into a database.
- a method of developing a treatment for a disease comprising: (a) generating multiomics data from one or more single cells, wherein generating comprises performing Primary Template Directed Amplification (PTA), and wherein the multiomics data comprises two or more of genome data, transcriptome data, and proteomics data; (b) correlating one or more mutations in genome data with corresponding mutations in one or both of (i) an mRNA of the transcriptome data and (ii) a protein of the proteome data; and (c) generating a treatment targeting one or both of the mRNA and the protein, thereby developing the treatment for the disease.
- the disease comprises or is cancer.
- the correlation is quantified by a penetrance score.
- the penetrance score is at least 0.5. In some embodiments, the penetrance score is at least 0.9.
- the treatment comprises an mRNA vaccine. In some embodiments, the treatment comprises reprogramming a dendritic cell to target one or both of the mRNA or protein.
- the mutation in genome data comprises a DNA mutation. In some embodiments, the DNA mutation is selected from the group consisting of SNV*X, CNV*X, translocation, INDEL, frameshift, stop codon, mitochondrial, promoter/enhancer, TCR/BCR, and other change.
- the mRNA comprises a transcript change. In some embodiments, the transcript change is selected from the group consisting of expression, splice variant, fusion, IncRNA, miRNA, TCR/BCR, promoter, truncated gene, mitochondrial, or mutation.
- the protein comprises a protein change.
- the protein change is selected from the group consisting of over/under expressed, truncated, surface bound, frameshift, misfolded, metabolic, ligand independence, confirmation, activity change, or fused.
- the disease comprises cancer.
- cancer comprises breast cancer.
- the breast cancer comprises ductal carcinoma.
- the cancer comprises leukemia.
- the single cells e g., single cancer cells
- the method or system is capable of detecting a number of RNA variant per cell of at least 750, 1000, 1500, 2000, 2500 or higher.
- the method or system is capable of detecting a number of genes per cell of from about 1000 to about 8000.
- the method or system is capable of detecting a number of RNA variant per cell of at least 750, 1000, 1500, 2000, 2500 or higher and a number of genes per cell of from about 1000 to about 8000.
- the methods comprise full length synthesis of RNA transcripts in the cell wherein a plurality of amplification products achieved from performing the method are substantially unbiased over a range of 5 ’-3’ gene body percentiles.
- the methods and systems of the present disclosure are capable of amplifying and detecting transcripts of at least 1 kb, 1.5 kb, 2kb, 2.5 kb, 3 kb, 3.5 kb, 4 kb, 4.5 kb, or longer.
- multiomics may include analysis of at least one feature of a proteome, genome, transcriptome, metabolome, lipidome, or epigenome.
- Proteomics may include translation level, phosphorylation state, and protein modification.
- Transcriptomics may include, without limitations, analysis of ribosomal RNA (rRNA), messenger RNA (mRNA), transfer RNA (tRNA), micro-RNA (miRNA), and other non-coding RNA (ncRNA), or a combination thereof.
- rRNA ribosomal RNA
- mRNA messenger RNA
- tRNA transfer RNA
- miRNA micro-RNA
- ncRNA non-coding RNA
- Epigenomics may include, without limitations, analysis of methylation patterns (e.g.
- a method comprises one or more steps of isolating a single cell from a population of cells, wherein the single cell comprises RNA and genomic DNA; amplifying the RNA by RT-PCR to generate a cDNA library; isolating the cDNA from the genomic DNA; contacting the genomic DNA with at least one amplification primer, at least one nucleic acid polymerase, and a mixture of nucleotides; and sequencing the cDNA 1 i brary and the genomic DNA library.
- the mixture of nucleotides comprises at least one terminator nucleotide which terminates nucleic acid replication by the polymerase to generate a genomic DNA library.
- Methods described herein may be used as a replacement for any number of other known methods in the art which are used for single cell sequencing (multiomics or the like).
- a method described herein comprises PTA and a method of poly adenylated mRNA transcripts.
- a method descnbed herein comprises PTA and a method of non-polyadenylated mRNA transcripts.
- a method described herein comprises PTA and a method of total (poly adenylated and non- polyadenylated) mRNA transcripts.
- PTA may substitute genomic DNA sequencing methods such as MDA, PicoPlex, DOP-PCR, MALBAC, or target-specific amplifications.
- PTA replaces the standard genomic DNA sequencing method in a multiomics method including DR-seq (Dey et al., 2015), G&T seq (MacAulay et al., 2015), scMT-seq (Hu et al., 2016), sc-GEM (Cheow et al., 2016), scTrio-seq (Hou et al., 2016), simultaneous multiplexed measurement of RNA and proteins (Darmanis et al., 2016), scCOOL-seq (Guo et al., 2017), CITE-seq (Stoeckius et al., 2017), REAP-seq (Peterson et al., 2017), scNMT-seq (Clark et al., 2018), or
- PTA is combined with a standard RNA sequencing method to obtain genome and transcriptome data.
- a multiomics method described herein comprises PTA and one of the following: Drop-seq (Macosko, et al.
- an RT reaction mix is used to generate a cDNA library.
- the RT reaction mixture comprises a crowding reagent, at least one primer, a template switching oligonucleotide (TSO), a reverse transcriptase, and a dNTP mix.
- TSO template switching oligonucleotide
- an RT reaction mix comprises an RNAse inhibitor.
- an RT reaction mix comprises one or more surfactants.
- an RT reaction mix comprises Tween-20 and/or Tnton-X.
- an RT reaction mix comprises Betaine.
- an RT reaction mix comprises one or more salts.
- an RT reaction mix comprises a magnesium salt (e.g., magnesium chloride) and/or tetramethyl ammonium chloride.
- an RT reaction mix comprises gelatin.
- an RT reaction mix comprises PEG (PEG1000, PEG2000, PEG4000, PEG6000, PEG8000, or PEG of other length).
- Multiomic methods described herein may provide both genomic and RNA transcript information from a single cell (e.g., a combined or dual protocol).
- genomic information from the single cell is obtained from the PTA method, and RNA transcript information is obtained from reverse transcription to generate a cDNA library.
- a whole transcript method is used to obtain the cDNA library.
- 3’ or 5’ end counting is used to obtain the cDNA library.
- cDNA libraries are not obtained using UMIs.
- a multiomic method provides RNA transcript information from the single cell for at least 500, 1000, 2000, 5000, 8000, 10,000, 12,000, or at least 15,000 genes.
- a multi omic method provides RNA transcript information from the single cell for about 500, 1000, 2000, 5000, 8000, 10,000, 12,000, or about 15,000 genes. In some instances, a multiomic method provides RNA transcript information from the single cell for 100-12,000 1000-10,000, 2000-15,000, 5000-15,000, 10,000-20,000, 8000-15,000, or 10,000-15,000 genes. In some instances, a multiomic method provides genomic sequence information for at least 80%, 90%, 92%, 95%, 97%, 98%, or at least 99% of the genome of the single cell. In some instances, a multiomic method provides genomic sequence information for about 80%, 90%, 92%, 95%, 97%, 98%, or about 99% of the genome of the single cell.
- RNA may be amplified in the multiomics methods described herein.
- RNA is amplified to isolate mRNA transcripts.
- templateswitching polynucleotides are used.
- amplification of RNA uses labeled primers.
- a label comprises biotin.
- at least some of the cDNA polynucleotides are isolated with affinity binding to the label.
- multiomics methods comprise amplification of RNA to generate a cDNA library.
- a cDNA library is generated having at least 10, 20, 30, 50, 75, 100, 125, 150, 175, 200, 225, 250, 300, 350, 400, or at least 500 ng of DNA.
- a cDNA library is generated having 10-500, 20-500, 30-500, 50-500, 50-400, 50-300, 100-500, 100-400, 100-300, 100-200, 200-500, 300-500, or 400-750 ng of DNA.
- at least some polynucleotides in the cDNA library comprise a barcode.
- the cDNA comprises polynucleotides corresponding to at least 100, 500, 1000, 1500, 2000, 2500, 3000, 3500, or at least 4000 genes.
- the cDNA comprises a 5’ to 3’ transcript bias of 0.5-1.5, 0.6-1.5, 0.7-1.5, 0.8-1.5, 0.9-1.5, 0.8-1.5, 1-1.5, 1-2.0, 1.2-2.0, 0.5-2.0.
- Multiomic methods may comprise analysis of single cells from a population of cells. In some instances, at least 5, 10, 20, 50, 100, 200, 500, 1000, 2000, 5000, or at least 8000 cells are analyzed. In some instances, about 5, 10, 20, 50, 100, 200, 500, 1000, 2000, 5000, or about 8000 cells are analyzed. In some instances, 5-100, 10-100, 50-500, 100-500, 100-1000, 50- 5000, 100-5000, 500-1000, 500-10000, 1000-10000, or 5000-20,000 cells are analyzed.
- Multiomic methods may generate yields of amplified genomic DNA from the PTA reaction based on the type of single cell.
- the amount of DNA generated from a single cell is about 0.1, 1, 1.5, 2, 3, 5, or about 10 micrograms.
- the amount of DNA generated from a single cell is about 0.1, 1, 1.5, 2, 3, 5, or about 10 femtograms.
- the amount of DNA generated from a single cell is at least 0.1, 1, 1.5, 2, 3, 5, or at least 10 micrograms.
- the amount of DNA generated from a single cell is at least 0.1, 1, 1.5, 2, 3, 5, or at least 10 femtograms.
- the amount of DNA generated from a single cell is about 0.1-10, 1-10, 1.5-10, 2-20, 2-50, 1-3, or 0.5-3.5 micrograms. In some instances, the amount of DNA generated from a single cell is about 0.1- 10, 1-10, 1.5-10, 2-20, 2-4, 1-3, or 0.5-4 femtograms. In some instances, the amount of DNA generated from a single cell is about 0.5-2.5, 0.5-3, 0.5-5, 0.2-5, 1-2.5, or 1-5 ng of DNA. In some instances, the amount of DNA generated from a single cell is at least 0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2, 2.25, 2.5, 2.75, 3, 3.25, 3.5, 4, or at least 5 ng of DNA.
- DNA libraries may comprise an allelic balance.
- the allelic balance is 50-100, 60-100, 70-100, 80-100, 60-95, 70-95, 80-95, 85-95, 90-95, 90-98, 90-99, 85-99, or 95- 99 percent.
- the allelic balance is at least 50, 60, 70, 80, 83, 85, 87, 90, 92, 95, 98, or at least 99 percent.
- DNA libraries may comprise a sensitivity for one or more SNVs.
- the sensitivity is 0.50-1, 0.60-1, 0.70-1, 0.80-1, 0.60-0.95, 0.70-0.95, 0.80-0.95, 0.85-0.95, 0.90- 0.95, 0.90-0.98, 0.90-0.99, 0.85-0.99, or 0.95-0.99.
- the sensitivity is at least 0.50, 0.60, 0.70, 0.80, 0.83, 0.85, 0.87, 0.90, 0.92, 0.95, 0.98, or at least 0.99.
- DNA libraries may comprise a precision for one or more SNVs.
- the precision is 0.50-1, 0.60-1, 0.70-1, 0.80-1, 0.60-0.95, 0.70-0.95, 0.80-0.95, 0.85-0.95, 0.90- 0.95, 0.90-0.98, 0.90-0.99, 0.85-0.99, or 0.95-0.99.
- the precision is at least 0.50, 0.60, 0.70, 0.80, 0.83, 0.85, 0.87, 0.90, 0.92, 0.95, 0.98, or at least 0.99.
- a penetrance score represents the contribution of one or more pieces of molecular information that associate with the physical signs and symptoms of a genetic disorder.
- subjects with one or more biomarkers do not develop physical features of the disorder, and the condition has incomplete (or low) penetrance.
- penetrance is determined from one or more biomarkers and/or biological mechanisms (pathways).
- changes to biomarkers are used to determine penetrance score.
- these phenotypic changes are due to a functional element (e.g., an RNA or a protein).
- a change is silent (having no impact to protein).
- a phenotypic change is manifested as a change in measurements obtained from one or more multiomic modalities.
- multiomics comprises DNA (e.g., genome/epigenome), RNA (e.g., transcriptome), protein (proteome), and/or other molecules (e.g., lipidome, metabolome).
- DNA e.g., genome/epigenome
- RNA e.g., transcriptome
- protein proteome
- multiomics enables determination of a mechanism with interdependent components for a disease state or disorder.
- a penetrance score and/or mechanism are used to identify validated therapeutic targets.
- treatments are generated based on the therapeutic target.
- the treatment comprises a vaccine, antibody, a genetic therapy, modified immune cells, or small molecule.
- a workflow comprises one or more steps of detecting a genetic change, transcript change, methylation change, and protein change.
- systems and methods comprise a workflow according to FIG. 3.
- a first step comprises detecting a genetic change.
- a lack of a genetic change indicates no genomic mechanism (e.g., allele-related).
- an optional second step comprises detecting a methylation change.
- a lack of methylation change indicates a gene is not silenced.
- a third step comprises detecting a transcript change.
- a lack of transcript change indicates no transcriptome mechanism. In some instances, a lack of transcript changes indicates a transient expression of the expressed gene. In some instances, a lack of transcript change indicates incomplete penetrance. In some instances, a fourth step comprises detecting a protein change. In some instances, a lack of change in the proteome indicates no proteomic mechanism. In some instances, lack of a change in the proteome indicates incomplete penetrance. In some instances, a change detected in two or more steps indicates high penetrance. In some instances, a change detected in three or more steps indicates high penetrance. In some instances, a change detected in four or more steps indicates high penetrance. In some instances, detected changes in the genome, transcriptome, and proteome indicate high penetrance.
- systems and methods described herein in some instances comprise one or more steps shown in FIG. 5.
- Systems and methods described herein in some instances comprise one or more measurements shown in FIG. 5.
- systems for determining a penetrance score comprise one or more of a computing system comprising at least one processor and instructions executable by the at least one processor to provide an application configured to perform operations.
- the operations comprise one or more of: receiving multiomics data from one or more sources and at least one biological state; and applying an algorithm configured to process the data and generate a penetrance score.
- the system comprises a standalone computing platform.
- the system comprises a cloud computing platform.
- the multiomics data comprises data from one or more of a genome, a trans criptome, a proteome, a metabolome, a lipidome, or an epigenome (such as a methylome). In some instances, the multiomics data comprises data from two or more of a genome, a trans criptome, a proteome, a metabolome, a lipidome, or an epigenome. In some instances, the multiomics data comprises data from a genome, a transcriptome, a proteome, a metabolome, a lipidome, or an epigenome.
- the multiormcs data comprises data obtained from processes which analyze one or more of a genome, a transcriptome, a proteome, a metabolome, a lipidome, or an epigenome.
- multiomics data is obtained from a sample described herein.
- multiomics data is obtained from a single cell.
- multiomics data is obtained from a single cell from a tissue.
- systems described herein analyze multiomics data from single cells in a tissue.
- one or more measurements are selected from copy number variation, translocation, mutation burden, methylation at CpG sites, gene activation, gene repression, expressed genes, gene fusions, splice variants, translation level, phosphorylation state, and protein modification.
- two or more measurements are selected from copy number variation, translocation, mutation burden, methylation at CpG sites, gene activation, gene repression, expressed genes, gene fusions, splice variants, translation level, phosphorylation state, and protein modification.
- four or more measurements are selected from copy number variation, translocation, mutation burden, methylation at CpG sites, gene activation, gene repression, expressed genes, gene fusions, splice variants, translation level, phosphorylation state, and protein modification.
- eight or more measurements are selected from copy number variation, translocation, mutation burden, methylation at CpG sites, gene activation, gene repression, expressed genes, gene fusions, splice variants, translation level, phosphorylation state, and protein modification.
- Penetrance scores may be measured from one or more changes to measurements obtained from multiomics data.
- a change is established against a reference sequence.
- the reference sequence is obtained from a healthy or non-disease control sample.
- a reference sequence is obtained from bulk measurements of a sample population.
- a change comprises one or more of a genome DNA change, a transcript change, and a proteome change.
- a change comprises one or more of a genomic SNV*X (single nucleotide change), genomic CNV*X (copy number variation change), genomic translocation, genomic INDEL, genomic frameshift, genomic stop codon, genomic mitochondrial, genomic promoter/enhancer, genomic TCR/BCR, transcript expression, transcript splice variant, transcript fusion, transcript IncRNA, transcript miRNA, transcript TCR/BCR, transcript promoter, transcript truncated gene, transcript mitochondrial, transcript mutation, over/under expressed protein, truncated protein, surface bound protein, frameshift protein, misfolded protein, metabolic protein, protein ligand independence, protein confirmation, protein activity change, and fused protein.
- genomic SNV*X single nucleotide change
- genomic CNV*X copy number variation change
- genomic translocation genomic INDEL
- genomic frameshift genomic stop codon
- genomic mitochondrial genomic promoter/enhancer
- genomic TCR/BCR genomic TCR/BCR
- transcript expression transcript splice variant
- transcript fusion
- a change comprises two or more of a genomic SNV*X, genomic CNV*X, genomic translocation, genomic INDEL, genomic frameshift, genomic stop codon, genomic mitochondrial, genomic promoter/enhancer, genomic TCR/BCR, transcript expression, transcript splice variant, transcript fusion, transcript IncRNA, transcript miRNA, transcript TCR/BCR, transcript promoter, transcript truncated gene, transcript mitochondrial, transcript mutation, over/under expressed protein, truncated protein, surface bound protein, frameshift protein, misfolded protein, metabolic protein, protein ligand independence, protein confirmation, protein activity change, and fused protein.
- a change comprises five or more of a genomic SNV*X, genomic CNV*X, genomic translocation, genomic INDEL, genomic frameshift, genomic stop codon, genomic mitochondrial, genomic promoter/enhancer, genomic TCR/BCR, transcript expression, transcript splice variant, transcript fusion, transcript IncRNA, transcript miRNA, transcript TCR/BCR, transcript promoter, transcript truncated gene, transcript mitochondrial, transcript mutation, over/under expressed protein, truncated protein, surface bound protein, frameshift protein, misfolded protein, metabolic protein, protein ligand independence, protein confirmation, protein activity change, and fused protein.
- a measurement change is used to determine a mechanism.
- a mechanism comprises a determinate of cell fate.
- a cell fate is shown in FIG. 7.
- a penetrance score may be represented in different ways.
- a penetrance score comprises a numerical value.
- a penetrance score is categorical.
- a numerical value is used to determine a categoncal value.
- categorical values comprise high or low.
- Biological inquiries may be used to interrogate changes in measurements obtained from multiomics data.
- methods described herein perform one or more biological inquiries.
- a biological inquiry comprises throughput number of cells processed, throughput number of cells recovered, throughput sequencing, DNA mutation - SNV, DNA copy number variation, RNA - 3’ gene expression, RNA - genes analyzed/detected, RNA - low level genes detected, RNA - mitochondrial gene expression, protein - translation panel, RNA - chromatin panel, RNA - chromatin state, RNA - BCR/TCR, and RNA - full transcript gene.
- An example of a workflow comprising biological inquires for both mammalian and bacteria samples is shown in FIG. 9.
- systems and methods described herein comprise obtaining a sample, and performing one or more methods comprising biological inquiries.
- obtaining cells comprises one or more of FACS sorting, microfluidics, spatial cell selection, and ultra-high throughput methods.
- methods comprise simultaneous genome/transcriptome analysis to prepare libraries (e g., using PTA).
- libraries are then sequenced to obtain multiomics data.
- methods of target validation In some instances, a target is associated with a disease state or condition. In some instances, a target validation workflow comprises one or more steps of FIG. 6.
- a workflow for validating a target comprises one or more of obtaining a sample, storing a sample, performing one or more multiomic methods on the sample to generate multiomics data, using a computation engine to process the data, and validating a target.
- the sample comprises cells from a tissue.
- the sample comprises cells from a frozen tissue.
- the sample comprises a section of tissue.
- cells are collected and then banked. In some instances, no more than 5000, 4000, 3000, 2000, 1000, 900, 800, 700, 600, 500, 400, 300, 200, 100, 50, 25, or no more than 10 cells are banked.
- Multiomic methods comprise methods described herein such as those which analyze and provide data on any one of the genome, methylome, transcriptome, and proteome.
- target validation is associated with targets related to immunology, cancer genomics, neurology, PGT, microbiome, toxicology, bioprocessing, or cardiology.
- sites of methylated DNA is detected using enzymatic methods.
- sites of methylated DNA is detected using non-enzymatic methods.
- these methods further comprise parallel analysis of the transcriptome and/or proteome of the same cell.
- Methods of detecting methylated genomic bases include selective restriction with methylation-sensitive endonucleases, followed by processing with the PTA method. Sites cut by such enzymes are determined from sequencing, and methylated bases are identified.
- libraries are amplified with methylation-specific primers which selectively anneal to methylated sequences.
- bisulfite treatment of genomic DNA libraries is used to detect a methylation signature.
- Bisulfite conversion of DNA results in conversion of unmodified cytosine (C) to uracil (U) that will be read as thymine (T) upon sequencing of PCR amplified DNA.
- C cytosine
- U uracil
- T thymine
- Both 5meC and 5hmC are protected against conversion and will not be converted to U. Therefore, they will both be read as C upon sequencing.
- non-methylation-specific PCR is conducted, followed by one or more methods to discriminate between bisulfite-reacted bases, including direct pyrosequencing, MS-SnuPE, HRM, COBRA, MS-SSCA, or basespecific cleavage/MALDI-TOF.
- genomic DNA samples are split for parallel analysis of the genome (or an enriched portion thereof) and methylome analysis.
- analysis of the genome and methylome comprises enrichment of genomic fragments (e g., exome, or other targets) or whole genome sequencing.
- the methylation signature is preserved during PTA.
- processing with the PTA method while preserving the methylation signature is used to create a reference library.
- methylation paterns are detected using the methods described herein to create a methylation-specific library.
- the methylation-specific library is compared to the reference library.
- the methylation-specific library and the reference library are prepared from the same cell.
- comparing the methylation-specific library to the reference library allows for identification of a methylation signature.
- the genomic DNA library is treated with bisulfite.
- the genomic library treated with bisulfite is amplified with the PTA method to produce a methylation-specific library.
- the data obtained from single-cell analysis methods utilizing PTA described herein may be compiled into a database. Described herein are methods and systems of bioinformatic data integration. Data from the proteome, genome, trans criptome, methylome or other data is in some instances combined/integrated into a database and analyzed. Bioinformatic data integration methods and systems in some instances comprise one or more of protein detection (FACS and/or NGS), mRNA detection, and/or genome variance detection. In some instances, this data is correlated with a disease state or condition. In some instances, data from a plurality of single cells is compiled to describe properties of a larger cell population, such as cells from a specific sample, region, organism, or tissue.
- protein data is acquired from fluorescently labeled antibodies which selectively bind to proteins on a cell.
- a method of protein detection comprises grouping cells based on fluorescent markers and reporting sample location post-sorting.
- a method of protein detection comprises detecting sample barcodes, detecting protein barcodes, companng to designed sequences, and grouping cells based on barcode and copy number.
- protein data is acquired from oligo barcoded antibodies which selectively bind to proteins on a cell. Such oligo barcodes covalently linked to the antibody are used a reference to the specific antigen binding site for the detection of a particular antigen or translated protein.
- transcriptome data is acquired from sample and RNA specific barcodes.
- a method of mRNA detection comprises detecting sample and RNA specific barcodes, aligning to genome, aligning to RefSeq/Encode, reporting Exon/Intro/Intergenic sequences, analyzing exon-exon junctions, grouping cells based on barcode and expression variance and clustering analysis of variance and top variable genes.
- genomic data is acquired from sample and DNA specific barcodes.
- a method of genome variance detection comprises detecting sample and DNA specific barcodes, aligning to the genome, determine genome recovery and SNV mapping rate, filtering reads on exon-exon junctions, generating variant call file (VCF), and clustenng analysis of variance and top variable mutations.
- a mutation is a difference between an analyzed sequence (e.g., using the methods described herein) and a reference sequence.
- Reference sequences are in some instances obtained from other organisms, other individuals of the same or similar species, populations of organisms, or other areas of the same genome.
- mutations are identified on a plasmid or chromosome.
- a mutation is an SNV (single nucleotide variation), SNP (single nucleotide polymorphism), or CNV (copy number variation, or CNA/copy number aberration).
- a mutation is base substitution, insertion, or deletion.
- a mutation is a transition, transversion, nonsense mutation, silent mutation, synonymous or non-synonymous mutation, non-pathogenic mutation, missense mutation, or frameshift mutation (deletion or insertion).
- PTA results in higher detection sensitivity and/or lower rates of false positives for the detection of mutations when compared to methods such as in-silico prediction, ChlP-seq, GUIDE-seq, circle-seq, HTGTS (High- Throughput Genome-Wide Translocation Sequencing), IDLV (integration-deficient lentivirus), Digenome-seq, FISH (fluorescence in situ hybridization), or DISCOVER-seq.
- PTA Primary Template- Directed Amplification
- amplicons are preferentially generated from the primary template (“direct copies”) using a polymerase (e.g., a strand displacing polymerase). Consequently, errors are propagated at a lower rate from daughter amplicons during subsequent amplifications compared to MDA.
- a polymerase e.g., a strand displacing polymerase
- the terminated amplification products can undergo direction ligation after removal of the terminators, allowing for the attachment of a cell barcode to the amplification primers so that products from all cells can be pooled after undergoing parallel amplification reactions.
- template nucleic acids are not bound to a solid support.
- direct copies of template nucleic acids are not bound to a solid support.
- one or more pnmers are not bound to a solid support.
- no primers are not bound to a solid support.
- a primer is attached to a first solid support, and a template nucleic acid is attached to a second solid support, wherein the first and the second solid supports are not the same.
- PTA is used to analyze single cells from a larger population of cells. In some instances, PTA is used to analy ze more than one cell from a larger population of cells, or an entire population of cells.
- nucleic acid polymerases with strand displacement activity for amplification.
- such polymerases comprise strand displacement activity and low error rate.
- such polymerases comprise strand displacement activity and proofreading exonuclease activity, such as 3 ’->5’ proofreading activity.
- nucleic acid polymerases are used in conjunction with other components such as reversible or irreversible terminators, or additional strand displacement factors.
- the polymerase has strand displacement activity, but does not have exonuclease proofreading activity.
- such polymerases include bacteriophage phi29 (029) polymerase, which also has very low error rate that is the result of the 3’->5’ proofreading exonuclease activity (see, e.g., U.S. Pat. Nos. 5,198,543 and 5,001,050).
- examples of strand displacing nucleic acid polymerases include, e.g., genetically modified phi29 ( 29) DNA polymerase, Klenow Fragment of DNA polymerase I (Jacobsen et al., Eur. J. Biochem.
- Bst DNA polymerase e.g., Bst large fragment DNA polymerase (Exo(-) Bst; Aliotta et al., Genet. Anal. (Netherlands) 12: 185-195 (1996)), exo(-)Bca DNA polymerase (Walker and Linn, Clinical Chemistry 42: 1604-1608 (1996)), Bsu DNA polymerase, VentR DNA polymerase including VentR (exo-) DNA polymerase (Kong et al., J. Biol. Chem.
- Deep Vent DNA polymerase including Deep Vent (exo-) DNA polymerase, IsoPol DNA polymerase, DNA polymerase I, Therminator DNA polymerase, T5 DNA polymerase (Chattegee et al., Gene 97: 13-19 (1991)), Sequenase (U.S. Biochemicals), T7 DNA polymerase, T7-Sequenase, T7 gp5 DNA polymerase, PRDI DNA polymerase, T4 DNA polymerase (Kaboord and Benkovic, Curr. Biol. 5: 149-157 (1995)). Additional strand displacing nucleic acid polymerases are also compatible with the methods described herein.
- the ability of a given polymerase to carry' out strand displacement replication can be determined, for example, by using the polymerase in a strand displacement replication assay (e.g., as disclosed in U.S. Pat. No. 6,977,148). Such assays in some instances are performed at a temperature suitable for optimal activity for the enzyme being used, for example, 32°C for phi29 DNA polymerase, from 46°C to 64°C for exo(-) Bst DNA polymerase, or from about 60°C to 70°C for an enzyme from a hyperthermophylic organism.
- Another useful assay for selecting a polymerase is the primer-block assay described in Kong et al., J. Biol. Chem. 268: 1965-1975
- the assay consists of a primer extension assay using an M13 ssDNA template in the presence or absence of an oligonucleotide that is hybridized upstream of the extending primer to block its progress.
- Other enzymes capable of displacement the blocking primer in this assay are in some instances useful for the disclosed method.
- polymerases incorporate dNTPs and terminators at approximately equal rates.
- the ratio of rates of incorporation for dNTPs and terminators for a polymerase described herein are about 1:1, about 1.5: 1, about 2: 1, about 3: 1 about 4: 1 about 5: 1, about 10: 1, about 20:1 about 50: 1, about 100: 1, about 200: 1, about 500:1, or about 1000: 1.
- the ratio of rates of incorporation for dNTPs and terminators for a polymerase described herein are 1 : 1 to 1000: 1, 2:1 to 500: 1, 5: 1 to 100: 1, 10: 1 to 1000: 1, 100: 1 to 1000: 1, 500:1 to 2000: 1, 50: 1 to 1500: 1, or 25: 1 to 1000: 1.
- strand displacement factors such as, e.g., helicase.
- additional amplification components such as poly merases, terminators, or other component.
- a strand displacement factor is used with a polymerase that does not have strand displacement activity.
- a strand displacement factor is used with a polymerase having strand displacement activity.
- strand displacement factors may increase the rate that smaller, double stranded amplicons are reprimed.
- any DNA polymerase that can perform strand displacement replication in the presence of a strand displacement factor is suitable for use in the PT A method, even if the DNA polymerase does not perform strand displacement replication in the absence of such a factor.
- Strand displacement factors useful in strand displacement replication in some instances include (but are not limited to) BMRF1 polymerase accessory subunit (Tsurumi et al., J. Virology 67(12):7648-7653 (1993)), adenovirus DNA-binding protein (Zijderveld and van der Vliet, J. Virology 68(2): 1158-1164
- herpes simplex viral protein ICP8 Boehmer and Lehman, J. Virology 67(2):711-715 (1993); Skaliter and Lehman, Proc. Natl. Acad. Sci. USA 91(22): 10665-10669 (1994)); singlestranded DNA binding proteins (SSB; Rigler and Romano, J. Biol. Chem. 270:8910-8919
- SSB Replication Protein A
- RPA Replication Protein A
- mtSSB human mitochondrial SSB
- Recombinases e.g., Recombinase A (RecA) family proteins, T4 UvsX, T4 UvsY, Sak4 of Phage HK620, Rad51, Dmcl, or Radb.
- RecA Recombinase A family proteins, T4 UvsX, T4 UvsY, Sak4 of Phage HK620, Rad51, Dmcl, or Radb.
- a helicase is used in conjunction with a polymerase.
- the PTA method comprises use of a single-strand DNA binding protein (SSB, T4 gp32, or other single stranded DNA binding protein), a helicase, and a polymerase (e.g., SauDNA polymerase, Bsu polymerase, Bst2.0, GspM, GspM2.0, GspSSD, or other suitable polymerase).
- a polymerase e.g., SauDNA polymerase, Bsu polymerase, Bst2.0, GspM, GspM2.0, GspSSD, or other suitable polymerase.
- reverse transcriptases are used in conjunction with the strand displacement factors described herein.
- reverse transcriptases are used in conjunction with the strand displacement factors described herein.
- amplification is conducted using a polymerase and a nicking enzyme (e.g., “NEAR”), such as those described in US 9,617,586.
- the nicking enzyme is Nt.BspQI, Nb.BbvCi, Nb.BsmI, Nb.BsrDI, Nb.BtsI, Nt.AlwI, Nt.BbvCI, Nt.BstNBI, Nt.CviPII, Nb.BpulOI, or Nt.BpulOI.
- amplification methods comprising use of terminator nucleotides, polymerases, and additional factors or conditions.
- factors are used in some instances to fragment the nucleic acid template(s) or amplicons during amplification.
- factors comprise endonucleases.
- factors comprise transposases.
- mechanical shearing is used to fragment nucleic acids during amplification.
- nucleotides are added during amplification that may be fragmented through the addition of additional proteins or conditions. For example, uracil is incorporated into amplicons; treatment with uracil D-glycosylase fragments nucleic acids at uracil-containing positions.
- amplification methods comprising use of terminator nucleotides, which terminate nucleic acid replication thus decreasing the size of the amplification products.
- terminator nucleotides are in some instances used in conjunction with polymerases, strand displacement factors, or other amplification components described herein.
- terminator nucleotides reduce or lower the efficiency of nucleic acid replication.
- Such terminators in some instances reduce extension rates by at least 99.9%, 99%, 98%, 95%, 90%, 85%, 80%, 75%, 70%, or at least 65%.
- Such terminators reduce extension rates by 50%-90%, 60%-80%, 65%-90%, 70%-85%, 60%-90%, 70%-99%, 80%-99%, or 50%- 80%.
- terminators reduce the average amplicon product length by at least 99.9%, 99%, 98%, 95%, 90%, 85%, 80%, 75%, 70%, or at least 65%. Terminators in some instances reduce the average amplicon length by 50%-90%, 60%-80%, 65%-90%, 70%-85%, 60%-90%, 70%-99%, 80%-99%, or 50%-80%. In some instances, amplicons comprising terminator nucleotides form loops or hairpins which reduce a polymerase's ability to use such amplicons as templates.
- terminators slows the rate of amplification at initial amplification sites through the incorporation of terminator nucleotides (e.g., dideoxynucleotides that have been modified to make them exonuclease-resistant to terminate DNA extension), resulting in smaller amplification products.
- terminator nucleotides e.g., dideoxynucleotides that have been modified to make them exonuclease-resistant to terminate DNA extension
- PTA amplification products undergo direct ligation of adapters without the need for fragmentation, allowing for efficient incorporation of cell barcodes and unique molecular identifiers (UMI).
- UMI unique molecular identifiers
- Terminator nucleotides are present at various concentrations depending on factors such as polymerase, template, or other factors. For example, the amount of terminator nucleotides in some instances is expressed as a ratio of non-terminator nucleotides to terminator nucleotides in a method described herein. Such concentrations in some instances allow control of amplicon lengths. In some instances, the ratio of terminator to non-terminator nucleotides is modified for the amount of template present or the size of the template. In some instances, the ratio of ratio of terminator to non-terminator nucleotides is reduced for smaller samples sizes (e.g., femtogram to picogram range).
- the ratio of non-terminator to terminator nucleotides is about 2: 1, 5: 1, 7:1, 10:1, 20:1, 50:1, 100: 1, 200:1, 500: 1, 1000:1, 2000:1, or 5000: 1. In some instances the ratio of non-terminator to terminator nucleotides is 2:1-10: 1, 5: 1- 20: 1, 10: 1-100: 1, 20: 1-200: 1, 50:1-1000:1, 50:1-500: 1, 75: 1-150:1, or 100: 1-500:1. In some instances, at least one of the nucleotides present during amplification using a method described herein is a terminator nucleotide.
- each terminator need not be present at approximately the same concentration; in some instances, ratios of each terminator present in a method described herein are optimized for a particular set of reaction conditions, sample type, or polymerase.
- each terminator may possess a different efficiency for incorporation into the growing polynucleotide chain of an amplicon, in response to pairing with the corresponding nucleotide on the template strand.
- a terminator pairing with cytosine is present at about 3%, 5%, 10%, 15%, 20%, 25%, or 50% higher concentration than the average terminator concentration.
- a terminator pairing with thymine is present at about 3%, 5%, 10%, 15%, 20%, 25%, or 50% higher concentration than the average terminator concentration.
- a terminator pairing with guanine is present at about 3%, 5%, 10%, 15%, 20%, 25%, or 50% higher concentration than the average terminator concentration. In some instances, a terminator pairing with adenine is present at about 3%, 5%, 10%, 15%, 20%, 25%, or 50% higher concentration than the average terminator concentration. In some instances, a terminator pairing with uracil is present at about 3%, 5%, 10%, 15%, 20%, 25%, or 50% higher concentration than the average terminator concentration. Any nucleotide capable of terminating nucleic acid extension by a nucleic acid polymerase in some instances is used as a terminator nucleotide in the methods described herein.
- a reversible terminator is used to terminate nucleic acid replication.
- a non-reversible terminator is used to terminate nucleic acid replication.
- non-limited examples of terminators include reversible and non- reversible nucleic acids and nucleic acid analogs, such as, e.g., 3’ blocked reversible terminator comprising nucleotides, 3’ unblocked reversible terminator comprising nucleotides, terminators comprising 2’ modifications of deoxynucleotides, terminators comprising modifications to the nitrogenous base of deoxynucleotides, or any combination thereof.
- terminator nucleotides are dideoxynucleotides.
- nucleotide modifications that terminate nucleic acid replication and may be suitable for practicing the invention include, without limitation, any modifications of the r group of the 3’ carbon of the deoxyribose such as inverted dideoxynucleotides, 3' biotinylated nucleotides, 3' amino nucleotides, 3 '-phosphorylated nucleotides, 3'-O-methyl nucleotides, 3' carbon spacer nucleotides including 3' C3 spacer nucleotides, 3' C18 nucleotides, 3' Hexanediol spacer nucleotides, acyclonucleotides, and combinations thereof.
- any modifications of the r group of the 3’ carbon of the deoxyribose such as inverted dideoxynucleotides, 3' biotinylated nucleotides, 3' amino nucleotides, 3 '-phosphorylated nucleotides, 3'-O-methyl nucleo
- terminators are polynucleotides comprising 1, 2, 3, 4, or more bases in length.
- terminators do not comprise a detectable moiety or tag (e.g., mass tag, fluorescent tag, dye, radioactive atom, or other detectable moiety).
- terminators do not comprise a chemical moiety allowing for attachment of a detectable moiety or tag (e.g., “click” azide/alkyne, conjugate addition partner, or other chemical handle for attachment of a tag).
- all terminator nucleotides comprise the same modification that reduces amplification to at region (e.g., the sugar moiety, base moiety, or phosphate moiety) of the nucleotide.
- At least one terminator has a different modification that reduces amplification.
- all terminators have a substantially similar fluorescent excitation or emission wavelengths.
- terminators without modification to the phosphate group are used with polymerases that do not have exonuclease proofreading activity. Terminators, when used with polymerases which have 3’->5’ proofreading exonuclease activity (such as, e.g., phi29) that can remove the terminator nucleotide, are in some instances further modified to make them exonuclease-resistant.
- dideoxynucleotides are modified with an alpha-thio group that creates a phosphorothioate linkage which makes these nucleotides resistant to the 3’->5’ proofreading exonuclease activity of nucleic acid polymerases.
- Such modifications in some instances reduce the exonuclease proofreading activity of polymerases by at least 99.5%, 99%, 98%, 95%, 90%, or at least 85%.
- examples of other terminator nucleotide modifications providing resistance to the 3’->5’ exonuclease activity include in some instances: nucleotides with modification to the alpha group, such as alpha-thio dideoxynucleotides creating a phosphorothioate bond, C3 spacer nucleotides, locked nucleic acids (LNA), inverted nucleic acids, 2' Fluoro bases, 3' phosphorylation, 2'-O-Methyl modifications (or other 2’-O-alkyl modification), propyne- modified bases (e.g., deoxycytosine, deoxyuridine), L-DNA nucleotides, L-RNA nucleotides, nucleotides with inverted linkages (e.g., 5’ -5’ or 3’-3’), 5’ inverted bases (e.g., 5’ inverted 2’,3’-dideoxy dT), methylphosphonate backbones, and trans nucleic acids.
- LNA
- nucleotides with modification include base-modified nucleic acids comprising free 3’ OH groups (e.g., 2-nitrobenzyl alkylated HOMedU triphosphates, bases comprising modification with large chemical groups, such as solid supports or other large moiety).
- a polymerase with strand displacement activity but without 3 '->5 'exonuclease proofreading activity is used with terminator nucleotides with or without modifications to make them exonuclease resistant.
- nucleic acid polymerases include, without limitation, Bst DNA polymerase, Bsu DNA polymerase, Deep Vent (exo-) DNA polymerase, Klenow Fragment (exo-) DNA polymerase, Therminator DNA polymerase, and VentR(exo-).
- Described herein are computer-implemented systems for visualization of biological data.
- the data comprises genomic, transcriptomic, proteomic, methylation and epigenomic data.
- computer-implemented systems comprising one or more modules.
- computer-implemented systems comprising at least one memory storing computer-executable instructions; and at least one processor configured to access the at least one memory and execute the computer-executable instructions, wherein the computer-executable instructions comprise one or more of a frontend, a backend, and a pipeline module.
- an exemplary arrangement of modules is shown in FIG. 10.
- modules are accessed from a cloud-based database or interface.
- Methods and systems described herein in some instances comprise one or more steps of accessing a web-based software application; providing or otherwise linking an input file (such as a file comprising whole genomes sequencing, RNA, or other biological information); processing the file; applying one or more filters or annotations to the data in the file; querying one or more databases; and displaying a visualization of the filtered and/or annotated data.
- the systems and methods described herein may comprise a frontend module.
- the frontend module comprises a Vue.js application that provides the user interface and visualizations for the systems and methods described herein.
- the frontend makes requests to the backend to query data.
- a frontend comprises computer-executable instructions for one or more of: displays complex visualizations such as the circos plot, phylogenic tree, etc. (e.g., as navigable tabs); displays quality metrics; visualizes filters and filtering interactions; and presents data tables for cell information.
- a web version of IGV is integrated into the frontend.
- the systems and methods described herein may comprise a backend module.
- the backend comprises a Flask framework application and provides one or more backend features of for the methods and systems described herein.
- the backend is written in Python.
- a backend comprises computer-executable instructions for one or more of: user authentication and registration; data computations and filtering; access of a Vaex open-source library for speeding up data interactions; interacting with a database and HDF5 files to process data requests; presenting and encoding data for visualizations; and presenting data for IGV.
- the systems and methods described herein may comprise a pipeline module.
- the pipeline comprises a computationally intensive workflow that runs genomics analysis tools to extract signatures of biomarkers from sequencing files and loads them into a database.
- the methods and systems described herein comprise one or more pipeline modules.
- pipeline modules comprise multi-omics, such as WGS/exome, methylation, proteome, proteome bacterial, or RNA-seq/transcriptome.
- pipeline comprises one or more sub-modules.
- a pipeline comprises one or more data files.
- a pipeline comprises one or more of sequencing input files, sub-pipeline modules, and summary files.
- Pipelines may be configured for whole genome or exome sequencing data.
- a WGS/exome pipeline is configured to input one or more fastQ files.
- a WGS/exome pipeline comprises one or more of alignment, haplotype callerjointgenotyping, heterozygous site detector (Pipeline used for the analysis of cell lines without a priori knowledge of reference heterozy gous variant sites), statistics, ADO, and CNV are needed to drive insights from sequencing data.
- the files contain sequence(ing) information/data.
- files comprise sequence data from the clusters that pass filter on a flow cell.
- the files comprise FastQ files.
- the database comprises a PostgreSQL database.
- the databases are accessed from a backend module, rises computer-executable instructions for one or more of: accepts a sequencing information file as input (e g., FastQ); running joint genotyping to produce VCF file and linking variants to COSMIC, ClinVar, or another variant list.
- a VCF file contains the variants called from multiple samples (cells) all together and represent high confidence variants distributed across the cells. These variants in some instances represent changes in nucleotides observed in a cell in relation to the reference genome. In some instances, these variants are placed along the genome using genomic coordinates (e.g., chrl base 18903). Such a configuration having a specific location for a variant allows in some instances association of information complied in databases to this given variant.
- Pipelines may be configured for multi-omics analysis.
- multi-omics comprises two or more types of biological information.
- multi-omics comprises two or more of transcript (transcriptome), genomic, proteomic, methylome, or other form of sample analysis.
- methods described herein display and/process multi-omics data. Data in some instances is obtained from a single cell. Data in other instances is obtained by evaluation of a population of cells.
- methods described herein display transcript and genomic data.
- methods described herein utilize transcript, genomic data, and proteomics data.
- methods described herein utilize transcript, genomic data, and methylome data.
- an alignment pipeline comprises one or more of a compressed alignment file describing the alignment information of the reads in the project against a given reference (e.g., hg38), a .bam file) and an index file of the alignment file).
- a given reference e.g., hg38
- a .bam file e.g., a .bam file
- an index file of the alignment file e.g., a .bam file.
- a haplotype caller pipeline comprises one or more of a genomic variant call format (GVCF) file containing the detected variants for a given sample) and an indexer file associated with the GVCF file.
- GVCF genomic variant call format
- a joint-genotyping pipeline comprises one or more of a genomic variant call format (GVCF) file containing the joint variant calling of multiple samples) and an indexer file associated with the Joint-Genotyped GVCF file.
- GVCF genomic variant call format
- a heterozygous site detector pipeline comprises one or more of a genomic variant call format (GVCF) file containing the called variants with high degree of prevalence across a dataset and high confidence; and an indexer file associated with the GVCF file.
- GVCF genomic variant call format
- a statistics pipeline comprises one or more of a tabulator-separated value table describing whole genome sequence (WGS) level statistics estimated from the aligned reads (e.g., IX, 5X, 10X coverage, etc.); and a tabulator-separated value table showing exome-panel specific statistics (e.g., On, OFF, Near target events).
- WGS whole genome sequence
- an ADO pipeline comprises one or more of a tabulator-separated value table showing allele frequencies of N number of queried heterozygous sites. This table is in some instances used to estimate WGS allele balance.
- a CNV pipeline comprises one or more of a tabulator-separated value table describing, for a sample, the estimated copy number for bins of size N across the whole genome; and tabulator-separated value table describing, for a sample, the type of event (insertion, deletion) for all bins of size N across the genome.
- Pipelines may be configured for bacterial sequencing data.
- a bacterial pipeline is configured to input a fastQ file.
- a bacterial pipeline comprises one or more of: a compressed FASTQ files containing trimmed and filtered high qualify sequences; a tabulator-separated value table describing taxonomic assignation of each read to a given species using a database, such as Kraken’s database); a fasta file describing the genome assembly, at the level of contigs, constructed from the reads in the dataset; fasta file describing the genome-assembly, at the level of scaffolds, constructed from the reads in the dataset; a BAM file describing the alignments of the reads in reference to the assemble genome (e.g., contigs).
- a bacterial pipeline comprises one or more summary files.
- summary files comprise one or more of: a Tabulator-separated value table describing the taxonomic assignment of contigs in an assembly based on the proportion of reads mapped to them; a tabulator-separated value table showing the estimated completeness of a given assembly based on a set of phylogenetic marker genes.
- Pipelines may be configured for RNA-seq data.
- an RNA-seq pipeline is configured to accept one or more of a compressed alignment file describing the alignment information of the reads in the project against a given reference (e.g., hg38); an index file of the compressed alignment file; a compressed alignment file describing the alignment information of the reads in the project against a RNA-Seq specific index for a given reference and an index file for the alignment file.
- a given reference e.g., hg38
- an index file of the compressed alignment file e.g., a compressed alignment file describing the alignment information of the reads in the project against a RNA-Seq specific index for a given reference
- an index file for the alignment file e.g., RNA-Seq specific index for a given reference.
- an RNA-seq pipeline compnses one or more summary files.
- summary files comprise one or more of a tabulator-separated table describing the matnx of counts of the genomic features (e.g., exons in a gene) across samples; a tabulator-separated table describing the number of unique splice-junction overlaps; a tabulator- separated table describing overall alignment metrics (e.g., number of genes with counts, etc.); and a tabulator-separated table showing the estimated ratio of exon-non exon alignment events.
- Systems and methods described herein may comprise filters for visualizing data.
- filters comprise one or more of: Germline mutation, Somatic mutation, Copy number variation, Single nucleotide variation, Insertions and deletions, Tumor Mutation Burden (TMB) Analysis, Catalog of somatic mutation in cancer (Cosmic)4, ClinVar, and Predicted Coding Change.
- TMB Tumor Mutation Burden
- FIG. 1 Further described herein are computer-implemented systems comprising: at least one memory storing computer-executable instructions; and at least one processor configured to access the at least one memory and execute the computer-executable instructions to perform: receiving a query, wherein the query comprises genomic data from one or more samples; querying a database; wherein the database comprises a plurality of genomic data and a plurality of phenotype data; generating, using at least the genomic data, a genome summary, the genome summary comprising genes and gene variants of the cohort; determining a graphical representation of the genome summary; and sending the graphical representation to a display device.
- GUI graphical user interface
- a GUI comprises a project browser or dashboard.
- a GUI comprises drop down menus for one or more of project, owner, analysis type, and status.
- a GUI comprises a list of previous and current projects.
- projects and data are shared among a group of users.
- projects are saved for future modification or access.
- GUI is facilitated by a frontend.
- Computer-implemented systems may comprise a genome browser.
- a genome browser is configured to display sections of a genome and/or variants.
- a genome browser comprises an IGV (integrated genome viewer).
- IGV integrated genome viewer
- the bin size is selectable from the entire genome down to the individual base.
- individual mutations in some instances are viewed to determine the alternative allele or base change.
- each mutation is selectable, further detailing the nature of the modification and presenting it to the user.
- Computer-implemented systems may comprise an interface for annotating variants. This is an important step to empower interpretation of downstream coding changes in protein structure and function.
- Variant information in some instances comprises one or more of features (name, gene id, gene type, strand, Tdl, Hgncld), predictions (SIFT/sorting intolerant from tolerant, LFT I likelihood ratio test, FATHMM, PROVEAN/ protein variation effect analyzer, MetaSVM, MetaLR), conservation among species (e.g., vertebrates, mammals, etc.); evidence (pathology-related data from databases such as COSMIC), and biological population.
- a variant annotation interface assesses the degree of conservation among (100) vertabrates and (30) mammals.
- this display is helpful in the investigation of de-novo variant alleles which are not annotated by ClinVar, Cosmic, Genecards or Ensembl.
- the comparison allows the determination of conservation of alleles found in the sample compared to the same allele found in an alternative species. conserveed alleles are right shifts, where the conservation is high, where alleles which have low conservation are shifted left. As an example, in the Phylo 30-way mammal plot the allele is highly conserved across all 30 mammals indicating the gene is highly conserved and likely to be important for the health of all mammalian species.
- Variants in some instances are annotated as one or more of Germline mutation, Somatic mutation, Copy number variation (CNV), Single nucleotide variation (SNV), Insertions and deletions, Catalog of somatic mutation in cancer (Cosmic), ClinVar, and Predicted Coding Change. Additional resources are also accessed in some instances, such as GeneCards, Essembl, CinVar and Cosmic. In some instances, variants comprise complex markers such as those obtained using Tumor Mutation Burden (TMB) Analysis.
- TMB Tumor Mutation Burden
- Computer-implemented systems may comprise an interface for tracing variant lineages.
- lineages comprise somatic, ancestral, or reference lineages.
- Lineage trees in some instances are generated from specific chromosomes, and graphically display variants in a chart format.
- Computer-implemented systems may comprise an interface for analyzing cells.
- samples comprise one or more cells.
- Cells in some instances are searched, or summary information about each cell is displayed such as cell name, variants detected (somatic, germline, SNPs, and mdels.
- metrics high, medium, and low are used to describe confidence of variant calls for each cell.
- inter-cell distances are graphed.
- Computer-implemented systems may comprise an interface for visualizing sequencing metrics (e.g., Picard metrics). Metrics include but are not limited to chromosome M population, percent pass/fail reads aligned, WGS mean coverage, and WGS percent excluded duplicate reads. Each metric in some instances is also displayed on an individual per-cell basis.
- sequencing metrics e.g., Picard metrics.
- Metrics include but are not limited to chromosome M population, percent pass/fail reads aligned, WGS mean coverage, and WGS percent excluded duplicate reads. Each metric in some instances is also displayed on an individual per-cell basis.
- Computer-implemented systems may comprise an interface for visualizing genomic data.
- data may be visualized using a circos plot.
- Circos plots in some instances comprise additional variant information, such as number of somatic, germline, SNP or indel variants.
- a circos plot comprises a lineage tree.
- a user interface is configured to apply one or more filters to the circos plot.
- two or more groups of cells or samples are compared (optionally filtered by number of variants).
- views of one or more chromosomes are displayed or hidden.
- data from one or more cells is hidden or displayed.
- variant filters comprise one or more of variant type (SNP, indel), origin (somatic vs. germline), annotation (COSMIC, CLINVAR, coding change), or features.
- features comprise name, gene id, gene type, strand, Tdl, Hgncld.
- variant filters comprise predictions (SIFT, FATHMM, PROVEAN, MetaSVM, and MetaLR). Upon selection of a region or chromosome within a cell's genome, a pop-up window is in some instances presented to the user which includes a genome viewing frame (e.g., IGV) plot.
- a genome viewing frame e.g., IGV
- This window can be configured in terms of genome window bin size allowing the visualization of the entire chromosome to the individual bases across that genome, which can be completed in matter of seconds.
- the window size in some instances is scrollable by simply dragging the window left or right.
- each sample in some instances is interrogated to determine, for example, the specific change which is highlighted by a color change from the parental allele.
- the alternative allele is selected to determine the base change, while the parent allele can be detected to determine pathogenic risk score based on several public algorithms as well as the conservation of the allele across several vertebrate and mammalian species.
- This variant annotation further provides links to several databases to provide greater detail of the impact of the genomic alteration.
- the systems and methods described herein may provide a visualization of genomic and multiomic data having a large number of data sets.
- the genomic data comprises at least 1, 2, 5, 10, 15, 20, 25, 30, 40, 50, 75, 100, 125, 150, 200, 250, 300, 400, 500, 600, 750, 1000, or at least 1500 data sets.
- the genomic data comprises 1- 1000, 5-1000, 10-1000, 5-10,000, 100-10,000, 100-10,000, 100-1000, 10-500, 10-750, 50-750, or 50-500 data sets.
- each sample data set corresponds to a single cell.
- each data set comprises at least 500, 1000, 2000, 5000, 10,000, 50,000, 100,000, 150,000, 250,000, 500,000 1 million, 2 million, 3 million, 4 million, 5 million, 6 million, 10 million or at least 15 million variants.
- each data set comprises about 500, 1000, 2000, 5000, 10,000, 50,000, 100,000, 150,000, 250,000, 500,000 1 million, 2 million, 3 million, 4 million, 5 million, 6 million, 10 million or about 15 million variants.
- each data set comprises 100-1 million, 100-100,000, 100,000-1 million, 100,000-5 million, 100-500,000, 500-5 million, 1 million-2 million, 2 million to 6 million, 3 million to 10 million, or 4 million to 7 million variants.
- data sets comprise at least 1, 2, 5, 10, 20, 25, 50, 75, 80, 85, 90, 95, 100, 110, 120, 150, 200, or at least 250 million rows of data.
- data sets comprise no more than 1, 2, 5, 10, 20, 25, 50, 75, 80, 85, 90, 95, 100, 110, 120, 150, 200, or no more than 250 million rows of data.
- data sets comprise 1-250, 1-100, 1-50, 1-25, 5-25, 5-50, 10-100, 10-200, 50-200, 50-150, 100-400 or 100-300 million rows of data.
- a system for visualizing genomic data comprises one or more of a devices comprising at least one processor and instructions executable by the at least one processor to provide a first application configured to perform operations comprising: i. accessing one or more data sets comprising genomic data; and ii. generating a visual representation of the one or more data sets.
- the visualization comprises a circos plot. In some instances, the circos plot is generated in no more than 10, 5, 4, 3, 2, 1, 0.5, 0.2, 0.1, 0.05, or no more than 0.01 seconds.
- the circos plot is generated in 0.01-10, 0.05-10, 0.1-50, 0.5-10, 1-10, 2-10, 5-10, 0.01-0.05, 0.01-0.1, 0.01-0.5, 0.1-0.5, 0.1-1, or 0.1-5 seconds. In some instances, the circos plot is generated in no more than 10, 5, 4, 3, 2, 1, 0.5, 0.2, 0. 1, 0.05, or no more than 0.01 seconds for data set having at least 5 cells.
- the circos plot is generated in 0.01-10, 0.05-10, 0.1-50, 0.5-10, 1-10, 2-10, 5-10, 0.01-0.05, 0.01-0.1, 0.01-0.5, 0.1-0.5, 0.1-1, or 0.1-5 seconds for a data set having at least 5 cells. In some instances, the circos plot is generated in no more than 10, 5, 4, 3, 2, 1, 0.5, 0.2, 0.1, 0.05, or no more than 0.01 seconds for data set having at least 10 cells.
- the circos plot is generated in 0.01-10, 0.05-10, 0.1-50, 0.5-10, 1-10, 2-10, 5-10, 0.01-0.05, 0.01-0.1, 0.01-0.5, 0.1-0.5, 0.1-1, or 0.1-5 seconds for a data set having at least 10 cells. In some instances, the circos plot is generated in no more than 10, 5, 4, 3, 2, 1, 0.5, 0.2, 0.1, 0.05, or no more than 0.01 seconds for data set having at least 20 cells.
- the circos plot is generated in 0.01-10, 0.05-10, 0.1-50, 0.5-10, 1-10, 2-10, 5-10, 0.01-0.05, 0.01-0.1, 0.01- 0.5, 0.1-0.5, 0.1-1, or 0.1-5 seconds for a data set having at least 20 cells. In some instances, the circos plot is generated in no more than 10, 5, 4, 3, 2, 1, 0.5, 0.2, 0.1, 0.05, or no more than 0,01 seconds for data set having at least 1 million variants per cell. In some instances, the circos plot is generated in 0.01-10, 0.05-10, 0.1-50, 0.5-10, 1-10, 2-10, 5-10, 0.01-0.05, 0.01-0.1, 0.01-0.5, 0.
- the circos plot is generated in no more than 10, 5, 4, 3, 2, 1, 0.5, 0.2, 0.1, 0.05, or no more than 0.01 seconds for data set having at least 4 million variants per cell. In some instances, the circos plot is generated in 0.01-10, 0.05-10, 0.1-50, 0.5-10, 1-10, 2-10, 5-10, 0.01-0.05, 0.01-0.1, 0.01-0.5, 0.1-0.5, 0.1-1, or 0.1-5 seconds for a data set having least 4 million variants per cell. In some instances, the circos plot is generated using no more than 1, 2, 3, 4, 5, 6, 7, or no more than 8 processors.
- processors may be used. As many processors as needed may be used.
- the circos plot is generated using at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 20, 30, 40 or more processors. In some instances, the circos plot is generated using about 1, 2, 3, 4, 5, 6, 7, 8, 10, 20, 30, 40 or more processors.
- the visualization further comprises a phylogenic tree. In some instances, the visualization further comprises sequencing qualify metrics. In some instances, the visualization further comprises annotated variations. In some instances, the visualization further comprises number of variations. In some instances, the visualization further comprises cell and cell population statistics.
- platforms comprising: a database, in a computer memory, comprising biologic information for member of a population of individuals or samples, the biologic information comprising genome data, the biologic information obtained by analysis of one or more biologic samples from each sample and/or individual, each individual and/or sample having an ID; and a processor configured to provide a biologic information visual synthesis application comprising: a software module presenting an interface allowing a user to query the database one or more of: inputting a phenotype, inputting a gene name, inputting an individual ID, and inputting a sample ID; a software module generating a genome browser, the genome browser comprising: a whole genome display comprising an icon representing each chromosome, each icon indicating a densify of gene variants; and a chromosome display comprising an iconic representation of each chromosome, the representation indicating a density of gene variants located at the relevant portion of the chromosome, wherein selection of a chromosome by a user generates a linear
- the platforms, systems, media, and methods described herein include biologic data pertaining to a population of individuals, or use of the same.
- the population of individuals comprises more than 1,000, more than 5,000, more than 10,000, more than 20,000, more than 50,000, or more than 100,000, more than 500,000, more than 1,000,000 more than 10,000,000, more than 50,000,000, or more than 100,000,000 individuals.
- the individuals in the population participated in academic medical research studies using consents allowing for genetic testing of specimens.
- biologic specimens and phenotype data are collected for individuals from pharmaceutical clinical trials, academic research, and health care settings.
- biologic data pertaining to a population of individuals is collected from integrated health records for individuals representing a spectrum of diseases with unmet medical needs.
- biologic information comprises genetic information.
- the biologic information compnses whole human genome sequencing information.
- the biologic information comprises human transcriptome sequencing information.
- biologic information comprises genetic information from humans, non-human primates, animals, plants, fungi, protozoa, archaea, or bacteria.
- biologic information comprises genetic information from the microbiome.
- the biologic information may comprise genomic information.
- genomic information refers to genetic information found within a biological sample arising from the genome (or DNA - nuclear, mitochondrial or otherw ise).
- genomic information comprises nucleic acid sequence copy number, location, and sequence.
- the genomic information is not limited to protein-coding sequence, it may refer to intronic sequence and intergenic sequence, each known to harbor multiple functional elements whereby DNA changes at those elements may be consequential in normal development and disease.
- genomic information comprises post-transcriptional modifications such as methylation.
- genomic information is found w ithin a chromosome, plasmid, or other medium comprising nucleic acids.
- the biologic information may comprise transcript information.
- transcript information refers to information obtained from a transcriptome within a biological sample.
- transcript information comprises expression levels of genes and sequence of corresponding nucleic acids expressed from genes.
- the biologic information may comprise microbiome information.
- microbiome refers to the bacteria and other microorganisms that live in and on the human body.
- the microbiome information comprises metagenomic microbiome characterization.
- the microbiome information comprises one or more of: microflora genus and/or species information, microflora relative abundance information, and microflora gene and/or gene variant information.
- the biologic information may comprise proteome information.
- the proteome information comprises information regarding abundance, localization, identity, post-transcriptional modifications, or other protein information.
- the biologic information may comprise methylome information.
- methylome information comprises post-transcriptional modifications such as the location of 5- methylcytosine (5-mC), 5-hydroxymethylcytosine (5-hmC), CpG islands, ATAC seq, methyl histone modification, other post-transcriptional modification to nucleic acids, and/or any combinations thereof.
- the biologic information may comprise metabolome information.
- metabolome refers to the small-molecule chemicals found within a biological sample.
- metabolome information comprises the presence of one or more smallmolecule chemicals.
- the metabolome information comprises a qualitative measurement of one or more small-molecule chemicals.
- the metabolome information comprises a quantitative measurement of one or more small-molecule chemicals.
- the microbiome information comprises measurements of at least 100, at least 200, at least 300, at least 400, at least 500, at least 600, at least 700, at least 800, at least 900, at least 1000, at least 1100, at least 1200, at least 1300, at least 1400, or at least 1500 substances (e.g., molecules).
- substances e.g., molecules
- Databases and visualizations described herein may comprise sensitive information pertaining to an individual’s health.
- platforms for data security comprising one or more of an access control for one or more users; a security framework; and biological data from an individual.
- one or more security measures are implemented via security frameworks to restrict access or protect an individual’s health information.
- Security frameworks in some instances comprises standards.
- security frameworks include HIPPA standards.
- security frameworks comprise NIST cybersecurity framework.
- Access controls in some instances restrict access to certain individuals or groups of individuals. Access controls in some instances comprise passwords, biometrics, or other method of user authentication.
- the system can be applied in a variety of fields.
- the system provides useful data and analysis to pharmaceutical companies, including informaticians, bench scientists, medical director, the senior executive team, or commercial organizations.
- Such data and analysis includes analysis of clinical trial data for patient stratification and biomarker discovery, identification and in silico validation of novel genetic targets, discovery of novel disease and dose response biomarkers/signatures, compound repurposing and expand indications of marketed drugs, rescue of failed clinical trial assets, real time genetic analysis of adverse events, or targeted accelerated recruitment for clinical trials.
- the system in some instances offers analysis of specific cohorts, analysis of individual patients, or large-scale analysis of variation in populations.
- Clinics, hospitals and cancer centers, including physicians and genetic counsellors, in some instances will find the system useful in the analysis of individuals, analysis of cohorts, wellness focus, or oncology focus.
- the data and analysis in some instances also have value to insurance companies, actuarial teams, or health economists.
- the system can serve as or enable a reference set of knowledge/evidence, a hypothesis generation engine, a platform for analysis of pharma’s own data, a platform for combination of pharma data and data and analysis provided by the system, a platform for combining data from multiple collaborators, a platform for sharing data within a company, etc.
- the system can similarly be used as part of a care tool to identify the most relevant results for treatment and prevention, a reference set of knowledge/evidence, or a tool to identify other physicians with similar patients/ share knowledge.
- the system can be useful as part of a tool for detect individual care pathway and incentivize healthy living or a tool to help quantify risk that they have in the insured population.
- kits comprising reagents for acquiring biological information.
- the kit is configured to obtain genomic or transcriptome data.
- the kit is configured to obtain genomic, methylome, transcriptome or proteome data from single cells.
- kits comprising reagents for obtaining biological data from single cells, and instructions for using the kit.
- the instructions comprise links to a web-based portal or mobile based software application to import, analyze, and/or compare biological data obtained from the kit.
- the platforms, systems, media, and methods described herein include a digital processing device, or use of the same.
- the digital processing device includes one or more hardware central processing units (CPUs) or general- purpose graphics processing units (GPGPUs) that carry out the device’s functions.
- the digital processing device further comprises an operating system configured to perform executable instructions.
- one or more resources related to the systems described herein is stored locally.
- the digital processing device is optionally connected a computer network.
- the digital processing device is optionally connected to the Internet such that it accesses the World Wide Web.
- the digital processing device is optionally connected to a cloud computing infrastructure.
- the digital processing device is optionally connected to an intranet.
- the digital processing device is optionally connected to a data storage device.
- suitable digital processing devices include, by way of examples, server computers, desktop computers, laptop computers, and notebook computers.
- the digital processing device includes an operating system configured to perform executable instructions.
- the operating system is, for example, software, including programs and data, which manages the device’s hardware and provides services for execution of applications.
- suitable server operating systems include, by way of examples, FreeBSD, OpenBSD, NetBSD®, Linux, Apple® Mac OS X Server®, Oracle® Solaris®, Windows Server®, and Novell® NetWare®.
- suitable personal computer operating systems include, by way of examples, Microsoft® Windows®, Apple® Mac OS X®, UNIX®, and UNIX-like operating systems such as GNU/Linux®.
- the operating system is provided by cloud computing.
- the device includes a storage and/or memory device.
- the storage and/or memory device is one or more physical apparatuses used to store data or programs on a temporary or permanent basis.
- the device is volatile memory and requires power to maintain stored information.
- the device is non-volatile memory and retains stored information when the digital processing device is not powered.
- the non-volatile memory comprises flash memory.
- the non-volatile memory comprises dynamic random-access memory (DRAM).
- the non-volatile memory comprises ferroelectric random-access memory (FRAM).
- the non-volatile memory comprises phase-change random access memory (PRAM).
- the device is a storage device including, by way of examples, CD-ROMs, DVDs, flash memory devices, magnetic disk drives, magnetic tapes drives, optical disk drives, and cloud computing-based storage.
- the storage and/or memory device is a combination of devices such as those disclosed herein.
- data may be stored on and/or using a DNA data storage system. Any suitable data storage system and database may be used.
- the digital processing device includes a display to send visual information to a user.
- the display is a cathode ray tube (CRT).
- the display is a liquid crystal display (LCD).
- the display is a thin film transistor liquid crystal display (TFT-LCD).
- the display is an organic light emitting diode (OLED) display.
- OLED organic light emitting diode
- on OLED display is a passive-matrix OLED (PMOLED) or active-matrix OLED (AMOLED) display.
- the display is a plasma display.
- the display is a video projector.
- the display is a wearable display.
- the display is a combination of devices such as those disclosed herein.
- the digital processing device includes an input device to receive information from a user.
- the input device is a keyboard.
- the input device is a pointing device including, by way of examples, a mouse, trackball, track pad, joystick, game controller, or stylus.
- the input device is a touch screen or a multi-touch screen.
- the input device is a microphone to capture voice or other sound input.
- the input device is a video camera or other sensor to capture motion or visual input.
- the input device is a Kinect, Leap Motion, or the like.
- the input device is a combination of devices such as those disclosed herein.
- Non-transitory computer readable storage medium
- the platforms, systems, media, and methods disclosed herein include one or more non-transitory computer readable storage media encoded with a program including instructions executable by the operating system of an optionally networked digital processing device.
- a computer readable storage medium is a tangible component of a digital processing device.
- a computer readable storage medium is optionally removable from a digital processing device.
- a computer readable storage medium includes, by way of examples, CD-ROMs, DVDs, flash memory devices, solid state memory, magnetic disk drives, magnetic tape drives, optical disk drives, cloud computing systems and services, and the like.
- the program and instructions are permanently, substantially permanently, semi-permanently, or non-transitorily encoded on the media.
- FIG. 10 a block diagram is shown depicting an exemplary machine that includes a computer system 1100 (e.g., a processing or computing system) within which a set of instructions can execute for causing a device to perform or execute any one or more of the aspects and/or methodologies for static code scheduling of the present disclosure.
- a computer system 1100 e.g., a processing or computing system
- the components in FIG. 10 are examples only and do not limit the scope of use or functionality of any hardware, software, embedded logic component, or a combination of two or more such components implementing particular embodiments.
- Computer system 1100 may include one or more processors 1101, a memory 1103, and a storage 1108 that communicate with each other, and with other components, via a bus 1140.
- the bus 1140 may also link a display 1132, one or more input devices 1133 (which may, for example, include a keypad, a keyboard, a mouse, a stylus, etc.), one or more output devices 1134, one or more storage devices 1135, and various tangible storage media 1136. All of these elements may interface directly or via one or more interfaces or adaptors to the bus 1140.
- the various tangible storage media 1136 can interface with the bus 1140 via storage medium interface 1126.
- Computer system 1100 may have any suitable physical form, including but not limited to one or more integrated circuits (ICs), printed circuit boards (PCBs), mobile handheld devices (such as mobile telephones or PDAs), laptop or notebook computers, distributed computer systems, computing grids, or servers.
- ICs integrated circuits
- PCBs printed circuit boards
- mobile handheld devices such as mobile telephones or PDAs
- laptop or notebook computers distributed computer systems, computing grids, or servers.
- Computer system 1100 includes one or more processor(s) 1101 (e.g., central processing units (CPUs), general purpose graphics processing units (GPGPUs), or quantum processing units (QPUs)) that carry out functions.
- processor(s) 1101 optionally contains a cache memory unit 1102 for temporary' local storage of instructions, data, or computer addresses.
- Processor(s) 1101 are configured to assist in execution of computer readable instructions.
- Computer system 1100 may provide functionality for the components depicted in FIG. 10 as a result of the processor(s) 1101 executing non-transitory, processor-executable instructions embodied in one or more tangible computer-readable storage media, such as memory 1103, storage 1108, storage devices 1135, and/or storage medium 1136.
- the computer-readable media may store software that implements particular embodiments, and processor(s) 1101 may execute the software.
- Memory 1103 may read the software from one or more other computer-readable media (such as mass storage device(s) 1135, 1136) or from one or more other sources through a suitable interface, such as network interface 1120.
- the software may cause processor(s) 1101 to carry out one or more processes or one or more steps of one or more processes described or illustrated herein. Carrying out such processes or steps may include defining data structures stored in memory 1103 and modifying the data structures as directed by the software.
- the memory 1103 may include various components (e.g., machine readable media) including, but not limited to, a random-access memory' component (e.g., RAM 1104) (e.g., static RAM (SRAM), dynamic RAM (DRAM), ferroelectric random-access memory (FRAM), phase-change random access memory (PRAM), etc.), a read-only memory component (e.g., ROM 1105), and any combinations thereof.
- ROM 1105 may act to communicate data and instructions unidirectionally to processor(s) 1101
- RAM 1104 may act to communicate data and instructions bidirectionally with processor(s) 1101.
- ROM 1105 and RAM 1104 may include any suitable tangible computer-readable media described below.
- a basic input/output system 106 (BIOS) including basic routines that help to transfer information between elements within computer system 1100, such as during start-up, may be stored in the memory 1103.
- Fixed storage 1108 is connected bidirectionally to processor(s) 1101, optionally through storage control unit 1107.
- Fixed storage 1108 provides additional data storage capacity and may also include any suitable tangible computer-readable media described herein.
- Storage 108 may be used to store operating system 1109, executable(s) 1110, data 1111, applications 1112 (application programs), and the like.
- Storage 1108 can also include an optical disk drive, a solid-state memory device (e.g., flash-based systems), or a combination of any of the above.
- Information in storage 1108 may, in appropriate cases, be incorporated as virtual memory' in memory 1103.
- storage device(s) 1135 may be removably interfaced with computer system 1100 (e.g., via an external port connector (not shown)) via a storage device interface 1125.
- storage device(s) 1135 and an associated machine-readable medium may provide non-volatile and/or volatile storage of machine-readable instructions, data structures, program modules, and/or other data for the computer system 1100.
- software may reside, completely or partially, within a machine-readable medium on storage device(s) 1135.
- software may reside, completely or partially, within processor(s) 1101
- Bus 1140 connects a wide variety of subsystems.
- reference to a bus may encompass one or more digital signal lines serving a common function, where appropriate.
- Bus 140 may be any of several types of bus structures including, but not limited to, a memory bus, a memory controller, a peripheral bus, a local bus, and any combinations thereof, using any of a variety of bus architectures.
- Computer system 1100 may also include an input device 1133.
- a user of computer system 1100 may enter commands and/or other information into computer system 1100 via input device(s) 1133.
- Examples of an input device(s) 1133 include, but are not limited to, an alpha-numeric input device (e.g., a keyboard), a pointing device (e.g., a mouse or touchpad), a touchpad, a touch screen, a multi-touch screen, a joystick, a stylus, a gamepad, an audio input device (e.g., a microphone, a voice response system, etc.), an optical scanner, a video or still image capture device (e.g., a camera), and any combinations thereof.
- the input device is a Kinect, Leap Motion, or the like.
- Input device(s) 1133 may be interfaced to bus 1140 via any of a variety of input interfaces 1123 (e.g., input interface 1123) including, but not limited to, serial, parallel, game port, USB, FIREWIRE, THUNDERBOLT, or any combination of the above.
- input interfaces 1123 e.g., input interface 1123
- serial, parallel, game port, USB, FIREWIRE, THUNDERBOLT or any combination of the above.
- computer system 1100 when computer system 1100 is connected to network 1130, computer system 1100 may communicate with other devices, specifically mobile devices and enterprise systems, distributed computing systems, cloud storage systems, cloud computing systems, and the like, connected to network 1130. Communications to and from computer system 100 may be sent through network interface 1120.
- network interface 1120 may receive incoming communications (such as requests or responses from other devices) in the form of one or more packets (such as Internet Protocol (IP) packets) from network 1130, and computer system 100 may store the incoming communications in memory 1103 for processing.
- Computer system 100 may similarly store outgoing communications (such as requests or responses to other devices) in the form of one or more packets in memory 1103 and communicated to network 1130 from network interface 1120.
- Processor(s) 1101 may access these communication packets stored in memory 1103 for processing.
- Examples of the network interface 1120 include, but are not limited to, a network interface card, a modem, and any combination thereof.
- Examples of a network 1130 or network segment 1130 include, but are not limited to, a distributed computing system, a cloud computing system, a wide area network (WAN) (e.g., the Internet, an enterprise network), a local area network (LAN) (e.g., a network associated with an office, a building, a campus or other relatively small geographic space), a telephone network, a direct connection between two computing devices, a peer-to-peer network, and any combinations thereof.
- a network, such as network 1130 may employ a wired and/or a wireless mode of communication. In general, any network topology may be used.
- Information and data can be displayed through a display 1132.
- a display 1132 include, but are not limited to, a cathode ray tube (CRT), a liquid crystal display (LCD), a thin film transistor liquid crystal display (TFT-LCD), an organic liquid crystal display (OLED) such as a passive-matrix OLED (PMOLED) or active-matrix OLED (AMOLED) display, a plasma display, and any combinations thereof.
- the display 1132 can interface to the processor(s) 1101, memory 1103, and fixed storage 1108, as well as other devices, such as input device(s) 1133, via the bus 1140.
- the display 1132 is linked to the bus 1140 via a video interface 1122, and transport of data between the display 1132 and the bus 1140 can be controlled via the graphics control 1121.
- the display is a video projector.
- the display is a head-mounted display (HMD) such as a VR headset.
- suitable VR headsets include, by way of example, HTC Vive, Oculus Rift, Samsung Gear VR, Microsoft HoloLens, Razer OSVR, FOVE VR, Zeiss VR One, Avegant Glyph, Freefly VR headset, and the like.
- the display is a combination of devices such as those disclosed herein.
- computer system 1100 may include one or more other peripheral output devices 1134 including, but not limited to, an audio speaker, a printer, a storage device, and any combinations thereof.
- peripheral output devices may be connected to the bus 1140 via an output interface 1124.
- Examples of an output interface 1124 include, but are not limited to, a serial port, a parallel connection, a USB port, a FIREWIRE port, a THUNDERBOLT port, and any combinations thereof.
- computer system 1100 may provide functionality as a result of logic hardwired or otherwise embodied in a circuit, which may operate in place of or together with software to execute one or more processes or one or more steps of one or more processes described or illustrated herein.
- Reference to software in this disclosure may encompass logic, and reference to logic may encompass software.
- reference to a computer-readable medium may encompass a circuit (such as an IC) storing software for execution, a circuit embodying logic for execution, or both, where appropriate.
- the present disclosure encompasses any suitable combination of hardware, software, or both.
- DSP digital signal processor
- ASIC application specific integrated circuit
- FPGA field programmable gate array
- a general-purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine.
- a processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
- a software module may reside in RAM memory , flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium know n in the art.
- An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium.
- the storage medium may be integral to the processor.
- the processor and the storage medium may reside in an ASIC.
- the ASIC may reside in a user terminal.
- the processor and the storage medium may reside as discrete components in a user terminal.
- suitable computing devices include, by way of example, server computers, desktop computers, laptop computers, notebook computers, subnotebook computers, netbook computers, netpad computers, set-top computers, media streaming devices, handheld computers, Internet appliances, mobile smartphones, tablet computers, and personal digital assistants.
- Suitable tablet computers in various embodiments, include those with booklet, slate, and convertible configurations, known to those of skill in the art.
- the computing device includes an operating system configured to perform executable instructions.
- the operating system is, for example, software, including programs and data, which manages the device’s hardware and provides services for execution of applications.
- Operating systems in some instances are stored locally or accessed via a network.
- server operating systems include, by way of example, FreeBSD, OpenBSD, NetBSD®, Linux, Apple® Mac OS X Server®, Oracle® Solaris®, Windows Server®, and Novell® NetWare®.
- suitable personal computer operating systems include, by way of example, Microsoft® Windows®, Apple® Mac OS X®, UNIX®, and UNIX-like operating systems such as GNU/Linux®.
- the operating system is provided by cloud computing.
- suitable mobile smartphone operating systems include, by way of example, Nokia® Symbian® OS, Apple® iOS®, Research In Motion® BlackBerry OS®, Google® Android®, Microsoft® Windows Phone® OS, Microsoft® Windows Mobile® OS, Linux®, and Palm® WebOS®.
- Computer systems described herein may be utilized as part of the systems and methods of the present invention.
- a computer system may be utilized as a device configured for use by a researcher, patient, partner, caretaker, or healthcare provider.
- Non-transitory computer readable storage medium
- the platforms, systems, media, and methods disclosed herein include one or more non-transitory computer readable storage media encoded with a program including instructions executable by the operating system of an optionally networked computing device.
- a computer readable storage medium is a tangible component of a computing device.
- a computer readable storage medium is optionally removable from a computing device.
- a computer readable storage medium includes, by way of example, CD-ROMs, DVDs, flash memory devices, solid state memory, magnetic disk drives, magnetic tape drives, optical disk drives, distributed computing systems including cloud computing systems and services, and the like.
- the program and instructions are permanently, substantially permanently, semipermanently, or non-transitorily encoded on the media.
- the platforms, systems, media, and methods disclosed herein include at least one computer program, or use of the same.
- a computer program includes a sequence of instructions, executable by one or more processors) of the computing device’s CPU, written to perform a specified task.
- Computer readable instructions may be implemented as program modules, such as functions, objects, Application Programming Interfaces (APIs), computing data structures, and the like, that perform particular tasks or implement particular abstract data types. In light of the disclosure provided herein, those of skill in the art will recognize that a computer program may be written in various versions of various languages.
- the functionality of the computer readable instructions may be combined or distributed as desired in various environments.
- a computer program comprises one sequence of instructions.
- a computer program comprises a plurality of sequences of instructions. In some embodiments, a computer program is provided from one location. In other embodiments, a computer program is provided from a plurality of locations. In various embodiments, a computer program includes one or more software modules or features. In various embodiments, a computer program includes, in part or in whole, one or more web applications, one or more mobile applications, one or more standalone applications, one or more web browser plug-ins, extensions, add-ins, or add-ons, or combinations thereof.
- one or more web applications, one or more mobile applications, one or more standalone applications, one or more web browser plug-ins, extensions, add-ins, or add-ons, or combinations thereof are utilized to perform the methods as described herein.
- one or more web applications, one or more mobile applications, one or more standalone applications, one or more web browser plug-ins, extensions, add-ins, or addons, or combinations thereof are utilized as part of the systems as described herein.
- one or more web applications, one or more mobile applications, one or more standalone applications, one or more web browser plug-ins, extensions, add-ins, or add-ons, or combinations thereof are utilized to fully or partially automate the systems and methods as described herein. In some embodiments, automation allows methods to be carried out which are beyond the limits of what can be processed by a human.
- a computer program includes a web application.
- a web application in various embodiments, utilizes one or more software frameworks and one or more database systems.
- a web application is created upon a software framework such as Microsoft® .NET or Ruby on Rails (RoR).
- a web application utilizes one or more database systems including, by way of examples, relational, non-relational, object oriented, associative, XML, and document-oriented database systems.
- suitable relational database systems include, by way of examples, Microsoft® SQL Server, mySQLTM, and Oracle®.
- a web application in various embodiments, is written in one or more versions of one or more languages.
- a web application may be written in one or more markup languages, presentation definition languages, client-side scripting languages, server-side coding languages, database query languages, or combinations thereof.
- a web application is written to some extent in a markup language such as Hypertext Markup Language (HTML), Extensible Hypertext Markup Language (XHTML), or extensible Markup Language (XML).
- a web application is written to some extent in a presentation definition language such as Cascading Style Sheets (CSS).
- CSS Cascading Style Sheets
- a web application is written to some extent in a client-side scripting language such as Asynchronous JavaScript and XML (AJAX), Flash® ActionScript, JavaScript, or Silverlight®.
- AJAX Asynchronous JavaScript and XML
- a web application is written to some extent in a server-side coding language such as Active Server Pages (ASP), ColdFusion®, Perl, JavaTM, JavaServer Pages (JSP), Hypertext Preprocessor (PHP), PythonTM, Ruby, Tel, Smalltalk, WebDNA®, or Groovy.
- a web application is written to some extent in a database query language such as Structured Query Language (SQL).
- SQL Structured Query Language
- a web application integrates enterprise server products such as IBM® Lotus Domino®.
- a web application includes a media player element.
- a media player element utilizes one or more of many suitable multimedia technologies including, by way of examples, Adobe® Flash®, HTML 5, Apple® QuickTime®, Microsoft® Silverlight®, JavaTM, and Unity®.
- an application provision system comprises one or more databases 1200 accessed by a relational database management system (RDBMS) 1210. Suitable RDBMSs include Firebird, MySQL, PostgreSQL, SQLite, Oracle Database, Microsoft SQL Server, IBM DB2, IBM Informix, SAP Sybase, Teradata, and the like.
- the application provision system further comprises one or more application severs 1220 (such as Java servers, NET servers, PHP servers, and the like) and one or more web servers 1230 (such as Apache, IIS, GWS and the like).
- the web server(s) optionally expose one or more web services via app application programming interfaces (APIs) 1240.
- APIs app application programming interfaces
- an application provision system alternatively has a distributed, cloud-based architecture 1300 and comprises elastically load balanced, auto-scaling web server resources 1310 and application server resources 1320 as well synchronously replicated databases 1330.
- the web applications may be utilized as part of the systems as described herein.
- the web applications may be utilized to perform the systems as described herein.
- web applications are utilized to provide features or modules of the systems described herein.
- web applications are utilized to fully or partially automate systems and methods described herein. In some embodiments, automation allows methods to be carried out which are beyond the limits of what can be processed by a human.
- a computer program includes a mobile application provided to a mobile computing device.
- the mobile application is provided to a mobile computing device at the time it is manufactured.
- the mobile application is provided to a mobile computing device via the computer network described herein.
- a mobile application is created by techniques known to those of skill in the art using hardware, languages, and development environments known to the art. Those of skill in the art will recognize that mobile applications are written in several languages. Suitable programming languages include, by way of examples, C, C++, C#, Objective-C, JavaTM, JavaScript, Pascal, Object Pascal, PythonTM, Ruby, VB.NET, WML, and XHTML/HTML with or without CSS, or combinations thereof.
- Suitable mobile application development environments are available from several sources. Commercially available development environments include, by way of examples, AirplaySDK, alcheMo, Appcelerator®, Celsius, Bedrock, Flash Lite, .NET Compact Framework, Rhomobile, and WorkLight Mobile Platform. Other development environments are available without cost including, by way of examples, Lazarus, MobiFlex, MoSync, and PhoneGap. Also, mobile device manufacturers distribute software developer kits including, by way of examples, iPhone and iPad (iOS) SDK, AndroidTM SDK, BlackBerry® SDK, BREW SDK, Palm® OS SDK, Symbian SDK, webOS SDK, and Windows® Mobile SDK.
- iOS iPhone and iPad
- the mobile applications may be utilized as part of the systems as described herein.
- the mobile applications may be utilized to perform the systems as described herein.
- mobile applications are utilized to provide features or modules of the systems described herein.
- mobile applications are utilized to fully or partially automate systems and methods described herein. In some embodiments, automation allows methods to be carried out which are beyond the limits of what can be processed by a human.
- Web browser plug-in
- the computer program includes a web browser plug-m (e.g., extension, etc.).
- a plug-in is one or more software components that add specific functionality to a larger software application. Makers of software applications support plug-ins to enable third-party developers to create abilities which extend an application, to support easily adding new features, and to reduce the size of an application. When supported, plug-ins enable customizing the functionality of a software application. For example, plug-ins are commonly used in web browsers to play video, generate interactivity, scan for viruses, and display particular file types. Those of skill in the art will be familiar with several web browser plug-ins including, Adobe® Flash® Player, Microsoft® Sil verlight®, and Apple® QuickTime®.
- the toolbar comprises one or more web browser extensions, add-ins, or add-ons. In some embodiments, the toolbar comprises one or more explorer bars, tool bands, or desk bands. [0219] In view of the disclosure provided herein, those of skill in the art will recognize that several plug-in frameworks are available that enable development of plug-ins in various programming languages, including, by way of examples, C++, Delphi, JavaTM, PHP, PythonTM, and VB .NET, or combinations thereof.
- Web browsers are software applications, designed for use with network-connected computing devices, for retrieving, presenting, and traversing information resources on the World Wide Web. Suitable web browsers include, by way of examples, Microsoft® Edge®, Mozilla® Firefox®, Google® Chrome, Apple® Safari®, Opera Software® Opera®, and KDE Konqueror. In some embodiments, the web browser is a mobile web browser. Mobile web browsers (also called microbrowsers, mini-browsers, and wireless browsers) are designed for use on mobile computing devices including, by way of examples, handheld computers, tablet computers, netbook computers, subnotebook computers, smartphones, and personal digital assistants (PDAs).
- PDAs personal digital assistants
- Suitable mobile web browsers include, by way of examples, Google® Android® browser, RIM BlackBerry® Browser, Apple® Safari®, Palm® Blazer, Palm® WebOS® Browser, Mozilla® Firefox® for mobile, Microsoft® Internet Explorer® Mobile, Amazon® Kindle® Basic Web, Nokia® Browser, Opera Software® Opera® Mobile, and Sony® PSPTM browser.
- the platforms, systems, media, and methods disclosed herein include software, server, and/or database modules, or use of the same.
- software modules are created by techniques known to those of skill in the art using machines, software, and languages known to the art.
- the software modules disclosed herein are implemented in a multitude of ways.
- a software module comprises a file, a section of code, a programming object, a programming structure, or combinations thereof.
- a software module comprises a plurality of files, a plurality of sections of code, a plurality of programming objects, a plurality of programming structures, or combinations thereof.
- the one or more software modules comprise, by way of examples, a web application, a mobile application, and a standalone application.
- software modules are in one computer program or application.
- software modules are in more than one computer program or application.
- software modules are hosted on one machine.
- software modules are hosted on more than one machine.
- software modules are hosted on a distributed computing platform such as a cloud computing platform.
- software modules are hosted on one or more machines in one location. In other embodiments, software modules are hosted on one or more machines in more than one location.
- users query one or more databases to identify information about biological data in his or her data set. For example, user may use an interface to display specific information about a variant, such as the variants' role in cancer or other diseases.
- the platforms, systems, media, and methods disclosed herein include one or more databases, or use of the same.
- databases are suitable for storage and retrieval of, for example, patient, photo, video, skin condition, visit, physician, and insurance information.
- suitable databases include, by way of examples, relational databases, nonrelational databases, object-oriented databases, object databases, entity-relationship model databases, associative databases, XML databases, document-oriented databases, and graph databases. Further examples include SQL, PostgreSQL, MySQL, Oracle, DB2, Sybase, and MongoDB.
- a database is Internet-based.
- a database is web-based.
- a database is cloud computing based.
- a database is a distributed database.
- a database is based on one or more local computer storage devices.
- Databases may comprise information (e.g., annotations) regarding genetic variants.
- databases provide information on somatic, germline, or somatic and germline variants.
- a database comprises one or more of ClinVar, COSMIC, NCBI database of Genotypes and Phenotypes (dbGaP), gnomAD, 69 genomes from CGI, Personalized Genome Project, NCI Genomic Data Commons (GDC), cBioPortal, Intogen, and the Pediatric Cancer Genome Project.
- databases provide information on variants related to cancer or other disease.
- EXAMPLE 1 PROCESSING CELLS FOR A MULTIOMIC WORKFLOW
- FIG. 1 An exemplary workflow is shown in FIG. 1.
- a sample is obtained from a diseased tissue, such as a frozen or FFPE sample.
- Cells are collected from the sample using any number of techniques known in the art. In some instances, cells are collected from specific genographic (spatial) locations on a tissue. The cells are then processed using one or more multiomics workflows to collect measurements (FIG. 2) from the genome, trans criptome, methylome, and proteome.
- Such workflows target biological inquiries (FIG. 4).
- data from measurements are entered as an input file into a cloud computing platform (FIG. 10).
- a penetrance score (FIG. 3) and mechanism (FIG. 5) are generated using an algorithm. Changes having a high penetrance score are selected to validate as specific drug targets (FIG. 6).
- Personalized treatments e.g., vaccine or small molecule
- a MOLM-13 drug-resistant model was generated using quizartinib to target FLT3.
- the patient from which the MOLM-13 cell line was generated harbored an internal tandem duplication (ITD) in the receptor kinase FLT3 gene, resulting in hyperactive growth signaling and sensitivity to the FLT3 inhibitor quizartinib.
- ITD internal tandem duplication
- the generation of resistance in culture can be seen in FIG. 13.
- the quizartinib cells also harbor a N841K mutation, which has also been found in AML patients.
- a genetic analysis of parental and resistant genes can be seen in FIG. 14A.
- Genomic and transcriptomic libraries were prepared. First, the cytosol was lysed. Then the mRNA transcriptome was converted to cDNA using 1st strand synthesis. Next, nuclear lysis occurred. Whole genome amplification via PTA occurred. The transcriptome cDNA and genomic DNA were then isolated. The cDNA was pre-amplified via PTA and a library was prepared for NGS of the transcriptomic library. Likewise, library prep of the PTA-amplified genomic DNA occurred, and the genomic li brary was analyzed via NGS. Resistant cells showed a loss of Chromosome 5 and a gain of 19q, consistent with karyotypic data, as depicted in FIG.
- FIG. 14D depicts a principal component analysis of the transcriptomics data of parental and resistant cells.
- a clustered heat map as depicted in FIG. 14E, showed that resistant cells had an upregulation of the enhancer factor CEB PA (mutated in AML patients) in resistant cells.
- GAS6 was also upregulated.
- Transcriptional bypass of FLT3 signaling by GAS6 upregulation can drive Axl signaling in resistant cells, as depicted in FIG. 14F.
- Full transcript (compared to end-counting) allows for insights into exon usage, as depicted in FIGS. 14G-14H. Isoform biases in parental versus resistant cells manifest both as alternative 5’ exon utilization (PPP1R14B ) & alternative internal exon utilization (HADHA ) resulting in different transcript lengths
- FIGS. 14G-14FL Isoform biases in parental versus resistant cells manifest both as alternative 5’ exon utilization (PPP1R14B ) & alternative internal exon utilization (HADHA )
- the genomic and transcriptomic data can be correlated. Linking the SNV and transcription modulation data reveals that an intronic single nucleotide genotypic shift between parental and resistant cells within the MYC gene correlated with differential MYC transcript levels. Results are depicted in FIGS. 16A-16C. Overall, the genome had approximately two orders of magnitude more plasticity than the transcriptome. There were 300 expression variants and 28,134 genetic variants. Genome plasticity drove greater differentiation of cell clusters. These cell foundational changes were verified within the transcriptome. The evolutionary pressure on the drug resistance is high.
- EXAMPLE 4 A MULTIOMIC VIEW OF DUCTAL CARCINOMA IN SITU(DCIS)/INVASIVE DUCTAL CARCINOMA
- a 7 cm DCIS (grade II) and a 1.2 cm invasive cancer (grade I) were analyzed.
- the cancer was ER+ PR+ HER2-.
- Normal and tumor tissue were digested to single cells.
- the tissue was stained with H&E staining and formalin-fixed, paraffin embedded prior to genomic DNA isolation (FIG 17).
- the transcriptome and genome were analyzed using the methods described in Example 5.
- Known DCIS copy number alterations harbor prototypical tumor suppressor genes, as depicted in FIG. 18B.
- Patient 2 had 10/13 cells with &PIK3CA E545K mutation.
- Patient 3 had 0/8 cells with PIK3CA mutations.
- SNV and CNV were compared across the 19 cells analyzed.
- a principal component analysis of the gene expression profiles results in a separation of EpCAM high and low cells, as depicted in FIG. 21. Clustering by genes enriched in breast cancer showed low levels of expression in the EpCAM low cells. IL-2 and CD4 expression suggests these cells are tumor infiltrating lymphocytes.
- RNA mechanisms of resistance were jointly identified, including transcriptional bypass mechanisms in response to drug treatment. Unification of these DNA/RNA data identified candidate regulatory SNVs proximal to genes differentially influencing their expression between parental and resistant cells, thereby exposing novel genes and modes of drug resistance.
- EXAMPLE 5 DEVELOPING A VACCINATION TARGET BASED ON A “HIGH PENTRANCE” TARGET
- Example 2 Following the general procedure of Example 1, the data is used to create a vaccination target to a specific “high penetrance” target according to workflow of FIG. 9.
- a change is detected either in the genome at a single base, a translocation, or a copy number variant, and they can also detect in combination with the same mutation in the transcriptome, then the penetrance for this change may be high.
- This can also involve a mutation in a promotor, enhancer or pioneer factor for a splice variant.
- a splice variant arising from an alternate single nucleotide variant. If for example, this splice variant codes for a surface marker presentation or translation/expression.
- the same genomic or trans criptomic sequence can be used to target the immune system to this specific cell with this the specific mutation or genomic, transcriptormc or proteomic alteration.
- this oligonucleotide can be introduced to the same animal (person or study subject) to elicit response to this modified gene as its transcriptome or proteomic state.
- a dendritic cell may be “reprogrammed” with this information.
- the methods and systems of the present disclosure (Resolve amplification of cDNA) enables full length synthesis of most transcripts found in the cell.
- the cDNA is enriched across its entire length. Therefore, a bias of amplification or subsequent sequencing reads from the 5’ or 3’ end of a transcript does not occur.
- FIGs. 22A- 22C show example data illustrating this point.
- ResolveOME refers to data generated using the methods and systems of the present disclosure.
- Droplet-RNAseq shows an example data set generated using a system other than the systems of the present disclosure as a comparison.
- ResolveOME a method according to the present disclosure which may comprise using PTA
- PTA PTA
- the methods of the present disclosure demonstrate a superior performance in analyzing RNA with high coverage over a wide range of 5 ’-3’ gene body percentile values, as shown on the graph. Conversely, the Droplet-RNAseq method leads to low coverage in the early sections of the x-axis and higher coverage further along the x-axis and toward the end. As such, this data set is unsymmetrical and biased.
- FIG. 22B shows a graph demonstrating single cell analysis data comparing ResolveOME to droplet RNAseq in terms of transcript length.
- ResolveOME methods and systems of the present disclosure involving PTA
- FIG. 22B shows a graph demonstrating single cell analysis data comparing ResolveOME to droplet RNAseq in terms of transcript length.
- This graph demonstrated that ResolveOME (methods and systems of the present disclosure involving PTA) demonstrate superior RNA performance with respect to increased representation across various transcript sizes. This coverage is shown across a broader and longer set of transcript lengths.
- the competing technology droplet RNAseq starts losing enrichment after 1.5kb, while resolveOME (the method of the present disclosure) is capable of amplifying and detecting transcripts over 4kb.
- FIG. 22C shows an additional graph characterizing number of detected DNA variants per cell vs. number of detected genes per cell.
- the data set on the top (depicted with circles) is generated using the methods and systems of the present disclosure, demonstrating more robust variant calling for a wider range of number of detected genes per cell.
- the competing technology' (depicted in squares) detects variant over a narrower range of number of detected genes per cell. As such, the competing technology is more limited.
- the methods and systems of the present disclosure demonstrate a number of detected RNA variants per cell ranging from 150 to 2750.
- the competing technology droplet RNAseq
- FIGs 23A-23C demonstrate unification of genomic lesions and gene expression in AML model of drug resistance.
- FIG. 23A shows differential transcript utilization (DTU) between MOLM-13 parental and drug-resistant single cells. Color intensity indicates transcript proportion of A or B isoform of indicated transcript.
- FIG. 23B shows heatmap with transcripts in the y-axis that show a statistical (ZLM p ⁇ 0.01) association with ploidy level across all cells in the MOLM-13 dataset. Color of the tiles represents the average standardized expression value at a given ploidy level.
- the right panel shows the output of the ZLM model testing the expression given the ploidy. Bars are colored based on the -loglO p-value of the ZLM model testing transcriptional differences between parental and resistant cells.
- FIG. 23C shows bubble plot showing SNV -transcript expression associations (p ⁇ 0.05) determined by ZLM modeling between parental and resistant cells.
- Candidate SNVs are shown in the y-axis and genotypes in the x-axis. Size of the circle denotes the genotype prevalence of the variant in the MOLM-13 cell type set (parental or resistant). Colors of points denote the standardized mean expression level of the transcript in the set.
- ENCODE genotypic features mapping to the given single nucleotide variant are indicated in the right bar and are categorized in the heatmap as regulatory (top) or genic (bottom).
- Bacterial colonies represent a unique opportunity for naturally discrete cells which tend to have accelerated evolutionary forces. Being able to process a high number of bacteria open up tangible impacts to human health.
- AMR antimicrobial resistance
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Chemical & Material Sciences (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Biotechnology (AREA)
- Organic Chemistry (AREA)
- Wood Science & Technology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Theoretical Computer Science (AREA)
- Zoology (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Analytical Chemistry (AREA)
- Medical Informatics (AREA)
- Chemical Kinetics & Catalysis (AREA)
- Genetics & Genomics (AREA)
- Evolutionary Biology (AREA)
- Microbiology (AREA)
- Physiology (AREA)
- General Engineering & Computer Science (AREA)
- Biochemistry (AREA)
- Immunology (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The present disclosure provides methods and systems for performing experiments and computational methods for generating, analyzing, and using multi-omics data and leveraging such multiomics data and computational analysis for applications such as identifying biomarkers, diagnostics, prognostics, drug and vaccine discovery and development, personalized and precision medicine, and any combination thereof. In some aspects, a correlation between genomics data and transcriptomics/proteomics data are used to determine the effects of a genetic event on a transcriptomics/proteomics effect and/or the effect of a genomics event in development of the course of a disease. Such information and analyses are then used for the aforementioned applications.
Description
METHODS AND SYSTEMS FOR MULTIOMIC ANALYSIS
CROSS-REFERENCE
[0001] This application claims the benefit of U.S. Provisional Application No. 63/392,580, filed on July 27, 2022, which is incorporated herein by reference in its entirety and for all purposes.
BACKGROUND
[0002] The present disclosure is generally related to the fields of genomics, transcriptomics, and bioinformatics and high-throughput single cell analysis. High-throughput single cell analysis can provide extensive and valuable information about a subject (e.g., a human patient) or a population which can be used to make informed decisions regarding health-related matters. Such methods and systems may have vast applications in diagnostics, prognostics, personalized and precision medicine, drug design, discovery', and development.
SUMMARY
[0003] There is an unmet need for comprehensive and effective approaches to generate one or more datasets including genomics, transcriptomics, proteomics, and methylomics. Using such to identify correlations (in some cases direct correlations) therebetween, such as to diagnose patients, identify biomarkers, design therapeutics or vaccines, prescribe medications, and/or implement individualized/personalized medicine approaches. In some aspects, provided herein is a comprehensive approach comprising elements of high-throughput single cell analysis, genomics, transcriptomics, proteomics, bioinformatics, software engineenng, and data analysis for generating and analyzing data sets that have vast applications for identifying disease biomarkers, diagnosing patients, and designing drugs or vaccines. Provided are also methods for conducting such biomarker identifications, diagnosis, prognosis, and drug design.
[0004] In an aspect, provided herein is a method of single cell analysis comprising: (a) providing or obtaining a plurality of cells; (b) performing one or more experiments on single cells of the plurality' of cells to generate at least a first data set and a second data set from the plurality' of cells, wherein the first data set is a genomic data set and the second data set is a transcriptomic data set and/or a proteomic data set; (c) identifying a correlation between the first data set and the second data set for at least a portion of the plurality of cells; and (d) using the correlation obtained in (c), identifying a disease biomarker, designing a therapeutic, or designing a vaccine for a disease.
[0005] In some embodiments, performing the one or more experiments comprises performing primary template directed amplification (PTA). In some embodiments, the one or more experiments or screens comprise a genomics experiment, a transcriptomic experiment, a proteomics experiment, or any combination thereof. In some embodiments, the one or more experiments comprise high-throughput single cell analysis, wherein single cells of the plurality of cells are screened in high-throughput. In some embodiments, the one or more experiments are performed using a miniaturized high-throughput single cell screening system. In some embodiments, the method comprises compartmentalizing the plurality of cells into a plurality of partitions, a partition of the plurality of partitions comprises a single cell of the plurality of cells. In some embodiments, the plurality of partitions comprises a plurality of wells, a plurality of droplets, or both. In some embodiments, the wells are miniaturized wells. In some embodiments, the miniaturized high-throughput single cell screening system comprises a microfluidic device, a miniaturized array, or both.
[0006] In some embodiments, the one or more experiments comprise performing one or more reactions. In some embodiments, a partition of the plurality of partitions comprises a single cell therein, and the one or more experiments or screens comprise performing one or more reactions on the single cell in the partition. In some embodiments, the one or more reactions comprise cell lysis. In some embodiments, the one or more reactions comprise an amplification reaction. In some embodiments, the amplification reaction comprises primary template directed amplification (PTA).
[0007] In some embodiments, the one or more reactions comprise lysmg the single cell, extracting the molecular information of the single cell, thereby releasing a cellular nucleic acids, proteins, lipids, and metabolites from the single cell in the partition, and performing an amplification reaction on the cellular nucleic acid molecule.
[0008] In some embodiments, performing the one or more reactions comprises using one or more reagents. In some embodiments, the one or more reagent(s) comprise one or more of at least one amplification primer, at least one nucleic acid polymerase, and a mixture of nucleotides, wherein the mixture of nucleotides comprises at least one terminator nucleotide which terminates nucleic acid replication by the polymerase.
[0009] In some embodiments, the terminator nucleotide is an irreversible terminator. In some embodiments, the terminator nucleotide is selected from the group consisting of nucleotides with modification to the alpha group, C3 spacer nucleotides, locked nucleic acids (LNA), inverted nucleic acids, 2' fluoro nucleotides, 3' phosphorylated nucleotides, 2'-O-Methyl modified nucleotides, and trans nucleic acids. In some embodiments, the nucleotides with modification to the alpha group are alpha-thio dideoxynucleotides. In some embodiments, the
terminator nucleotide comprises modifications of the r group of the 3’ carbon of the deoxyribose. In some embodiments, the terminator nucleotide is selected from the group consisting of dideoxynucleotides, inverted dideoxynucleotides, 3' biotinylated nucleotides, 3' amino nucleotides, 3’-phosphorylated nucleotides, 3'-O-methyl nucleotides, 3' carbon spacer nucleotides including 3' C3 spacer nucleotides, 3' C18 nucleotides, 3' Hexanediol spacer nucleotides, acyclonucleotides, and combinations thereof.
[0010] In some embodiments, a partition of the plurality of partitions comprises at least a single cell and a bead. In some embodiments, the bead delivers a reagent for performing a reaction on the single cell in the partition. In some embodiments, the reagent is bound to the bead via a cleavable linker and is configured to be released from the bead via cleavage of the cleavable linker. In some embodiments, the reagent comprises a barcode configured to identify the cell or a constituent of the cell. In some embodiments, the bead can envelop the entire cell to enable chemical reactions at a miniaturized scale.
[0011] In some embodiments, the constituent of the cell comprises genomic material of the cell, ribonucleic acid (RNA), deoxyribonucleic acid (DNA), or any combination thereof. In some embodiments, the method comprises lysing the cell in the partition, releasing a cellular nucleic acid molecule of the cell in the partition, releasing the barcode from the bead via cleavage of the cleavable linker, and hybridizing the cellular nucleic acid molecule to the barcode. In some embodiments, the one or more reactions comprise lysing the single cell, thereby releasing cellular nucleic acid molecules in the partition, performing one or more amplification reactions on the cellular nucleic acid molecules thereby generating amplified cellular nucleic acid molecules, and wherein the method further comprises extracting the amplified cellular nucleic acid molecules from the partition, and sequencing the amplified cellular nucleic acid molecules.
[0012] In some embodiments, generating the first data set comprises performing primary template directed amplification (PTA) and generating the second data set comprises performing a reverse transcription reaction. In some embodiments, performing the reverse transcription reaction comprises generating a cDNA library. In some embodiments, generating the first data set comprises determining a methylation site in a cellular nucleic acid molecule using PTA, thereby generating a methylation library. In some embodiments, the method further comprises comparing the methylation library to a reference library for a single cell of the plurality of cells, wherein the methylation library and the reference library are generated from the same cell.
[0013] In some embodiments, identifying the correlation comprises calculating or assigning a penetrance score to the correlation of these molecular data (biomarkers), wherein the penetrance score quantifies the correlation. In some embodiments, the penetrance score guides identifying
the disease biomarker, identifying collection of biomarkers which may comprise one or more of the multiomic modalities, designing the therapeutic, designing the vaccine for the disease, or any combination thereof. In some embodiments, a high penetrance score indicates a strong correlation between the first data set and the second data set. In some embodiments, the high penetrance score indicates that the expression of a gene identified in the first data set leads to a transcriptomic event, a proteomic event or both, and wherein the gene is identified as a disease biomarker. In some embodiments, a low penetrance score indicates a weak correlation between the first data set and the second data set, and that the expression of a gene identified in the first data set does not lead to a transcriptomic event, a proteomic event, or either, and wherein the gene is not identified as a disease biomarker.
[0014] In some embodiments, identifying the correlation is performed with the aid of a computer system comprising a computer program. In some embodiments, the computer program compnses one or more bioinformatics algorithms or workflows. In some embodiments, the first data set and the second data set are combined or integrated into a database with or without links to related datasets independently generated across the research community.
[0015] In some aspects, described herein is a system for determining a penetrance score comprising: a computing system comprising at least one processor and instructions executable by the at least one processor to provide an application configured to perform operations comprising: receiving multiomics data from one or more sources and at least one biological state; and applying an algorithm configured to process the data and generate a penetrance score. In some embodiments, the computing system comprises a cloud computing platform. In some embodiments, the multiomics data comprises data obtained from analysis of one or more of genomic DNA, transcript RNA, proteins, lipids, or metabolites.
[0016] In some embodiments, the correlation is quantified by a penetrance score. In some embodiments, the penetrance score is at least 0.5. In some embodiments, the penetrance score is at least 0.9.
[0017] In an aspect, provided herein is a method of developing a treatment for a disease, wherein the method comprises: (a) generating multiomics data from one or more single cells, wherein generating comprises performing Primary Template Directed Amplification (PTA), and wherein the multiomics data comprises two or more of genome data, transcriptome data, and proteomics data; (b) correlating one or more mutations in genome data with corresponding mutations in one or both of (i) an mRNA of the transcriptome data and (ii) a protein of the proteome data; and (c) generating a treatment targeting one or both of the mRNA and the
protein, thereby developing the treatment for the disease. In some embodiments, the disease comprises or is cancer.
[0018] In some embodiments, the treatment comprises an mRNA vaccine. In some embodiments, the treatment comprises reprogramming a dendritic cell to target one or both of the mRNA or protein. In some embodiments, the mutation in genome data comprises a DNA mutation. In some embodiments, the DNA mutation is selected from the group consisting of SNV*X, CNV*X, translocation, IND EL, frameshift, stop codon, mitochondrial, promoter/enhancer, TCR/BCR, and other change. In some embodiments, the mRNA comprises a transcript change. In some embodiments, the transcript change is selected from the group consisting of expression, splice variant, fusion, IncRNA, miRNA, TCR/BCR, promoter, truncated gene, mitochondrial, or mutation.
[0019] In some embodiments, the protein comprises a protein change. In some embodiments, the protein change is selected from the group consisting of over/under expressed, truncated, surface bound, frameshift, misfolded, metabolic, ligand independence, confirmation, activity change, or fused.
[0020] In some embodiments, the multiomics data comprises one or more measurements. In some embodiments, one or more of the measurements is a silent change. In some embodiments, the multiomics data comprises data from one or more of a genome, a transcriptome, a proteome, a metabolome, a lipidome, or an epigenome. In some embodiments, the multiomics data comprises data from a genome. In some embodiments, the one or more measurements are selected from the group consisting of: copy number variation, translocation, and mutation burden.
[0021] In some embodiments, the disease comprises cancer. In some embodiments, cancer comprises breast cancer. In some embodiments, the breast cancer comprises ductal carcinoma. In some embodiments, the cancer comprises leukemia. In some embodiments, the single cells (e g., single cancer cells) are obtained from an FFPE sample.
[0022] In some embodiments, the multiomics data comprises data from a methylome. In some embodiments, the one or more measurements are selected from the group consisting of: methylation at CpG sites, gene activation, and gene repression. In some embodiments, the multiomics data comprises data from a transcriptome. In some embodiments, the one or more measurements are selected from the group consisting of: expressed genes, gene fusions, and splice variants.
[0023] In some embodiments, the multiomics data comprises data from a proteome. In some embodiments, the one or more measurements are selected from the group consisting of: translation level, phosphorylation state, and protein modification. In some embodiments, the
one or more sources comprise an individual organism. In some embodiments, the one or more sources comprise cells. In some embodiments, the cells are mammalian cells, human cells, bacterial cells, cancer cells, an immortalized cell line, a primary patient cell line, or any combination thereof. In some embodiments, the cells are obtained from a tissue. In some embodiments, the cells are obtained from a tissue cross-section. In some embodiments, the biological state comprises a disease state. In some embodiments, the disease state comprises cancer.
[0024] In some embodiments, the algorithm further generates a mechanism based on the data. In some embodiments, the mechanism is generated by detecting one or more changes in one or measurements.
[0025] In some embodiments, the change comprises a genome DNA change. In some embodiments, the genome DNA change is selected from the group consisting of SNV*X, CNV*X, translocation, INDEL, frameshift, stop codon, mitochondrial, promoter/enhancer, TCR/BCR, and other change. In some embodiments, the change comprises a transcript change. In some embodiments, the transcript change is selected from the group consisting of expression, splice variant, fusion, IncRNA, miRNA, TCR/BCR, promoter, truncated gene, mitochondrial, or mutation. In some embodiments, the change comprises a genome change. In some embodiments, the protein change is selected from the group consisting of over/under expressed, truncated, surface bound, frameshift, misfolded, metabolic, ligand independence, confirmation, activity change, or fused. In some embodiments, the mechanism is determined to be one or more of a genomic, transcriptomic, proteomic, lipidomic, or metabolomic mechanism.
[0026] In some aspects, described herein is a method for validating a disease target for a disease comprising (a) selecting cells from a tissue; (b) banking the cells; (c) performing one or more multiomic methods on the cells to generate multiomics data; and (d) applying a computer algorithm to process the multiomics data and generate a disease target. In some embodiments, selecting the cells comprises FACS sorting, microfluidics, spatial cell selection, or ultra-high throughput cell sorting. In some embodiments, the number of cells is at least about 200, 300, 400, 500, 1000, 2000, 3000, 4000, 5000, 6000, 10,000 or greater. In some embodiments, the disease is cancer. In some embodiments, the multiomics methods comprise PTA. In some embodiments, the multiomics data comprises data from one or more of a genome, epigenome, transcriptome, proteome, lipidome, or metabolome. In some embodiments, the method further comprises a treatment based on the disease target. In some embodiments, the treatment comprises an mRNA vaccine or small molecule.
[0027] In some embodiments, the method or system is capable of detecting a number of RNA variant per cell of at least 750, 1000, 1500, 2000, 2500 or higher. In some embodiments, the
method or system is capable of detecting a number of genes per cell of from about 1000 to about 8000. In some embodiments, the method or system is capable of detecting a number of RNA variant per cell of at least 750, 1000, 1500, 2000, 2500 or higher and a number of genes per cell of from about 1000 to about 8000.
[0028] In some embodiments, the methods comprise full length synthesis of RNA transcriptsin the cell wherein a plurality of amplification products achieved from performing the method are substantially unbiased over a range of 5 ’-3’ gene body percentiles.
[0029] In some embodiments, the methods and systems of the present disclosure are capable of amplifying and detecting transcripts of at least 1 kb, 1.5 kb, 2kb, 2.5 kb, 3 kb, 3.5 kb, 4 kb, 4.5 kb, or longer. In some embodiments, these transcripts may consist of coding information from one or more genes and represent aberrations of splicing which can affect, but not limited to, transcript isoforms or gene fusion events.
INCORPORATION BY REFERENCE
[0030] All publications, patents, and patent applications mentioned in this specification are herein incorporated by reference to the same extent as if each individual publication, patent, or patent application was specifically and individually indicated to be incorporated by reference.
BRIEF DESCRIPTION OF THE DRAWINGS
[0031] The novel features of the invention are set forth with particularity in the appended claims. A better understanding of the features and advantages of the present invention will be obtained by reference to the following detailed description that sets forth illustrative embodiments, in which the principles of the invention are utilized, and the accompanying drawings of which:
[0032] The novel features of the invention are set forth with particularity in the appended claims. A better understanding of the features and advantages of the present invention will be obtained by reference to the following detailed description that sets forth illustrative embodiments, in which the principles of the invention are utilized, and the accompanying drawings of which:
[0033] FIG. 1 depicts a workflow comprising providing a sample, cell selection, and multiomic analysis (including genome, methylome, transcriptome, and proteome);
[0034] FIG. 2 depicts various multiomic modalities which contribute to penetrance score;
[0035] FIG. 3 depicts a workflow schematic of measuring penetrance score using multiomic analysis;
[0036] FIG. 4 depicts a list of various biological inquiries useful for multiomics measurements;
[0037] FIG. 5 depicts another workflow schematic for the types of changes in multiomics measurements which in some instances is used for determining a mechanism;
[0038] FIG. 6 depicts a workflow schematic for spatially selecting cells from a frozen specimen, banking the cells, performing multiomic chemistry processes, providing multiomic data/measurements to a computational engine process, and validating targets;
[0039] FIG. 7 depicts a schematic of factors which in some instances dictate cell fate;
[0040] FIG. 8 schematically illustrating the various components and applications of the methods and systems of the present disclosure;
[0041] FIG. 9 depicts a workflow schematic for mammalian and bacterial multiomics analysis using the methods and systems of the present disclosure;
[0042] FIG. 10 schematically illustrates a workflow involving the computational components and systems of the present disclosure;
[0043] FIG. 11 depicts an example of a web/mobile application provision system; in this case, a system providing browser-based and/or native mobile user interfaces;
[0044] FIG. 12 depicts an example of a cloud-based web/mobile application provision system; in this case, a system comprising an elastically load balanced, auto-scaling web server and application server resources as well synchronously replicated databases;
[0045] FIG. 13 depicts change in cellular growth rates of MOLM-13 cell lines in the presence of the cancer drug quizartinib where resistant clones are thriving over genetically native cells;
[0046] FIG. 14A depicts a genomic view of allele variation FLT3 gene in resistant and parental strains;
[0047] FIG. 14B depicts CNV genomic data of resistant and parental strains;
[0048] FIG. 14C depicts karyoty pes of resistant and parental strains;
[0049] FIG. 14D depicts a principal component analysis of the transcriptomics data of parental and resistant cells;
[0050] FIG. 14E depicts a clustered heat map of transcriptomic data;
[0051] FIG. 14F depicts a mechanism for transcriptional bypass of FLT3 signaling in resistant cells;
[0052] FIGS. 14G-14H depict alternative exon utilization in transcriptional data;
[0053] FIG. 15A depicts a PCA of SNV data, showing discrimination between groups based on genomic variation;
[0054] FIG. 15B depicts clustered SNV data, showing groups of genomic positions with similar zygosity across biological groups;
[0055] FIG. 16A depicts SNV -gene expression interactions, highlighting specific mutations within genes associated to expression changes significant across biology groups;
[0056] FIG. 16B depicts the location of a SNV in the MYC gene;
[0057] FIG. 16C depicts a plot of MYC gene expression and SNV genotype for the parental and resistant cells showing similar grouping of resistant cells with the signature;
[0058] FIG. 17 depicts H&E and a-ER staining of the primary cancer cells prior to sequencing; [0059] FIG. 18A depicts heterogeneity in CNV in primary breast cancer cells;
[0060] FIG. 18B depicts known CNV in DCIS;
[0061] FIG. 19 depicts SNV PIK3CA mutations detected in single cells derived from 3 separate patients;
[0062] FIG. 20 depicts SNV and CNV detected in single cells of a DCIS patient;
[0063] FIG. 21 depicts correlations between genomic and transcnptomic data;
[0064] FIGs. 22A-22C show experimental data generated using the methods and systems of the present disclosure (ResolveOME) and its comparison to droplet RNA sequencing demonstrating superior RNA performance with respect to enhanced gene body coverage, increased representation across transcript sizes, and robust variant calling;
[0065] FIG. 23A shows significant isoforms across parental or resistant clones of the MOLM- 13(transcript ‘A’ and ‘B’) from the same genes;
[0066] FIG. 23B shows transcripts that are significantly associated to changes in copy number ploidy across the genomes of MOLM-13 cells;
[0067] FIG 23C shows genomic variants of MOLM-13, in regulatory regions of the genome (depicted by color) that are also significantly associated to transcript changes across resistant cells.
DETAILED DESCRIPTION
[0068] There is an unmet need for comprehensive and effective approaches to generate one or more datasets including genomics, transcriptomics, proteomics, and methylomics, and identifying correlations therebetween, such as to diagnose patients, identify biomarkers, design therapeutics or vaccines, prescribe medications, and/or implement individualized/personalized medicine approaches. In some aspects, provided herein is a comprehensive approach comprising elements of high-throughput single cell analysis, genomics, transcriptomics, proteomics, bioinformatics, software engineering, and data analysis for generating and analyzing data sets that have vast applications for identifying disease biomarkers, diagnosing patients, and designing drugs or vaccines. Provided are also methods for conducting such biomarker identifications, diagnosis, prognosis, and drug design.
[0069] Provided herein are systems and methods for processing and visualization of biological data (e.g., biomarkers). Further provided herein are systems and methods described herein
result in generating a penetrance score. Further provided herein are systems and methods to interrogate disease mechanisms. Further provided herein are systems and methods for validating therapeutic targets using penetrance data and mechanism. Provided herein are systems and methods for providing accurate and scalable Primary Template-Directed Amplification (PTA) and sequencing in combination with additional cell analysis techniques (multiomics). Further provided herein are methods of multiomic analysis, including analysis of proteins, DNA, and RNA from single cells, and corresponding post-transcriptional or post- translational modifications in combination with PTA. Such methods and compositions facilitate highly accurate amplification of target (or “template”) nucleic acids, which increases accuracy and sensitivity of downstream applications, such as Next-Generation Sequencing.
[0070] The methods and systems described herein in some instances automates many of the required functions formerly requiring labor intensive processes as well dedicated personnel to curate, analyze and interpret complex biological data.
[0071] Unless otherwise defined, all technical terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. As used in this specification and the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise. Any reference to “or” herein is intended to encompass “and/or” unless otherwise stated.
[0072] Unless defined otherwise, all terms of art, notations and other technical and scientific terms or terminology used herein are intended to have the same meaning as is commonly understood by one of ordinary skill in the art to which the claimed subject matter pertains. In some cases, terms with commonly understood meanings are defined herein for clarity and/or for ready reference, and the inclusion of such definitions herein should not necessarily be construed to represent a substantial difference over what is generally understood in the art.
[0073] Throughout this application, various embodiments may be presented in a range format. It should be understood that the description in range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of the disclosure. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numbers within that range, for example, 1, 2, 3, 4, 5, and 6. This applies regardless of the breadth of the range.
[0074] As used in the specification and claims, the singular forms “a”, “an” and “the” include plural references unless the context clearly dictates otherwise. For example, the term “a sample” includes a plurality of samples, including mixtures thereof.
[0075] The terms “determining,” “measuring,” “evaluating,” “assessing,” “assaying,” and “analyzing” are often used interchangeably herein and can refer to forms of measurement. The terms include determining if an element is present or not (for example, detection). These terms can include quantitative, qualitative or quantitative and qualitative determinations. Assessing can be relative or absolute. “Detecting the presence of’ can include determining the amount of something present in addition to determining whether it is present or absent depending on the context.
[0076] As used herein, the term “gene” can refer to a linear sequence of nucleotides along a segment of DNA that provides the coded instructions for synthesis of RNA, which, when translated into protein, leads to the expression of hereditary character.
[0077] As used herein, the term “nucleic acid molecule” can mean DNA, RNA, singlestranded, double-stranded or triple stranded and any chemical modifications thereof. Virtually any modification of the nucleic acid is contemplated. A “nucleic acid molecule” can be of almost any length, from 10, 20, 30, 40, 50, 60, 75, 100, 125, 150, 175, 200, 225, 250, 275, 300, 400, 500, 600, 700, 800, 900, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, 5000, 6000, 7000, 8000, 9000, 10,000, 15,000, 20,000, 30,000, 40,000, 50,000, 75,000, 100,000, 150,000, 200,000, 500,000, 1,000,000, 1,500,000, 2,000,000, 5,000,000 or even more bases in length, including increments therein, up to a full-length chromosomal DNA molecule. For methods that analyze expression of a gene, the nucleic acid isolated from a sample is typically RNA.
[0078] A single-stranded nucleic acid molecule is “complementary” to another single-stranded nucleic acid molecule, in certain embodiments of the subject matter descnbed herein, when it can base-pair (hybridize) with all or a portion of the other nucleic acid molecule to form a double helix (double-stranded nucleic acid molecule), based on the ability of guanine (G) to base pair with cytosine (C) and adenine (A) to base pair with thymine (T) or uridine (U). For example, the nucleotide sequence 5'-TATAC-3' is complementary to the nucleotide sequence 5'-GTATA-3'.
[0079] As used herein, the term “mutation” can refer to a change in the genome with respect to the standard wild-type sequence. Mutations can be deletions, insertions, or rearrangements of nucleic acid sequences at a position in the genome, or they can be single base changes at a position in the genome, referred to as “point mutations.” Mutations can be inherited, or they can occur in one or more cells during the lifespan of an individual. In some instances, mutation and variant are used synonymously.
[0080] As used herein, the term “kit” or “research kit” can refer to a collection of products that are used to perform a biological research reaction, procedure, or synthesis, such as, for example, a detection, assay, separation, purification, etc., which are typically shipped together, usually within a common packaging, to an end user.
[0081] Described herein is a cloud-based solution for the storage, query, and analysis of longitudinal data comprising a multiplicity of whole genomes, a large number of public and proprietary annotation sources as well as associated high quality phenotypic data, including microbiome metagenomes and metabolomics profiles. In various embodiments, the data analyzed by the platforms, systems, media, and methods described herein comprises more than 1,000, more than 5,000, more than 10,000, more than 20,000, more than 50,000, more than 100,000, more than 500,000, or more than 1,000,000 whole genomes.
[0082] The data analyzed by the platforms, systems, media, and methods described herein comprises genomic data. The genomic data is produced, by way of example, at a next generation sequencing (NGS) lab. In some cases, an AWS analysis pipeline based on Illumina’s HiSeq X and the ISIS Analysis Software are utilized to produce the genomic data. Sequencing reads are mapped to the hg38 human reference sequence and variant callers are used to call single nucleotide variants (SNVs) and insertions and deletions (indels). The genomic data comprises a multiplicity of unique SNVs. By way of examples, the genomic data comprises over 1 million, over 10 million, over 50 million, over 100 million, over 500 million, or over 1 billion unique SNVs.
[0083] The data analyzed by the platforms, systems, media, and methods described herein comprises metadata. The whole genomes are associated with high quality phenotypic information. A proprietary phenotype ingestion process enables the cleaning and standardization of phenotype data across disparate data sources. In some embodiments, the ingestion process includes: data integrity checks; standardization of units; standardization of terms; ontology/vocabulary mapping; and maintenance of the proprietary data dictionary. [0084] In various embodiments, the phenotype data comprises more than 1000, more than 5000, more than 10,000, more than 100,000, more than 1,000,000, or more than 10,000,000 phenotype data fields with, more than 1 million, more than 5 million, more than 10 million, more than 50 million, more than 100 million, more than 500 million, or more than 1 billion data points. Phenotypic data in some instances comprises cellular phenotype data. In some instances, cellular phenotypic data obtained from microscopy. In some instances, cell phenotypic data comprises one or more observable phenotypic traits such as cell shape or morphology , size, texture, internal structure, patterns of distribution of one or more specific proteins, glycosylated proteins, nucleic acid molecules, lipid molecules, glycosylated lipid molecules, carbohydrate
molecules, metabolites, and ions. In some instances, phenotypic data describes populations of cells described herein. In some instances, phenotypic data describes phenotypic traits of an organism such as a human. In some instances, a phenotypic data comprises a clinical designation or category, for example, a clinical diagnosis, a clinical parameter name, a clinical parameter value, a laboratory test name or a laboratory test value. In some instances, a phenotype is associated with an observable disease characteristic.
[0085] The data analyzed by the platforms, systems, media, and methods described herein comprises annotation data. Annotation data is also cleaned and standardized through an automated end-to-end solution, which allows: idempotence, immutability, persistence; high quality data; consistency between data sources; and scalability and flexibility.
[0086] Samples described herein may represent biologic information obtained from individuals or populations of individuals (e.g., genomic information). In some instances, samples comprise single cells. In some instances, samples comprise 1, 2, 5, 10, 20, 25, 50, 75, 100, 200, 500, or more than 1000 cells from the same or different individual. In some instances, samples comprise 1000, 2000, 5000, 10,000 20,000, 50,000, 75,000, or at least 100,000 cells from the same or different individual. Samples may be obtained from any species, including but not limited to viruses, bacteria, plants, fungi, protozoa, archaea, or animals. In some instances, samples are obtained from vertebrates. In some instances, samples are obtained from mammals. In some instances, samples are obtained from humans. Samples in some instances are obtained from any bodily fluid or tissue. In some instances, samples are obtained from diseased tissue such as a tumor.
[0087] In an aspect, provided herein is a method of single cell analysis comprising: (a) providing or obtaining a plurality of cells; (b) performing one or more experiments on single cells of the plurality' of cells to generate at least a first data set and a second data set from the plurality' of cells, wherein the first data set is a genomic data set and the second data set is a transcriptomic data set and/or a proteomic data set and/or a methylomic data set; (c) identifying a correlation between the first data set and the second data set for at least a portion of the plurality' of cells; and (d) using the correlation obtained in (c), identifying a disease biomarker, designing a therapeutic, or designing a vaccine for a disease.
[0088] In some embodiments, performing the one or more experiments comprises performing primary template directed amplification (PTA). In some embodiments, the one or more experiments or screens comprise a genomics experiment, a transcriptomic experiment, a proteomics experiment, a methylomics experiment or any combination thereof. In some embodiments, the one or more experiments comprise high-throughput single cell analysis, wherein single cells of the plurality of cells are screened in high-throughput. In some
embodiments, the one or more experiments are performed using a miniaturized high-throughput single cell screening system. In some embodiments, the method comprises compartmentalizing the plurality of cells into a plurality of partitions, a partition of the plurality of partitions comprises a single cell of the plurality of cells. In some embodiments, the plurality of partitions comprises a plurality of wells, a plurality of droplets, or both. In some embodiments, the wells are miniaturized wells. In some embodiments, the miniaturized high-throughput single cell screening system comprises a microfluidic device, a miniaturized array, or both.
[0089] In some embodiments, the one or more experiments comprise performing one or more reactions. In some embodiments, a partition of the plurality of partitions comprises a single cell therein, and the one or more experiments or screens comprise performing one or more reactions on the single cell in the partition. In some embodiments, the one or more reactions comprise cell lysis. In some embodiments, the one or more reactions comprise an amplification reaction. In some embodiments, the amplification reaction comprises primary template directed amplification (PTA).
[0090] In some embodiments, the one or more reactions comprise lysing the single cell, extracting the genomic material of the single cell, thereby releasing a cellular nucleic acid molecule from the single cell in the partition, and performing an amplification reaction on the cellular nucleic acid molecule.
[0091] In some embodiments, performing the one or more reactions comprises using one or more reagents. In some embodiments, the one or more reagent(s) comprise one or more of at least one amplification primer, at least one nucleic acid polymerase, and a mixture of nucleotides, wherein the mixture of nucleotides comprises at least one terminator nucleotide which terminates nucleic acid replication by the polymerase.
[0092] In some embodiments, the terminator nucleotide is an irreversible terminator. In some embodiments, the terminator nucleotide is selected from the group consisting of nucleotides with modification to the alpha group, C3 spacer nucleotides, locked nucleic acids (LNA), inverted nucleic acids, 2' fluoro nucleotides, 3' phosphorylated nucleotides, 2'-O-Methyl modified nucleotides, and trans nucleic acids. In some embodiments, the nucleotides with modification to the alpha group are alpha-thio dideoxynucleotides. In some embodiments, the terminator nucleotide comprises modifications of the r group of the 3’ carbon of the deoxyribose. In some embodiments, the terminator nucleotide is selected from the group consisting of dideoxynucleotides, inverted dideoxynucleotides, 3' biotinylated nucleotides, 3' amino nucleotides, 3’-phosphorylated nucleotides, 3'-O-methyl nucleotides, 3' carbon spacer nucleotides including 3' C3 spacer nucleotides, 3' C18 nucleotides, 3' Hexanediol spacer nucleotides, acyclonucleotides, and combinations thereof.
[0093] In some embodiments, a partition of the plurality of partitions comprises at least a single cell and a bead. In some embodiments, the bead delivers a reagent for performing a reaction on the single cell in the partition. In some embodiments, the reagent is bound to the bead via a cleavable linker and is configured to be released from the bead via cleavage of the cleavable linker. In some embodiments, the reagent comprises a barcode configured to identify the cell or a constituent of the cell.
[0094] In some embodiments, the constituent of the cell comprises genomic material of the cell, ribonucleic acid (RNA), deoxyribonucleic acid (DNA), or any combination thereof. In some embodiments, the method comprises lysing the cell in the partition, releasing a cellular nucleic acid molecule of the cell in the partition, releasing the barcode from the bead via cleavage of the cleavable linker, and hybridizing the cellular nucleic acid molecule to the barcode. In some embodiments, the one or more reactions comprise lysing the single cell, thereby releasing cellular nucleic acid molecules in the partition, performing one or more amplification reactions on the cellular nucleic acid molecules thereby generating amplified cellular nucleic acid molecules, and wherein the method further comprises extracting the amplified cellular nucleic acid molecules from the partition, and sequencing the amplified cellular nucleic acid molecules.
[0095] In some embodiments, generating the first data set comprises performing primary template directed amplification (PTA) and generating the second data set comprises performing a reverse transcription reaction. In some embodiments, performing the reverse transcription reaction comprises generating a cDNA library. In some embodiments, generating the first data set comprises determining a methylation site in a cellular nucleic acid molecule using PTA, thereby generating a methylation library. In some embodiments, the method further comprises comparing the methylation library to a reference library for a single cell of the plurality of cells, wherein the methylation library and the reference library are generated from the same cell.
[0096] In some embodiments, identifying the correlation comprises calculating or assigning a penetrance score to the correlation, wherein the penetrance score quantifies the correlation. In some embodiments, the penetrance score guides identifying the disease biomarker, designing the therapeutic, designing the vaccine for the disease, or any combination thereof. In some embodiments, a high penetrance score indicates a strong correlation between the first data set and the second data set. In some embodiments, the high penetrance score indicates that the expression of a gene identified in the first data set leads to a transcriptomic event, a proteomic event or both, and wherein the gene is identified as a disease biomarker. In some embodiments, a low penetrance score indicates a weak correlation between the first data set and the second data set, and that the expression of a gene identified in the first data set does not lead to a
transcriptomic event, a proteomic event, or either, and wherein the gene is not identified as a disease biomarker.
[0097] In some embodiments, identifying the correlation is performed with the aid of a computer system comprising a computer program. In some embodiments, the computer program comprises a bioinformatics algorithm. In some embodiments, the first data set and the second data set are combined or integrated into a database.
[0098] In an aspect, provided herein is a method of developing a treatment for a disease, wherein the method comprises: (a) generating multiomics data from one or more single cells, wherein generating comprises performing Primary Template Directed Amplification (PTA), and wherein the multiomics data comprises two or more of genome data, transcriptome data, and proteomics data; (b) correlating one or more mutations in genome data with corresponding mutations in one or both of (i) an mRNA of the transcriptome data and (ii) a protein of the proteome data; and (c) generating a treatment targeting one or both of the mRNA and the protein, thereby developing the treatment for the disease. In some embodiments, the disease comprises or is cancer.
[0099] In some embodiments, the correlation is quantified by a penetrance score. In some embodiments, the penetrance score is at least 0.5. In some embodiments, the penetrance score is at least 0.9.
[0100] In some embodiments, the treatment comprises an mRNA vaccine. In some embodiments, the treatment comprises reprogramming a dendritic cell to target one or both of the mRNA or protein. In some embodiments, the mutation in genome data comprises a DNA mutation. In some embodiments, the DNA mutation is selected from the group consisting of SNV*X, CNV*X, translocation, INDEL, frameshift, stop codon, mitochondrial, promoter/enhancer, TCR/BCR, and other change. In some embodiments, the mRNA comprises a transcript change. In some embodiments, the transcript change is selected from the group consisting of expression, splice variant, fusion, IncRNA, miRNA, TCR/BCR, promoter, truncated gene, mitochondrial, or mutation.
[0101] In some embodiments, the protein comprises a protein change. In some embodiments, the protein change is selected from the group consisting of over/under expressed, truncated, surface bound, frameshift, misfolded, metabolic, ligand independence, confirmation, activity change, or fused.
[0102] In some embodiments, the disease comprises cancer. In some embodiments, cancer comprises breast cancer. In some embodiments, the breast cancer comprises ductal carcinoma. In some embodiments, the cancer comprises leukemia. In some embodiments, the single cells (e g., single cancer cells) are obtained from an FFPE sample.
[0103] In some embodiments, the method or system is capable of detecting a number of RNA variant per cell of at least 750, 1000, 1500, 2000, 2500 or higher. In some embodiments, the method or system is capable of detecting a number of genes per cell of from about 1000 to about 8000. In some embodiments, the method or system is capable of detecting a number of RNA variant per cell of at least 750, 1000, 1500, 2000, 2500 or higher and a number of genes per cell of from about 1000 to about 8000.
[0104] In some embodiments, the methods comprise full length synthesis of RNA transcripts in the cell wherein a plurality of amplification products achieved from performing the method are substantially unbiased over a range of 5 ’-3’ gene body percentiles.
[0105] In some embodiments, the methods and systems of the present disclosure are capable of amplifying and detecting transcripts of at least 1 kb, 1.5 kb, 2kb, 2.5 kb, 3 kb, 3.5 kb, 4 kb, 4.5 kb, or longer.
Multiomics
[0106] Provided herein are methods for multiomics sample preparation and/or analysis. In some embodiments, multiomics may include analysis of at least one feature of a proteome, genome, transcriptome, metabolome, lipidome, or epigenome. Proteomics may include translation level, phosphorylation state, and protein modification. Transcriptomics may include, without limitations, analysis of ribosomal RNA (rRNA), messenger RNA (mRNA), transfer RNA (tRNA), micro-RNA (miRNA), and other non-coding RNA (ncRNA), or a combination thereof. Epigenomics may include, without limitations, analysis of methylation patterns (e.g.
“methylome”) or histone modifications.
[0107] In some instances, a method comprises one or more steps of isolating a single cell from a population of cells, wherein the single cell comprises RNA and genomic DNA; amplifying the RNA by RT-PCR to generate a cDNA library; isolating the cDNA from the genomic DNA; contacting the genomic DNA with at least one amplification primer, at least one nucleic acid polymerase, and a mixture of nucleotides; and sequencing the cDNA 1 i brary and the genomic DNA library. In some instances, the mixture of nucleotides comprises at least one terminator nucleotide which terminates nucleic acid replication by the polymerase to generate a genomic DNA library.
[0108] Methods described herein (e.g., PTA) may be used as a replacement for any number of other known methods in the art which are used for single cell sequencing (multiomics or the like). In some instances, a method described herein comprises PTA and a method of poly adenylated mRNA transcripts. In some instances, a method descnbed herein comprises PTA and a method of non-polyadenylated mRNA transcripts. In some instances, a method described herein comprises PTA and a method of total (poly adenylated and non-
polyadenylated) mRNA transcripts. PTA may substitute genomic DNA sequencing methods such as MDA, PicoPlex, DOP-PCR, MALBAC, or target-specific amplifications. In some instances, PTA replaces the standard genomic DNA sequencing method in a multiomics method including DR-seq (Dey et al., 2015), G&T seq (MacAulay et al., 2015), scMT-seq (Hu et al., 2016), sc-GEM (Cheow et al., 2016), scTrio-seq (Hou et al., 2016), simultaneous multiplexed measurement of RNA and proteins (Darmanis et al., 2016), scCOOL-seq (Guo et al., 2017), CITE-seq (Stoeckius et al., 2017), REAP-seq (Peterson et al., 2017), scNMT-seq (Clark et al., 2018), or SIDR-seq (Han et al., 2018).
[0109] In some instances, PTA is combined with a standard RNA sequencing method to obtain genome and transcriptome data. In some instances, a multiomics method described herein comprises PTA and one of the following: Drop-seq (Macosko, et al. 2015), mRNA-seq (Tang et al., 2009), InDrop (Klein et al., 2015), MARS-seq (Jaitin et al., 2014), Smart-seq2 (Hashimshony, et al., 2012; Fish et al., 2016), CEL-seq (Jaitin et al., 2014), STRT-seq (Islam, et al., 2011), Quartz-seq (Sasagawa et al., 2013), CEL-seq2 (Hashimshony, et al. 2016), cytoSeq (Fan et al., 2015), SuPeR-seq (Fan et al., 2011), RamDA-seq (Hayashi, et al. 2018), MATQ-seq (Sheng et al., 2017), or SMARTer (Verboom et al., 2019).
[0110] Various reaction conditions and mixes may be used for generating cDNA libraries for transcriptome analysis. In some instances, an RT reaction mix is used to generate a cDNA library. In some instances, the RT reaction mixture comprises a crowding reagent, at least one primer, a template switching oligonucleotide (TSO), a reverse transcriptase, and a dNTP mix. In some instances, an RT reaction mix comprises an RNAse inhibitor. In some instances, an RT reaction mix comprises one or more surfactants. In some instances, an RT reaction mix comprises Tween-20 and/or Tnton-X. In some instances, an RT reaction mix comprises Betaine. In some instances, an RT reaction mix comprises one or more salts. In some instances, an RT reaction mix comprises a magnesium salt (e.g., magnesium chloride) and/or tetramethyl ammonium chloride. In some instances, an RT reaction mix comprises gelatin. In some instances, an RT reaction mix comprises PEG (PEG1000, PEG2000, PEG4000, PEG6000, PEG8000, or PEG of other length).
[OHl] Multiomic methods described herein may provide both genomic and RNA transcript information from a single cell (e.g., a combined or dual protocol). In some instances, genomic information from the single cell is obtained from the PTA method, and RNA transcript information is obtained from reverse transcription to generate a cDNA library. In some instances, a whole transcript method is used to obtain the cDNA library. In some instances, 3’ or 5’ end counting is used to obtain the cDNA library. In some instances, cDNA libraries are not obtained using UMIs. In some instances, a multiomic method provides RNA transcript
information from the single cell for at least 500, 1000, 2000, 5000, 8000, 10,000, 12,000, or at least 15,000 genes. In some instances, a multi omic method provides RNA transcript information from the single cell for about 500, 1000, 2000, 5000, 8000, 10,000, 12,000, or about 15,000 genes. In some instances, a multiomic method provides RNA transcript information from the single cell for 100-12,000 1000-10,000, 2000-15,000, 5000-15,000, 10,000-20,000, 8000-15,000, or 10,000-15,000 genes. In some instances, a multiomic method provides genomic sequence information for at least 80%, 90%, 92%, 95%, 97%, 98%, or at least 99% of the genome of the single cell. In some instances, a multiomic method provides genomic sequence information for about 80%, 90%, 92%, 95%, 97%, 98%, or about 99% of the genome of the single cell. RNA may be amplified in the multiomics methods described herein. In some instances, RNA is amplified to isolate mRNA transcripts. In some instances, templateswitching polynucleotides are used. In some instances, amplification of RNA uses labeled primers. In some instances, a label comprises biotin. In some instances, at least some of the cDNA polynucleotides are isolated with affinity binding to the label. In some instances, multiomics methods comprise amplification of RNA to generate a cDNA library. In some instances, a cDNA library is generated having at least 10, 20, 30, 50, 75, 100, 125, 150, 175, 200, 225, 250, 300, 350, 400, or at least 500 ng of DNA. In some instances, a cDNA library is generated having 10-500, 20-500, 30-500, 50-500, 50-400, 50-300, 100-500, 100-400, 100-300, 100-200, 200-500, 300-500, or 400-750 ng of DNA. In some instances, at least some polynucleotides in the cDNA library comprise a barcode. In some instances, the cDNA comprises polynucleotides corresponding to at least 100, 500, 1000, 1500, 2000, 2500, 3000, 3500, or at least 4000 genes. In some instances, the cDNA comprises a 5’ to 3’ transcript bias of 0.5-1.5, 0.6-1.5, 0.7-1.5, 0.8-1.5, 0.9-1.5, 0.8-1.5, 1-1.5, 1-2.0, 1.2-2.0, 0.5-2.0.
[0112] Multiomic methods may comprise analysis of single cells from a population of cells. In some instances, at least 5, 10, 20, 50, 100, 200, 500, 1000, 2000, 5000, or at least 8000 cells are analyzed. In some instances, about 5, 10, 20, 50, 100, 200, 500, 1000, 2000, 5000, or about 8000 cells are analyzed. In some instances, 5-100, 10-100, 50-500, 100-500, 100-1000, 50- 5000, 100-5000, 500-1000, 500-10000, 1000-10000, or 5000-20,000 cells are analyzed.
[0113] Multiomic methods may generate yields of amplified genomic DNA from the PTA reaction based on the type of single cell. In some instances, the amount of DNA generated from a single cell is about 0.1, 1, 1.5, 2, 3, 5, or about 10 micrograms. In some instances, the amount of DNA generated from a single cell is about 0.1, 1, 1.5, 2, 3, 5, or about 10 femtograms. In some instances, the amount of DNA generated from a single cell is at least 0.1, 1, 1.5, 2, 3, 5, or at least 10 micrograms. In some instances, the amount of DNA generated from a single cell is at least 0.1, 1, 1.5, 2, 3, 5, or at least 10 femtograms. In some instances, the amount of DNA
generated from a single cell is about 0.1-10, 1-10, 1.5-10, 2-20, 2-50, 1-3, or 0.5-3.5 micrograms. In some instances, the amount of DNA generated from a single cell is about 0.1- 10, 1-10, 1.5-10, 2-20, 2-4, 1-3, or 0.5-4 femtograms. In some instances, the amount of DNA generated from a single cell is about 0.5-2.5, 0.5-3, 0.5-5, 0.2-5, 1-2.5, or 1-5 ng of DNA. In some instances, the amount of DNA generated from a single cell is at least 0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2, 2.25, 2.5, 2.75, 3, 3.25, 3.5, 4, or at least 5 ng of DNA.
[0114] DNA libraries may comprise an allelic balance. In some instances, the allelic balance is 50-100, 60-100, 70-100, 80-100, 60-95, 70-95, 80-95, 85-95, 90-95, 90-98, 90-99, 85-99, or 95- 99 percent. In some instances, the allelic balance is at least 50, 60, 70, 80, 83, 85, 87, 90, 92, 95, 98, or at least 99 percent.
[0115] DNA libraries may comprise a sensitivity for one or more SNVs. In some instances, the sensitivity is 0.50-1, 0.60-1, 0.70-1, 0.80-1, 0.60-0.95, 0.70-0.95, 0.80-0.95, 0.85-0.95, 0.90- 0.95, 0.90-0.98, 0.90-0.99, 0.85-0.99, or 0.95-0.99. In some instances, the sensitivity is at least 0.50, 0.60, 0.70, 0.80, 0.83, 0.85, 0.87, 0.90, 0.92, 0.95, 0.98, or at least 0.99.
[0116] DNA libraries may comprise a precision for one or more SNVs. In some instances, the precision is 0.50-1, 0.60-1, 0.70-1, 0.80-1, 0.60-0.95, 0.70-0.95, 0.80-0.95, 0.85-0.95, 0.90- 0.95, 0.90-0.98, 0.90-0.99, 0.85-0.99, or 0.95-0.99. In some instances, the precision is at least 0.50, 0.60, 0.70, 0.80, 0.83, 0.85, 0.87, 0.90, 0.92, 0.95, 0.98, or at least 0.99.
Penetrance
[0117] Provided herein are systems and methods for quantifying penetrance (e.g., penetrance score). In some instances, a penetrance score represents the contribution of one or more pieces of molecular information that associate with the physical signs and symptoms of a genetic disorder. In some instances, subjects with one or more biomarkers do not develop physical features of the disorder, and the condition has incomplete (or low) penetrance. In some instances, penetrance is determined from one or more biomarkers and/or biological mechanisms (pathways). In some instances, changes to biomarkers (type, measurement, etc.) are used to determine penetrance score.
[0118] In some instances, these phenotypic changes are due to a functional element (e.g., an RNA or a protein). In some instances, a change is silent (having no impact to protein). In some instances, a phenotypic change is manifested as a change in measurements obtained from one or more multiomic modalities.
[0119] In some instances, multiomics comprises DNA (e.g., genome/epigenome), RNA (e.g., transcriptome), protein (proteome), and/or other molecules (e.g., lipidome, metabolome). In some instances, multiomics enables determination of a mechanism with interdependent components for a disease state or disorder. In some instances, a penetrance score and/or
mechanism are used to identify validated therapeutic targets. In some instances, treatments are generated based on the therapeutic target. In some instances, the treatment comprises a vaccine, antibody, a genetic therapy, modified immune cells, or small molecule.
[0120] Provided herein are systems and methods comprising workflows for measuring penetrance (e.g., penetrance score). In some instances, a workflow comprises one or more steps of detecting a genetic change, transcript change, methylation change, and protein change. In some instances, systems and methods comprise a workflow according to FIG. 3. In some instances, a first step comprises detecting a genetic change. In some instances, a lack of a genetic change indicates no genomic mechanism (e.g., allele-related). In some instances, an optional second step comprises detecting a methylation change. In some instances, a lack of methylation change indicates a gene is not silenced. In some instances, a third step comprises detecting a transcript change. In some instances, a lack of transcript change indicates no transcriptome mechanism. In some instances, a lack of transcript changes indicates a transient expression of the expressed gene. In some instances, a lack of transcript change indicates incomplete penetrance. In some instances, a fourth step comprises detecting a protein change. In some instances, a lack of change in the proteome indicates no proteomic mechanism. In some instances, lack of a change in the proteome indicates incomplete penetrance. In some instances, a change detected in two or more steps indicates high penetrance. In some instances, a change detected in three or more steps indicates high penetrance. In some instances, a change detected in four or more steps indicates high penetrance. In some instances, detected changes in the genome, transcriptome, and proteome indicate high penetrance. Systems and methods described herein in some instances comprise one or more steps shown in FIG. 5. Systems and methods described herein in some instances comprise one or more measurements shown in FIG. 5. [0121] Provided herein are systems for determining a penetrance score. In some instances, systems comprise one or more of a computing system comprising at least one processor and instructions executable by the at least one processor to provide an application configured to perform operations. In some instances, the operations comprise one or more of: receiving multiomics data from one or more sources and at least one biological state; and applying an algorithm configured to process the data and generate a penetrance score. In some instances, the system comprises a standalone computing platform. In some instances, the system comprises a cloud computing platform. In some instances, the multiomics data comprises data from one or more of a genome, a trans criptome, a proteome, a metabolome, a lipidome, or an epigenome (such as a methylome). In some instances, the multiomics data comprises data from two or more of a genome, a trans criptome, a proteome, a metabolome, a lipidome, or an epigenome. In some instances, the multiomics data comprises data from a genome, a transcriptome, a
proteome, a metabolome, a lipidome, or an epigenome. In some instances, the multiormcs data comprises data obtained from processes which analyze one or more of a genome, a transcriptome, a proteome, a metabolome, a lipidome, or an epigenome. In some instances, multiomics data is obtained from a sample described herein. In some instances, multiomics data is obtained from a single cell. In some instances, multiomics data is obtained from a single cell from a tissue. In some instances, systems described herein analyze multiomics data from single cells in a tissue. In some instances, one or more measurements are selected from copy number variation, translocation, mutation burden, methylation at CpG sites, gene activation, gene repression, expressed genes, gene fusions, splice variants, translation level, phosphorylation state, and protein modification. In some instances, two or more measurements are selected from copy number variation, translocation, mutation burden, methylation at CpG sites, gene activation, gene repression, expressed genes, gene fusions, splice variants, translation level, phosphorylation state, and protein modification. In some instances, four or more measurements are selected from copy number variation, translocation, mutation burden, methylation at CpG sites, gene activation, gene repression, expressed genes, gene fusions, splice variants, translation level, phosphorylation state, and protein modification. In some instances, eight or more measurements are selected from copy number variation, translocation, mutation burden, methylation at CpG sites, gene activation, gene repression, expressed genes, gene fusions, splice variants, translation level, phosphorylation state, and protein modification.
[0122] Penetrance scores may be measured from one or more changes to measurements obtained from multiomics data. In some instances, a change is established against a reference sequence. In some instances, the reference sequence is obtained from a healthy or non-disease control sample. In some instances, a reference sequence is obtained from bulk measurements of a sample population. In some instances, a change comprises one or more of a genome DNA change, a transcript change, and a proteome change. In some instances a change comprises one or more of a genomic SNV*X (single nucleotide change), genomic CNV*X (copy number variation change), genomic translocation, genomic INDEL, genomic frameshift, genomic stop codon, genomic mitochondrial, genomic promoter/enhancer, genomic TCR/BCR, transcript expression, transcript splice variant, transcript fusion, transcript IncRNA, transcript miRNA, transcript TCR/BCR, transcript promoter, transcript truncated gene, transcript mitochondrial, transcript mutation, over/under expressed protein, truncated protein, surface bound protein, frameshift protein, misfolded protein, metabolic protein, protein ligand independence, protein confirmation, protein activity change, and fused protein. In some instances a change comprises two or more of a genomic SNV*X, genomic CNV*X, genomic translocation, genomic INDEL, genomic frameshift, genomic stop codon, genomic mitochondrial, genomic promoter/enhancer,
genomic TCR/BCR, transcript expression, transcript splice variant, transcript fusion, transcript IncRNA, transcript miRNA, transcript TCR/BCR, transcript promoter, transcript truncated gene, transcript mitochondrial, transcript mutation, over/under expressed protein, truncated protein, surface bound protein, frameshift protein, misfolded protein, metabolic protein, protein ligand independence, protein confirmation, protein activity change, and fused protein. In some instances a change comprises five or more of a genomic SNV*X, genomic CNV*X, genomic translocation, genomic INDEL, genomic frameshift, genomic stop codon, genomic mitochondrial, genomic promoter/enhancer, genomic TCR/BCR, transcript expression, transcript splice variant, transcript fusion, transcript IncRNA, transcript miRNA, transcript TCR/BCR, transcript promoter, transcript truncated gene, transcript mitochondrial, transcript mutation, over/under expressed protein, truncated protein, surface bound protein, frameshift protein, misfolded protein, metabolic protein, protein ligand independence, protein confirmation, protein activity change, and fused protein. In some instances, a measurement change is used to determine a mechanism. In some instances, a mechanism comprises a determinate of cell fate. In some instances, a cell fate is shown in FIG. 7. A penetrance score may be represented in different ways. In some instances, a penetrance score comprises a numerical value. In some instances, a penetrance score is categorical. In some instances, a numerical value is used to determine a categoncal value. In some instances, categorical values comprise high or low.
[0123] Biological inquiries may be used to interrogate changes in measurements obtained from multiomics data. In some instances, methods described herein perform one or more biological inquiries. In some instances, a biological inquiry comprises throughput number of cells processed, throughput number of cells recovered, throughput sequencing, DNA mutation - SNV, DNA copy number variation, RNA - 3’ gene expression, RNA - genes analyzed/detected, RNA - low level genes detected, RNA - mitochondrial gene expression, protein - translation panel, RNA - chromatin panel, RNA - chromatin state, RNA - BCR/TCR, and RNA - full transcript gene. An example of a workflow comprising biological inquires for both mammalian and bacteria samples is shown in FIG. 9. In some instances, systems and methods described herein comprise obtaining a sample, and performing one or more methods comprising biological inquiries. In some instances, obtaining cells comprises one or more of FACS sorting, microfluidics, spatial cell selection, and ultra-high throughput methods. In some instances, methods comprise simultaneous genome/transcriptome analysis to prepare libraries (e g., using PTA). In some instances, libraries are then sequenced to obtain multiomics data. [0124] Provided herein are methods of target validation. In some instances, a target is associated with a disease state or condition. In some instances, a target validation workflow
comprises one or more steps of FIG. 6. In some instances, a workflow for validating a target comprises one or more of obtaining a sample, storing a sample, performing one or more multiomic methods on the sample to generate multiomics data, using a computation engine to process the data, and validating a target. In some instances, the sample comprises cells from a tissue. In some instances, the sample comprises cells from a frozen tissue. In some instances, the sample comprises a section of tissue. In some instances, cells are collected and then banked. In some instances, no more than 5000, 4000, 3000, 2000, 1000, 900, 800, 700, 600, 500, 400, 300, 200, 100, 50, 25, or no more than 10 cells are banked. Multiomic methods comprise methods described herein such as those which analyze and provide data on any one of the genome, methylome, transcriptome, and proteome. In some instances, target validation is associated with targets related to immunology, cancer genomics, neurology, PGT, microbiome, toxicology, bioprocessing, or cardiology.
Methylome analysis
[0125] Described herein are methods comprising PTA, wherein sites of methylated DNA in single cells are determined using the PTA method. In some instances, sites of methylated DNA is detected using enzymatic methods. In some instances, sites of methylated DNA is detected using non-enzymatic methods. In some instances, these methods further comprise parallel analysis of the transcriptome and/or proteome of the same cell. Methods of detecting methylated genomic bases include selective restriction with methylation-sensitive endonucleases, followed by processing with the PTA method. Sites cut by such enzymes are determined from sequencing, and methylated bases are identified. In some instance, libraries are amplified with methylation-specific primers which selectively anneal to methylated sequences. [0126] In another instance, bisulfite treatment of genomic DNA libraries is used to detect a methylation signature. Bisulfite conversion of DNA results in conversion of unmodified cytosine (C) to uracil (U) that will be read as thymine (T) upon sequencing of PCR amplified DNA. Both 5meC and 5hmC are protected against conversion and will not be converted to U. Therefore, they will both be read as C upon sequencing. Alternatively, non-methylation-specific PCR is conducted, followed by one or more methods to discriminate between bisulfite-reacted bases, including direct pyrosequencing, MS-SnuPE, HRM, COBRA, MS-SSCA, or basespecific cleavage/MALDI-TOF. In some instances, genomic DNA samples are split for parallel analysis of the genome (or an enriched portion thereof) and methylome analysis. In some instances, analysis of the genome and methylome comprises enrichment of genomic fragments (e g., exome, or other targets) or whole genome sequencing.
[0127] In some instances, the methylation signature is preserved during PTA. In some instances, processing with the PTA method while preserving the methylation signature is used
to create a reference library. In some instances, after a reference library is created, methylation paterns are detected using the methods described herein to create a methylation-specific library. In some embodiments, the methylation-specific library is compared to the reference library. In some instances, the methylation-specific library and the reference library are prepared from the same cell. In some instances, comparing the methylation-specific library to the reference library allows for identification of a methylation signature. In some instances, after a reference library is created, the genomic DNA library is treated with bisulfite. In some instances, the genomic library treated with bisulfite is amplified with the PTA method to produce a methylation-specific library.
Bioinformatics
[0128] The data obtained from single-cell analysis methods utilizing PTA described herein may be compiled into a database. Described herein are methods and systems of bioinformatic data integration. Data from the proteome, genome, trans criptome, methylome or other data is in some instances combined/integrated into a database and analyzed. Bioinformatic data integration methods and systems in some instances comprise one or more of protein detection (FACS and/or NGS), mRNA detection, and/or genome variance detection. In some instances, this data is correlated with a disease state or condition. In some instances, data from a plurality of single cells is compiled to describe properties of a larger cell population, such as cells from a specific sample, region, organism, or tissue. In some instances, protein data is acquired from fluorescently labeled antibodies which selectively bind to proteins on a cell. In some instances, a method of protein detection comprises grouping cells based on fluorescent markers and reporting sample location post-sorting. In some instances, a method of protein detection comprises detecting sample barcodes, detecting protein barcodes, companng to designed sequences, and grouping cells based on barcode and copy number. In some instances, protein data is acquired from oligo barcoded antibodies which selectively bind to proteins on a cell. Such oligo barcodes covalently linked to the antibody are used a reference to the specific antigen binding site for the detection of a particular antigen or translated protein. In some instances, transcriptome data is acquired from sample and RNA specific barcodes. In some instances, a method of mRNA detection comprises detecting sample and RNA specific barcodes, aligning to genome, aligning to RefSeq/Encode, reporting Exon/Intro/Intergenic sequences, analyzing exon-exon junctions, grouping cells based on barcode and expression variance and clustering analysis of variance and top variable genes. In some instances, genomic data is acquired from sample and DNA specific barcodes. In some instances, a method of genome variance detection comprises detecting sample and DNA specific barcodes, aligning to the genome, determine genome recovery and SNV mapping rate, filtering reads on exon-exon
junctions, generating variant call file (VCF), and clustenng analysis of variance and top variable mutations.
Mutations
[0129] In some instances, the methods (e.g., multiomic PTA) described herein result in higher detection sensitivity and/or lower rates of false positives for the detection of mutations. In some instances, a mutation is a difference between an analyzed sequence (e.g., using the methods described herein) and a reference sequence. Reference sequences are in some instances obtained from other organisms, other individuals of the same or similar species, populations of organisms, or other areas of the same genome. In some instances, mutations are identified on a plasmid or chromosome. In some instances, a mutation is an SNV (single nucleotide variation), SNP (single nucleotide polymorphism), or CNV (copy number variation, or CNA/copy number aberration). In some instances, a mutation is base substitution, insertion, or deletion. In some instances, a mutation is a transition, transversion, nonsense mutation, silent mutation, synonymous or non-synonymous mutation, non-pathogenic mutation, missense mutation, or frameshift mutation (deletion or insertion). In some instances, PTA results in higher detection sensitivity and/or lower rates of false positives for the detection of mutations when compared to methods such as in-silico prediction, ChlP-seq, GUIDE-seq, circle-seq, HTGTS (High- Throughput Genome-Wide Translocation Sequencing), IDLV (integration-deficient lentivirus), Digenome-seq, FISH (fluorescence in situ hybridization), or DISCOVER-seq.
Primary Template-Directed Amplification
[0130] Described herein are nucleic acid amplification methods, such as ‘’Primary Template- Directed Amplification (PTA).” In some instances, PTA is combined with other analysis workflows for multiomic analysis. With the PTA method, amplicons are preferentially generated from the primary template (“direct copies”) using a polymerase (e.g., a strand displacing polymerase). Consequently, errors are propagated at a lower rate from daughter amplicons during subsequent amplifications compared to MDA. The result is an easily executed method that, unlike existing WGA protocols, can amplify low DNA input including the genomes of single cells with high coverage breadth and uniformity in an accurate and reproducible manner. Moreover, the terminated amplification products can undergo direction ligation after removal of the terminators, allowing for the attachment of a cell barcode to the amplification primers so that products from all cells can be pooled after undergoing parallel amplification reactions. In some instances, template nucleic acids are not bound to a solid support. In some instances, direct copies of template nucleic acids are not bound to a solid support. In some instances, one or more pnmers are not bound to a solid support. In some
instances, no primers are not bound to a solid support. In some instances, a primer is attached to a first solid support, and a template nucleic acid is attached to a second solid support, wherein the first and the second solid supports are not the same. In some instances, PTA is used to analyze single cells from a larger population of cells. In some instances, PTA is used to analy ze more than one cell from a larger population of cells, or an entire population of cells.
[0131] Described herein are methods employing nucleic acid polymerases with strand displacement activity for amplification. In some instances, such polymerases comprise strand displacement activity and low error rate. In some instances, such polymerases comprise strand displacement activity and proofreading exonuclease activity, such as 3 ’->5’ proofreading activity. In some instances, nucleic acid polymerases are used in conjunction with other components such as reversible or irreversible terminators, or additional strand displacement factors. In some instances, the polymerase has strand displacement activity, but does not have exonuclease proofreading activity. For example, in some instances such polymerases include bacteriophage phi29 (029) polymerase, which also has very low error rate that is the result of the 3’->5’ proofreading exonuclease activity (see, e.g., U.S. Pat. Nos. 5,198,543 and 5,001,050). In some instances, examples of strand displacing nucleic acid polymerases include, e.g., genetically modified phi29 ( 29) DNA polymerase, Klenow Fragment of DNA polymerase I (Jacobsen et al., Eur. J. Biochem. 45:623-627 (1974)), phage M2 DNA polymerase (Matsumoto et al., Gene 84:247 (1989)), phage phiPRDl DNA polymerase (Jung et al., Proc. Natl. Acad. Sci. USA 84:8287 (1987); Zhu and Ito, Biochim. Biophys. Acta.
1219:267-276 (1994)), Bst DNA polymerase (e.g., Bst large fragment DNA polymerase (Exo(-) Bst; Aliotta et al., Genet. Anal. (Netherlands) 12: 185-195 (1996)), exo(-)Bca DNA polymerase (Walker and Linn, Clinical Chemistry 42: 1604-1608 (1996)), Bsu DNA polymerase, VentR DNA polymerase including VentR (exo-) DNA polymerase (Kong et al., J. Biol. Chem. 268:1965-1975 (1993)), Deep Vent DNA polymerase including Deep Vent (exo-) DNA polymerase, IsoPol DNA polymerase, DNA polymerase I, Therminator DNA polymerase, T5 DNA polymerase (Chattegee et al., Gene 97: 13-19 (1991)), Sequenase (U.S. Biochemicals), T7 DNA polymerase, T7-Sequenase, T7 gp5 DNA polymerase, PRDI DNA polymerase, T4 DNA polymerase (Kaboord and Benkovic, Curr. Biol. 5: 149-157 (1995)). Additional strand displacing nucleic acid polymerases are also compatible with the methods described herein. The ability of a given polymerase to carry' out strand displacement replication can be determined, for example, by using the polymerase in a strand displacement replication assay (e.g., as disclosed in U.S. Pat. No. 6,977,148). Such assays in some instances are performed at a temperature suitable for optimal activity for the enzyme being used, for example, 32°C for phi29 DNA polymerase, from 46°C to 64°C for exo(-) Bst DNA polymerase, or from about 60°C to 70°C for
an enzyme from a hyperthermophylic organism. Another useful assay for selecting a polymerase is the primer-block assay described in Kong et al., J. Biol. Chem. 268: 1965-1975
(1993). The assay consists of a primer extension assay using an M13 ssDNA template in the presence or absence of an oligonucleotide that is hybridized upstream of the extending primer to block its progress. Other enzymes capable of displacement the blocking primer in this assay are in some instances useful for the disclosed method. In some instances, polymerases incorporate dNTPs and terminators at approximately equal rates. In some instances, the ratio of rates of incorporation for dNTPs and terminators for a polymerase described herein are about 1:1, about 1.5: 1, about 2: 1, about 3: 1 about 4: 1 about 5: 1, about 10: 1, about 20:1 about 50: 1, about 100: 1, about 200: 1, about 500:1, or about 1000: 1. In some instances, the ratio of rates of incorporation for dNTPs and terminators for a polymerase described herein are 1 : 1 to 1000: 1, 2:1 to 500: 1, 5: 1 to 100: 1, 10: 1 to 1000: 1, 100: 1 to 1000: 1, 500:1 to 2000: 1, 50: 1 to 1500: 1, or 25: 1 to 1000: 1.
[0132] Described herein are methods of amplification wherein strand displacement can be facilitated through the use of a strand displacement factor, such as, e.g., helicase. Such factors are in some instances used in conjunction with additional amplification components, such as poly merases, terminators, or other component. In some instances, a strand displacement factor is used with a polymerase that does not have strand displacement activity. In some instances, a strand displacement factor is used with a polymerase having strand displacement activity. Without being bound by theory, strand displacement factors may increase the rate that smaller, double stranded amplicons are reprimed. In some instances, any DNA polymerase that can perform strand displacement replication in the presence of a strand displacement factor is suitable for use in the PT A method, even if the DNA polymerase does not perform strand displacement replication in the absence of such a factor. Strand displacement factors useful in strand displacement replication in some instances include (but are not limited to) BMRF1 polymerase accessory subunit (Tsurumi et al., J. Virology 67(12):7648-7653 (1993)), adenovirus DNA-binding protein (Zijderveld and van der Vliet, J. Virology 68(2): 1158-1164
(1994)), herpes simplex viral protein ICP8 (Boehmer and Lehman, J. Virology 67(2):711-715 (1993); Skaliter and Lehman, Proc. Natl. Acad. Sci. USA 91(22): 10665-10669 (1994)); singlestranded DNA binding proteins (SSB; Rigler and Romano, J. Biol. Chem. 270:8910-8919
(1995)); phage T4 gene 32 protein (Villemain and Giedroc, Biochemistry 35: 14395-14404
(1996);T7 helicase-primase; T7 gp2.5 SSB protein; Tte-UvrD (from Thermoanaerobacter tengcongensis), calf thymus helicase (Siegel et al., J. Biol. Chem. 267: 13629-13635 (1992)); bacterial SSB (e.g., E. coll SSB), Replication Protein A (RPA) in eukaryotes, human mitochondrial SSB (mtSSB), and recombinases, (e.g., Recombinase A (RecA) family proteins,
T4 UvsX, T4 UvsY, Sak4 of Phage HK620, Rad51, Dmcl, or Radb). Combinations of factors that facilitate strand displacement and pnming are also consistent with the methods described herein. For example, a helicase is used in conjunction with a polymerase. In some instances, the PTA method comprises use of a single-strand DNA binding protein (SSB, T4 gp32, or other single stranded DNA binding protein), a helicase, and a polymerase (e.g., SauDNA polymerase, Bsu polymerase, Bst2.0, GspM, GspM2.0, GspSSD, or other suitable polymerase). In some instances, reverse transcriptases are used in conjunction with the strand displacement factors described herein. In some instances, reverse transcriptases are used in conjunction with the strand displacement factors described herein. In some instances, amplification is conducted using a polymerase and a nicking enzyme (e.g., “NEAR”), such as those described in US 9,617,586. In some instances, the nicking enzyme is Nt.BspQI, Nb.BbvCi, Nb.BsmI, Nb.BsrDI, Nb.BtsI, Nt.AlwI, Nt.BbvCI, Nt.BstNBI, Nt.CviPII, Nb.BpulOI, or Nt.BpulOI.
[0133] Described herein are amplification methods comprising use of terminator nucleotides, polymerases, and additional factors or conditions. For example, such factors are used in some instances to fragment the nucleic acid template(s) or amplicons during amplification. In some instances, such factors comprise endonucleases. In some instances, factors comprise transposases. In some instances, mechanical shearing is used to fragment nucleic acids during amplification. In some instances, nucleotides are added during amplification that may be fragmented through the addition of additional proteins or conditions. For example, uracil is incorporated into amplicons; treatment with uracil D-glycosylase fragments nucleic acids at uracil-containing positions. Additional systems for selective nucleic acid fragmentation are also in some instances employed, for example an engineered DNA glycosylase that cleaves modified cytosine-pyrene base pairs. (Kwon, et al. Chem Biol. 2003, 10(4), 351)
[0134] Described herein are amplification methods comprising use of terminator nucleotides, which terminate nucleic acid replication thus decreasing the size of the amplification products. Such terminators are in some instances used in conjunction with polymerases, strand displacement factors, or other amplification components described herein. In some instances, terminator nucleotides reduce or lower the efficiency of nucleic acid replication. Such terminators in some instances reduce extension rates by at least 99.9%, 99%, 98%, 95%, 90%, 85%, 80%, 75%, 70%, or at least 65%. Such terminators in some instances reduce extension rates by 50%-90%, 60%-80%, 65%-90%, 70%-85%, 60%-90%, 70%-99%, 80%-99%, or 50%- 80%. In some instances, terminators reduce the average amplicon product length by at least 99.9%, 99%, 98%, 95%, 90%, 85%, 80%, 75%, 70%, or at least 65%. Terminators in some instances reduce the average amplicon length by 50%-90%, 60%-80%, 65%-90%, 70%-85%, 60%-90%, 70%-99%, 80%-99%, or 50%-80%. In some instances, amplicons comprising
terminator nucleotides form loops or hairpins which reduce a polymerase's ability to use such amplicons as templates. Use of terminators in some instances slows the rate of amplification at initial amplification sites through the incorporation of terminator nucleotides (e.g., dideoxynucleotides that have been modified to make them exonuclease-resistant to terminate DNA extension), resulting in smaller amplification products. By producing smaller amplification products than the currently used methods (e.g., average length of 50-2000 nucleotides in length for PTA methods as compared to an average product length of >10,000 nucleotides for MDA methods) PTA amplification products in some instances undergo direct ligation of adapters without the need for fragmentation, allowing for efficient incorporation of cell barcodes and unique molecular identifiers (UMI).
[0135] Terminator nucleotides are present at various concentrations depending on factors such as polymerase, template, or other factors. For example, the amount of terminator nucleotides in some instances is expressed as a ratio of non-terminator nucleotides to terminator nucleotides in a method described herein. Such concentrations in some instances allow control of amplicon lengths. In some instances, the ratio of terminator to non-terminator nucleotides is modified for the amount of template present or the size of the template. In some instances, the ratio of ratio of terminator to non-terminator nucleotides is reduced for smaller samples sizes (e.g., femtogram to picogram range). In some instances, the ratio of non-terminator to terminator nucleotides is about 2: 1, 5: 1, 7:1, 10:1, 20:1, 50:1, 100: 1, 200:1, 500: 1, 1000:1, 2000:1, or 5000: 1. In some instances the ratio of non-terminator to terminator nucleotides is 2:1-10: 1, 5: 1- 20: 1, 10: 1-100: 1, 20: 1-200: 1, 50:1-1000:1, 50:1-500: 1, 75: 1-150:1, or 100: 1-500:1. In some instances, at least one of the nucleotides present during amplification using a method described herein is a terminator nucleotide. Each terminator need not be present at approximately the same concentration; in some instances, ratios of each terminator present in a method described herein are optimized for a particular set of reaction conditions, sample type, or polymerase. Without being bound by theory, each terminator may possess a different efficiency for incorporation into the growing polynucleotide chain of an amplicon, in response to pairing with the corresponding nucleotide on the template strand. For example, in some instances a terminator pairing with cytosine is present at about 3%, 5%, 10%, 15%, 20%, 25%, or 50% higher concentration than the average terminator concentration. In some instances, a terminator pairing with thymine is present at about 3%, 5%, 10%, 15%, 20%, 25%, or 50% higher concentration than the average terminator concentration. In some instances, a terminator pairing with guanine is present at about 3%, 5%, 10%, 15%, 20%, 25%, or 50% higher concentration than the average terminator concentration. In some instances, a terminator pairing with adenine is present at about 3%, 5%, 10%, 15%, 20%, 25%, or 50% higher concentration than the
average terminator concentration. In some instances, a terminator pairing with uracil is present at about 3%, 5%, 10%, 15%, 20%, 25%, or 50% higher concentration than the average terminator concentration. Any nucleotide capable of terminating nucleic acid extension by a nucleic acid polymerase in some instances is used as a terminator nucleotide in the methods described herein. In some instances, a reversible terminator is used to terminate nucleic acid replication. In some instances, a non-reversible terminator is used to terminate nucleic acid replication. In some instances, non-limited examples of terminators include reversible and non- reversible nucleic acids and nucleic acid analogs, such as, e.g., 3’ blocked reversible terminator comprising nucleotides, 3’ unblocked reversible terminator comprising nucleotides, terminators comprising 2’ modifications of deoxynucleotides, terminators comprising modifications to the nitrogenous base of deoxynucleotides, or any combination thereof. In one embodiment, terminator nucleotides are dideoxynucleotides. Other nucleotide modifications that terminate nucleic acid replication and may be suitable for practicing the invention include, without limitation, any modifications of the r group of the 3’ carbon of the deoxyribose such as inverted dideoxynucleotides, 3' biotinylated nucleotides, 3' amino nucleotides, 3 '-phosphorylated nucleotides, 3'-O-methyl nucleotides, 3' carbon spacer nucleotides including 3' C3 spacer nucleotides, 3' C18 nucleotides, 3' Hexanediol spacer nucleotides, acyclonucleotides, and combinations thereof. In some instances, terminators are polynucleotides comprising 1, 2, 3, 4, or more bases in length. In some instances, terminators do not comprise a detectable moiety or tag (e.g., mass tag, fluorescent tag, dye, radioactive atom, or other detectable moiety). In some instances, terminators do not comprise a chemical moiety allowing for attachment of a detectable moiety or tag (e.g., “click” azide/alkyne, conjugate addition partner, or other chemical handle for attachment of a tag). In some instances, all terminator nucleotides comprise the same modification that reduces amplification to at region (e.g., the sugar moiety, base moiety, or phosphate moiety) of the nucleotide. In some instances, at least one terminator has a different modification that reduces amplification. In some instances, all terminators have a substantially similar fluorescent excitation or emission wavelengths. In some instances, terminators without modification to the phosphate group are used with polymerases that do not have exonuclease proofreading activity. Terminators, when used with polymerases which have 3’->5’ proofreading exonuclease activity (such as, e.g., phi29) that can remove the terminator nucleotide, are in some instances further modified to make them exonuclease-resistant. For example, dideoxynucleotides are modified with an alpha-thio group that creates a phosphorothioate linkage which makes these nucleotides resistant to the 3’->5’ proofreading exonuclease activity of nucleic acid polymerases. Such modifications in some instances reduce the exonuclease proofreading activity of polymerases by at least 99.5%, 99%, 98%, 95%, 90%,
or at least 85%. examples of other terminator nucleotide modifications providing resistance to the 3’->5’ exonuclease activity include in some instances: nucleotides with modification to the alpha group, such as alpha-thio dideoxynucleotides creating a phosphorothioate bond, C3 spacer nucleotides, locked nucleic acids (LNA), inverted nucleic acids, 2' Fluoro bases, 3' phosphorylation, 2'-O-Methyl modifications (or other 2’-O-alkyl modification), propyne- modified bases (e.g., deoxycytosine, deoxyuridine), L-DNA nucleotides, L-RNA nucleotides, nucleotides with inverted linkages (e.g., 5’ -5’ or 3’-3’), 5’ inverted bases (e.g., 5’ inverted 2’,3’-dideoxy dT), methylphosphonate backbones, and trans nucleic acids. In some instances, nucleotides with modification include base-modified nucleic acids comprising free 3’ OH groups (e.g., 2-nitrobenzyl alkylated HOMedU triphosphates, bases comprising modification with large chemical groups, such as solid supports or other large moiety). In some instances, a polymerase with strand displacement activity but without 3 '->5 'exonuclease proofreading activity is used with terminator nucleotides with or without modifications to make them exonuclease resistant. Such nucleic acid polymerases include, without limitation, Bst DNA polymerase, Bsu DNA polymerase, Deep Vent (exo-) DNA polymerase, Klenow Fragment (exo-) DNA polymerase, Therminator DNA polymerase, and VentR(exo-).
Visualization of biological information
[0136] Described herein are computer-implemented systems for visualization of biological data. In some instances, the data comprises genomic, transcriptomic, proteomic, methylation and epigenomic data. Further described herein are computer-implemented systems comprising one or more modules. Further described herein are computer-implemented systems comprising at least one memory storing computer-executable instructions; and at least one processor configured to access the at least one memory and execute the computer-executable instructions, wherein the computer-executable instructions comprise one or more of a frontend, a backend, and a pipeline module. In some instances, an exemplary arrangement of modules is shown in FIG. 10. In some instances, modules are accessed from a cloud-based database or interface. Methods and systems described herein in some instances comprise one or more steps of accessing a web-based software application; providing or otherwise linking an input file (such as a file comprising whole genomes sequencing, RNA, or other biological information); processing the file; applying one or more filters or annotations to the data in the file; querying one or more databases; and displaying a visualization of the filtered and/or annotated data. [0137] The systems and methods described herein may comprise a frontend module. In some instances, the frontend module comprises a Vue.js application that provides the user interface and visualizations for the systems and methods described herein. In some instances, the frontend makes requests to the backend to query data. In some instances, a frontend comprises
computer-executable instructions for one or more of: displays complex visualizations such as the circos plot, phylogenic tree, etc. (e.g., as navigable tabs); displays quality metrics; visualizes filters and filtering interactions; and presents data tables for cell information. In some instances, a web version of IGV is integrated into the frontend.
[0138] The systems and methods described herein may comprise a backend module. In some instances, the backend comprises a Flask framework application and provides one or more backend features of for the methods and systems described herein. In some instances, the backend is written in Python. In some instances, a backend comprises computer-executable instructions for one or more of: user authentication and registration; data computations and filtering; access of a Vaex open-source library for speeding up data interactions; interacting with a database and HDF5 files to process data requests; presenting and encoding data for visualizations; and presenting data for IGV.
[0139] The systems and methods described herein may comprise a pipeline module. In some instances, the pipeline comprises a computationally intensive workflow that runs genomics analysis tools to extract signatures of biomarkers from sequencing files and loads them into a database. In some instances, the methods and systems described herein comprise one or more pipeline modules. In some instances, pipeline modules comprise multi-omics, such as WGS/exome, methylation, proteome, proteome bacterial, or RNA-seq/transcriptome. In some instances, pipeline comprises one or more sub-modules. In some instances, a pipeline comprises one or more data files. In some instances, a pipeline comprises one or more of sequencing input files, sub-pipeline modules, and summary files.
[0140] Pipelines may be configured for whole genome or exome sequencing data. In some instances, a WGS/exome pipeline is configured to input one or more fastQ files. In some instances, a WGS/exome pipeline comprises one or more of alignment, haplotype callerjointgenotyping, heterozygous site detector (Pipeline used for the analysis of cell lines without a priori knowledge of reference heterozy gous variant sites), statistics, ADO, and CNV are needed to drive insights from sequencing data. In some instances, the files contain sequence(ing) information/data. In some instances, files comprise sequence data from the clusters that pass filter on a flow cell. In some instances, the files comprise FastQ files. In some instances, the database comprises a PostgreSQL database. In some instances, the databases are accessed from a backend module, rises computer-executable instructions for one or more of: accepts a sequencing information file as input (e g., FastQ); running joint genotyping to produce VCF file and linking variants to COSMIC, ClinVar, or another variant list. In some instances, a VCF file contains the variants called from multiple samples (cells) all together and represent high confidence variants distributed across the cells. These variants in some instances represent
changes in nucleotides observed in a cell in relation to the reference genome. In some instances, these variants are placed along the genome using genomic coordinates (e.g., chrl base 18903). Such a configuration having a specific location for a variant allows in some instances association of information complied in databases to this given variant.
[0141] Pipelines may be configured for multi-omics analysis. In some instances, multi-omics comprises two or more types of biological information. In some instances, multi-omics comprises two or more of transcript (transcriptome), genomic, proteomic, methylome, or other form of sample analysis. In some instances, methods described herein display and/process multi-omics data. Data in some instances is obtained from a single cell. Data in other instances is obtained by evaluation of a population of cells. In some instances, methods described herein display transcript and genomic data. In some instances, methods described herein utilize transcript, genomic data, and proteomics data. In some instances, methods described herein utilize transcript, genomic data, and methylome data.
[0142] In some instances, an alignment pipeline comprises one or more of a compressed alignment file describing the alignment information of the reads in the project against a given reference (e.g., hg38), a .bam file) and an index file of the alignment file). In some instances, the pipeline comprises a .bam file.
[0143] In some instances, a haplotype caller pipeline comprises one or more of a genomic variant call format (GVCF) file containing the detected variants for a given sample) and an indexer file associated with the GVCF file.
[0144] In some instances, a joint-genotyping pipeline comprises one or more of a genomic variant call format (GVCF) file containing the joint variant calling of multiple samples) and an indexer file associated with the Joint-Genotyped GVCF file.
[0145] In some instances, a heterozygous site detector pipeline comprises one or more of a genomic variant call format (GVCF) file containing the called variants with high degree of prevalence across a dataset and high confidence; and an indexer file associated with the GVCF file.
[0146] In some instances, a statistics pipeline comprises one or more of a tabulator-separated value table describing whole genome sequence (WGS) level statistics estimated from the aligned reads (e.g., IX, 5X, 10X coverage, etc.); and a tabulator-separated value table showing exome-panel specific statistics (e.g., On, OFF, Near target events).
[0147] In some instances, an ADO pipeline comprises one or more of a tabulator-separated value table showing allele frequencies of N number of queried heterozygous sites. This table is in some instances used to estimate WGS allele balance.
[0148] In some instances, a CNV pipeline comprises one or more of a tabulator-separated value table describing, for a sample, the estimated copy number for bins of size N across the whole genome; and tabulator-separated value table describing, for a sample, the type of event (insertion, deletion) for all bins of size N across the genome.
[0149] Pipelines may be configured for bacterial sequencing data. In some instances, a bacterial pipeline is configured to input a fastQ file. In some instances, a bacterial pipeline comprises one or more of: a compressed FASTQ files containing trimmed and filtered high qualify sequences; a tabulator-separated value table describing taxonomic assignation of each read to a given species using a database, such as Kraken’s database); a fasta file describing the genome assembly, at the level of contigs, constructed from the reads in the dataset; fasta file describing the genome-assembly, at the level of scaffolds, constructed from the reads in the dataset; a BAM file describing the alignments of the reads in reference to the assemble genome (e.g., contigs). In some instances, a bacterial pipeline comprises one or more summary files. In some instances, summary files comprise one or more of: a Tabulator-separated value table describing the taxonomic assignment of contigs in an assembly based on the proportion of reads mapped to them; a tabulator-separated value table showing the estimated completeness of a given assembly based on a set of phylogenetic marker genes.
[0150] Pipelines may be configured for RNA-seq data. In some instances, an RNA-seq pipeline is configured to accept one or more of a compressed alignment file describing the alignment information of the reads in the project against a given reference (e.g., hg38); an index file of the compressed alignment file; a compressed alignment file describing the alignment information of the reads in the project against a RNA-Seq specific index for a given reference and an index file for the alignment file. In some instances, an RNA-seq pipeline compnses one or more summary files. In some instances, summary files comprise one or more of a tabulator-separated table describing the matnx of counts of the genomic features (e.g., exons in a gene) across samples; a tabulator-separated table describing the number of unique splice-junction overlaps; a tabulator- separated table describing overall alignment metrics (e.g., number of genes with counts, etc.); and a tabulator-separated table showing the estimated ratio of exon-non exon alignment events. [0151] Systems and methods described herein may comprise filters for visualizing data. In some instances, filters comprise one or more of: Germline mutation, Somatic mutation, Copy number variation, Single nucleotide variation, Insertions and deletions, Tumor Mutation Burden (TMB) Analysis, Catalog of somatic mutation in cancer (Cosmic)4, ClinVar, and Predicted Coding Change.
[0152] Further described herein are computer-implemented systems comprising: at least one memory storing computer-executable instructions; and at least one processor configured to
access the at least one memory and execute the computer-executable instructions to perform: receiving a query, wherein the query comprises genomic data from one or more samples; querying a database; wherein the database comprises a plurality of genomic data and a plurality of phenotype data; generating, using at least the genomic data, a genome summary, the genome summary comprising genes and gene variants of the cohort; determining a graphical representation of the genome summary; and sending the graphical representation to a display device.
[0153] Also described herein are computer-implemented systems comprising: at least one memory storing computer-executable instructions; and at least one processor configured to access the at least one memory and execute the computer-executable instructions to generate a graphical user interface (GUI) that accepts a query from a user comprising sequencing information from one or more samples and presents to the user genome information. In some instances, a GUI comprises a project browser or dashboard. In some instances, a GUI comprises drop down menus for one or more of project, owner, analysis type, and status. In some instances, a GUI comprises a list of previous and current projects. In some instances, projects and data are shared among a group of users. In some instances, projects are saved for future modification or access. In some instances, GUI is facilitated by a frontend. In some instances, user control queries a backend to reflect actions of the user through.
[0154] Computer-implemented systems may comprise a genome browser. In some instances, a genome browser is configured to display sections of a genome and/or variants. In some instances, a genome browser comprises an IGV (integrated genome viewer). In some instances, in the IGV window the bin size is selectable from the entire genome down to the individual base. In this view individual mutations in some instances are viewed to determine the alternative allele or base change. In some instances, each mutation is selectable, further detailing the nature of the modification and presenting it to the user.
[0155] Computer-implemented systems may comprise an interface for annotating variants. This is an important step to empower interpretation of downstream coding changes in protein structure and function. Variant information in some instances comprises one or more of features (name, gene id, gene type, strand, Tdl, Hgncld), predictions (SIFT/sorting intolerant from tolerant, LFT I likelihood ratio test, FATHMM, PROVEAN/ protein variation effect analyzer, MetaSVM, MetaLR), conservation among species (e.g., vertebrates, mammals, etc.); evidence (pathology-related data from databases such as COSMIC), and biological population. In some instances, a variant annotation interface assesses the degree of conservation among (100) vertabrates and (30) mammals. In some instances, this display is helpful in the investigation of de-novo variant alleles which are not annotated by ClinVar, Cosmic, Genecards or Ensembl.
The comparison allows the determination of conservation of alleles found in the sample compared to the same allele found in an alternative species. Conserved alleles are right shifts, where the conservation is high, where alleles which have low conservation are shifted left. As an example, in the Phylo 30-way mammal plot the allele is highly conserved across all 30 mammals indicating the gene is highly conserved and likely to be important for the health of all mammalian species. Having assessed the potential for the mutation to be pathogenic, if annotated the user in some instances navigates to a variety of external databases (e.g., GeneCards, Ensembl, Clinvar and COSMIC) by simply selecting the hyperlink for that specific database.
[0156] Variants in some instances are annotated as one or more of Germline mutation, Somatic mutation, Copy number variation (CNV), Single nucleotide variation (SNV), Insertions and deletions, Catalog of somatic mutation in cancer (Cosmic), ClinVar, and Predicted Coding Change. Additional resources are also accessed in some instances, such as GeneCards, Essembl, CinVar and Cosmic. In some instances, variants comprise complex markers such as those obtained using Tumor Mutation Burden (TMB) Analysis.
[0157] Computer-implemented systems may comprise an interface for tracing variant lineages. In some instances, lineages comprise somatic, ancestral, or reference lineages. Lineage trees in some instances are generated from specific chromosomes, and graphically display variants in a chart format.
[0158] Computer-implemented systems may comprise an interface for analyzing cells. In some instances, samples comprise one or more cells. Cells in some instances are searched, or summary information about each cell is displayed such as cell name, variants detected (somatic, germline, SNPs, and mdels. In some instances, metrics high, medium, and low are used to describe confidence of variant calls for each cell. In some instances, inter-cell distances are graphed.
[0159] Computer-implemented systems may comprise an interface for visualizing sequencing metrics (e.g., Picard metrics). Metrics include but are not limited to chromosome M population, percent pass/fail reads aligned, WGS mean coverage, and WGS percent excluded duplicate reads. Each metric in some instances is also displayed on an individual per-cell basis.
[0160] Computer-implemented systems may comprise an interface for visualizing genomic data. In some instances, data may be visualized using a circos plot. Circos plots in some instances comprise additional variant information, such as number of somatic, germline, SNP or indel variants. Variants in some instances visualized at the chromosome level. In some instances, a circos plot comprises a lineage tree. In some instances, a user interface is configured to apply one or more filters to the circos plot. In some instances, two or more groups
of cells or samples are compared (optionally filtered by number of variants). In some instances, views of one or more chromosomes are displayed or hidden. In some instances, data from one or more cells is hidden or displayed. In some instances, variant filters comprise one or more of variant type (SNP, indel), origin (somatic vs. germline), annotation (COSMIC, CLINVAR, coding change), or features. In some instances, features comprise name, gene id, gene type, strand, Tdl, Hgncld. In some instances, variant filters comprise predictions (SIFT, FATHMM, PROVEAN, MetaSVM, and MetaLR). Upon selection of a region or chromosome within a cell's genome, a pop-up window is in some instances presented to the user which includes a genome viewing frame (e.g., IGV) plot. This window can be configured in terms of genome window bin size allowing the visualization of the entire chromosome to the individual bases across that genome, which can be completed in matter of seconds. The window size in some instances is scrollable by simply dragging the window left or right. In the IGV window, each sample in some instances is interrogated to determine, for example, the specific change which is highlighted by a color change from the parental allele. The alternative allele is selected to determine the base change, while the parent allele can be detected to determine pathogenic risk score based on several public algorithms as well as the conservation of the allele across several vertebrate and mammalian species. This variant annotation further provides links to several databases to provide greater detail of the impact of the genomic alteration.
[0161] The systems and methods described herein may provide a visualization of genomic and multiomic data having a large number of data sets. In some instances, the genomic data comprises at least 1, 2, 5, 10, 15, 20, 25, 30, 40, 50, 75, 100, 125, 150, 200, 250, 300, 400, 500, 600, 750, 1000, or at least 1500 data sets. In some instances, the genomic data comprises 1- 1000, 5-1000, 10-1000, 5-10,000, 100-10,000, 100-10,000, 100-1000, 10-500, 10-750, 50-750, or 50-500 data sets. In some instances, each sample data set corresponds to a single cell.
[0162] The systems and methods described herein may provide a visualization of genomic data comprises data sets with a large number of variants. In some instances, each data set comprises at least 500, 1000, 2000, 5000, 10,000, 50,000, 100,000, 150,000, 250,000, 500,000 1 million, 2 million, 3 million, 4 million, 5 million, 6 million, 10 million or at least 15 million variants. In some instances, each data set comprises about 500, 1000, 2000, 5000, 10,000, 50,000, 100,000, 150,000, 250,000, 500,000 1 million, 2 million, 3 million, 4 million, 5 million, 6 million, 10 million or about 15 million variants. In some instances, each data set comprises 100-1 million, 100-100,000, 100,000-1 million, 100,000-5 million, 100-500,000, 500-5 million, 1 million-2 million, 2 million to 6 million, 3 million to 10 million, or 4 million to 7 million variants.
[0163] The systems and methods described herein may provide a visualization of genomic data within data sets represented in table format. In some instances, data sets comprise at least 1, 2,
5, 10, 20, 25, 50, 75, 80, 85, 90, 95, 100, 110, 120, 150, 200, or at least 250 million rows of data. In some instances, data sets comprise no more than 1, 2, 5, 10, 20, 25, 50, 75, 80, 85, 90, 95, 100, 110, 120, 150, 200, or no more than 250 million rows of data. In some instances, data sets comprise 1-250, 1-100, 1-50, 1-25, 5-25, 5-50, 10-100, 10-200, 50-200, 50-150, 100-400 or 100-300 million rows of data.
[0164] The systems and methods described herein may provide a visualization of genomic data in a short period of time. In some instances, a system for visualizing genomic data comprises one or more of a devices comprising at least one processor and instructions executable by the at least one processor to provide a first application configured to perform operations comprising: i. accessing one or more data sets comprising genomic data; and ii. generating a visual representation of the one or more data sets. In some instances, the visualization comprises a circos plot. In some instances, the circos plot is generated in no more than 10, 5, 4, 3, 2, 1, 0.5, 0.2, 0.1, 0.05, or no more than 0.01 seconds. In some instances, the circos plot is generated in 0.01-10, 0.05-10, 0.1-50, 0.5-10, 1-10, 2-10, 5-10, 0.01-0.05, 0.01-0.1, 0.01-0.5, 0.1-0.5, 0.1-1, or 0.1-5 seconds. In some instances, the circos plot is generated in no more than 10, 5, 4, 3, 2, 1, 0.5, 0.2, 0. 1, 0.05, or no more than 0.01 seconds for data set having at least 5 cells. In some instances, the circos plot is generated in 0.01-10, 0.05-10, 0.1-50, 0.5-10, 1-10, 2-10, 5-10, 0.01-0.05, 0.01-0.1, 0.01-0.5, 0.1-0.5, 0.1-1, or 0.1-5 seconds for a data set having at least 5 cells. In some instances, the circos plot is generated in no more than 10, 5, 4, 3, 2, 1, 0.5, 0.2, 0.1, 0.05, or no more than 0.01 seconds for data set having at least 10 cells. In some instances, the circos plot is generated in 0.01-10, 0.05-10, 0.1-50, 0.5-10, 1-10, 2-10, 5-10, 0.01-0.05, 0.01-0.1, 0.01-0.5, 0.1-0.5, 0.1-1, or 0.1-5 seconds for a data set having at least 10 cells. In some instances, the circos plot is generated in no more than 10, 5, 4, 3, 2, 1, 0.5, 0.2, 0.1, 0.05, or no more than 0.01 seconds for data set having at least 20 cells. In some instances, the circos plot is generated in 0.01-10, 0.05-10, 0.1-50, 0.5-10, 1-10, 2-10, 5-10, 0.01-0.05, 0.01-0.1, 0.01- 0.5, 0.1-0.5, 0.1-1, or 0.1-5 seconds for a data set having at least 20 cells. In some instances, the circos plot is generated in no more than 10, 5, 4, 3, 2, 1, 0.5, 0.2, 0.1, 0.05, or no more than 0,01 seconds for data set having at least 1 million variants per cell. In some instances, the circos plot is generated in 0.01-10, 0.05-10, 0.1-50, 0.5-10, 1-10, 2-10, 5-10, 0.01-0.05, 0.01-0.1, 0.01-0.5, 0. 1-0.5, 0.1-1, or 0.1-5 seconds for a data set having least 1 million variants per cell. In some instances, the circos plot is generated in no more than 10, 5, 4, 3, 2, 1, 0.5, 0.2, 0.1, 0.05, or no more than 0.01 seconds for data set having at least 4 million variants per cell. In some instances, the circos plot is generated in 0.01-10, 0.05-10, 0.1-50, 0.5-10, 1-10, 2-10, 5-10, 0.01-0.05, 0.01-0.1, 0.01-0.5, 0.1-0.5, 0.1-1, or 0.1-5 seconds for a data set having least 4 million variants per cell. In some instances, the circos plot is generated using no more than 1, 2,
3, 4, 5, 6, 7, or no more than 8 processors. Alternatively, in some examples, more processors may be used. As many processors as needed may be used. In some instances, the circos plot is generated using at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 20, 30, 40 or more processors. In some instances, the circos plot is generated using about 1, 2, 3, 4, 5, 6, 7, 8, 10, 20, 30, 40 or more processors.
[0165] In some instances, the visualization further comprises a phylogenic tree. In some instances, the visualization further comprises sequencing qualify metrics. In some instances, the visualization further comprises annotated variations. In some instances, the visualization further comprises number of variations. In some instances, the visualization further comprises cell and cell population statistics.
[0166] Also described herein are platforms comprising: a database, in a computer memory, comprising biologic information for member of a population of individuals or samples, the biologic information comprising genome data, the biologic information obtained by analysis of one or more biologic samples from each sample and/or individual, each individual and/or sample having an ID; and a processor configured to provide a biologic information visual synthesis application comprising: a software module presenting an interface allowing a user to query the database one or more of: inputting a phenotype, inputting a gene name, inputting an individual ID, and inputting a sample ID; a software module generating a genome browser, the genome browser comprising: a whole genome display comprising an icon representing each chromosome, each icon indicating a densify of gene variants; and a chromosome display comprising an iconic representation of each chromosome, the representation indicating a density of gene variants located at the relevant portion of the chromosome, wherein selection of a chromosome by a user generates a linear display of the chromosome demonstrating the spatial relationship on the chromosome of the genes and the variants; and a software module generating a lineage viewer, the lineage viewer comprising: a geographic display of autosomal ancestry, a geographic display of maternal line ancestry, and a geographic display of paternal line ancestry and multi-omic displays.
Population of individuals
[0167] The platforms, systems, media, and methods described herein include biologic data pertaining to a population of individuals, or use of the same. In various embodiments, the population of individuals comprises more than 1,000, more than 5,000, more than 10,000, more than 20,000, more than 50,000, or more than 100,000, more than 500,000, more than 1,000,000 more than 10,000,000, more than 50,000,000, or more than 100,000,000 individuals. In some cases, the individuals in the population participated in academic medical research studies using consents allowing for genetic testing of specimens. In such cases, biologic specimens and
phenotype data are collected for individuals from pharmaceutical clinical trials, academic research, and health care settings. In some cases, biologic data pertaining to a population of individuals is collected from integrated health records for individuals representing a spectrum of diseases with unmet medical needs.
Biologic information
[0168] The platforms, systems, media, and methods described herein include biologic information, or use of the same. In some instances, biologic information comprises genetic information. In some embodiments, the biologic information compnses whole human genome sequencing information. In some embodiments, the biologic information comprises human transcriptome sequencing information. In some instances, biologic information comprises genetic information from humans, non-human primates, animals, plants, fungi, protozoa, archaea, or bacteria. In some instances, biologic information comprises genetic information from the microbiome.
[0169] The biologic information may comprise genomic information. As used herein, genomic information refers to genetic information found within a biological sample arising from the genome (or DNA - nuclear, mitochondrial or otherw ise). In some instances, genomic information comprises nucleic acid sequence copy number, location, and sequence. The genomic information is not limited to protein-coding sequence, it may refer to intronic sequence and intergenic sequence, each known to harbor multiple functional elements whereby DNA changes at those elements may be consequential in normal development and disease. In some instances, genomic information comprises post-transcriptional modifications such as methylation. In some instances, genomic information is found w ithin a chromosome, plasmid, or other medium comprising nucleic acids.
[0170] The biologic information may comprise transcript information. As used herein, transcript information refers to information obtained from a transcriptome within a biological sample. In some instances, transcript information comprises expression levels of genes and sequence of corresponding nucleic acids expressed from genes.
[0171] The biologic information may comprise microbiome information. As used herein, “microbiome” refers to the bacteria and other microorganisms that live in and on the human body. In some embodiments, the microbiome information comprises metagenomic microbiome characterization. In various embodiments, the microbiome information comprises one or more of: microflora genus and/or species information, microflora relative abundance information, and microflora gene and/or gene variant information.
[0172] The biologic information may comprise proteome information. In some embodiments, the proteome information comprises information regarding abundance, localization, identity, post-transcriptional modifications, or other protein information.
[0173] The biologic information may comprise methylome information. In some embodiments, methylome information comprises post-transcriptional modifications such as the location of 5- methylcytosine (5-mC), 5-hydroxymethylcytosine (5-hmC), CpG islands, ATAC seq, methyl histone modification, other post-transcriptional modification to nucleic acids, and/or any combinations thereof.
[0174] The biologic information may comprise metabolome information. As used herein, “metabolome” refers to the small-molecule chemicals found within a biological sample. In some embodiments, metabolome information comprises the presence of one or more smallmolecule chemicals. In further embodiments, the metabolome information comprises a qualitative measurement of one or more small-molecule chemicals. In still further embodiments, the metabolome information comprises a quantitative measurement of one or more small-molecule chemicals. In various embodiments, the microbiome information comprises measurements of at least 100, at least 200, at least 300, at least 400, at least 500, at least 600, at least 700, at least 800, at least 900, at least 1000, at least 1100, at least 1200, at least 1300, at least 1400, or at least 1500 substances (e.g., molecules).
Data security
[0175] Databases and visualizations described herein may comprise sensitive information pertaining to an individual’s health. Provided herein are platforms for data security comprising one or more of an access control for one or more users; a security framework; and biological data from an individual. In some instances, one or more security measures are implemented via security frameworks to restrict access or protect an individual’s health information. Security frameworks in some instances comprises standards. In some instances, security frameworks include HIPPA standards. In some instances, security frameworks comprise NIST cybersecurity framework. Access controls in some instances restrict access to certain individuals or groups of individuals. Access controls in some instances comprise passwords, biometrics, or other method of user authentication.
Use Cases and applications
[0176] The system can be applied in a variety of fields. In some instances, the system provides useful data and analysis to pharmaceutical companies, including informaticians, bench scientists, medical director, the senior executive team, or commercial organizations. Such data and analysis, in some instances, includes analysis of clinical trial data for patient stratification and biomarker discovery, identification and in silico validation of novel genetic targets,
discovery of novel disease and dose response biomarkers/signatures, compound repurposing and expand indications of marketed drugs, rescue of failed clinical trial assets, real time genetic analysis of adverse events, or targeted accelerated recruitment for clinical trials. For academic research groups, including physicians/principal investigators, informaticians, research scientists and geneticists, the system in some instances offers analysis of specific cohorts, analysis of individual patients, or large-scale analysis of variation in populations. Clinics, hospitals and cancer centers, including physicians and genetic counsellors, in some instances will find the system useful in the analysis of individuals, analysis of cohorts, wellness focus, or oncology focus. The data and analysis in some instances also have value to insurance companies, actuarial teams, or health economists.
[0177] Specifically, for pharma and researchers, the system can serve as or enable a reference set of knowledge/evidence, a hypothesis generation engine, a platform for analysis of pharma’s own data, a platform for combination of pharma data and data and analysis provided by the system, a platform for combining data from multiple collaborators, a platform for sharing data within a company, etc. For physicians or genetic counsellors, the system can similarly be used as part of a care tool to identify the most relevant results for treatment and prevention, a reference set of knowledge/evidence, or a tool to identify other physicians with similar patients/ share knowledge. In addition, for insurance companies, the system can be useful as part of a tool for detect individual care pathway and incentivize healthy living or a tool to help quantify risk that they have in the insured population.
Kits
[0178] The systems described herein may accompany or be provided as a service with a kit. In some instances, the kit comprises reagents for acquiring biological information. In some instances, the kit is configured to obtain genomic or transcriptome data. In some instances, the kit is configured to obtain genomic, methylome, transcriptome or proteome data from single cells. In some instances, provided herein are kits comprising reagents for obtaining biological data from single cells, and instructions for using the kit. In some instances, the instructions comprise links to a web-based portal or mobile based software application to import, analyze, and/or compare biological data obtained from the kit.
Digital processing device
[0179] In some embodiments, the platforms, systems, media, and methods described herein include a digital processing device, or use of the same. In further embodiments, the digital processing device includes one or more hardware central processing units (CPUs) or general- purpose graphics processing units (GPGPUs) that carry out the device’s functions. In still further embodiments, the digital processing device further comprises an operating system
configured to perform executable instructions. In some embodiments, one or more resources related to the systems described herein is stored locally. In some embodiments, the digital processing device is optionally connected a computer network. In further embodiments, the digital processing device is optionally connected to the Internet such that it accesses the World Wide Web. In still further embodiments, the digital processing device is optionally connected to a cloud computing infrastructure. In other embodiments, the digital processing device is optionally connected to an intranet. In other embodiments, the digital processing device is optionally connected to a data storage device.
[0180] In accordance with the description herein, suitable digital processing devices include, by way of examples, server computers, desktop computers, laptop computers, and notebook computers.
[0181] In some embodiments, the digital processing device includes an operating system configured to perform executable instructions. The operating system is, for example, software, including programs and data, which manages the device’s hardware and provides services for execution of applications. Those of skill in the art will recognize that suitable server operating systems include, by way of examples, FreeBSD, OpenBSD, NetBSD®, Linux, Apple® Mac OS X Server®, Oracle® Solaris®, Windows Server®, and Novell® NetWare®. Those of skill in the art will recognize that suitable personal computer operating systems include, by way of examples, Microsoft® Windows®, Apple® Mac OS X®, UNIX®, and UNIX-like operating systems such as GNU/Linux®. In some embodiments, the operating system is provided by cloud computing.
[0182] In some embodiments, the device includes a storage and/or memory device. The storage and/or memory device is one or more physical apparatuses used to store data or programs on a temporary or permanent basis. In some embodiments, the device is volatile memory and requires power to maintain stored information. In some embodiments, the device is non-volatile memory and retains stored information when the digital processing device is not powered. In further embodiments, the non-volatile memory comprises flash memory. In some embodiments, the non-volatile memory comprises dynamic random-access memory (DRAM). In some embodiments, the non-volatile memory comprises ferroelectric random-access memory (FRAM). In some embodiments, the non-volatile memory comprises phase-change random access memory (PRAM). In other embodiments, the device is a storage device including, by way of examples, CD-ROMs, DVDs, flash memory devices, magnetic disk drives, magnetic tapes drives, optical disk drives, and cloud computing-based storage. In further embodiments, the storage and/or memory device is a combination of devices such as those disclosed herein. In
some embodiments, data may be stored on and/or using a DNA data storage system. Any suitable data storage system and database may be used.
[0183] In some embodiments, the digital processing device includes a display to send visual information to a user. In some embodiments, the display is a cathode ray tube (CRT). In some embodiments, the display is a liquid crystal display (LCD). In further embodiments, the display is a thin film transistor liquid crystal display (TFT-LCD). In some embodiments, the display is an organic light emitting diode (OLED) display. In various further embodiments, on OLED display is a passive-matrix OLED (PMOLED) or active-matrix OLED (AMOLED) display. In some embodiments, the display is a plasma display. In other embodiments, the display is a video projector. In some embodiments, the display is a wearable display. In still further embodiments, the display is a combination of devices such as those disclosed herein.
[0184] In some embodiments, the digital processing device includes an input device to receive information from a user. In some embodiments, the input device is a keyboard. In some embodiments, the input device is a pointing device including, by way of examples, a mouse, trackball, track pad, joystick, game controller, or stylus. In some embodiments, the input device is a touch screen or a multi-touch screen. In other embodiments, the input device is a microphone to capture voice or other sound input. In other embodiments, the input device is a video camera or other sensor to capture motion or visual input. In further embodiments, the input device is a Kinect, Leap Motion, or the like. In still further embodiments, the input device is a combination of devices such as those disclosed herein.
Non-transitory computer readable storage medium
[0185] In some embodiments, the platforms, systems, media, and methods disclosed herein include one or more non-transitory computer readable storage media encoded with a program including instructions executable by the operating system of an optionally networked digital processing device. In further embodiments, a computer readable storage medium is a tangible component of a digital processing device. In still further embodiments, a computer readable storage medium is optionally removable from a digital processing device. In some embodiments, a computer readable storage medium includes, by way of examples, CD-ROMs, DVDs, flash memory devices, solid state memory, magnetic disk drives, magnetic tape drives, optical disk drives, cloud computing systems and services, and the like. In some cases, the program and instructions are permanently, substantially permanently, semi-permanently, or non-transitorily encoded on the media.
Computing system
[0186] Referring to FIG. 10, a block diagram is shown depicting an exemplary machine that includes a computer system 1100 (e.g., a processing or computing system) within which a set of
instructions can execute for causing a device to perform or execute any one or more of the aspects and/or methodologies for static code scheduling of the present disclosure. The components in FIG. 10 are examples only and do not limit the scope of use or functionality of any hardware, software, embedded logic component, or a combination of two or more such components implementing particular embodiments.
[0187] Computer system 1100 may include one or more processors 1101, a memory 1103, and a storage 1108 that communicate with each other, and with other components, via a bus 1140. The bus 1140 may also link a display 1132, one or more input devices 1133 (which may, for example, include a keypad, a keyboard, a mouse, a stylus, etc.), one or more output devices 1134, one or more storage devices 1135, and various tangible storage media 1136. All of these elements may interface directly or via one or more interfaces or adaptors to the bus 1140. For instance, the various tangible storage media 1136 can interface with the bus 1140 via storage medium interface 1126. Computer system 1100 may have any suitable physical form, including but not limited to one or more integrated circuits (ICs), printed circuit boards (PCBs), mobile handheld devices (such as mobile telephones or PDAs), laptop or notebook computers, distributed computer systems, computing grids, or servers.
[0188] Computer system 1100 includes one or more processor(s) 1101 (e.g., central processing units (CPUs), general purpose graphics processing units (GPGPUs), or quantum processing units (QPUs)) that carry out functions. Processor(s) 1101 optionally contains a cache memory unit 1102 for temporary' local storage of instructions, data, or computer addresses. Processor(s) 1101 are configured to assist in execution of computer readable instructions. Computer system 1100 may provide functionality for the components depicted in FIG. 10 as a result of the processor(s) 1101 executing non-transitory, processor-executable instructions embodied in one or more tangible computer-readable storage media, such as memory 1103, storage 1108, storage devices 1135, and/or storage medium 1136. The computer-readable media may store software that implements particular embodiments, and processor(s) 1101 may execute the software. Memory 1103 may read the software from one or more other computer-readable media (such as mass storage device(s) 1135, 1136) or from one or more other sources through a suitable interface, such as network interface 1120. The software may cause processor(s) 1101 to carry out one or more processes or one or more steps of one or more processes described or illustrated herein. Carrying out such processes or steps may include defining data structures stored in memory 1103 and modifying the data structures as directed by the software.
[0189] The memory 1103 may include various components (e.g., machine readable media) including, but not limited to, a random-access memory' component (e.g., RAM 1104) (e.g., static RAM (SRAM), dynamic RAM (DRAM), ferroelectric random-access memory (FRAM),
phase-change random access memory (PRAM), etc.), a read-only memory component (e.g., ROM 1105), and any combinations thereof. ROM 1105 may act to communicate data and instructions unidirectionally to processor(s) 1101, and RAM 1104 may act to communicate data and instructions bidirectionally with processor(s) 1101. ROM 1105 and RAM 1104 may include any suitable tangible computer-readable media described below. In one example, a basic input/output system 106 (BIOS), including basic routines that help to transfer information between elements within computer system 1100, such as during start-up, may be stored in the memory 1103.
[0190] Fixed storage 1108 is connected bidirectionally to processor(s) 1101, optionally through storage control unit 1107. Fixed storage 1108 provides additional data storage capacity and may also include any suitable tangible computer-readable media described herein. Storage 108 may be used to store operating system 1109, executable(s) 1110, data 1111, applications 1112 (application programs), and the like. Storage 1108 can also include an optical disk drive, a solid-state memory device (e.g., flash-based systems), or a combination of any of the above. Information in storage 1108 may, in appropriate cases, be incorporated as virtual memory' in memory 1103.
[0191] In one example, storage device(s) 1135 may be removably interfaced with computer system 1100 (e.g., via an external port connector (not shown)) via a storage device interface 1125. Particularly, storage device(s) 1135 and an associated machine-readable medium may provide non-volatile and/or volatile storage of machine-readable instructions, data structures, program modules, and/or other data for the computer system 1100. In one example, software may reside, completely or partially, within a machine-readable medium on storage device(s) 1135. In another example, software may reside, completely or partially, within processor(s) 1101
[0192] Bus 1140 connects a wide variety of subsystems. Herein, reference to a bus may encompass one or more digital signal lines serving a common function, where appropriate. Bus 140 may be any of several types of bus structures including, but not limited to, a memory bus, a memory controller, a peripheral bus, a local bus, and any combinations thereof, using any of a variety of bus architectures. As an example, such architectures include an Industry Standard Architecture (ISA) bus, an Enhanced ISA (EISA) bus, a Micro Channel Architecture (MCA) bus, a Video Electronics Standards Association local bus (VLB), a Peripheral Component Interconnect (PCI) bus, a PCI-Express (PCI-X) bus, an Accelerated Graphics Port (AGP) bus, HyperTransport (HTX) bus, serial advanced technology attachment (SATA) bus, and any combinations thereof.
[0193] Computer system 1100 may also include an input device 1133. In one example, a user of computer system 1100 may enter commands and/or other information into computer system 1100 via input device(s) 1133. Examples of an input device(s) 1133 include, but are not limited to, an alpha-numeric input device (e.g., a keyboard), a pointing device (e.g., a mouse or touchpad), a touchpad, a touch screen, a multi-touch screen, a joystick, a stylus, a gamepad, an audio input device (e.g., a microphone, a voice response system, etc.), an optical scanner, a video or still image capture device (e.g., a camera), and any combinations thereof. In some embodiments, the input device is a Kinect, Leap Motion, or the like. Input device(s) 1133 may be interfaced to bus 1140 via any of a variety of input interfaces 1123 (e.g., input interface 1123) including, but not limited to, serial, parallel, game port, USB, FIREWIRE, THUNDERBOLT, or any combination of the above.
[0194] In particular embodiments, when computer system 1100 is connected to network 1130, computer system 1100 may communicate with other devices, specifically mobile devices and enterprise systems, distributed computing systems, cloud storage systems, cloud computing systems, and the like, connected to network 1130. Communications to and from computer system 100 may be sent through network interface 1120. For example, network interface 1120 may receive incoming communications (such as requests or responses from other devices) in the form of one or more packets (such as Internet Protocol (IP) packets) from network 1130, and computer system 100 may store the incoming communications in memory 1103 for processing. Computer system 100 may similarly store outgoing communications (such as requests or responses to other devices) in the form of one or more packets in memory 1103 and communicated to network 1130 from network interface 1120. Processor(s) 1101 may access these communication packets stored in memory 1103 for processing.
[0195] Examples of the network interface 1120 include, but are not limited to, a network interface card, a modem, and any combination thereof. Examples of a network 1130 or network segment 1130 include, but are not limited to, a distributed computing system, a cloud computing system, a wide area network (WAN) (e.g., the Internet, an enterprise network), a local area network (LAN) (e.g., a network associated with an office, a building, a campus or other relatively small geographic space), a telephone network, a direct connection between two computing devices, a peer-to-peer network, and any combinations thereof. A network, such as network 1130, may employ a wired and/or a wireless mode of communication. In general, any network topology may be used.
[0196] Information and data can be displayed through a display 1132. Examples of a display 1132 include, but are not limited to, a cathode ray tube (CRT), a liquid crystal display (LCD), a thin film transistor liquid crystal display (TFT-LCD), an organic liquid crystal display (OLED)
such as a passive-matrix OLED (PMOLED) or active-matrix OLED (AMOLED) display, a plasma display, and any combinations thereof. The display 1132 can interface to the processor(s) 1101, memory 1103, and fixed storage 1108, as well as other devices, such as input device(s) 1133, via the bus 1140. The display 1132 is linked to the bus 1140 via a video interface 1122, and transport of data between the display 1132 and the bus 1140 can be controlled via the graphics control 1121. In some embodiments, the display is a video projector. In some embodiments, the display is a head-mounted display (HMD) such as a VR headset. In further embodiments, suitable VR headsets include, by way of example, HTC Vive, Oculus Rift, Samsung Gear VR, Microsoft HoloLens, Razer OSVR, FOVE VR, Zeiss VR One, Avegant Glyph, Freefly VR headset, and the like. In still further embodiments, the display is a combination of devices such as those disclosed herein.
[0197] In addition to a display 1132, computer system 1100 may include one or more other peripheral output devices 1134 including, but not limited to, an audio speaker, a printer, a storage device, and any combinations thereof. Such peripheral output devices may be connected to the bus 1140 via an output interface 1124. Examples of an output interface 1124 include, but are not limited to, a serial port, a parallel connection, a USB port, a FIREWIRE port, a THUNDERBOLT port, and any combinations thereof.
[0198] In addition, or as an alternative, computer system 1100 may provide functionality as a result of logic hardwired or otherwise embodied in a circuit, which may operate in place of or together with software to execute one or more processes or one or more steps of one or more processes described or illustrated herein. Reference to software in this disclosure may encompass logic, and reference to logic may encompass software. Moreover, reference to a computer-readable medium may encompass a circuit (such as an IC) storing software for execution, a circuit embodying logic for execution, or both, where appropriate. The present disclosure encompasses any suitable combination of hardware, software, or both.
[0199] Those of skill in the art will appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality.
[0200] The various illustrative logical blocks, modules, and circuits described in connection with the embodiments disclosed herein may be implemented or performed with a generalpurpose processor, a digital signal processor (DSP), an application specific integrated circuit
(ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete
gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general-purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
[0201] The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by one or more processor(s), or in a combination of the two. A software module may reside in RAM memory , flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium know n in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.
[0202] In accordance with the description herein, suitable computing devices include, by way of example, server computers, desktop computers, laptop computers, notebook computers, subnotebook computers, netbook computers, netpad computers, set-top computers, media streaming devices, handheld computers, Internet appliances, mobile smartphones, tablet computers, and personal digital assistants. Suitable tablet computers, in various embodiments, include those with booklet, slate, and convertible configurations, known to those of skill in the art.
[0203] In some embodiments, the computing device includes an operating system configured to perform executable instructions. The operating system is, for example, software, including programs and data, which manages the device’s hardware and provides services for execution of applications. Operating systems in some instances are stored locally or accessed via a network. Those of skill in the art will recognize that suitable server operating systems include, by way of example, FreeBSD, OpenBSD, NetBSD®, Linux, Apple® Mac OS X Server®, Oracle® Solaris®, Windows Server®, and Novell® NetWare®. Those of skill in the art will recognize that suitable personal computer operating systems include, by way of example, Microsoft® Windows®, Apple® Mac OS X®, UNIX®, and UNIX-like operating systems such as GNU/Linux®. In some embodiments, the operating system is provided by cloud computing. Those of skill in the art will also recognize that suitable mobile smartphone operating systems
include, by way of example, Nokia® Symbian® OS, Apple® iOS®, Research In Motion® BlackBerry OS®, Google® Android®, Microsoft® Windows Phone® OS, Microsoft® Windows Mobile® OS, Linux®, and Palm® WebOS®.
[0204] Computer systems described herein may be utilized as part of the systems and methods of the present invention. In some embodiments, a computer system may be utilized as a device configured for use by a researcher, patient, partner, caretaker, or healthcare provider. Non-transitory computer readable storage medium
[0205] In some embodiments, the platforms, systems, media, and methods disclosed herein include one or more non-transitory computer readable storage media encoded with a program including instructions executable by the operating system of an optionally networked computing device. In further embodiments, a computer readable storage medium is a tangible component of a computing device. In still further embodiments, a computer readable storage medium is optionally removable from a computing device. In some embodiments, a computer readable storage medium includes, by way of example, CD-ROMs, DVDs, flash memory devices, solid state memory, magnetic disk drives, magnetic tape drives, optical disk drives, distributed computing systems including cloud computing systems and services, and the like. In some cases, the program and instructions are permanently, substantially permanently, semipermanently, or non-transitorily encoded on the media.
Computer program
[0206] In some embodiments, the platforms, systems, media, and methods disclosed herein include at least one computer program, or use of the same. A computer program includes a sequence of instructions, executable by one or more processors) of the computing device’s CPU, written to perform a specified task. Computer readable instructions may be implemented as program modules, such as functions, objects, Application Programming Interfaces (APIs), computing data structures, and the like, that perform particular tasks or implement particular abstract data types. In light of the disclosure provided herein, those of skill in the art will recognize that a computer program may be written in various versions of various languages. [0207] The functionality of the computer readable instructions may be combined or distributed as desired in various environments. In some embodiments, a computer program comprises one sequence of instructions. In some embodiments, a computer program comprises a plurality of sequences of instructions. In some embodiments, a computer program is provided from one location. In other embodiments, a computer program is provided from a plurality of locations. In various embodiments, a computer program includes one or more software modules or features. In various embodiments, a computer program includes, in part or in whole, one or
more web applications, one or more mobile applications, one or more standalone applications, one or more web browser plug-ins, extensions, add-ins, or add-ons, or combinations thereof. [0208] In some embodiments, one or more web applications, one or more mobile applications, one or more standalone applications, one or more web browser plug-ins, extensions, add-ins, or add-ons, or combinations thereof are utilized to perform the methods as described herein. In some embodiments, one or more web applications, one or more mobile applications, one or more standalone applications, one or more web browser plug-ins, extensions, add-ins, or addons, or combinations thereof are utilized as part of the systems as described herein. In some embodiments, one or more web applications, one or more mobile applications, one or more standalone applications, one or more web browser plug-ins, extensions, add-ins, or add-ons, or combinations thereof are utilized to fully or partially automate the systems and methods as described herein. In some embodiments, automation allows methods to be carried out which are beyond the limits of what can be processed by a human.
Web application
[0209] In some embodiments, a computer program includes a web application. In light of the disclosure provided herein, those of skill in the art will recognize that a web application, in various embodiments, utilizes one or more software frameworks and one or more database systems. In some embodiments, a web application is created upon a software framework such as Microsoft® .NET or Ruby on Rails (RoR). In some embodiments, a web application utilizes one or more database systems including, by way of examples, relational, non-relational, object oriented, associative, XML, and document-oriented database systems. In further embodiments, suitable relational database systems include, by way of examples, Microsoft® SQL Server, mySQL™, and Oracle®. Those of skill in the art will also recognize that a web application, in various embodiments, is written in one or more versions of one or more languages. A web application may be written in one or more markup languages, presentation definition languages, client-side scripting languages, server-side coding languages, database query languages, or combinations thereof. In some embodiments, a web application is written to some extent in a markup language such as Hypertext Markup Language (HTML), Extensible Hypertext Markup Language (XHTML), or extensible Markup Language (XML). In some embodiments, a web application is written to some extent in a presentation definition language such as Cascading Style Sheets (CSS). In some embodiments, a web application is written to some extent in a client-side scripting language such as Asynchronous JavaScript and XML (AJAX), Flash® ActionScript, JavaScript, or Silverlight®. In some embodiments, a web application is written to some extent in a server-side coding language such as Active Server Pages (ASP), ColdFusion®, Perl, Java™, JavaServer Pages (JSP), Hypertext Preprocessor (PHP), Python™, Ruby, Tel,
Smalltalk, WebDNA®, or Groovy. In some embodiments, a web application is written to some extent in a database query language such as Structured Query Language (SQL). In some embodiments, a web application integrates enterprise server products such as IBM® Lotus Domino®. In some embodiments, a web application includes a media player element. In various further embodiments, a media player element utilizes one or more of many suitable multimedia technologies including, by way of examples, Adobe® Flash®, HTML 5, Apple® QuickTime®, Microsoft® Silverlight®, Java™, and Unity®.
[0210] Referring to FIG. 11, in a particular embodiment, an application provision system comprises one or more databases 1200 accessed by a relational database management system (RDBMS) 1210. Suitable RDBMSs include Firebird, MySQL, PostgreSQL, SQLite, Oracle Database, Microsoft SQL Server, IBM DB2, IBM Informix, SAP Sybase, Teradata, and the like. In this embodiment, the application provision system further comprises one or more application severs 1220 (such as Java servers, NET servers, PHP servers, and the like) and one or more web servers 1230 (such as Apache, IIS, GWS and the like). The web server(s) optionally expose one or more web services via app application programming interfaces (APIs) 1240. Via a network, such as the Internet, the system provides browser-based and/or mobile native user interfaces.
[0211] Referring to FIG. 12, in a particular embodiment, an application provision system alternatively has a distributed, cloud-based architecture 1300 and comprises elastically load balanced, auto-scaling web server resources 1310 and application server resources 1320 as well synchronously replicated databases 1330.
[0212] The web applications may be utilized as part of the systems as described herein. The web applications may be utilized to perform the systems as described herein. In some application, web applications are utilized to provide features or modules of the systems described herein. In some application, web applications are utilized to fully or partially automate systems and methods described herein. In some embodiments, automation allows methods to be carried out which are beyond the limits of what can be processed by a human. Mobile application
[0213] In some embodiments, a computer program includes a mobile application provided to a mobile computing device. In some embodiments, the mobile application is provided to a mobile computing device at the time it is manufactured. In other embodiments, the mobile application is provided to a mobile computing device via the computer network described herein.
[0214] In view of the disclosure provided herein, a mobile application is created by techniques known to those of skill in the art using hardware, languages, and development environments known to the art. Those of skill in the art will recognize that mobile applications are written in
several languages. Suitable programming languages include, by way of examples, C, C++, C#, Objective-C, Java™, JavaScript, Pascal, Object Pascal, Python™, Ruby, VB.NET, WML, and XHTML/HTML with or without CSS, or combinations thereof.
[0215] Suitable mobile application development environments are available from several sources. Commercially available development environments include, by way of examples, AirplaySDK, alcheMo, Appcelerator®, Celsius, Bedrock, Flash Lite, .NET Compact Framework, Rhomobile, and WorkLight Mobile Platform. Other development environments are available without cost including, by way of examples, Lazarus, MobiFlex, MoSync, and PhoneGap. Also, mobile device manufacturers distribute software developer kits including, by way of examples, iPhone and iPad (iOS) SDK, Android™ SDK, BlackBerry® SDK, BREW SDK, Palm® OS SDK, Symbian SDK, webOS SDK, and Windows® Mobile SDK.
[0216] Those of skill in the art will recognize that several commercial forums are available for distribution of mobile applications including, by way of examples, Apple® App Store, Google® Play, Chrome WebStore, BlackBerry® App World, App Store for Palm devices, App Catalog for webOS, Windows® Marketplace for Mobile, Ovi Store for Nokia® devices, and Samsung® Apps.
[0217] The mobile applications may be utilized as part of the systems as described herein. The mobile applications may be utilized to perform the systems as described herein. In some application, mobile applications are utilized to provide features or modules of the systems described herein. In some application, mobile applications are utilized to fully or partially automate systems and methods described herein. In some embodiments, automation allows methods to be carried out which are beyond the limits of what can be processed by a human. Web browser plug-in
[0218] In some embodiments, the computer program includes a web browser plug-m (e.g., extension, etc.). In computing, a plug-in is one or more software components that add specific functionality to a larger software application. Makers of software applications support plug-ins to enable third-party developers to create abilities which extend an application, to support easily adding new features, and to reduce the size of an application. When supported, plug-ins enable customizing the functionality of a software application. For example, plug-ins are commonly used in web browsers to play video, generate interactivity, scan for viruses, and display particular file types. Those of skill in the art will be familiar with several web browser plug-ins including, Adobe® Flash® Player, Microsoft® Sil verlight®, and Apple® QuickTime®. In some embodiments, the toolbar comprises one or more web browser extensions, add-ins, or add-ons. In some embodiments, the toolbar comprises one or more explorer bars, tool bands, or desk bands.
[0219] In view of the disclosure provided herein, those of skill in the art will recognize that several plug-in frameworks are available that enable development of plug-ins in various programming languages, including, by way of examples, C++, Delphi, Java™, PHP, Python™, and VB .NET, or combinations thereof.
[0220] Web browsers (also called Internet browsers) are software applications, designed for use with network-connected computing devices, for retrieving, presenting, and traversing information resources on the World Wide Web. Suitable web browsers include, by way of examples, Microsoft® Edge®, Mozilla® Firefox®, Google® Chrome, Apple® Safari®, Opera Software® Opera®, and KDE Konqueror. In some embodiments, the web browser is a mobile web browser. Mobile web browsers (also called microbrowsers, mini-browsers, and wireless browsers) are designed for use on mobile computing devices including, by way of examples, handheld computers, tablet computers, netbook computers, subnotebook computers, smartphones, and personal digital assistants (PDAs). Suitable mobile web browsers include, by way of examples, Google® Android® browser, RIM BlackBerry® Browser, Apple® Safari®, Palm® Blazer, Palm® WebOS® Browser, Mozilla® Firefox® for mobile, Microsoft® Internet Explorer® Mobile, Amazon® Kindle® Basic Web, Nokia® Browser, Opera Software® Opera® Mobile, and Sony® PSP™ browser.
Software modules
[0221] In some embodiments, the platforms, systems, media, and methods disclosed herein include software, server, and/or database modules, or use of the same. In view of the disclosure provided herein, software modules are created by techniques known to those of skill in the art using machines, software, and languages known to the art. The software modules disclosed herein are implemented in a multitude of ways. In various embodiments, a software module comprises a file, a section of code, a programming object, a programming structure, or combinations thereof. In further various embodiments, a software module comprises a plurality of files, a plurality of sections of code, a plurality of programming objects, a plurality of programming structures, or combinations thereof. In various embodiments, the one or more software modules comprise, by way of examples, a web application, a mobile application, and a standalone application. In some embodiments, software modules are in one computer program or application. In other embodiments, software modules are in more than one computer program or application. In some embodiments, software modules are hosted on one machine. In other embodiments, software modules are hosted on more than one machine. In further embodiments, software modules are hosted on a distributed computing platform such as a cloud computing platform. In some embodiments, software modules are hosted on one or more machines in one
location. In other embodiments, software modules are hosted on one or more machines in more than one location.
Databases
[0222] In some instances, users query one or more databases to identify information about biological data in his or her data set. For example, user may use an interface to display specific information about a variant, such as the variants' role in cancer or other diseases. In some embodiments, the platforms, systems, media, and methods disclosed herein include one or more databases, or use of the same. In view of the disclosure provided herein, those of skill in the art will recognize that many databases are suitable for storage and retrieval of, for example, patient, photo, video, skin condition, visit, physician, and insurance information. In various embodiments, suitable databases include, by way of examples, relational databases, nonrelational databases, object-oriented databases, object databases, entity-relationship model databases, associative databases, XML databases, document-oriented databases, and graph databases. Further examples include SQL, PostgreSQL, MySQL, Oracle, DB2, Sybase, and MongoDB. In some embodiments, a database is Internet-based. In further embodiments, a database is web-based. In still further embodiments, a database is cloud computing based. In a particular embodiment, a database is a distributed database. In other embodiments, a database is based on one or more local computer storage devices.
[0223] Databases may comprise information (e.g., annotations) regarding genetic variants. In some instances, databases provide information on somatic, germline, or somatic and germline variants. In some instances, a database comprises one or more of ClinVar, COSMIC, NCBI database of Genotypes and Phenotypes (dbGaP), gnomAD, 69 genomes from CGI, Personalized Genome Project, NCI Genomic Data Commons (GDC), cBioPortal, Intogen, and the Pediatric Cancer Genome Project. In some instances, databases provide information on variants related to cancer or other disease.
EXAMPLES
[0224] The following examples are included for illustrative purposes only and are not intended to limit the scope of the invention.
EXAMPLE 1: PROCESSING CELLS FOR A MULTIOMIC WORKFLOW
[0225] An exemplary workflow is shown in FIG. 1. A sample is obtained from a diseased tissue, such as a frozen or FFPE sample. Cells are collected from the sample using any number of techniques known in the art. In some instances, cells are collected from specific genographic (spatial) locations on a tissue. The cells are then processed using one or more multiomics workflows to collect measurements (FIG. 2) from the genome, trans criptome, methylome, and
proteome. Such workflows in some instances target biological inquiries (FIG. 4). In some instances, data from measurements are entered as an input file into a cloud computing platform (FIG. 10). Using the measurements, a penetrance score (FIG. 3) and mechanism (FIG. 5) are generated using an algorithm. Changes having a high penetrance score are selected to validate as specific drug targets (FIG. 6). Personalized treatments (e.g., vaccine or small molecule) are then designed for the specific drug target.
EXAMPLE 2: SAMPLE PROCESSING STEPS AND MODALITIES
[0226] Following the general procedure of Example 1, both mammalian and bacterial cells can be analyzed according to workflow of FIG. 9.
[0227] While preferred embodiments of the present invention have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. Numerous variations, changes, and substitutions will now occur to those skilled in the art without departing from the invention. It should be understood that various alternatives to the embodiments of the invention described herein may be employed in practicing the invention. It is intended that the following claims define the scope of the invention and that methods and structures within the scope of these claims and their equivalents be covered thereby.
EXAMPLE 3: A MULTIOMIC VIEW OF AML
[0228] A MOLM-13 drug-resistant model was generated using quizartinib to target FLT3. The patient from which the MOLM-13 cell line was generated harbored an internal tandem duplication (ITD) in the receptor kinase FLT3 gene, resulting in hyperactive growth signaling and sensitivity to the FLT3 inhibitor quizartinib. The generation of resistance in culture can be seen in FIG. 13. The quizartinib cells also harbor a N841K mutation, which has also been found in AML patients. A genetic analysis of parental and resistant genes can be seen in FIG. 14A.
[0229] Genomic and transcriptomic libraries were prepared. First, the cytosol was lysed. Then the mRNA transcriptome was converted to cDNA using 1st strand synthesis. Next, nuclear lysis occurred. Whole genome amplification via PTA occurred. The transcriptome cDNA and genomic DNA were then isolated. The cDNA was pre-amplified via PTA and a library was prepared for NGS of the transcriptomic library. Likewise, library prep of the PTA-amplified genomic DNA occurred, and the genomic li brary was analyzed via NGS. Resistant cells showed a loss of Chromosome 5 and a gain of 19q, consistent with karyotypic data, as depicted in FIG.
14B-14C.
[0230] Analysis of the transcriptome showed differences between single cells in the parental cell lines and the resistant cell lines. FIG. 14D depicts a principal component analysis of the transcriptomics data of parental and resistant cells. A clustered heat map, as depicted in FIG. 14E, showed that resistant cells had an upregulation of the enhancer factor CEB PA (mutated in AML patients) in resistant cells. GAS6 was also upregulated. Transcriptional bypass of FLT3 signaling by GAS6 upregulation can drive Axl signaling in resistant cells, as depicted in FIG. 14F. Full transcript (compared to end-counting) allows for insights into exon usage, as depicted in FIGS. 14G-14H. Isoform biases in parental versus resistant cells manifest both as alternative 5’ exon utilization (PPP1R14B ) & alternative internal exon utilization (HADHA ) resulting in different transcript lengths FIGS. 14G-14FL
[0231] Single nucleotide variations were also analyzed between parental and resistant genotypes A SNV matrix was created, and genotypes were coded as a -1 (0/0), 0 (0/1 or 1/0) and 1 (1/1). The matrix described the presence of 28134 SNVs across samples. A PCA was performed using the matrix and projected into two dimensions. The PCA is depicted in FIG. 15A. Multinomial logistic regression of the SNV matrix was performed whereby the condition Parental or Resistant was modeled. Subsequently, a Wald test derived p-values and was filtered using p < 0.01 that resulted in 520 SNVs that appear in the heatmap (FIG. 15B). Hierarchical clustering was applied over the matrix using Manhattan distance and ward.D as the clustering algorithm.
[0232] The genomic and transcriptomic data can be correlated. Linking the SNV and transcription modulation data reveals that an intronic single nucleotide genotypic shift between parental and resistant cells within the MYC gene correlated with differential MYC transcript levels. Results are depicted in FIGS. 16A-16C. Overall, the genome had approximately two orders of magnitude more plasticity than the transcriptome. There were 300 expression variants and 28,134 genetic variants. Genome plasticity drove greater differentiation of cell clusters. These cell foundational changes were verified within the transcriptome. The evolutionary pressure on the drug resistance is high.
EXAMPLE 4: A MULTIOMIC VIEW OF DUCTAL CARCINOMA IN SITU(DCIS)/INVASIVE DUCTAL CARCINOMA
[0233] A 7 cm DCIS (grade II) and a 1.2 cm invasive cancer (grade I) were analyzed. The cancer was ER+ PR+ HER2-. Normal and tumor tissue were digested to single cells. The tissue was stained with H&E staining and formalin-fixed, paraffin embedded prior to genomic DNA isolation (FIG 17). The transcriptome and genome were analyzed using the methods described in Example 5.
[0234] There was single-cell heterogeneity in CNV profiles of the primary breast cancer cells. Additionally, high and low EpC M cells showed specificity in CNV profiles, as depicted in FIG. 18A. Known DCIS copy number alterations harbor prototypical tumor suppressor genes, as depicted in FIG. 18B.
[0235] An analysis of SNV in primary breast cancer cells showed a variety of mutually exclusive single-cell oncogene PIK3CA mutations, as depicted in FIG. 19. Patient 1 had 2/19 cells with a PIK3CA H1047R mutation and 13/19 cells with a PIK3CA N345K mutation.
Patient 2 had 10/13 cells with &PIK3CA E545K mutation. Patient 3 had 0/8 cells with PIK3CA mutations. For patient 1, SNV and CNV were compared across the 19 cells analyzed.
Heterogeneity was observed within single cells. However, some cells showed neither SNV nor CNV mutations (e.g., FIG. 20).
[0236] A principal component analysis of the gene expression profiles results in a separation of EpCAM high and low cells, as depicted in FIG. 21. Clustering by genes enriched in breast cancer showed low levels of expression in the EpCAM low cells. IL-2 and CD4 expression suggests these cells are tumor infiltrating lymphocytes.
[0237] The plasticity of the genome is significantly higher than found in the transcriptome and is the driver of cellular evolution. This method described transcriptional signatures that exposed the presence of tumor infiltrating lymphocytes in the tumor sample and guided interpretation of genotype. RNA mechanisms of resistance were jointly identified, including transcriptional bypass mechanisms in response to drug treatment. Unification of these DNA/RNA data identified candidate regulatory SNVs proximal to genes differentially influencing their expression between parental and resistant cells, thereby exposing novel genes and modes of drug resistance.
EXAMPLE 5: DEVELOPING A VACCINATION TARGET BASED ON A “HIGH PENTRANCE” TARGET
[0238] Following the general procedure of Example 1, the data is used to create a vaccination target to a specific “high penetrance” target according to workflow of FIG. 9.
[0239] If a change is detected either in the genome at a single base, a translocation, or a copy number variant, and they can also detect in combination with the same mutation in the transcriptome, then the penetrance for this change may be high. This can also involve a mutation in a promotor, enhancer or pioneer factor for a splice variant. A splice variant arising from an alternate single nucleotide variant. If for example, this splice variant codes for a surface marker presentation or translation/expression. In this case the same genomic or trans criptomic sequence can be used to target the immune system to this specific cell with this the specific
mutation or genomic, transcriptormc or proteomic alteration. Similar to an mRNA vaccine, this oligonucleotide can be introduced to the same animal (person or study subject) to elicit response to this modified gene as its transcriptome or proteomic state. Alternatively, a dendritic cell may be “reprogrammed” with this information.
EXAMPLE 6: SUPERIOR RNA REPRESENTATION COMPARED TO FIELD
[0240] The methods and systems of the present disclosure (Resolve amplification of cDNA) enables full length synthesis of most transcripts found in the cell. In some examples, including this example, the cDNA is enriched across its entire length. Therefore, a bias of amplification or subsequent sequencing reads from the 5’ or 3’ end of a transcript does not occur. FIGs. 22A- 22C show example data illustrating this point. On the data graph shown in FIG. 22A, ResolveOME refers to data generated using the methods and systems of the present disclosure. Droplet-RNAseq shows an example data set generated using a system other than the systems of the present disclosure as a comparison. Data generated using ResolveOME (a method according to the present disclosure which may comprise using PTA) as detailed anywhere and throughout the present disclosure, do not demonstrate a bias of amplification or subsequent sequencing reads from the 5’ or 3’ end of a transcript. The symmetrical pattern seen in the data points of the ResolveOME demonstrate this point. Such bias is observed in the other data set on the graph (Droplet-RNAseq).
The methods of the present disclosure demonstrate a superior performance in analyzing RNA with high coverage over a wide range of 5 ’-3’ gene body percentile values, as shown on the graph. Conversely, the Droplet-RNAseq method leads to low coverage in the early sections of the x-axis and higher coverage further along the x-axis and toward the end. As such, this data set is unsymmetrical and biased.
[0241] FIG. 22B shows a graph demonstrating single cell analysis data comparing ResolveOME to droplet RNAseq in terms of transcript length. This graph demonstrated that ResolveOME (methods and systems of the present disclosure involving PTA) demonstrate superior RNA performance with respect to increased representation across various transcript sizes. This coverage is shown across a broader and longer set of transcript lengths. The competing technology droplet RNAseq starts losing enrichment after 1.5kb, while resolveOME (the method of the present disclosure) is capable of amplifying and detecting transcripts over 4kb. This increased evenness of coverage and broader detection of transcripts impacts robustness of downstream biomarker detection, such as allele variation, where a 2.5x increase in variant detection is achieved, coming from variation detected outside the 5’ or 3’ end of the transcript (FIG. 22C). FIG. 22C shows an additional graph characterizing number of detected
DNA variants per cell vs. number of detected genes per cell. The data set on the top (depicted with circles) is generated using the methods and systems of the present disclosure, demonstrating more robust variant calling for a wider range of number of detected genes per cell. The competing technology' (depicted in squares) detects variant over a narrower range of number of detected genes per cell. As such, the competing technology is more limited. The methods and systems of the present disclosure demonstrate a number of detected RNA variants per cell ranging from 150 to 2750. The competing technology (droplet RNAseq) is limited to a number of detected RNA variants per cell ranging from 750 to 1250.
EXAMPLE 7: ISOFORM EXPRESSION UNDERLYING COMPLEX DISEASE MECHANISMS
[0242] FIGs 23A-23C demonstrate unification of genomic lesions and gene expression in AML model of drug resistance. FIG. 23A shows differential transcript utilization (DTU) between MOLM-13 parental and drug-resistant single cells. Color intensity indicates transcript proportion of A or B isoform of indicated transcript. FIG. 23B shows heatmap with transcripts in the y-axis that show a statistical (ZLM p < 0.01) association with ploidy level across all cells in the MOLM-13 dataset. Color of the tiles represents the average standardized expression value at a given ploidy level. The right panel shows the output of the ZLM model testing the expression given the ploidy. Bars are colored based on the -loglO p-value of the ZLM model testing transcriptional differences between parental and resistant cells. Blocks of concordance (ploidy and expression increased or decreased concomitantly) or discordance (ploidy and expression inversely correlated) are shown for a given transcript and chromosomal location. FIG. 23C shows bubble plot showing SNV -transcript expression associations (p < 0.05) determined by ZLM modeling between parental and resistant cells. Candidate SNVs are shown in the y-axis and genotypes in the x-axis. Size of the circle denotes the genotype prevalence of the variant in the MOLM-13 cell type set (parental or resistant). Colors of points denote the standardized mean expression level of the transcript in the set. ENCODE genotypic features mapping to the given single nucleotide variant are indicated in the right bar and are categorized in the heatmap as regulatory (top) or genic (bottom).
EXAMPLE 8: METAGENOMIC APPLICATIONS
[0243] As highlighted in FIG. 9, Bacterial colonies represent a unique opportunity for naturally discrete cells which tend to have accelerated evolutionary forces. Being able to process a high number of bacteria open up tangible impacts to human health.
[0244] In situations where infection accelerates in the presences of antimicrobials, understanding the molecular determinants of antimicrobial resistance (AMR) are key in
determining treatment. Standard genomic methods, which typically involve bulk sequencing of colonies, lack the sensitivity to see rare (small numbers of bacterium) mutations that drive the change. In addition, traditional sequencing represents genomic information independently across component nucleotides such that it is unknown if more than 1 mutation are found in the same bacterium. A unique advantage to the workflow here is that we fully characterize each allele for each bacterium sequenced, so we can report phased mutations or specific cells in a quantifable manner (mutation detected in 10 out of 1000 cells would be 1% allele frequency. [0245] In addition to the review of genetic changes, empowering the multiome enables us to capture multiple mechanisms of AMR beyond genetic changes. By reviewing transcriptomic measurements within the bacterium, we can see expression states of pathways and membrane channels. Expression levels can be used to see if there is active enrichment (increased expression) or repression (decreased expression) of genes that may be involved in drug uptake or active drug efflux. We can also look at expressed mutations to see if proteins which may metabolize specific drugs or drug classes have translated genomic variants to impact proteins. [0246] While preferred embodiments of the present invention have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. None of the descriptions are meant to be limiting. Numerous variations, changes, and substitutions will now occur to those skilled in the art without departing from the methods and systems of the present disclosure. It should be understood that various alternatives to the embodiments of the invention described herein may be employed in practicing the invention. It is intended that the following claims define the scope of the invention and that methods and structures within the scope of these claims and their equivalents be covered thereby.
Claims
WHAT IS CLAIMED IS:
1. A method of single cell analysis comprising:
(a) providing or obtaining a plurality of cells.
(b) performing one or more experiments on single cells of the plurality of cells to generate at least a first data set and a second data set from the plurality of cells, wherein the first data set is a genomic data set and the second data set is a transcriptomic data set and/or a proteomic data set;
(c) identifying a correlation between the first data set and the second data set for at least a portion of the plurality of cells; and
(d) using the correlation obtained in (c), identifying a disease biomarker, designing a therapeutic, or designing a vaccine for a disease.
2. The method of claim 1, wherein performing the one or more experiments comprises performing primary template directed amplification (PT A).
3. The method of claim 1, wherein the one or more experiments or screens comprise a genomics experiment, a transcriptomic experiment, a proteomics experiment, a methylomic experiment or any combination thereof.
4. The method of claim 1, wherein the one or more experiments comprise high-throughput single cell analysis, wherein single cells of the plurality of cells are screened in high- throughput.
5. The method of claim 4, wherein the one or more experiments are performed using a miniaturized high-throughput single cell screening system.
6. The method of claim 5, wherein the method comprises compartmentalizing the plurality of cells into a plurality of partitions, wherein a partition of the plurality of partitions comprises a single cell of the plurality of cells.
7. The method of claim 6, wherein the plurality of partitions comprises a plurality of wells, a plurality of droplets, or both.
8. The method of claim 7, wherein the wells are miniaturized wells.
9. The method of claim 5, wherein the miniaturized high-throughput single cell screening system comprises a microfluidic device, a miniaturized array, or both.
10. The method of claim 1, wherein the one or more experiments comprise performing one or more reactions.
11. The method of claim 6, wherein a partition of the plurality of partitions comprises a single cell therein, and the one or more experiments or screens comprise performing one or more reactions on the single cell in the partition.
The method of claim 11, wherein the one or more reactions comprise cell lysis. The method of claim 11, wherein the one or more reactions comprise an amplification reaction. The method of claim 13, wherein the amplification reaction comprises primary template directed amplification (PTA). The method of claim 11, wherein the one or more reactions comprise lysing the single cell, extracting the genomic material of the single cell, thereby releasing a cellular nucleic acid molecule from the single cell in the partition, and performing an amplification reaction on the cellular nucleic acid molecule. The method of any one of claims 10-15, wherein performing the one or more reactions comprises using one or more reagents. The method of claim 16, wherein the one or more reagent(s) comprise one or more of at least one amplification primer, at least one nucleic acid polymerase, and a mixture of nucleotides, wherein the mixture of nucleotides comprises at least one terminator nucleotide which terminates nucleic acid replication by the polymerase. The method of claim 17, wherein the terminator nucleotide is an irreversible terminator. The method of claim 17, wherein the terminator nucleotide is selected from the group consisting of nucleotides with modification to the alpha group, C3 spacer nucleotides, locked nucleic acids (LNA), inverted nucleic acids, 2' fluoro nucleotides, 3' phosphorylated nucleotides, 2'-O-Methyl modified nucleotides, and trans nucleic acids. The method of claim 19, wherein the nucleotides with modification to the alpha group are alpha-thio di deoxynucleotides. The method of claim 17, wherein the terminator nucleotide comprises modifications of the r group of the 3’ carbon of the deoxy ribose. The method of claim 17, wherein the terminator nucleotide is selected from the group consisting of dideoxynucleotides, inverted dideoxynucleotides, 3' biotinylated nucleotides, 3' amino nucleotides, 3 ’-phosphorylated nucleotides, 3'-O-methyl nucleotides, 3' carbon spacer nucleotides including 3' C3 spacer nucleotides, 3' Cl 8 nucleotides, 3' Hexanediol spacer nucleotides, acyclonucleotides, and combinations thereof. The method of any one of claims 5-9 or 11-22, wherein a partition of the plurality of partitions comprises at least a single cell and a bead. The method of claim 23, wherein the bead delivers a reagent for performing a reaction on the single cell in the partition.
The method of claim 24, wherein the reagent is bound to the bead via a cleavable linker and is configured to be released from the bead via cleavage of the cleavable linker. The method of claim 24 or 25, wherein the reagent comprises a barcode configured to identify the cell or a constituent of the cell. The method of claim 26, wherein the constituent of the cell comprises genomic material of the cell, ribonucleic acid (RNA), deoxyribonucleic acid (DNA), or any combination thereof. The method of claim 26, wherein the method comprises lysing the cell in the partition, releasing a cellular nucleic acid molecule of the cell in the partition, releasing the barcode from the bead via cleavage of the cleavable linker, and hybridizing the cellular nucleic acid molecule to the barcode. The method of claim 11, wherein the one or more reactions comprise lysing the single cell, thereby releasing cellular nucleic acid molecules in the partition, performing one or more amplification reactions on the cellular nucleic acid molecules thereby generating amplified cellular nucleic acid molecules, and wherein the method further comprises extracting the amplified cellular nucleic acid molecules from the partition, and sequencing the amplified cellular nucleic acid molecules. The method of claim 1, wherein generating the first data set comprises performing primary template directed amplification (PTA) and generating the second data set comprises performing a reverse transcription reaction. The method of claim 30, performing the reverse transcription reaction comprises generating a cDNA library. The method of claim 1, wherein generating the first data set comprises determining a methylation site in a cellular nucleic acid molecule using PTA, thereby generating a methylation library. The method of claim 32, further comprising comparing the methylation library to a reference library for a single cell of the plurality of cells, wherein the methylation library and the reference library are generated from the same cell. The method of any one of the preceding claims, wherein identifying the correlation comprises calculating or assigning a penetrance score to the correlation, wherein the penetrance score quantifies the correlation. The method of claim 34, wherein the penetrance score guides identifying the disease biomarker, designing the therapeutic, designing the vaccine for the disease, or any combination thereof. The method of claim 34, wherein a high penetrance score indicates a strong correlation between the first data set and the second data set.
The method of claim 36, wherein the high penetrance score indicates that the expression of a gene identified in the first data set leads to a transcriptomic event, a proteomic event or both, and wherein the gene is identified as a disease biomarker. The method of any one of claims 34-37, wherein a low penetrance score indicates a weak correlation between the first data set and the second data set, and that the expression of a gene identified in the first data set does not lead to a transcriptomic event, a proteomic event, or either, and wherein the gene is not identified as a disease biomarker. The method of any one of the preceding claims, wherein identifying the correlation is performed with the aid of a computer system comprising a computer program. The method of claim 39, wherein the computer program comprises a bioinformatics algorithm. The method of claim 1, wherein the first data set and the second data set are combined or integrated into a database. A method of developing a cancer treatment comprising:
(a) generating multiomics data from one or more single cells, wherein generating comprises performing Primary Template Directed Amplification (PTA), and wherein the multiomics data comprises two or more of genome data, transcriptome data, and proteomics data;
(b) correlating one or more mutations in genome data with corresponding mutations in one or both of (i) an mRNA of the transcriptome data and (ii) a protein of the proteome data; and
(c) generating a treatment targeting one or both of the mRNA and the protein. The method of claim 42, wherein the correlation is quantified by a penetrance score. The method of claim 43, wherein the penetrance score is at least 0.5. The method of claim 43, wherein the penetrance score is at least 0.9. The method of claim 43, wherein the treatment comprises an mRNA vaccine. The method of claim 43, wherein the treatment comprises reprogramming a dendritic cell to target one or both of the mRNA or protein. The method of claim 43, wherein the mutation in genome data comprises a DNA mutation. The method of claim 48, wherein the DNA mutation is selected from the group consisting of SNV*X, CNV*X, translocation, IND EL, frameshift, stop codon, mitochondrial, promoter/enhancer, TCR/BCR, and other change. The method of claim 43, wherein the mRNA comprises a transcript change.
51. The method of claim 50, wherein the transcript change is selected from the group consisting of expression, splice variant, fusion, IncRNA, miRNA, TCR/BCR, promoter, truncated gene, mitochondnal, or mutation.
52. The method of claim 43, wherein the protein comprises a protein change.
53. The method of claim 52, wherein the protein change is selected from the group consisting of over/under expressed, truncated, surface bound, frameshift, misfolded, metabolic, ligand independence, confirmation, activity change, or fused.
54. The method of claim 43, wherein the cancer comprises breast cancer.
55. The method of claim 43, wherein the breast cancer comprises ductal carcinoma.
56. The method of claim 43, wherein the cancer comprises leukemia.
57. The method of claim 43, wherein the single cancer cells are obtained from an FFPE sample.
58. A method for validating a disease target for a disease comprising:
(a) selecting cells from a tissue;
(b) banking the cells;
(c) performing one or more multiomic methods on the cells to generate multiomics data; and
(d) applying a computer algorithm to process the multiomics data and generate a disease target.
59. The method of claim 58, wherein selecting the cells comprises FACS sorting, microfluidics, spatial cell selection, or ultra-high throughput cell sorting.
60. The method of claim 58, wherein the number of cells is at least about 200, 300, 400, 500, 1000, 2000, 3000, 4000, 5000, 6000, 10,000 or greater.
61. The method of claim 58, wherein the disease is cancer.
62. The method of claim 58, wherein the multiomics methods comprise PTA.
63. The method of claim 58, wherein the multiomics data comprises data from one or more of a genome, epigenome, transcriptome, proteome, lipidome, or methylome.
64. The method of claim 58, wherein the method further comprises a treatment based on the disease target.
65. The method of claim 58, wherein the treatment comprises an mRNA vaccine or small molecule.
66. A system for determining a penetrance score comprising: a computing system comprising at least one processor and instructions executable by the at least one processor to provide an application configured to perform operations comprising:
receiving multiomics data from one or more sources and at least one biological state; and applying an algorithm configured to process the data and generate a penetrance score.
67. The system of claim 66, wherein the computing system comprises a cloud computing platform.
68. The system of claim 66, wherein the multiomics data comprises data obtained from analysis of one or more of genomic DNA, transcript RNA, proteins, lipids, or metabolites.
69. The system of claim 66, wherein the multiomics data comprises one or more measurements.
70. The system of claim 68, wherein one or more of the measurements is a silent change.
71. The system of claim 66, wherein the multiomics data comprises data from one or more of a genome, a transcriptome, a proteome, a metabolome, a lipidome, or an epigenome.
72. The system of claim 69, wherein the multiomics data comprises data from a genome.
73. The system of claim 71, wherein the one or more measurements are selected from the group consisting of: copy number variation, translocation, and mutation burden.
74. The system of claim 69, wherein the multiomics data comprises data from a methylome.
75. The system of claim 73, wherein the one or more measurements are selected from the group consisting of: methylation at CpG sites, gene activation, and gene repression.
76. The system of claim 69, wherein the multiomics data comprises data from a transcriptome.
77. The system of claim 75, wherein the one or more measurements are selected from the group consisting of: expressed genes, gene fusions, expressed variants and splice variants.
78. The system of claim 69, wherein the multiomics data comprises data from a proteome.
79. The system of claim 77, wherein the one or more measurements are selected from the group consisting of: translation level, phosphorylation state, and protein modification.
80. The system of claim 66, wherein the one or more sources comprise an individual organism.
81. The system of claim 66, wherein the one or more sources comprise cells.
82. The system of claim 66, wherein the cells are mammalian cells, human cells, bacterial cells, cancer cells, an immortalized cell line, a primary patient cell line, or any combination thereof.
83. The system of claim 66, wherein the cells are obtained from a tissue.
84. The system of claim 66, wherein the cells are obtained from a tissue cross-section.
85. The system of claim 66, wherein the biological state comprises a disease state.
86. The system of claim 66, wherein the disease state comprises cancer.
87. The system of claim 66, wherein the algorithm further generates a mechanism based on the data.
88. The system of claim 66, wherein the mechanism is generated by detecting one or more changes in one or measurements.
89. The system of claim 88, wherein the change comprises a genome DNA change.
90. The system of claim 88, wherein the genome DNA change is selected from the group consisting of SNV*X, CNV*X, translocation, INDEL, frameshift, stop codon, mitochondrial, promoter/enhancer, TCR/BCR, and other change.
91. The system of claim 88, wherein the change comprises a transcript change.
92. The system of claim 88, wherein the transcript change is selected from the group consisting of expression, splice variant, fusion, IncRNA, miRNA, TCR/BCR, promoter, truncated gene, mitochondrial, or mutation.
93. The system of claim 88, wherein the change comprises a genome change.
94. The system of claim 88, wherein the protein change is selected from the group consisting of over/under expressed, truncated, surface bound, frameshift, misfolded, metabolic, ligand independence, confirmation, activity change, or fused.
95. The system of claim 88, wherein the mechanism is determined to be one or more of a genomic, transcriptomic, proteomic, methylomic hpidomic, or metabolomic mechanism.
96. The method or system of any one of the preceding claims, wherein the method or system is capable of detecting a number of RNA variant per cell of at least 750, 1000, 1500, 2000, 2500 or higher.
97. The method or system of any one of the preceding claims, wherein the method or system is capable of detecting a number of genes per cell of from about 1000 to about 8000.
98. The method or system of any one of the preceding claims, wherein the method or system is capable of detecting a number of RNA variant per cell of at least 750, 1000, 1500, 2000, 2500 or higher and a number of genes per cell of from about 1000 to about 8000.
99. The method or system of any one of the preceding claims comprising full length synthesis of RNA transcripts in the cell wherein a plurality of amplification products achieved from performing the method are substantially unbiased substantially unbiased over a range of 5 ’-3’ gene body percentiles.
100. The method or system of any one of the preceding claims capable of amplifying and detecting transcripts over 1 kb, 2kb, 3 kb, 4 kb, or longer in length.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US202263392580P | 2022-07-27 | 2022-07-27 | |
US63/392,580 | 2022-07-27 |
Publications (2)
Publication Number | Publication Date |
---|---|
WO2024026376A2 true WO2024026376A2 (en) | 2024-02-01 |
WO2024026376A3 WO2024026376A3 (en) | 2024-03-07 |
Family
ID=89707350
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2023/071068 WO2024026376A2 (en) | 2022-07-27 | 2023-07-26 | Methods and systems for multiomic analysis |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2024026376A2 (en) |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
MX2022001324A (en) * | 2019-07-31 | 2022-05-19 | Bioskryb Genomics Inc | Single cell analysis. |
CN115298309A (en) * | 2020-03-20 | 2022-11-04 | 基因泰克公司 | System and method for tracking single cell evolution |
-
2023
- 2023-07-26 WO PCT/US2023/071068 patent/WO2024026376A2/en unknown
Also Published As
Publication number | Publication date |
---|---|
WO2024026376A3 (en) | 2024-03-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Foox et al. | Performance assessment of DNA sequencing platforms in the ABRF Next-Generation Sequencing Study | |
Wang et al. | Efficient and unique cobarcoding of second-generation sequencing reads from long DNA molecules enabling cost-effective and accurate sequencing, haplotyping, and de novo assembly | |
Chiang et al. | The impact of structural variation on human gene expression | |
Bell et al. | Insights into variation in meiosis from 31,228 human sperm genomes | |
Han et al. | Advanced applications of RNA sequencing and challenges | |
AU2013209499B2 (en) | Diagnostic processes that factor experimental conditions | |
Anderson et al. | Next generation DNA sequencing and the future of genomic medicine | |
Yohe et al. | Clinical validation of targeted next-generation sequencing for inherited disorders | |
Sboner et al. | A primer on precision medicine informatics | |
EP3631657B1 (en) | System and method for detecting gene fusion | |
JP2015513392A5 (en) | ||
Vanni et al. | Next-generation sequencing workflow for NSCLC critical samples using a targeted sequencing approach by ion torrent PGM™ platform | |
Li et al. | An NGS workflow blueprint for DNA sequencing data and its application in individualized molecular oncology | |
Sana et al. | GAMES identifies and annotates mutations in next-generation sequencing projects | |
Mahmoud et al. | Utility of long-read sequencing for All of Us | |
Leache et al. | Comparative species divergence across eight triplets of spiny lizards (Sceloporus) using genomic sequence data | |
Conroy et al. | A scalable high-throughput targeted next-generation sequencing assay for comprehensive genomic profiling of solid tumors | |
Forster et al. | From next-generation sequencing alignments to accurate comparison and validation of single-nucleotide variants: the pibase software | |
Wolujewicz et al. | Genome-wide investigation identifies a rare copy-number variant burden associated with human spina bifida | |
Nyangiri et al. | Copy number variation in human genomes from three major ethno-linguistic groups in Africa | |
Zhang et al. | Deep oncopanel sequencing reveals within block position-dependent quality degradation in FFPE processed samples | |
Dhiman et al. | Next-generation sequencing: a transformative tool for vaccinology | |
Alexander et al. | Assessment of the molecular heterogeneity of E-cadherin expression in invasive lobular breast cancer | |
Sarkadi et al. | Analytical performance of NGS-based molecular genetic tests used in the diagnostic workflow of pheochromocytoma/paraganglioma | |
WO2024026376A2 (en) | Methods and systems for multiomic analysis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23847544 Country of ref document: EP Kind code of ref document: A2 |